AI Data Requirements Checklist
Pre-project checklist covering 5 areas: Data Identification & Inventory, Data Quality Assessment, Privacy & Compliance, AI-Specific Requirements (bias, labeling, representativeness), and Security. 45+ checkpoint items with sign-off section
Key Insights
Data is the foundation of AI—and data problems are the most common cause of AI project failure. Models trained on incomplete, biased, or poor-quality data produce unreliable results regardless of algorithmic sophistication. Privacy violations in training data create legal exposure. Security gaps in data pipelines create breach risks.
This data requirements checklist ensures systematic evaluation of data readiness before committing resources to model development. It catches data issues early when they're cheap to fix, rather than late when they derail projects or create production failures.
Overview
Most AI projects fail because of data, not algorithms. Incomplete data leads to model gaps. Biased data leads to biased outputs. Poor-quality data leads to unreliable predictions. And privacy violations in training data create legal exposure that surfaces long after deployment.
This comprehensive checklist ensures systematic evaluation of data readiness before model development begins. It's designed to catch data issues early—when they're cheap to fix—rather than late when they derail projects or create production failures.
What's Inside
- Data Identification & Inventory: Verification of data sources, owners, schemas, volume estimates, refresh frequency, historical availability, and access permissions
- Data Quality Assessment: Evaluation of completeness (missing values), accuracy (source of truth validation), consistency (cross-source conflicts), timeliness (staleness), duplicates, outliers, and quality metrics baseline
- Data Privacy & Compliance: Classification (public/internal/confidential/restricted), PII/PHI identification, legal basis documentation, privacy impact assessment triggers, data minimization, retention requirements, cross-border transfers, anonymization, subject rights, and regulatory requirements (GDPR, CCPA, HIPAA)
- AI-Specific Data Requirements: Training/validation/test split strategy, representativeness assessment, bias identification, label quality verification, feature engineering documentation, data lineage, ground truth availability, versioning, and feedback collection plans
- Data Security: Encryption (rest and transit), access controls, logging, secure transfer, development environment protections, backup/recovery, and disposal procedures
- Assessment Sign-Off Workflow: Structured approval from data owner, privacy/legal, and AI governance
Who This Is For
- Data Scientists preparing data for model development
- Data Engineers building AI data pipelines
- AI Project Managers ensuring data readiness before development
- Privacy Officers reviewing AI data usage
- AI Governance Teams establishing data quality gates
Why This Resource
This checklist distills hard-won lessons from AI projects that failed due to data issues. It provides systematic coverage of data requirements that are easy to overlook: not just data quality, but privacy compliance, AI-specific needs like representativeness and bias identification, and security controls.
The structured assessment with sign-off workflow creates accountability and documentation—evidence that data requirements were evaluated before development began.
FAQ
Q: When should this checklist be completed?
A: Before committing significant resources to model development. Ideally during project scoping or proof-of-concept planning. Completing this checklist early catches data issues when there's still time and budget to address them.
Q: What if we can't complete all checklist items?
A: The checklist identifies gaps—addressing every gap isn't always necessary or possible. Some items may be N/A for your use case. Others may reveal risks that need mitigation planning. The goal is informed decision-making, not perfection.
Q: Who should complete this checklist?
A: Typically a collaboration between data scientists (technical requirements), data engineers (infrastructure), and privacy/legal (compliance requirements). The sign-off workflow ensures appropriate stakeholders review before proceeding.
What's Inside
- Data Identification & Inventory: Verification of data sources, owners, schemas, volume estimates, refresh frequency, historical availability, and access permissions
- Data Quality Assessment: Evaluation of completeness (missing values), accuracy (source of truth validation), consistency (cross-source conflicts), timeliness (staleness), duplicates, outliers, and quality metrics baseline
- Data Privacy & Compliance: Classification (public/internal/confidential/restricted), PII/PHI identification, legal basis documentation, privacy impact assessment triggers, data minimization, retention requirements, cross-border transfers, anonymization, subject rights, and regulatory requirements (GDPR, CCPA, HIPAA)
- AI-Specific Data Requirements: Training/validation/test split strategy, representativeness assessment, bias identification, label quality verification, feature engineering documentation, data lineage, ground truth availability, versioning, and feedback collection plans
- Data Security: Encryption (rest and transit), access controls, logging, secure transfer, development environment protections, backup/recovery, and disposal procedures
- Assessment Sign-Off Workflow: Structured approval from data owner, privacy/legal, and AI governance
Ready to Get Started?
Sign up for a free Explorer account to download this resource and access more AI governance tools.
Create Free Account