The Data Quality Problem: Garbage In, Garbage Out
Your data pipeline ingests data from 20 sources. It transforms, combines, and loads into your data warehouse. Downstream, analytics break, ML models underperform, reports show wrong numbers.
Why? Data quality issues weren't caught upstream.
- Missing fields: Source system stopped sending email addresses
- Invalid values: Revenue = -$1M (data entry error)
- Duplicates: Same user appeared 3x
- Schema changes: New field added upstream, pipeline breaks
- Anomalies: 100x spike in transactions (real or fake?)
Result: Bad decisions based on bad data. Re-processing takes days. Trust in the data pipeline erodes.
Why Data Quality Is Hard
Challenge 1: Multiple Sources
Data comes from databases, APIs, files, sensors. Each has different formats, schemas, update frequencies. Inconsistent field names, types, precision.
Challenge 2: Schema Evolution
Source systems change. New fields added. Existing fields renamed or removed. Your pipeline must adapt without breaking.
Challenge 3: Silent Failures
Data is silently corrupted. Missing value defaults to 0. Date is parsed wrong. Null becomes "null" (string). Pipeline completes "successfully" but data is garbage.
Pattern 1: Schema Validation
Define expected schema. Validate every record against it. Reject or quarantine invalid records.
Key validation rules:
| Rule | Validates | Example |
|---|---|---|
| Required Fields | No null/missing values | email must be present |
| Type Checking | Correct data type | age must be integer |
| Range Validation | Values within bounds | age 0-150 |
| Format Validation | Correct format | email is valid format |
| Referential Integrity | Foreign keys valid | user_id exists in users table |
Pattern 2: Anomaly Detection
Some data quality issues aren't schema violationsโthey're anomalies. A sale of $1M is valid if it's a special order, but anomalous if typical order is $100.
Statistical Anomalies
Time-Series Anomalies
| Anomaly Type | Detection Method | Example |
|---|---|---|
| Spike | Sudden increase from baseline | Traffic jumps 10x in 1 hour |
| Drop | Sudden decrease from baseline | Revenue drops to 0 at midnight |
| Trend Shift | Sustained change in trajectory | Error rate trending up consistently |
| Seasonal Deviation | Wrong pattern for time of day/week | Heavy traffic on Sunday 3am |
Pattern 3: Quality Metrics
Measure data quality. Track over time. Alert when degrading.
| Metric | What It Measures | Calculation | Alert Threshold |
|---|---|---|---|
| Completeness | % fields with values | rows with value / total rows | < 95% |
| Validity | % values matching schema | valid rows / total rows | < 98% |
| Uniqueness | % unique values where expected | unique IDs / total IDs | < 99% |
| Timeliness | Freshness of data | now - max(timestamp) | > 1 hour |
Pattern 4: Validation Pipeline Design
Structure your pipeline to catch quality issues early and handle them gracefully.
Pattern 5: Data Remediation
Bad data happens. You need processes to detect, fix, and prevent.
Detection โ Investigation โ Remediation
Data quality metric drops below threshold
Alert: "Completeness dropped from 99% to 87%"
Phase 2: Investigation
What field? When did it start? Which source?
Root cause: Email field not sent by source API after schema update
Phase 3: Remediation
Option A: Fix upstream source
Option B: Reload from backup
Option C: Fill with default value
Option D: Exclude affected period
Phase 4: Prevention
Add stricter validation
Add alerting to source system
Add data quality SLA to contract
Complete Data Validation Architecture
โ Schema validation on all incoming data
โ Anomaly detection with statistical methods
โ Quality metrics tracked and alerted
โ Quarantine process for bad data
โ Referential integrity checks
โ Data lineage tracking
โ Root cause analysis process
โ Remediation procedures documented
Key Takeaways
โ Validate schemas on all incoming data
โ Detect anomalies with statistical methods
โ Track quality metrics continuously
โ Quarantine bad data, don't delete
โ Build remediation processes upfront