TracksSpecializations and Deep DivesData Engineering EssentialsData Quality and Validation(5 of 6)

Data Quality and Validation

Bad data leads to bad decisions. A dashboard showing incorrect revenue figures, a machine learning model trained on corrupted data, or a report with missing records — all erode trust and cause real business harm. Data quality requires intentional effort at every stage of your data pipeline.

Dimensions of Data Quality

Quality isn't a single metric. It has multiple dimensions you need to monitor:

Accuracy — Is the data correct? Does the recorded value match reality?

Completeness — Is data missing? Are required fields populated?

Consistency — Does data agree across systems? If a customer exists in two tables, do the details match?

Timeliness — Is data fresh enough? Yesterday's inventory counts might be useless for real-time decisions.

Uniqueness — Are there duplicates? The same order recorded twice inflates revenue reports.

Implementing Validation

Validation should happen at multiple points: when data enters your system, during transformations, and before it reaches consumers.

# Schema validation - structure is correct
assert df.schema == expected_schema

# Value checks - data makes sense
assert df['price'].min() >= 0, "Prices cannot be negative"
assert df['email'].str.contains('@').all(), "Invalid emails found"

# Freshness check - data is recent
assert df['updated_at'].max() > yesterday, "Data is stale"

# Uniqueness - no duplicates
assert df['order_id'].is_unique, "Duplicate orders detected"

# Referential integrity - relationships valid
assert df['customer_id'].isin(customers['id']).all()

These checks should fail loudly. A pipeline that silently processes bad data is worse than one that stops and alerts you to problems.

Tools for Data Quality

Great Expectations lets you define expectations about your data and validates them automatically. It generates documentation and alerts when expectations fail.

dbt tests integrate quality checks directly into your transformation layer. Define tests alongside your models, and they run with every build.

Monte Carlo and similar observability tools monitor data quality continuously, detecting anomalies you might not have anticipated.

Building a Quality Culture

Technical tools matter, but culture matters more. Teams that treat data quality as everyone's responsibility catch problems earlier. Document expected data shapes. Review quality metrics regularly. Treat data bugs with the same urgency as application bugs.

Quality is not a one-time effort — it's an ongoing practice woven into every step of your data work.

See More

Further Reading

You need to be signed in to leave a comment and join the discussion