What are common metrics for evaluating data quality post-ETL?

Common metrics for evaluating data quality after ETL (Extract, Transform, Load) processes focus on ensuring data is accurate, complete, consistent, and fit for its intended use. These metrics help identify issues that could impact downstream applications, analytics, or business decisions. Developers and data engineers typically assess data quality through measurable criteria that align with organizational requirements.

First, accuracy and completeness are foundational metrics. Accuracy measures whether data reflects real-world values correctly. For example, if an ETL pipeline processes customer addresses, accuracy could be validated by checking if postal codes match known geographic regions. Completeness evaluates whether required fields contain data. A null value in a mandatory column like user_id or missing rows in a daily sales feed would indicate gaps. Tools like SQL queries (e.g., COUNT(*) WHERE column IS NULL) or data profiling scripts can automate these checks. Another key metric is consistency, which ensures data aligns across systems or time. For instance, aggregated revenue in a data warehouse should match source transactional databases after transformations, and values like currency codes should adhere to a standardized format (e.g., "USD" instead of "US Dollar").

Second, timeliness, validity, and uniqueness address operational and structural concerns. Timeliness measures whether data is available within expected timeframes—a delay in hourly log ingestion could disrupt real-time dashboards. Validity checks if data conforms to defined formats or rules, such as email addresses containing "@" or dates in YYYY-MM-DD format. Uniqueness ensures no duplicates exist; for example, a customer table should have one record per unique identifier unless explicitly allowed. Automated tests using constraints (e.g., UNIQUE indexes) or checksums for duplicate detection are common solutions. These metrics often require collaboration with business stakeholders to define rules (e.g., what constitutes a "valid" product category).

Finally, integrity and reliability tie metrics to broader system behavior. Integrity focuses on relationships between datasets, such as foreign keys in related tables (e.g., an order record must reference a valid customer_id). Reliability measures how consistently the ETL process itself performs—unexpected failures or skewed row counts between pipeline runs could indicate instability. Tools like data observability platforms or custom logging (e.g., tracking row counts pre- and post-transformation) help monitor these aspects. Choosing metrics depends on context: a healthcare system might prioritize accuracy and validity for patient records, while a marketing team might emphasize timeliness for campaign analytics. Defining clear thresholds (e.g., <1% missing values) and integrating checks into CI/CD pipelines ensures ongoing quality without manual overhead.