To verify data integrity after ETL completion, developers use a combination of automated checks, validation rules, and reconciliation processes. The goal is to ensure data accuracy, consistency, and completeness between the source and target systems. This involves comparing metrics, enforcing constraints, and validating business logic post-load.
First, basic validation includes checking row counts and data volumes. For example, if a source table has 10,000 records, the target should have the same number after transformations, unless the ETL logic explicitly filters or aggregates data. Discrepancies here indicate issues in extraction or loading. Additionally, checksums or hash functions can verify that critical columns (e.g., IDs, transaction amounts) remain unchanged during transfer. Tools like MD5 or SHA-256 generate unique hashes for datasets, which can be compared before and after ETL. For transformed data, column-level checks ensure values align with expected patterns—such as email formats, date ranges, or numerical thresholds—using frameworks like Great Expectations or custom SQL queries.
Next, structural and relational integrity checks are crucial. This includes validating primary keys for uniqueness, foreign keys for referential consistency, and null constraints. For instance, a customer table’s "customer_id" should have no duplicates, and an orders table’s "customer_id" should reference valid entries in the customer table. Automated tests can flag violations, which are then reviewed. Data profiling tools like Apache Griffin or Talend Data Quality help compare statistical metrics (e.g., averages, min/max values) between source and target to detect anomalies. For example, if a sales column’s total in the source is $1M, the target’s aggregated value after transformations should match unless business rules dictate otherwise.
Finally, business logic validation ensures data aligns with domain-specific rules. This might involve reconciling totals (e.g., sum of individual transactions in the source equals the target’s aggregated daily sales) or verifying derived fields (e.g., a "discount_amount" column correctly reflects 10% off the original price). Tools like dbt (data build tool) allow developers to embed these checks as SQL assertions in the transformation layer. For critical datasets, a sample-based reconciliation process—comparing a subset of source and target records field-by-field—provides additional confidence. Logging and monitoring tools like Airflow or custom scripts track errors during ETL, ensuring issues are auditable and addressed in subsequent runs. Combining these methods creates a robust safety net to maintain data trustworthiness.