Automating data quality testing in CI with Datafold

Traditional data testing techniques are insufficient for ensuring data integrity. They often require significant setup time, manual effort, and often miss edge cases–the very scenarios that need to be caught to prevent unexpected issues.

It doesn’t need to be so complicated. I’ll discuss the whys and hows of different data testing paradigms, and show how if you’re git diff-ing your code for unexpected changes in your codebase, you should be data diffing your code for unexpected changes in your data too. 

What is data testing?

Data testing involves verifying the accuracy, consistency, and validity of data throughout its lifecycle, but most critically before pushing anything to production. It’s like unit testing but for data. 

Shifting data left

Shifting data testing to the left means integrating testing early in the development process. This proactive approach allows developers to catch and fix issues before they reach production, reducing the risk of data quality incidents. This seems obvious, but it’s a fairly new paradigm to data engineering workflows, and is one borrowed from our software engineering colleagues. 

Early detection of data anomalies, schema mismatches, and integration issues ensures that problems are resolved when they are easier and less expensive to fix–you can stop breaking your data

Why is data testing important?

Data engineers can’t create without breaking things along the way – the problem is not that you can’t break things, but you want to catch those breakages early. When we implement a change, we have a specific goal or product requirement in mind, such as improving performance, adding a feature, or fixing a broken pipeline. 

We need to validate that our changes are functioning as expected and don’t cause downstream issues such as broken dashboards or unexpected data and metric changes. This includes making sure any modifications adhere to the business logic and don’t introduce unexpected behavior.

And unexpected behavior can wreak disproportionate levels of havoc. Accurate data is essential for informed decision-making, and any errors can lead to significant business impacts, such as incorrect financial reports or flawed analytics. By identifying and addressing data issues early, data testing helps maintain the integrity of data pipelines and prevents costly downstream errors. 

This is particularly critical in AI modeling work, where data inaccuracies lead to flawed predictions, produce biased outcomes, and less effective models, ultimately affecting business or strategic operations. 

Why is data testing hard? 

Data testing is hard because it involves ensuring the accuracy and consistency of data across complex, interdependent systems that handle large volumes and diverse types of data. Code changes can have unpredictable impacts on data, and traditional tests often miss edge cases and downstream effects, leading to potential issues being discovered only after deployment.

Codebases grow

Our codebases are constantly growing in size and complexity. As new features are added and old ones are updated or refactored, the interdependencies between different parts of the code increase, making it more challenging to ensure that changes do not introduce bugs or data quality issues.

Code diffs don’t show the impact

Understanding how code changes impact the data is tough. Changes might seem innocuous but can lead to unexpected side effects in data transformations, aggregations, or integrations with other systems.

You can’t simply look at a code diff and predict all the ways it might affect the data. But why would you stare at two different cookie recipes and try to understand how the cookies would be different when you could have a proper bake off?

Let’s turn to a simple example to understand what we mean (about code diffs, not bake offs). Let’s say that your developer makes a small change to the order processing logic to include a new field order_type that categorizes orders as either "online" or "in-store". Here’s what you’ll see in the pull request’s git diff:


- SELECT order_id, customer_id, order_date, amount
+ SELECT order_id, customer_id, order_date, amount, order_type
  FROM orders

A number of problems could result:

  • There might be aggregations that calculate total sales separately for "online" and "in-store" orders. If these aggregations are not updated to handle the new order_type field correctly, it could lead to incorrect totals.
  • The addition of the order_type field may affect schema validation or cause schema mismatches in downstream systems expecting the previous structure. Any integration that relies on the old schema will break if it doesn’t handle the new field correctly.

Too many unknown unknowns

Data and unit tests, like those used in dbt, are helpful but they cover only a small percentage of all possible fields and potential failure scenarios. They typically focus on key metrics but might miss edge cases. In this scenario, the data team will only scramble to add additional tests specifically for the edge case after an issue is discovered in production. This often happens when an unexpected data anomaly or business rule violation is detected, causing downstream processes to fail or produce inaccurate reports.

These tests also do not fully account for the downstream impact of changes, meaning that issues can still arise in later stages of the data pipeline or in the final analytics and reporting layers. For example, if there’s a change in the schema, like with introducing a new column, the current tests will pass. But if downstream models rely on the presence (or absence) of this new field, the pipeline can fail without being flagged by existing tests. 

An effective and scalable approach to testing: Datafold

These three challenges have long been resistant to new tools and workflows designed to patch over gaps in data testing. This is because you fundamentally want to know if a code change triggered a data change, and if so, when and where a specific value changed. 

This is known in Datafold as a data diff. For every code change, Datafold shows two things:

  1. How the data changes in every column on a distribution and value-level: Our Data Diffs allow developers to see exactly how a code change impacts the data at a granular level. It shows changes in individual values for every column, and this helps identify unexpected changes and ensures that the data remains accurate and consistent.
  2. What downstream assets are impacted: Datafold tracks the impact of changes across the entire data pipeline. It highlights which downstream assets, such as data warehouse tables (including dbt models) and BI assets are affected by a code change before anything is negatively impacted. 

The best part is that this all runs on an automated basis in CI. Your data team can ensure that every code change undergoes complete data testing without any manual work. Being able to automate your tests is key to a complete data quality strategy: it ensures that data quality checks are consistently applied, reducing the risk of errors and freeing up developers to focus on building features rather than managing tests. 

How proactive data quality testing in Datafold works

Our platform relies on two key technologies: data diff and column-level lineage. These work hand-in-hand to provide a comprehensive view of how code changes impact data across the entire pipeline. 

Data Diff

Data diff compares data between production and staging environments. If you’re unfamiliar with these terms:

  • Production data: Built off the main code branch, representing the current live state.
  • Staging/dev data: Built off the developer branch, reflecting the latest code changes.

Hence, by comparing the two datasets, Datafold provides insights into how the data will be affected by a code change before it is deployed

Value-level data diffs in Datafold

Column-Level Lineage (CLL)

Datafold can then build off information (mined from data warehouse execution logs and BI metadata) to map out how data in one column affects other columns and tables throughout the system—no matter how big or small your analytics infrastructure. With CLL, you can understand how these dependencies impact one another. 

At a glance, you can see the downstream impact of any code changes, including how they might affect data warehouse tables, dbt models, and BI assets:

Column-level lineage in Datafold

Implementing Datafold in your CI

Adding Datafold to your CI process takes just three steps. 

  1. Connect Datafold to your data warehouse

It’s just like connecting a BI tool to your data warehouse. Once you’ve connected to your data source, Datafold can then access your schemas for comparison. 

  1. Install Datafold’s app your git repo of choice (e.g. GitHub)

Then, you’ll need to integrate Datafold with your version control system like GitHub. This allows Datafold to see your code changes and run data tests automatically whenever changes are made, thus seamlessly integrating with your development and CI processes. 

  1. Add automation!

Datafold works with any orchestrator for your CI pipeline. You can pick between our no-code or programmatic implementation:

  • No-Code: You can build CI runs within the Datafold platform itself, and the data diff results are automatically pushed into your version control system (e.g. GitHub, GitLab, BitBucket, Azure DevOps). 
  • Programmatic: Datafold also provides an SDK that can be integrated with any CI runner. This allows for more flexibility and control over how Datafold fits into your CI pipeline, enabling you to tailor the integration to your specific needs and workflows.

For dbt users, you can integrate with Datafold through:

  • dbt Core: Add the Datafold SDK to your dbt Core CI pipeline. 
  • dbt Cloud: Connect directly to Datafold via our App’s guided integration. 

What if I don’t have CI yet?

That’s ok! We can work with you to help get your very first CI pipeline up and running first, before integrating Datafold. There are lots of ways to learn: check out our tutorial on how to build a CI pipeline for your dbt Core project, or download our CI whitepaper below:

Data quality + CI takes you further, faster

We’re big believers in a proactive approach to data quality; whether or not you end up using our platform, this should set the high bar for quality that every data team should measure their successes, and failures, by. 

If you want an easy plug-and-play solution that hooks into your CI setup, you will soon find these four benefits.

Preventing data quality incidents

By automating comprehensive data quality tests with Datafold, you can catch any discrepancies or anomalies before they reach production. That means zero data incidents–which sounds unheard of, but the data scientists at Petal have shown what happens when you take ownership of data quality during development. 

Increasing everyone’s confidence in data

When developers, analysts, and stakeholders can trust that the data they are working with is accurate and reliable, it leads to better decision-making and more effective use of data assets. This confidence is built through consistent validation and transparency, as Datafold highlights any changes and their impacts, ensuring that everyone has a clear understanding of the data's state. And Datafold makes it easy for non-technical staff to participate in the process too: Rocket Money’s accountants were key stakeholders in their new data quality process, which helped them pass their first SOX audit.

Accelerating team velocity

With full visibility into every change, developers spend less time manually validating their work, PR reviewers can quickly understand the implications of changes, and teams can avoid lengthy debates about what has changed and why. Both small updates and complex refactorings become easier to manage, facilitating faster and more agile development cycles. If this sounds a little abstract–how exactly does data quality change developer productivity–consider how FINN was able to scale their headcount by 10x during a period of rapid growth while maintaining data integrity and focus. 

Democratizes data development

With automated and standardized testing, no one has to guess what kinds of testing are necessary or how testing should be carried out. Anyone on the team, regardless of their role or level of expertise, can contribute to the codebase with confidence as their changes are immediately visible and their impacts are clearly understood. No more fears about silent regressions or unintended consequences. This has ripple effects across the company–Patreon’s stakeholders were empowered to consult metrics in Datafold and have their questions answered directly. 

Conclusion

While there are multiple testing techniques including data tests and unit tests, those require significant time investments to achieve code coverage and are always a moving target. Datafold is the only solution that allows you to go from 0 to 100% test coverage in a day.

To get started with data quality testing in CI with Datafold:

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes