The problem of maintaining data quality persisted throughout the entire history of analytics work. It remains a hard problem to solve, and I speak from personal experience.
It all started during my time as an on-call data engineer at Lyft. One memorable night at 4am, I was woken up by a PagerDuty alarm and had to respond to a failed Airflow job as an on-call engineer. I quickly realized that to uninstall the job, I needed to ship a SQL hotfix that would filter out some odd data from our core table that described every ride.
I changed two lines of SQL, did some ad-hoc testing, got a thumbs-up on my PR from a fellow data engineer, and watched the Airflow pipeline go green. I closed my laptop and went back to sleep not realizing that my change caused a major corruption of the central model and in the following hours would cause hundreds of tables, dashboards, reports and ML models to break or show incorrect numbers.
The worst part of this was that it took a team of senior data engineers including myself 6 hours in a war room to finally pinpoint the anomalies we were seeing to a two-line hotfix I made the previous night. It wasn't obvious at all.
Reflecting on this painful experience, I've come to realize how easy it is to break data and impact the business—even if you follow the process—and how hard it is to diagnose data quality issues in production. There should be a better solution for data quality, and any changes to data should be validated before code is deployed to production.
This is how the idea of data diff came about: Comparing data in production to data in development or staging to understand how code changes impact the data after deployment.
Looking back, the solution seemed obvious, but at the time, there weren't any tools available to enable proactivedata testing. We were stuck with three bad choices, each presenting its own set of problems:
Assertion tests: They’re similar to dbt tests like primary key uniqueness. These test for your “known unknowns”—things you know must be true about the data.
Custom SQL tests: You could write 100s of lines of SQL to cover all your edge cases and make sure your dev and prod tables match, but then you would need to write 100s of lines of SQL, multiplied by each table, and maintain the SQL code over time.
Just ship it: Problematic for obvious reasons. This was more common in the past when your code changes didn’t necessarily affect business critical services like production ML models.
Back in 2020, I co-founded Datafold to tackle this problem head-on: To empower data engineers with exactly what they need to fortify their data pipelines against anything that could hurt the business. I wanted to provide a solution for regression testing with something that hooks into your existing data setup easily. (We launched on HackerNews shortly after.) Since then, we have built many more features, including impact analysis with column-level lineage and cross-database diffing for validating migrations and source-target replication.
What we built requires complex algorithms and the features continue to evolve to better serve a range of use cases. But the core philosophy behind it is incredibly simple and has not changed. It relies on an existing process that many of us in data are already familiar with.
Imagine you’re a data engineer shipping some code changes that will affect several dashboards in the company. You create a branch, make some code changes, open a pull request, and ask a colleague to review it. They click on “Files changed” in GitHub and the green-highlighted code shows what you've added, while the red-highlighted code shows deletions. We’ve got our process for reviewing software changes down. But what about data changes?
Well, a proactive data quality approach should follow the same process, and that’s exactly what we built with Datafold. This time, you’re a data engineer modifying some dbt code and you’re not sure about the downstream data changes, both intended and unintended. You want to see what the modified data looks like before you merge any code to production. So you create a branch, make some code changes, open a pull request, and we give you data diff summary statistics (e.g. % of different values by column) and value-level data diffs you and your reviewers can use to approve or require adjustments to a PR. This ensures that every time you merge a PR, you take a step towards fortifying your data pipelines instead of exposing the business to mistakes—or worse, adversaries on the horizon.
(This part is not meant to be a sales pitch, more just an elaboration on how and why Datafold came to be.)
In the following sections of this guide, we'll walk through concretely what it means to implement proactive testing practices within your data pipelines, where data diffing should be applied throughout your data workflows, and share some stories about data teams that are doing really ground-breaking data quality work.)
An aside: The state of data quality testing today
I’m surprised that 4 years after my HackerNews post, the state of data quality is still lagging behind where many leading companies should be operating. Proactive data testing remains the exception, even though it should be table stakes for a mature data team. But why haven’t even leading teams adopted more sophisticated methods for data quality? Why has it been so hard to figure out a solution to catching unintended changes to your data before you merge new code to production? What does that workflow actually look like?
Since it’s 2024, we're doubling down on the importance of data quality as the foundational defense for everything data teams, and businesses, should do. Just as a castle’s moat protects your business model from invaders of all kinds, robust data quality serves as the fortress defending against errors, inaccuracies, and vulnerabilities in your data infrastructure.
We decided to write a framework that sets forth a standardized approach that any data engineer or analytics engineer can implement in their specific business context.