Why create an open source extension of Data Diff?

In this episode of the Data Engineering Podcast, Gleb Mezhanskiy and Simon Eskildsen join the host, Tobias Macey to talk about Datafold’s open source data-diff tool. 

They cover the backstory of Data Diff, design decisions, open sourcing, and most importantly, how it works in practice. Below is a summary of the podcast transcript:

Q: How did Data Diff come to be?

The consensus from both Gleb and Simon is that in the data space, there are too many suboptimal and manual processes. One issue stuck out to Gleb a few years ago - how do you check what the impact of a dbt model update does to all of the downstream data, dashboards, and notebooks? You have to implement a ton of assertion tests, do spot checks, and even then you are likely to miss unintended changes to the data. Datafold built Data Diff as an automated solution to the repetitive and frustrating task of doing regression tests in data.

“One of the most painful things I’ve observed by being a data engineer is change management.” ~Gleb

Q: Why create an open source extension of Data Diff?

To assess the impact of business logic changes on downstream data, Data Diff required a rich, opinionated UI and certain infrastructure and security requirements. But people want to diff data for more than just regression testing, many teams in the data community want to use the tool for replication validation. They want to verify the accuracy of their data replication or migration efforts. In this use case, the workflow is very machine-driven and automated, and the UI is unimportant.

Providing the core functionality of Data Diff to the community made sense because to meet this use case it needs to be extensible and embeddable. By making part of Data Diff open source, developers could contribute connectors to source systems. So open source data-diff was born! Additionally, since the revenue source from the original Data Diff use case at Datafold was already there, they didn’t have the difficult task of figuring out how to build a monetization model on top of the new open source offering.

Q: What went into the design decisions?

Why use Python

A common knee-jerk reaction to an application built on Python is to ask, “Why did you pick a slow language for an application where speed is important?” Simon covers this in some depth, explaining that the main reason for choosing Python is that a tool that specializes in data requires the availability of high-quality adopters of the tool. Because Python is ubiquitous in the data engineering space, building data diff on Python reduces the learning curve for the community. The saving grace concerning speed is that most of the work is done in the data stores themselves which makes Python’s speed less of a consideration.

“Approachability from the data engineering community with Python is more important…” ~Simon

Why use md5

Another key point covered by Simon is why they decided to use md5 as the checksum algorithm. For data verification (as opposed to cryptographic applications), hash collision—which is a potential issue with md5—is not really a problem that we need to worry about. So, the simple answer is that md5, in comparison to other hashing algorithms, is nearly ubiquitous across data stores. The future goal is to use a more sophisticated approach that could detect changes across a column or row using simpler techniques.

Q: What lessons were learned?

In terms of interesting challenges, both Gleb and Simon expressed that the challenges tend to be what you’d expect. Overcoming manual processes, insufficient operationalization of data engineering, and plain old fear of the results are the big ones. On a more technical level, Simon covers the interesting challenges around cohesively comparing data types across data stores, noting differences in precision on DateTime types, BigInt versus SmallInt, Numeric versus Float, and others.

Q: When is open source data-diff not the right tool?

Simon describes some cases where open source data-diff would not be a good choice:

  • When you have many diverging records. Open source data-diff is best used to compare mostly similar tables with small numbers of mismatches.
  • When working with data stores that don’t have aggregation engines, such that the computation can’t be done in the datastore (with Kafka, for example). For these contexts, you would need another tool that sits on top to do the aggregation.
  • When you have very, very large tables, such as trillions of records or very wide tables. In these situations, open source data-diff will work, but it will likely take a long time.

In Conclusion

The podcast closed with Gleb and Simon reflecting on some of the biggest gaps in data management technology that is available today. We still haven’t learned how to make data more meaningful for automated processing. There’s still the need for a person to write SQL to turn it into a business metric for decision-making. Open source data-diff seeks to be a step towards that goal of automating data processes, seeking ultimately to improve the quality of life of data engineers.

“What I’m most excited about is building tooling that dramatically improves the quality of life of data engineers and analytics engineers, because that will empower more effective use of data.” ~Gleb

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes