3 different ways to diff data
[Guest blog by Kenny Ning, Freelance Analytics Engineer, and Kira Furuichi, Product Marketing at Datafold]
Data diffing involves comparing two datasets to identify the impact of code changes on the datasets' shape and content. It's crucial for any data transformation process, whether the datasets are CSV files on a local computer, database tables, or tables across multiple databases.
In this post, weâll walk through different approaches to diff your datasets depending on what format they are in:
- Local files: Utilize git diff
- Tables in a database: dbt tests, Datafold Cloud's in-database data diffing
- Tables across databases: Datafold Cloud's cross-database data diffing
Yillow: a fictional real estate listings company
Imagine youâre a data engineer at a real estate listings company called Yillow. The company ingests raw listings data from various Multiple Listing Services (MLSs) and unifies it into a standardized schema so that potential homebuyers can easily browse listings in a slick web UI.
Itâs your first day and youâre pointed to an internal python library that normalizes addresses. Different MLSs have different ways they represent addresses, and this function takes in an MLS source and raw address and transforms it into a standard format:
You get word that the California MLS source is going to change their address delimiter to semicolons instead of commas. You adjust the ca_mls clause in the function to the following:
After you merge your changes, someone notices that some California listings are missing from the app. You notice that <span class="code">ca_addresses.csv</span> is getting piped to <span class="code">normalize_address</span> which in turn writes to an intermediate dataset called <span class="code">cleaned_listings.csv</span>
Local files: git diff
You take the following steps to debug:
- Check out main branch
- Run full pipeline
- Compare <span class="code">ca_addresses.csv</span> and <span class="code">cleaned_listings.csv</span> which should return an equal amount of rows
It looks like not all of the addresses in <span class="code">ca_addresses</span> were migrated to semi-colon delimiters after all. In the example above, the first three addresses still have comma delimiters and subsequently were dropped in the pipeline. You submit a small patch and shoot the California MLS an email. In this example, a local data diffing solution like <span class="code">git diff</span> helped us identify a regression introduced by the code refactor.
You can specify the <span class="code">--stat</span> option to get a summary instead:
If you canât assume your data is sorted, then youâll need to first sort your datasets like <span class="code">cat cleaned_listings.csv | sort -s</span> and then pass the output as arguments using process substitution <span class="code"><(commands)</span> syntax. Unfortunately <span class="code">git diff</span> doesnât work with process substitution so youâll have to use the more minimal <span class="code">diff</span> command like so:
The output is not as readable as <span class="code">git diff</span> but you can see how deletions are marked with a <span class="code">XXXdXXX</span> and changes are marked with a <span class="code">XXXcXXX</span>. Experiment with the myriad options for these tools directly from the command line by typing <span class="code">man git-diff</span> or <span class="code">man diff</span>. For example, try pairing vanilla <span class="code">diff</span> with the <span class="code">--suppress-common-lines</span> and <span class="code">--side-by-side</span> commands.
â
Right for you if:
- Your data is in a flat file format
- Your data fits in memory
Youâve outgrown if:
- Your data is stored in a database
- Your pipeline takes a long time to run, so the development workflow described above takes too long to iterate
Tables in a database: dbt test
Now letâs assume Yillow has made the migration from flat files to the â modern data stack â. Youâve traded in your Python scripts, which previously downloaded data from S3, for dbt jobs that run on BigQuery.
The California MLS source has decided that they donât like semicolons after all and want to go back to using commas as a delimiter. You adjust the dbt job to the following:
You make a pull request, but an error is thrown in your CI dbt cloud run. It looks like a test called <span class="code">test_normalize_address.sql</span> failed which has the following code:
You can then run this test using the dbt test
command:
As the dbt test output shows, tests fail if more than one result is returned, so the fact that this test failed means there were some addresses that got lost in the <span class="code">normalize_address</span> step.
The great thing about these kinds of SQL tests is that they are run directly on your data warehouse, so they scale well with data sizes. They can be run during development to test your local changes, but can also be run in a CI environment to catch errors before they reach production.
Right for you if:
- Your data is stored in a relational database
- You want an easy / fast way to test if two datasets are the same
Youâve outgrown if:
- You want to compare differences across multiple databases
- Youâre looking for richer data diff reports than just a binary "yes/no" result
- You want to compare data differences between two versions of a dataset (e.g., staging vs production)
Tables across databases: Datafold's cross-database diffing
A few months pass, and you find out that normalize_address
isnât the problem anymore. The scope of Yillowâs data team has expanded to much more than just managing the data warehouse. Now youâre responsible for the ingestion process as well; you have access to the raw California MLS data that sits in a Postgres instance and have written a pipeline to copy that data over into your BigQuery data warehouse.
You want to make sure that all the data from Oracle is making its way to BigQuery. Datafold Cloud has a slick UI for doing data diffsâvalue-level comparisons of tables and views in a database. Notably, Datafold allows you to do data comparisons on database objects across databases, something that none of the previous solutions can handle.
Let's take a look. Assuming you have a BigQuery project already set up and authenticated, first jump into the Datafold Cloud UI (or, use the Datafold API) and set-up a cross-database data diff:
Here, you can see Iâm running a comparison between an addresses
table that exists in both Postgres and BigQuery. You can additionally set column mappings for columns with slightly renamed versions, and egress limits and diff sampling for extremely large tables.
When the data diff is run, you can see summary and value-level differences that may exist between your Postgres and BigQuery tables, like in the image below.
Before Datafold, Iâd jump between SQL consoles of the source and target database, running a lot of select count(*) from table queries
and praying they lined up. If they didnât, Iâd then try ordering by timestamp for both tables and start checking row by row. Or I'd pull down the latest events for each that can fit into memory on my computer and run git diff. All this to say that having a diff tool that can compare datasets across databases is a huge time saver and a welcome addition to any data engineerâs development workflow.
Right for you if:
- You need to compare data across databases, especially if you need to compare data across databases during ongoing data replication
- You're seeking a comprehensive data-diffing solution with advanced features such as detailed visualizations for reporting
The bottom line
Hopefully youâre inspired by now to incorporate data diffing into your workflow. Start with the basics at your disposal like <span class="code">git diff</span> and <span class="code">dbt test</span>, but once youâre ready to take data quality seriously and are looking for a high quality platform, look no further than Datafold.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.