3 different ways to diff data

Kenny Ning

January 10, 2024

[Guest blog by Kenny Ning, Freelance Analytics Engineer]

Data diffing involves comparing two datasets to identify the impact of code changes on the datasets' shape and content. It's crucial for any data transformation process, whether the datasets are CSV files on a local computer, database tables, or tables across multiple databases.

In this post, we’ll walk through different approaches to diff your datasets depending on what format they are in:

Local files: Utilize git diff
Tables in a database: dbt test, Datafold’s data diff
Tables across databases: Datafold's open source data-diff

Yillow: a fictional real estate listings company

Imagine you’re a data engineer at a real estate listings company called Yillow. The company ingests raw listings data from various Multiple Listing Services (MLSs) and unifies it into a standardized schema so that potential homebuyers can easily browse listings in a slick web UI.

It’s your first day and you’re pointed to an internal python library that normalizes addresses. Different MLSs have different ways they represent addresses, and this function takes in an MLS source and raw address and transforms it into a standard format:

def normalize_address(mls_source, address):
  """
  Example usage: normalize_address("ca_mls", "947 HIGH ST, BURLINGTON, NJ, 08016")
  Example output: "947 High St | Burlington | NJ | 08016"
  """
  # lots of crazy if statements
  return formatted_address

You get word that the California MLS source is going to change their address delimiter to semicolons instead of commas. You adjust the ca_mls clause in the function to the following:

def normalize_address(mls_source, address):
  ...
  # handle ca_mls format
  if mls_source == 'ca_mls':
  return '|'.join(address.split(';'))

After you merge your changes, someone notices that some California listings are missing from the app. You notice that ca_addresses.csv is getting piped to normalize_address which in turn writes to an intermediate dataset called cleaned_listings.csv

Local files: git diff

You take the following steps to debug:

Check out main branch
Run full pipeline
Compare ca_addresses.csv and cleaned_listings.csv which should return an equal amount of rows

> git diff ca_addresses.csv ~/cleaned_listings.csv
diff --git a/ca_addresses.csv b/cleaned_listings.csv
index 648ab5d..7b5653e 100644
--- a/.csv
+++ b/al_address_out2.csv
@@ -7,23 +7,18 @@ addr
-993 POST ST, IRVINE, CALIFORNIA, 92612
-989 POST ST, IRVINE, CALIFORNIA, 92612
-985 POST ST, IRVINE, CALIFORNIA, 92612
-975 POST ST; IRVINE; CALIFORNIA; 92612
+975 POST ST | IRVINE | CALIFORNIA | 92612
...

It looks like not all of the addresses in ca_addresses were migrated to semi-colon delimiters after all. In the example above, the first three addresses still have comma delimiters and subsequently were dropped in the pipeline. You submit a small patch and shoot the California MLS an email. In this example, a local data diffing solution like git diff helped us identify a regression introduced by the code refactor.

You can specify the --stat option to get a summary instead:

> git diff ca_addresses.csv ~/cleaned_listings.csv
> git diff ca_addresses.csv cleaned_listings.csv --stat

ca_addresses.csv => cleaned_listings.csv | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

If you can’t assume your data is sorted, then you’ll need to first sort your datasets like cat cleaned_listings.csv | sort -s and then pass the output as arguments using process substitution <(commands) syntax. Unfortunately git diff doesn’t work with process substitution so you’ll have to use the more minimal diff command like so:

> diff <(cat ca_addresses.csv | sort -s) <(cat cleaned_listings.csv | sort -s)
98681d98680
< 048 993 POST ST, IRVINE, CALIFORNIA, 92612
99002d99000
< 048 989 POST ST, IRVINE, CALIFORNIA, 92612
99439c99437
< 048 985 POST ST, IRVINE, CALIFORNIA, 92612
---
> 048 975 POST ST | IRVINE | CALIFORNIA | 92612

The output is not as readable as git diff but you can see how deletions are marked with a XXXdXXX and changes are marked with a XXXcXXX. Experiment with the myriad options for these tools directly from the command line by typing man git-diff or man diff. For example, try pairing vanilla diff with the --suppress-common-lines and --side-by-side commands.

*Me whenever I read the man pages for Unix commands and see a million different options I never knew existed*

Right for you if:

Your data is in a flat file format
Your data fits in memory

You’ve outgrown if:

Your data is stored in a database
Your pipeline takes a long time to run, so the development workflow described above takes too long to iterate

Looking to level up data testing? Download our new guide to testing with dbt

Discover the goals of testing with dbt, types of tests, principles and best practices for an effective testing strategy.

Tables in a database: dbt test

Now let’s assume Yillow has made the migration from flat files to the ⭐ modern data stack ⭐. You’ve traded in your Python scripts, which previously downloaded data from S3, for dbt jobs that run on BigQuery.

The California MLS source has decided that they don’t like semicolons after all and want to go back to using commas as a delimiter. You adjust the dbt job to the following:

# cleaned_listings.sql

with clean as (
	select
		id,
		upper(replace(address, ',', ' |')) as address
	
	from ca_addresses
)
select *
from clean
where formatted_address like '% | % | % | %'

You make a pull request, but an error is thrown in your CI dbt cloud run. It looks like a test called test_normalize_address.sql failed which has the following code:

#test_normalize_address.sql

select id from ca_addresses
except distinct
select id from cleaned_listings

You can then run this test using dbt test:

$ dbt test

20:31:20  Running with dbt=1.1.0
20:31:20  Found 3 models, 1 test, 0 snapshots, 0 analyses, 191 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
20:31:20
20:31:20  Concurrency: 1 threads (target='dev')
20:31:20
20:31:20  1 of 1 START test test_normalize_address .................................................... [RUN]
20:31:22  1 of 1 FAIL 1 test_normalize_address ........................................................ [FAIL 1 in 1.98s]
20:31:22
20:31:22  Finished running 1 test in 2.35s.
20:31:22
20:31:22  Completed with 1 error and 0 warnings:
20:31:22
20:31:22  Failure in test test_normalize_address (tests/test_normalize_address.sql)
20:31:22    Got 1 result, configured to fail if != 0
20:31:22
20:31:22    compiled SQL at target/compiled/yillow/tests/test_normalize_address.sql
20:31:22
20:31:22  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

As the dbt test output shows, tests fail if more than one result is returned, so the fact that this test failed means there were some addresses that got lost in the normalize_address step.

The great thing about these kinds of SQL tests is that they are run directly on your data warehouse, so they scale well with data sizes. They can be run during development to test your local changes, but can also be run in a CI environment to catch errors before they reach production.

Right for you if:

Your data is stored in a relational database
You want an easy / fast way to test if two datasets are the same

You’ve outgrown if:

You want to compare differences across multiple databases
You’re looking for richer data diff reports than just a binary “yes/no” result

One final note on diffing tables in a database: dbt test is a very minimal, binary “yes/no” framework for data diffing. If you’re looking for richer data diff reporting, check out these other options:

dbt-audit-helper package: helper macros that can generate summaries of differences
Datafold’s data diff: rich, web-based, visual experience offering general table statistics as well as easy exploration of differences across schema, primary keys, and specific column values. This is the Rolls Royce of diff reports 🚗

Tables across databases: Datafold's data-diff tool

A few months pass, and you find out that normalize_address isn’t the problem anymore. The scope of Yillow’s data team has expanded to much more than just managing the data warehouse. Now you’re responsible for the ingestion process as well; you have access to the raw California MLS data that sits in a MySQL instance and have written a pipeline to copy that data over into your BigQuery data warehouse.

You want to make sure that all the data from MySQL is making its way to BigQuery. Datafold has a slick UI for doing data diffs as well as a newly released open source version. Notably, this tool allows you to do data comparisons across databases, something that none of the above solutions can handle.

Let's try out the open source version. Assuming you have a BigQuery project already set up and authenticated, first install data-diff:

pip install data-diff

Assuming your data is making its way into a BigQuery project named yillow_ingest and a dataset named mls, you can then use data-diff as follows:

> data-diff \
    mysql://mysql:Password1@127.0.0.1:3306/mysql ca_addresses \
    bigquery://yillow_ingest/mls ca_addresses
		--key-column=id
    --columns=address

...
[18:59:30] INFO - . . Diffing segment 28/32, key-range: 526353..527329, size: 976
[18:59:30] INFO - . . Diffing segment 29/32, key-range: 527329..528305, size: 976
[18:59:30] INFO - . . Diffing segment 30/32, key-range: 528305..529281, size: 976
[18:59:30] INFO - . . Diffing segment 31/32, key-range: 529281..530257, size: 976
[18:59:30] INFO - . . Diffing segment 32/32, key-range: 530257..531251, size: 994
- (51, '993 POST ST, IRVINE, CALIFORNIA, 92612')
- (52, '989 POST ST, IRVINE, CALIFORNIA, 92612')
- (53, '985 POST ST, IRVINE, CALIFORNIA, 92612')
- (54, '975 POST ST; IRVINE; CALIFORNIA; 92612')
+ (54, '975 POST ST | IRVINE | CALIFORNIA | 92612')
[18:59:30] INFO - Duration: 0.70 seconds.

The --key-column argument specifies the name of the primary key column. The --columns argument tells data-diff to also compare differences in values for the address type. All other options are detailed on github or by typing data-diff --help. For example, you can just get the diff summary by adding the --stats option:

> data-diff \
    mysql://mysql:Password1@127.0.0.1:3306/mysql ca_addresses \
    bigquery://yillow_ingest/mls ca_addresses
		--key-column=id
    --columns=address
    --stats

Diff-Total: 5 changed rows out of 1000000
Diff-Percent: 400.0000%
Diff-Split: +1  -4

Before Datafold, I’d jump between SQL consoles of the source and target database, running select count(*) from _table_ and praying they lined up. If they didn’t, I’d then try ordering by timestamp for both tables and start checking row by row. Or I'd pull down the latest events for each that can fit into memory on my computer and run git diff. All this to say that having a diff tool that can compare datasets across databases is a huge time saver and a welcome addition to any data engineer’s development workflow.

Right for you if:

You need to compare data across databases.
You're seeking a comprehensive data-diffing solution with advanced features such as detailed visualizations for reporting.

The bottom line

Hopefully you’re inspired by now to incorporate data diffing into your workflow. Start with the basics at your disposal like git diff and dbt test, but once you’re ready to take data quality seriously and are looking for a high quality platform, look no further than Datafold.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Looking to level up data testing? Download our new guide to testing with dbt

Discover the goals of testing with dbt, types of tests, principles and best practices for an effective testing strategy.

Datafold is the fastest way to test dbt code changes

Book a Demo