What the heck is data diffing?!
If you’re new to Datafold, welcome! If you’re a Datafold and data diffing connoisseur, welcome back (this will be a nice refresher 😉).
We’re unpacking what you’ve probably seen us talk about countless times: data diffing—what it is, why we support it, and how it’s relevant to you on your data testing journey. In particular, we’ll cover:
- How we define data diffing
- Why data diffing is important
- Use cases for data diffing
- What data diffing in Datafold looks like
So, let’s get into it!
Defining data diffing
In the simplest of terms, data diffing is the act of comparing two tables to check whether every value has changed, stayed the same, been added, or removed between the two tables. To compare it to something familiar, you can think of a data diff as a git code diff, but for the tables in your data warehouse. You’ll typically perform data diffs between the same table in different databases (development and production) or across data warehouses to confirm parity between two tables.
An output of a data diff from Datafold will typically take two forms: a high-level data diff overview or value-level diff.
Data diff overview
In a data diff overview, you can receive a high-level overview of how table A (typically your development or staging environment) differs from table B (typically in your production environment). Information in a your diff overview will include:
- Schema differences (Are there more or less columns between the two tables?)
- Primary key differences (Do tables A and B have the same or different amount of primary keys?)
- Row differences (Do tables A and B have different numbers of rows? Do any rows have different column values between tables A and B?)
In the image below, you can see the data diff overview for a dim_orgs model (between production and development versions of the model) using the Datafold VS Code Extension for dbt development.
Diff overviews are useful for getting a fast understanding of the important structural differences between two tables and ensuring primary keys match as expected. For dbt developers, you’ll often data diff between the dev and prod versions of a dbt model, so you can clearly understand how your code changes are impacting the underlying data. For teams undergoing data migrations or replication, this high-level diff result is useful for quickly validating parity between tables in different data warehouses.
When there are value-level diffs—meaning there’s at least one row that has a column with a differing value between two tables—you can leverage value-level differences to inspect row-level differences.
In the image below, you can see the value-level differences between the dev and prod version of a dim_orgs model. Value-level diffs show you row-by-row how column values may have changed or differ between two tables. For example, in the scenario below, primary key 100000004 has differing values for the sub_price column between the prod and dev version of a table.
With value-level data diffs, you can inspect individual rows to understand why a value has changed and identify the code change that led to that difference. It’s important to note, differences are usually expected with code changes! Data diffing is not a way to tell you whether a code change is good or bad (you often want your data to be different or changed!)—it’s up to you and your team to determine if differences between tables are acceptable and expected or something that would be a data quality issue if merged into production.
For teams undergoing data migrations, value-level diffs are a useful way to understand if copied data is as expected in your new data warehouse.
When we hit the dbt test ceiling, we diff
At this point, you may be wondering: why would I need this level of granularity? Does my existing testing suite not cover data diff use cases? How do I know if I should be diffing my data?
We’ve seen developers and teams adopt data diffing for many reasons:
- dbt tests and packages don’t capture unexpected changes: dbt tests, unit tests, and other test packages are great for asserting expected changes (e.g. ensuring primary keys are unique and not null, downstream tables maintain referential integrity), but often fail to catch unexpected or unforeseen changes. For example, there’s no dbt test that currently will alert you if there’s a changed first_name value or a slightly different event_timestamp.
- Manual checks don’t scale with your data and team: I have personally written more ad hoc queries to check that my prod and dev versions of a table match my expectations than I would like to admit. These queries are fine for quick gut-checks, but don’t catch every scenario and don’t scale with your data and dbt project growth.
- Countless dbt test failures create a noisy data quality landscape: When you have hundreds or thousands of dbt models with tests of their own, wading through those test results can be difficult to identify noise from value. With data diffs, you can see both high-level and row-level differences to quickly determine the important from the pandemonium.
This is not an exhaustive list of why you may want to data diff, but provides an overview of why many teams are adding data diffing to their data quality testing practices.
Data diffing use cases
The primary use cases for data diffing are during dbt code development, deployment, data migrations, and data replication.
- dbt development: Data diffing on top of your foundational dbt tests allows you to develop dbt models with more speed and confidence. Know exactly how your code changes will impact the underlying data before opening up that PR.
- dbt deployment: To guarantee data quality issues don’t slip through, automate data diffs during your CI process. By adding a data diff check in CI, you and your team can see diff results and only allow approved diffs to enter your production environment.
- Data migrations: During database migrations, data diff across data warehouses to ensure legacy systems and new systems match as expected. Expedite database migrations by quickly data diffing between thousands of tables between systems versus manual comparisons between legacy and new warehouses.
- Data replication: For teams that need to replicate data across databases or data warehouses, ensure parity by running data diffs between the locations (ex: Postgres to Snowflake to ingest product usage data). Immediately be aware if there are differences in your replicated tables by diffing replicated and original tables in your two systems.
Data diffing with Datafold
Have we convinced you that data diffing is a robust and intuitive way to test your data? If so, there’s three primary ways to diff your data using Datafold:
- Open source data-diff for the CLI: For single-developers who are comfortable with the command line, leverage open source data-diff to receive diff overview results between tables in your data warehouse.
- Datafold’s VS Code Extension: For dbt developers who use VS Code, use the free Datafold VS Code Extension to quickly run and diff dbt models between your prod and dev spaces, and view diff results in a clean GUI—all within your IDE. Also leverage the extension's Watch Mode for hot reloading of diffs upon dbt model changes. Users who use the Datafold VS Code Extension also receive free 30-day access to value-level diffs, a Datafold Cloud exclusive feature.
- Datafold Cloud: For teams that want to automate and scale data diffs with their team, leverage Datafold Cloud’s robust diffing features and CI integration. Using Datafold Cloud, you can add diff results (both diff overviews and value-level diffs) as an automated comment in your PR, so you and your team can see the immediate potential impact of dbt code changes. In addition, Datafold Cloud’s diffing capabilities supports thorough impact analysis, so you can clearly identify downstream dbt models, BI tools, or data apps that are impacted by your dbt code changes. With Datafold Cloud by your side, you only allow data diffs to be merged in that you and your team approve.
Interested in getting started with diffing? Check out the following resources:
- For individual developers, check out the Datafold VS Code Extension or open source data-diff installation guides
- For teams who want to see Datafold Cloud in action, see a live demo to learn more and get all your pressing questions answered
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.