audit_helper vs data-diff for validating dbt model changes

This blog post references the open source data-diff package; as of May 17, 2024, Datafold is no longer actively supporting or developing open source data-diff. For continued access to in-database and cross-database diffing, please check out our free trial of Datafold Cloud.
ā€
Weā€™re grateful to everyone who made contributions along the way. Please see our blog post for additional context on this decision.

When youā€™re updating a dbt model, itā€™s important to check not only that dbt models run and tests pass, but that unexpected changes in values donā€™t occur. A dbt test will often not catch such data impacts!

For example, what if a code change led to the strings in a 'first_name' column being duplicated, so that Leo became LeoLeo? Who is LeoLeo, anyway?

Thereā€™s not a test for that. You simply canā€™t create an accepted values test that contains all the right friends in all the right places.

To catch stuff like this, you need to run a data diff. In general, a data diff compares every value between two tables to search for differences. In the dbt context, that means comparing the values in production models to the values in development models.

Iā€™ll walk you through how to make the most of two fully open source and free options for running data diffs: dbt Labsā€™ audit_helper and Datafoldā€™s data-diff.Ā 

dbtā€™s audit_helper package

audit_helper is a dbt package containing several handy macros that can be used to compare values in dbt models across rows and columns. There is one macro in particular that fits this use case perfectly: the compare_all_columns macro, which compares every value of every column out-of-the-box.

In this example, Iā€™ll compare the values in my development version of 'dim_orgs' to the production version of the same table, after making a few code changes.

{{ 
  audit_helper.compare_all_columns(
    a_relation=api.Relation.create(database='DEMO', schema='CORE', identifier='dim_orgs'),
    b_relation=ref('dim_orgs'),
    primary_key='org_id',
    summarize=true
  ) 
}}

That table identifies important information that helps you see the differences between the two versions of the same table, with A representing production and B representing development:

  • PERFECT_MATCH: How many rows in a given column have the exact same value in both versions of the table.
  • CONFLICTING_VALUES: How many rows in a given column have differing values for the same primary key.
  • NULL_IN_: How many NULL values are in each column.
  • MISSING_FROM_: How many primary keys are missing from each version of the table. This will be the same number for every column.

In this example, there are three columns that have conflicting values for common primary keys:

  • NUM_USERS: 120 conflicting values
  • SUB_PLAN: 81 conflicting values
  • SUB_PRICE: 15 conflicting values

These are data differences that can be explained by your code changes!

To investigate these differences in more detail, you can simply reverse the 'summarize' flag.

{{ 
  audit_helper.compare_all_columns(
    a_relation=api.Relation.create(database='DEMO', schema='CORE', identifier='dim_orgs'),
    b_relation=ref('dim_orgs'),
    primary_key='org_id',
    summarize=false
  ) 
}}

Overwhelming! Yes, itā€™s not the most readable, but the boolean columns create a powerful structure for exploratory analysis. Letā€™s get oriented.

There is one row for every combination of primary key and column name. This makes it easy to summarize and count instances of an occurrence, and even join to a table in your warehouse to pull in values from columns.

Exploratory analysis of a specific columnā€™s data diff with audit_helper

Letā€™s take a look at SUB_PLAN, which has 81 conflicting values. We also see there are 104 NULLs in A, and zero NULL values B.

Is it the case that all of the 81 conflicting values are due to NULLS changing to NOT NULL? It seems possible, but itā€™s not absolutely clear from the table. For example, 37 of the NULL values could have become missing values, which would leave only 67 conflicting values that could be accounted for by NULLs changing to NOT NULLs.

Letā€™s check to see how many of the primary keys that are NULL_IN_A are also counted as CONFLICTING_VALUES. When a PK is counted as both NULL_IN_A and CONFLICTING_VALUES, it means that the value changed from NULL to NOT NULL.

with audit_helper_base as (
  {{ 
    audit_helper.compare_all_columns(
      a_relation=api.Relation.create(database='DEMO', schema='CORE', identifier='dim_orgs'),
      b_relation=ref('dim_orgs'),
      primary_key='org_id',
      summarize=false
    ) 
  }}
),
num_users_conflicting_values as (
  select 
    audit_helper_base.primary_key as org_id,
    audit_helper_base.null_in_a as null_in_a
  from audit_helper_base
  where audit_helper_base.conflicting_values and 
  audit_helper_base.column_name = 'SUB_PLAN'
)
select 
  null_in_a,
  count(*) as num_rows
from num_users_conflicting_values
group by 1

Check it out! All of the 81 primary keys that have CONFLICTING_VALUES are also NULL_IN_A.

Now, letā€™s see what SUB_PLAN value(s) these rows have in Table B. For this, weā€™ll actually join the table built by the `compare_all_columns` macro to the dev table.

with audit_helper_base as (
  {{ 
    audit_helper.compare_all_columns(
      a_relation=api.Relation.create(database='DEMO', schema='CORE', identifier='dim_orgs'),
      b_relation=ref('dim_orgs'),
      primary_key='org_id',
      summarize=false
    ) 
  }}
),
num_users_conflicting_values as (
  select 
    audit_helper_base.primary_key as org_id,
    audit_helper_base.null_in_a as null_in_a,
    dev.sub_plan
  from audit_helper_base
  inner join {{ ref('dim_orgs') }} dev 
    on audit_helper_base.primary_key = dev.org_id -- join the diff results to the dev table
  where audit_helper_base.conflicting_values and 
  audit_helper_base.column_name = 'SUB_PLAN'
)
select 
 sub_plan,
 count(*)
from num_users_conflicting_values
group by 1

Indeed, all 81 SUB_PLAN values that are NULL in production are now ā€œIndividual.ā€ Now, itā€™s up to you, the dbt developer, to decide whether 81 SUB_PLAN values changing from NULL to ā€œIndividualā€ is an acceptable data impact, or whether more updates to the code are needed.

One last thing. Where should you store all this code? For any analyses you want to save and reuse, I suggest creating a folder in your project called audit_helper where you store sql files containing tests that you then run in development. That way, you can execute these diffs early on in the dbt development process.

Diffing all modified models and downstreams with data-diff

As you can see, you can do a lot with audit_helper, but even in its simplest implementation, thereā€™s a fair amount of code management and manipulation you have to do, and itā€™s only set up to investigate one model at a time.

Datafoldā€™s open source data-diff + dbt integration offers a different set of tradeoffs. Itā€™s dead simple to get started, and it automatically picks up any models that were run in your most recent dbt run. You donā€™t have to create or store any files. With just a few simple configurations in your dbt project, you can get started.

This is how data-diff handles the exact same update to 'dim_orgs' that we just walked through. After youā€™ve edited your models, you can then run a dbt command selecting any number of models and add data-diff ā€“dbt to the command, just like this:

dbt run -m dim_orgs+ && data-diff --dbt

And thatā€™s it! This gives you almost instantaneous visibility into the columns that contain non-matching values between the two data sets, as well as any primary key differences that may exist.

Importantly, itā€™s effortless to run data diffs on not only the table(s) you modified, but any downstream tables, or any other tables that you select in your dbt run statement.

In its present form, Datafoldā€™s open source data-diff is limited in its ability to explore individual value differences out-of-the-box.

audit_helper for exploratory analysis vs data-diff for instantaneous diffing all relevant models

For a free, open source tool that allows deeper exploration of individual value differences, audit_helper is your best bet today.

For a quick diff of all modified and downstream models, data-diff is your best bet. The team at Datafold will soon release a free tier that lets users explore individual value-level differences in a rich visual interface.

What youā€™ll need to try out what Iā€™ve walked through on your own:

  • dbt-core so you can run dbt for free on your computer.
  • Datafold's data-diff to see a summary of how code changes impact your data.
  • dbt Labs' audit_helper to audit data differences and dive deep into analysis of specific columns. Be sure to install or upgrade to the latest version: 0.8.0.
  • VS Code with dbt Power User to run dbt models and analysis and see the results right in your text editor.
  • All of these tools are free and open source!

Please feel free to reach out if you have any questions or ideas as you run data diffs to be absolutely confident in the data differences between your development and production dbt environments.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes