Automate dbt development testing in Snowflake with data-diff

This blog post references the open source data-diff package; as of May 17, 2024, Datafold is no longer actively supporting or developing open source data-diff. For continued access to in-database and cross-database diffing, please check out our free trial of Datafold Cloud.
‍
We’re grateful to everyone who made contributions along the way. Please see our blog post for additional context on this decision.

data-diff is an open-source Python CLI that allows you to efficiently compare two tables in or across relational databases, data warehouses and dbt models. The dbt integration brings data-diff table comparison to dbt models. After building your dbt models with 'dbt run' in development you can now run the 'data-diff --dbt' command to compare dbt models between environments. This will output a data diff report containing changes across dbt models, rows and columns. 

In this tutorial, I will use a demo dbt project to show how to automatically test dbt code changes in Snowflake with data-diff reports. First, I will show how to seed development and production environments with data. Alternatively, you can use your existing Snowflake environment. Then, I will set up the dbt project to work with data-diff. Finally, I will make changes to the SQL code and investigate the data impact with data-diff.

Example output from running data-diff --dbt

Prerequisites

If you’d like to follow along with the dbt project I’m using, start with step 1. If you’d like to bring your own dbt Snowflake project, jump to step 2.

Step 1: Set up a dbt project in Snowflake (optional)

If you don’t want to use an existing Snowflake dbt project you can clone Datafold’s demo dbt project. This will give you sample data and dbt models to work with.

git clone https://github.com/datafold/demo
cd demo

If you don’t have dbt-snowflake installed (you can check via 'dbt –version'), install it with pip.

pip install dbt-snowflake

Now you need to set up a profile in ~/.dbt/profiles.yml with two targets (dev and prod). I am using a Snowflake database named DEMO that my role has OWNERSHIP permissions on.

demo:
 target: dev
 outputs:
   dev:
    type: snowflake
    account: [your_account_name]
    user: [your_user_name]
    password: [your_password]
    role: [your_role]
    warehouse: [your_warehouse]
    database: DEMO
    schema: DEV_DAN
    threads: 4
   prod:
    type: snowflake
    account: [your_account_name]
    user: [your_user_name]
    password: [your_password]
    role: [your_role]
    warehouse: [your_warehouse]
    database: DEMO
    schema: PROD_DAN
    threads: 4

This configuration will allow you to run the dbt project against two dbt targets (dev and prod). Development tables will be created under the DEMO.DEV_DAN schema and production tables will be created under the DEMO.PROD_DAN schema.

Validate that the profile is working by running 'dbt debug'.

Then you need to build dbt models in development and production targets. Build development models with 'dbt build -t dev' and production models with 'dbt build -t prod'.

Step 2: Set up data-diff

Next, you need to install the data-diff CLI for Snowflake:

pip install ‘data-diff[snowflake]'

You need to tell data-diff where your production models will be deployed, this is done through vars in your dbt_project.yml file. Since I set my “prod” target to write models to DEMO.PROD_DAN, add these lines to your dbt_project.yaml. Check the docs if you are using dbt custom schemas.

vars:
  data_diff:
    prod_database: DEMO
    prod_schema: PROD_DAN

The final step to make sure models have columns tagged with ‘primary-key’. A primary key allows data-diff to know whether a row has been added, updated or removed. Behind the scenes data-diff does an outer join between the two tables based on the primary key.

This has already been completed in the demo repo, but note that this is one of the required configuration steps. We plan to detect primary keys automatically from unique column tests (see Github Issue).

Step 3: View data impact reports

When running 'data-diff --dbt' it first reads the run_results.json artifact to find all the models that were built in the last 'dbt run' command. This list of models is then looked for in the manifest.json file to construct the paths for dev and prod models. Then data-diff compares these models one by one to the models in the production environment.

First, let’s build a single model (dim_orgs) without doing any code changes. For example, this can be useful to verify that your dev environment matches the production environment, and all data differences will arise from code changes you are about to make. Depending on your setup you may already see data differences between dev and prod.

dbt run --select dim_orgs

data-diff --dbt

Note that you also can also use zero-copy cloning in Snowflake to clone your sources from production. Alternatively, to assure that there will be no data differences when you start coding your feature you can build the two environments locally (instead of comparing to production data) from the same source data.

Now, let’s change some SQL code to see what is the output of data-diff. Below I added a WHERE clause that will filter out some records from the dim_orgs model.

Rebuild your dbt models in your dev environment to make a comparison with prod.

dbt run --select dim_orgs

Now you can run 'data-diff --dbt' again, and see the impact of adding the where clause. 101 rows have been removed in dev, leaving the table with just 56 rows compared to prod.

data-diff --dbt

This change is not happening in isolation, this small change can have big impacts on models downstream. To see the downstream impact of this change, run this model again, but also include the models downstream of it.

dbt run --select dim_orgs+

And here you can see the dbt lineage graph.

Now when running 'data-diff --dbt', the diffs of the downstream tables will also be included, and you can see the overall impact of this small change.

Here you can see the same output from earlier, this is the direct impact of our change, with data-diff reports for all downstream models.

data-diff --dbt

Above you can see how this change has impacted a downstream fact table FCT_MONTHLY_FINANCIALS. This is an aggregate of the previous table where all of the rows have been updated. The SUM_REVENUE and CNT_SUBSCRIBERS has (presumably) decreased for every month. And that change also happens in the yearly roll-up FCT_YEARLY_FINANCIALS. Another downstream model SALES_SYNC has had all 4 rows deleted, that sounds a little important.

Interpreting what the data changes here mean ultimately depends on the context of the change, but let’s imagine two different scenarios:

  • I was previously over counting subscribers. I want to reduce the subscriber counts in the monthly and yearly models. Then the output of data-diff is intentional so I can validate the code changes.
  • I did not realize that financial models were built from dim_orgs and I only meant to update an internal business definition. Changes to finance numbers and subscriber counts are unintentional. I go back and update this definition to avoid impacting finance models.

Step 4: Investigate value changes

When you find that a dbt model has changed, in some cases you may need to explore the value-level differences across columns to know whether the change is intentional. While you can do this by running some manual SQL queries, we built Datafold Cloud UI to make it easy to explore data differences and collaborate with data consumers to validate changes.

Datafold Cloud users can run the 'data-diff --dbt --cloud' command (docs) to submit data diff computations to Datafold and view data differences on the app.

What’s next?

Now you should know what it takes to automatically test your dbt code changes during development for Snowflake. You can find the latest changes and features in our docs, including support for BigQuery, Redshift, Databricks and DuckDB.

We would love it if you try out the 'data-diff --dbt' integration and let us know how it went. If you find any issue or want to request a new feature you can open a Github issue. If you are curious about how this is done behind the scenes you can check the dbt integration code in Github. Contributions are welcome, our solution engineering team is here to help you get your contributions merged. If you have questions or need help getting setup, our team is active on the dbt Slack #tools-datafold channel.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes