Automate dbt development testing in Snowflake with data-diff
This blog post references the open source data-diff package; as of May 17, 2024, Datafold is no longer actively supporting or developing open source data-diff. For continued access to in-database and cross-database diffing, please check out our free trial of Datafold Cloud.
â
Weâre grateful to everyone who made contributions along the way. Please see our blog post for additional context on this decision.
data-diff is an open-source Python CLI that allows you to efficiently compare two tables in or across relational databases, data warehouses and dbt models. The dbt integration brings data-diff table comparison to dbt models. After building your dbt models with 'dbt run' in development you can now run the 'data-diff --dbt' command to compare dbt models between environments. This will output a data diff report containing changes across dbt models, rows and columns.Â
In this tutorial, I will use a demo dbt project to show how to automatically test dbt code changes in Snowflake with data-diff reports. First, I will show how to seed development and production environments with data. Alternatively, you can use your existing Snowflake environment. Then, I will set up the dbt project to work with data-diff. Finally, I will make changes to the SQL code and investigate the data impact with data-diff.
Prerequisites
- A Snowflake account with user and password credentials, a role and warehouse.
- dbt-core installed.
If youâd like to follow along with the dbt project Iâm using, start with step 1. If youâd like to bring your own dbt Snowflake project, jump to step 2.
Step 1: Set up a dbt project in Snowflake (optional)
If you donât want to use an existing Snowflake dbt project you can clone Datafoldâs demo dbt project. This will give you sample data and dbt models to work with.
If you donât have dbt-snowflake installed (you can check via 'dbt âversion'), install it with pip.
Now you need to set up a profile in ~/.dbt/profiles.yml with two targets (dev and prod). I am using a Snowflake database named DEMO that my role has OWNERSHIP permissions on.
This configuration will allow you to run the dbt project against two dbt targets (dev and prod). Development tables will be created under the DEMO.DEV_DAN schema and production tables will be created under the DEMO.PROD_DAN schema.
Validate that the profile is working by running 'dbt debug'.
Then you need to build dbt models in development and production targets. Build development models with 'dbt build -t dev' and production models with 'dbt build -t prod'.
Step 2: Set up data-diff
Next, you need to install the data-diff CLI for Snowflake:
You need to tell data-diff where your production models will be deployed, this is done through vars in your dbt_project.yml file. Since I set my âprodâ target to write models to DEMO.PROD_DAN, add these lines to your dbt_project.yaml. Check the docs if you are using dbt custom schemas.
The final step to make sure models have columns tagged with âprimary-keyâ. A primary key allows data-diff to know whether a row has been added, updated or removed. Behind the scenes data-diff does an outer join between the two tables based on the primary key.
This has already been completed in the demo repo, but note that this is one of the required configuration steps. We plan to detect primary keys automatically from unique column tests (see Github Issue).
Step 3: View data impact reports
When running 'data-diff --dbt' it first reads the run_results.json artifact to find all the models that were built in the last 'dbt run' command. This list of models is then looked for in the manifest.json file to construct the paths for dev and prod models. Then data-diff compares these models one by one to the models in the production environment.
First, letâs build a single model (dim_orgs) without doing any code changes. For example, this can be useful to verify that your dev environment matches the production environment, and all data differences will arise from code changes you are about to make. Depending on your setup you may already see data differences between dev and prod.
Note that you also can also use zero-copy cloning in Snowflake to clone your sources from production. Alternatively, to assure that there will be no data differences when you start coding your feature you can build the two environments locally (instead of comparing to production data) from the same source data.
Now, letâs change some SQL code to see what is the output of data-diff. Below I added a WHERE clause that will filter out some records from the dim_orgs model.
Rebuild your dbt models in your dev environment to make a comparison with prod.
Now you can run 'data-diff --dbt' again, and see the impact of adding the where clause. 101 rows have been removed in dev, leaving the table with just 56 rows compared to prod.
This change is not happening in isolation, this small change can have big impacts on models downstream. To see the downstream impact of this change, run this model again, but also include the models downstream of it.
And here you can see the dbt lineage graph.
Now when running 'data-diff --dbt', the diffs of the downstream tables will also be included, and you can see the overall impact of this small change.
Here you can see the same output from earlier, this is the direct impact of our change, with data-diff reports for all downstream models.
Above you can see how this change has impacted a downstream fact table FCT_MONTHLY_FINANCIALS. This is an aggregate of the previous table where all of the rows have been updated. The SUM_REVENUE and CNT_SUBSCRIBERS has (presumably) decreased for every month. And that change also happens in the yearly roll-up FCT_YEARLY_FINANCIALS. Another downstream model SALES_SYNC has had all 4 rows deleted, that sounds a little important.
Interpreting what the data changes here mean ultimately depends on the context of the change, but letâs imagine two different scenarios:
- I was previously over counting subscribers. I want to reduce the subscriber counts in the monthly and yearly models. Then the output of data-diff is intentional so I can validate the code changes.
- I did not realize that financial models were built from dim_orgs and I only meant to update an internal business definition. Changes to finance numbers and subscriber counts are unintentional. I go back and update this definition to avoid impacting finance models.
Step 4: Investigate value changes
When you find that a dbt model has changed, in some cases you may need to explore the value-level differences across columns to know whether the change is intentional. While you can do this by running some manual SQL queries, we built Datafold Cloud UI to make it easy to explore data differences and collaborate with data consumers to validate changes.
Datafold Cloud users can run the 'data-diff --dbt --cloud' command (docs)Â to submit data diff computations to Datafold and view data differences on the app.
Whatâs next?
Now you should know what it takes to automatically test your dbt code changes during development for Snowflake. You can find the latest changes and features in our docs, including support for BigQuery, Redshift, Databricks and DuckDB.
We would love it if you try out the 'data-diff --dbt' integration and let us know how it went. If you find any issue or want to request a new feature you can open a Github issue. If you are curious about how this is done behind the scenes you can check the dbt integration code in Github. Contributions are welcome, our solution engineering team is here to help you get your contributions merged. If you have questions or need help getting setup, our team is active on the dbt Slack #tools-datafold channel.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.