7 dbt Testing Best Practices
- Start with Generic Tests
- Leverage helpers and extensions
- Model with the testing in mind
- Test data sources
- Integrate alerting
- Establish clear ownership of tests
- Create testing documentation
But before we dive in let's start with creating two tests in dbt and an example table we can apply tests to.
Creating a test in dbt
Testing in dbt uses SELECT statements in SQL queries to confirm that the data models, data sources, and other resources in a dbt project work as intended. These tests are designed to return rows or records that fail to meet the assertion criteria specified in the test.
An important concept of dbt tests is that these SELECT statements attempt to find failing records, records that would show a test to be incorrect. This is a slightly different concept from normal software testing, where certain conditions are asserted to be correct or true.
There are two main types of tests provided by dbt:
- Singular tests
- Generic tests
Singular tests are very focused, written as typical SQL statements, and stored in SQL files.
For example, we might write a singular test to ensure there are no NULL amount records in a typical fact table. Note we can include jinja in these singular dbt tests.
Generic tests are written and stored in YML files, with parameterized queries that can be used across different dbt models. This same singular test above could be written as a generic test.
There are several ways to implement testing with dbt depending on which data properties the tester is interested in. The dbt ecosystem helps in this regard through a variety of packages related to data testing that can be found on dbt hub.
Example - eCommerce
Let's pretend we are an Analytics Engineer at an ecommerce store selling widgets. A sales manager comes to us asking which widgets were sold when and for how much to determine revenue by product per month. We have our transaction data in a fact table and our product information in a dimension table but how do we know this data will stay reliable over time?
Let's take a closer look at how dbt testing best practices can detect many common data-related problems to help us calculate revenue accurately.
Downstream impact and analytics
One of the most common data issues solved by dbt testing best practices is incorrect data and analytics cascading downstream from a fact table, like in our example. It would be very common to have a Monthly Product Sales report consumed by multiple business units, using aggregated data points from this report to drive business decisions.
Such a Monthly Product Sales report might be written in SQL as follows.
The result of this SQL query displayed in the Monthly Product Sales report would look something like this:
Data-driven organizations need to protect and test for data bugs and issues that can cause incorrect downstream analytics and reporting.
In the following sections, we will discuss dbt testing best practices, and how in our example case, they can solve data problems causing downstream issues.
#1: Start with generic tests
It’s often tempting in data testing—just as in standard software-development testing—to start with the most complex cases and tests. Before long, test writers get mired in the minute details of these scenarios, and their test writing gets stalled. Instead, it’s best to start simple.
Create tests by starting with the most basic tests before moving to more advanced tests that deal with business logic. Begin by validating the structure and assumptions made of the data. If the basic assumptions of the data are wrong, this will affect the usefulness of any advanced tests built on top of these assumptions.
Let’s look at the following dbt test examples with our previously discussed fact table example.
Beginning with basic tests means beginning with generic tests, which are modular and reusable tests for ensuring data reliability and integrity. dbt has four generic tests (also known as “schema tests”) which come out of the box, and they can be deployed with minimal configuration:
1. not_null: A null value in data indicates that a value in a column is missing or unknown. Sometimes, null values are intentional and desired. Other times, they are an indication of a data issue. This kind of test identifies where null values exist.
It’s clear that our Monthly Product Sales report would be affected by NULL values in our or <span class="code">product_id</span> column. If the <span class="code">product_id</span> column contained NULL values, those sales would be thrown out of our analytics query.
2. unique: This kind of test is used to ensure that there are no duplicate values or recurring rows in a dataset. This is important because duplicate values in data columns like primary keys can lead to misleading data metrics, inaccurate visualizations, and unreliable reports.
In our example tables, our <span class="code">dim_products</span> table is particularly susceptible to having duplicate <span class="code">product_id</span> values, and this would have a serious impact on our analytics query. If duplicate values existed, this would inflate the numbers reported in our Monthly Product Sales because each row in our <span class="code">fact_transactions</span> table would be duplicated during the table join.
Without this example unique test for our <span class="code">dim_products</span> table, we could end up with duplicate <span class="code">product_id</span> records. For example, we might have a scenario like the following, in which <span class="code">product_id</span> 55 is reused in the table.
When our analytics query runs to populate the Monthly Product Sales report, the join between the <span class="code">dim_products</span> and the <span class="code">fact_transactions</span> tables now inflates our sales numbers by duplicating all records in <span class="code">fact_transactions</span> that have a <span class="code">product_id</span> of 55.
3. accepted_values: This kind of test verifies that all column values across all rows are in the set of valid values. For example, our <span class="code">product_category</span> column might only have two valid values (<span class="code">AB100, NB500</span>). If this test uncovers any values in the <span class="code">product_category</span> column which are not one of these two, it will fail.
4. relationships: This kind of test verifies referential integrity, ensuring all of the records in a child table have a corresponding record in the parent table. We could use this test to check that all <span class="code">product_id</span> values in our <span class="code">fact_transactions</span> table exist in the <span class="code">dim_products</span> table.
As an example, it would be a common data bug if new sales were being recorded in our <span class="code">fact_transactions</span> table with a <span class="code">product_id</span> that did not exist in our <span class="code">dim_products</span> table.
In the above example, we know that <span class="code">product_id</span> 01 does not exist in <span class="code">dim_products</span>. As a result, our Monthly Product Sales report will underreport our sales. This is why relational integrity checks are so important for catching those hard-to-detect problems that are introduced into a dataset.
Custom generic tests
When more complex logic or validation criteria is needed, data practitioners can write their own custom generic tests. A custom generic test is written with Jinja, a templating language combined with SQL.
When examining our table <span class="code">fact_transactions</span> and the Monthly Product Sales analytics along with the basic dbt testing we already put in place, there are still data issues and business rules that might not fit well into these simple tests.
For example, let’s say the business requires that we have no orders for over 500 units, as the company cannot fulfill such large orders with current inventory levels. This is the perfect use case for customer generic tests.
We could write the following custom test for verifying that none of the orders coming into our <span class="code">fact_transactions</span> table exceed the 500 unit order <span class="code">quantity</span>:
This custom generic test can now be used in a model:
When a testing scenario is unique to a single case—meaning it would not apply generically across several cases—the data practitioner can write a singular test. A singular test is based on the SQL statement which would return failing records. The dbt documentation provides the following example of a singular test, stored in a .sql file.
In our case, maybe we would like to determine if any orders have a negative order quantity:
#2: Leverage helpers and extensions
The dbt ecosystem is growing at a rapid pace. Because dbt is open source, its users can create packages that address specific needs within dbt. These packages can be found at the dbt Package hub. Examples of useful packages include dbt-expectations and dbt-utils.
The dbt-expectations extension package allows dbt users to deploy tests in the style of Great Expectations directly to their data warehouse. The package includes over 50 tests that are commonly used.
To install and use dbt-expectations, modify your dbt project’s packages.yaml to include the following. See the dbt-expectations GitHub README for more specifics.
With dbt-expectations built-in tests, data teams can take dbt testing to the next level, implementing and checking complex tests and business requirements.
Let’s return to our <span class="code">fact_transactions</span> and Monthly Product Sales report example. We recall that our report includes SQL that sums up and groups by products, by month.
Again, the results of this report would look as follows.
When the sales manager looks at the Monthly Product Sales report in a BI tool such as Looker, the negative monthly sales would jump out to them. Negative sales means something has gone seriously wrong. They might also add if sales for any particular widget were very large, lets say over 100,000 that might also indicate something has gone wrong.
These specific cases could be easily alerted with the dbt-expectations pre-built <span class="code">expect_column_sum_to_be_between</span> test.
Now, we can easily test if our monthly sales meet the criteria given to us by the business, catching any anomalies or errors.
The dbt-utils package was developed by dbt Labs, and it contains additional schema tests and utility methods, extending and enhancing dbt.
By using pre-built packages such as these and others, data practitioners can save time and stand on the shoulders of those who have gone before. These advanced testing capabilities—developed by stakeholders in the data industry—provide an added layer of functionality that data teams can leverage.
Similar to dbt-expectations, dbt-utils comes with a number of out-of-the-box tests we can use on our datasets. For example, we can check to ensure our <span class="code">fact_transactions</span> table doesn’t contain any negative quantity values that would negatively affect the Monthly Product Sales report.
Using dbt-utils <span class="code">expression_is_true</span> test could find any bad records like the one shown above.
#3: Model with the testing in mind
Create data models and tables with the data tests in mind. For example, tables that are related to one another should be designed with the mindset that relationship tests will be run to verify referential integrity.
Again, using dbt-expectations makes it very straightforward to create tests when creating the initial tables to hold our metrics, like monthly product sales.
Specific tests like <span class="code">expect_column_to_exist</span> could check to ensure table structure has not been changed and can hold the values we expect when our Monthly Product Sales report runs. For example, we would want to ensure the <span class="code">monthly_sales</span> column always exists in our table so the downstream analytics will always be available.
We can even test to ensure our <span class="code">fact_transactions</span> table regulates getting new records and that something has not gone wrong upstream. To do this, we would use tests like <span class="code">expect_row_values_to_have_recent_data</span> with dbt-expectations.
By beginning with a test in mind—right when we are creating new data tables, sources, and analytics—the technical details are fresh, and it’s easier to create broad data-testing coverage upfront.
#4: Test data sources
Beyond just testing models, strong data testing includes testing of sources too. Data sources can include CSV files, event data, streams, data from Facebook ads, and even another data warehouse. Testing for data integrity and accurate assumptions should be performed on data at the source level. Doing so helps catch inconsistencies early in the ETL process.
In advanced dbt projects, data sources are named and declared as tables in the warehouse.
Again, by using dbt-expectations, it’s easy to test that source data structures—such as additional database tables—have not changed since the last ingestion. Data source changes like the addition or removal of columns is a common problem in data pipelines.
Using <span class="code">expect_table_column_count_to_equal</span> is a great way to check for changes in source data.
#5: Integrate alerting
A strong testing practice includes an alerting component. If your test runs are automated—meaning they might be scheduled or triggered by certain events—then your data team may not always be aware that tests are running. When failures occur in automated tests, you need to know about them immediately.
In dbt, alerting can be used in the traditional sense: to alert the data team (or a specific person) when tests fail so they know when something is wrong with the code or data. Alert notifications can be delivered by email or via Slack.
Alerts can also be used in coordination with tests that validate data in the ETL process. For example, the data team can be alerted to the following issues:
- Incorrect or unexpected data is coming in from the data pipeline.
- There is a discrepancy in terms of data quality incoming from data sources.
If there is a difference in the data quality, alerts can be given different classifications. For example a “not urgent” alert can signify a data issue that is important but not needing immediate intervention. This type of classification nudges the data team to review the new data at an opportune time and understand where it has changed. Alert classifications help reduce the tendency to abandon current tasks and focus immediate attention on solving test issues irrespective of their severity.
Another option related to alerting is using dbt logs. By gathering and placing them in many existing logging technologies, you are able to aggregate and trigger certain log messages and tags. It’s also possible to enable structure logging with JSON, which of course provides more options for richly processing all types of logs and events.
#6: Establish who investigates test failures
Unclear expectations about triaging tests within a team will lead to failing tests being left ignored or broken code being left unfixed. When end users use broken data they will lose trust across all data. This severely damages a company’s ability to make informed decisions.
Some organizations segment ownership by scheduling, whereby each team member deals with any test failures that occur within a set window of time. This kind of approach spreads out responsibility and encourages all data practitioners to gain a better understanding of why tests fail. In addition, team members not currently scheduled to handle test failures can focus on their other tasks without the threat of being pulled away to deal with test issues.
In a large environment many different people may have created the original dbt tests. While the generic ones are fairly easy to evaluate it can be useful to label who created the test as an owner of it within dbt so that whoever is investigating the failure can reach out to them. While dbt doesn’t (yet) have an owner property that can be associated with models, providing additional context, such as a test author or owner, is possible through the use of key-value pairs in the meta config, like the following:
#7: Create testing documentation
Lastly, it is vital to create good documentation for your data tests so future stakeholders, such as data analysts or engineers can easily understand their purpose and extend them when appropriate. Documentation provides a clear template that others can utilize, and it helps your stakeholders to be autonomous. With documentation in hand, these stakeholders will not need to seek out the author of the tests to ask questions.
Documentation should provide the purpose of the test and why a test would fail. dbt can auto-generate documentation with information about your project and your data warehouse. You can add a description key in your test files to provide context or guidance regarding a test. For example:
Documentation increases development velocity because there are clear cues and pointers as to what the tests are analyzing, along with the expected results of those tests.
Beyond dbt tests: automating testing with data diff
Implementing dbt tests can be an effective way to proactively improve data quality. However, the major pitfall of dbt tests is scalability: each dbt test needs to be manually defined, implemented and tuned, making it virtually impossible to cover all potential failure scenarios. Another risk is blindspots: we write tests to check for failure scenarios that either have occurred or that we anticipate, often missing major potential issues. In other words, we don't know what we don't know when it comes to testing.
This is where data diff – a tool for comparing datasets – can come handy. Similarly how we run dbt tests on staging environment to validate changes to the dbt SQL, we can run data diff to compare the staging data with production data to see how each code change will impact the data. The power of data diff is in its ability to highlight all changes and therefore helping analytics engineers catch all potential issues without having to write tests beforehand.
Datafold Cloud leverages data diff to provide seamless integrations with dbt Cloud and dbt Core that enable dbt developers to:
- Easily compare data between staging and production with powerful UI visualizations
- Implement data diff in CI/CD to automate pull request reviews and test every change to the dbt code base
- Prevent breakages in downstream data apps and BI tools with column-level lineage analysis
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.