How to use dbt-expectations to detect data quality issues
How often are you interrupted by a panicked stakeholder because of a downstream data quality issue? They might be coming to you because one of their key dashboards broke and now they can’t see the KPIs they use to track marketing performance. When your data sources and data models are not being tested, data quality issues tend to sneak through the cracks and manifest downstream.
Luckily, if you use the popular data transformation tool called dbt, you have built-in testing that you can utilize with your data sources and models. Not only does dbt offer basic testing for columns like primary keys, but they have many open-source packages that offer advanced testing. In this article, I will show you how to use dbt-expectations, what to test, and how to avoid alert fatigue by limiting the scope of your tests.
What is dbt-expectations?
dbt-expectations is a dbt testing package inspired by the Python testing library, Great Expectations. Tests in dbt-expectations are written in YAML templates using a combination of SQL, Jinja and dbt macros. Compared to Great Expectations tests, dbt tests are easier to set up, easier to write, and run faster as everything happens within your database whereas Great Expectations is a different project that requires pulling data from the database.
Tests in dbt-expectations can be applied to sources, models, columns, and seeds within a dbt project. Some tests work on all the above while some only work on one. For example, if you use the test expect_table_column_count_to_equal, you will reference it under a source, model, or seed like so:
However, that test actually exists as a macro with the following code:
dbt-expectations vs other dbt testing packages
All dbt tests under the hood are SELECT statements that seek to return records that fail your assertion. dbt provides four generic tests that can be applied in a dbt project without having to install any external packages: not_null, unique, relationships, and accepted_values.
If you want to test other, more granular data quality issues like incorrect data types, duplicates, and missing data then you would need to install an external package like dbt-expectations or write your own tests in SQL. You can also check out the tests available to you in other dbt testing packages like dbt-utils or dbt-audit-helper, but neither of these contains the full library of data quality tests that this one has.
In order to get started with dbt-expectations, make sure you have a valid dbt project. Start by adding the package to your packages.yml file:
Next, add the dbt_date variable to your dbt_project.yml file. This is referenced by some of the tests.
Lastly, run the 'dbt deps' command and you are ready to start testing with dbt-expectations!
Example dbt-expectations tests
dbt-expectations contain plenty of tests to address the most common data quality issues. Incorrect data types, stale data, missing data, and non-unique or duplicate values are just some of the problems that this package can solve.
Incorrect data type
Data types are tricky because they often affect the SQL functions that you can use when writing your dbt data models. Unfortunately, if a column value has an unexpected datatype, it could break your entire model. This is exactly why it’s important to assert expectations for the data type of your column. I particularly recommend focusing on timestamp, date, integer, and string values as these tend to cause the most issues.
For this, dbt-expectations has two tests: expect_column_values_to_be_of_type and expect_column_values_to_be_in_type_list.
You will want to use the first test if your column only has one valid datatype. The second test is a little more forgiving as it allows you to pass a few different data types that would be accepted. If I have a timestamp column called `created_at`, I would test that column to ensure it is a timestamp value by specifying the expect_column_values_to_be_of_type test under the `created_at` column name.
This would look something like this:
Now, if the 'created_at' column is a date datatype, or any other kind, instead of a timestamp, this test will throw an error and alert you of the test failure.
Freshness can be a data quality issue that has a major impact on data directly used by the business. If you aren’t familiar with freshness, this refers to how up-to-date your data is. It’s important that you are alerted when your data isn’t as fresh as you expect it to be, as this could signal upstream data pipeline issues.
I’ve personally experienced what it’s like to have stale data without realizing I have stale data, and let me tell you, it’s a problem you don’t want to come across. Freshness tests like the ones written in dbt-expectations can save you a lot of frustration.
My personal favorite test to use is expect_row_values_to_have_recent_data. This one takes a timestamp column as input and allows you to set a time threshold to check for. It accepts any time interval allowed in a typical DATEPART() SQL function. Some of these interval values include: Year, Quarter, Month, Day, Hour, Minute.
You can then specify a numerical value for this interval that you set. Let’s add this test to the `created_at` column which we also added a datatype test too.
This test will ensure that the most recent value in the `created_at` column was created 3 hours ago or sooner. This would be a good time interval for a model that is run every 3 hours. Make sure you don’t choose just any interval, but one that makes sense for the data in the column you are testing. You want this test to be as helpful as possible in identifying root-cause issues in your sources and models.
For another freshness test, you can also check out the expect_grouped_row_values_to_have_recent_data to test aggregate queries.
Missing data is one of the scariest data quality issues to run into because you have possibly lost data that you will never be able to get back again. I’ve once lost two weeks worth of web session data, losing valuable insight into customer behaviors. When data is missing, it often hints at a greater data engineering or event collection issue. This being said, it is always ideal to catch this as soon as possible.
There are a few different tests you can apply to check for this. A common one is expect_column_to_exist which ensures that a column is ingested into your data. This is great for capturing any potential schema changes.
For example, expect_column_values_to_not_be_null is a great test for checking the specific values of a column. This one is more advanced than the not_null test available in dbt generic tests because of the ability to add a row condition. If you do expect NULLs for certain column conditions, you can specify a filter to be applied with the test. dbt generic tests only offer the ability to check all of the values in a column for null values.
Let’s say I had a table that tracked users and whether or not they signed up for a paid account. If they signed up, they would have a timestamp populated in the `created_at` column and an `account_id`. If they did not sign up, they would not have a timestamp populated in the `created_at` column and no account_id. We want to ensure this account_id column is never NULL when a user has signed up for a paid account, but we must do this considering the `created_at` column.
We could write our test like so:
This row condition now excludes all rows where the `created_at` column is NULL from the test. As long as that condition does not apply, the test will still apply to the other data rows, checking for NULL values.
Finding values that are not unique, especially when you expect a single column to act as a primary key, could identify a larger issue in your database design or dimensional modeling. It is helpful to have checks in place in order to be identified when your column values are no longer unique. Similar to all the other data quality issues, this will make you aware of an even larger issue.
While dbt generic tests offer a uniqueness test, so does dbt-expectations with expect_column_values_to_be_unique. The difference is in the ability to add a row condition.
dbt-expectations also has a test called expect_column_count_to_equal where you can check for the number of distinct columns. So rather than just checking for duplicates in rows, you can check them in columns as well. This proves to be particularly helpful with schema changes.
Lastly, my favorite test in dbt-expectations for checking duplicates is the expect_table_row_count_to_equal_other_table. This one is great to use on models where you expect the resultant row count to be equal to the row count in one of the staging tables you are using in it. Ideally, if you use a left join to join two tables, the row count should be equal to that first table mentioned in the left join.
You would write a test like this like so:
Some tests, like this one, have extra parameters you can add to make your test complete. This one allows you to group by columns in two different models and compare them. You can also add row conditions for both models you are comparing.
Where to run dbt-expectations tests
Like any other dbt tests, you can run tests in development, on your CI pipeline, or after your production dbt runs.
During development, you can manually run tests with the 'dbt test' command and see results from your command line. dbt also allows you to run a subset of tests, like for sources only, with more specific commands such as 'dbt test --select source'.
You can also automate running tests after you push code changes to a pull request. For example, dbt Cloud provides automation to update pull requests in GitHub, GitLab, or Azure DevOps with a status message indicating the results of the run. The status message will state whether the models and tests ran successfully or not.
Alert fatigue often manifests from running dbt tests after production dbt runs, and collecting all failing test results in a Slack channel for example.
How to avoid alert fatigue
If you are getting alerts all of the time from failing dbt tests, you’ll eventually start ignoring them. We don’t have the time in our busy days to spend all day, every day figuring out what caused a test to fail. This being said, it’s important that you correctly identify the tests that are most important to your business. These are the ones you want to spend the most time implementing. And, while implementing them, it’s important that you understand the data and the nuances that come with it. If NULL values in a column are pretty common, you probably shouldn’t be testing that column for NULLs. Data observability tools for dbt like re_data and elementary-data can collect metadata, and provide anomaly detection to avoid writing too many manual tests.
Make sure you fully understand each use case before implementing a test for it.
I recommend always beginning testing at the source. This way you will detect any data quality problems at the first layer where data is brought into dbt. Testing at the source will help you identify any source data issues if models downstream are failing. Otherwise, these source data problems can be hard to detect. After source tests are written, you can focus on your core data models that are used by the business. Tests on these models paired with source tests will help you identify data quality issues directly affecting the business.
Other tools like Datafold, which do automated testing in development and deployment, reduce the overhead of writing and maintaining tests. For example, Datafold allows you to see data changes on your dbt models as you develop and from your pull request. By validating data changes you can be confident that code changes will have the desired impact on data and avoid data regressions. dbt-expectations can then focus on catching data issues that originate at the ingestion layer and trickle down to the data models.
dbt-expectations allows you to implement an entire suite of pre-built tests to detect data quality issues. Just be careful to minimize the number of tests you add in order to avoid alert fatigue. Start with testing your data sources and core models in order to limit the scope. You can complement generic dbt tests and dbt-expectations tests with automatic testing tools like Datafold to validate dbt code changes by seeing the impact on data.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.