How to Create dbt Freshness Alerts

dbt freshness tests are a fantastic way to monitor source data, and to ensure that new records are arriving at an expected cadence. However, these tests require you to define this expected cadence, which can be difficult to manage at scale.

So how can we turn dbt tests into alerts that are constantly monitoring production?

For example, one might configure an event source to expect new records every day, like so:


version: 2

sources:
  - name: events
    database: analytics
    schema: events
    freshness:
      error_after:
        count: 1
        period: day
    tables:
      - name: user_signed_up
      - name: page_viewed
      - name: new_feature_configured
      

Great! We’d definitely expect page events every day, probably expect to see new users sign up every day, and maybe expect our new feature configuration event to occur every couple days. 

After digging around the events data, we realize that the daily period needs a little tuning, and we update the freshness thresholds to match reality. Users sign up every day, page views events should arrive every 12 hours, and our new feature configuration occurs once a week.


version: 2

sources:
  - name: events
    database: analytics
    schema: events
    tables:
      - name: user_signed_up
        freshness:
          error_after:
            count: 1
            period: day
      - name: page_viewed
        freshness:
          error_after:
            count: 12
            period: hour
      - name: new_feature_configured
        freshness:
          error_after:
            count: 7
            period: day
            

There’s nothing inherently wrong with this, but it’s incredibly manual, and could change over time. What if adoption of our new feature increases dramatically? What if it’s a flop and usage declines even further? Either scenario would require us to manually update the freshness period, and could result in false alarms or silent failures.

Just the other day, I woke up to a Slack alert that a source table was stale. 😱

My work was interrupted by the alert as I frantically scrambled to investigate. I ran some ad-hoc queries, only to discover that the freshness period on this event simply needed to be updated. To close the loop, I opened a pull request to update the test, which required attention from a colleague. All of which added no real value to the business, and only served to distract from higher priority tasks. I’ve personally had this scenario happen to me at least a dozen times.

A potential downside of rigorous testing is alert fatigue. An overwhelming number of alerts, most of which are false alarms, trains the people responsible for responding to become desensitized, and ultimately ignore alerts.

What if . . . you could dynamically configure expected thresholds, and never have to deal with noisy alerts or manually determine thresholds again? Datafold alerts to the rescue!

Datafold alerts utilize an ML-based forecasting model that adapts to seasonality and trends in your data to create dynamic thresholds. You can adjust the sensitivity or add custom anomaly detection to optimize your signal-to-noise ratio.

 

Ok! So Datafold has a cool chart thingy that can detect anomalies - how can we use this to detect stale data?

High level steps:

  1. Create a dbt snapshot of the information schema to record table metadata over time
  2. Write a SQL query to summarize table changes
  3. Put query results into a Datafold alert

Create a dbt Snapshot of the Information Schema

Add the information schema as a source. In this example, I’m targeting our `analytics` database.


version: 2

sources:
  - name: information_schema
    database: analytics
    schema: information_schema
    tables:
      - name: tables
      

Create a snapshot of the information schema tables source. This will allow us to see table metadata over time.


{% snapshot information_schema__tables_history %}
{{
    config(
        target_schema='analytics',
        unique_key='unique_key',
        strategy='check',
        check_cols='all',
    )
}}

select * 
    , table_schema||table_name as unique_key
from {{ source('information_schema', 'tables') }}
{% endsnapshot %}


Each update to the table will be captured in our snapshot

Write a SQL Query to Summarize Table Changes

I was interested in changing row counts per table, so this query focuses on changes in table record counts from day to day. 



with dates as ( --- date spine
  select date as _date
  from analytics.analytics.dim_date
  where
  _date between current_date - 90 and current_date
)

,metadata as (
  select table_name,row_count,dbt_valid_from,dbt_valid_to
  from analytics.analytics.information_schema__tables_history
  where
  table_schema = 'ANALYTICS'
  and table_type = 'BASE TABLE'
  and table_name in (
                     'GRAPHQL_REQUESTS'
                     ,'CI_RUNS'
                     ,'DIM_DATE'
                     ,'ORGANIZATIONS'
                     ,'PAGEVIEWS'
                    )
)

,daily_row_counts as (
  select _date,table_name,max(row_count) as daily_row_count
  from dates
  inner join metadata
      on _date between metadata.dbt_valid_from::date and ifnull(dbt_valid_to,current_timestamp)
  group by 1,2
)

,new_row_counts as (
  select *
  ,lag(daily_row_count,1) over (partition by table_name order by _date) as prior_day_rows
  ,daily_row_count - prior_day_rows as row_diff
  from daily_row_counts
)

select _date,table_name,sum(row_diff) as daily_row_diff
from new_row_counts
where row_diff is not null
group by _date,table_name
;

Example output with date, table name, and the daily change in record counts:

Put Query Results into a Datafold Alert

Finally, creating a Datafold alert is as simple as dropping the above query into the Datafold alert UI. To properly scale, it might make sense to park this query in a dbt model.

Voila! 

Table with thousands of new daily records:

Table with few new daily records:

Table that should never change:

You now have alerts with dynamically determined thresholds, and can subscribe to Slack alerts or emails with just a few clicks.

Notes:

  • This method only applies to materialized tables. Views have no record counts, and thus we can’t detect changes in row counts for views. 
  • This guide is Snowflake specific, but could be easily adapted to the information schema of another database type.

Tired of urgent Slack messages? Blindsided by broken dashboards and silent bugs? Spending countless hours on PR reviews? Datafold can help. Get Started today or Book a Demo with our data experts instead.

Stop data quality incidents, before they happen.
Get Started

Not ready yet? Book a Demo Instead

Schedule a personal call and see how Datafold can help you

Get Started

To get Datafold to integrate seamlessly with your data stack we need to have a quick onboarding call to get everything configured properly