dbt Exposures: What are they and how to use them

The output of our work as data practitioners are data products – datasets, dashboards, reports, ML models – no matter how complex or lengthy the pipelines are; it's the final data products that make an impact on the business.

Therefore, it’s essential to understand how data consumers use the data produced by the dbt pipelines—usually through some form of data lineage. While dbt ships with automatic table-level lineage and dbt Cloud provides column-level lineage in the dbt project via dbt docs, it only tracks (automatically) dbt source tables and models, and not how the models are used by the business.

dbt Exposures extend dbt's native docs and allow dbt developers to document end-user data products and their dependencies within the dbt DAG (directed acyclic graph). Exposures can help answer questions, including:

  1. What dbt models are upstream to dashboard X
  2. Who is the owner of the dashboard X
  3. What downstream applications may break if we update model Y

dbt exposures: Example of usage

Exposures are defined in YAML files nested under the "exposures" key.

Code example of a BI dashboard as a dbt Exposure


version: 2

exposures:

  - name: report_daily_kpi
    label: Daily KPI Dashboard
    type: dashboard
    maturity: high
    url: https://bi.tool/dashboards/100
    description: >
      Weekly KPI dashboard that execs are looking at daily

    depends_on:
      - ref('fct_transactions')
      - ref('dim_customers')
      - source('gsheets', 'goals')
      - metric('revenue')

    owner:
      name: Terry Soulcounter
      title: Accountant
      email: terry@greatbeyond.com


Exposures in the dbt DAG

Once added to your dbt project via YAML, exposures will appear in your dbt DAG (both in the local docs site as well as the dbt Cloud documentation site) as orange nodes. Like most lineage graphs, you can click on specific nodes and paths for your exposures to clearly understand upstream and downstream dependencies.

dbt exposures: A powerful [and underused] feature
Source: dbt Labs

Exposures best practices

As with any governance feature, it’s important to think about best practices when implementing exposures.

#1: Start small with the business-critical exposures

While exposures are a powerful feature, adding exposure tracking for a mature project with thousands of BI and other dependencies can be overwhelming. When starting using exposures, it’s best to start by adding exposures for the top 10 most important data products. These are commonly known, e.g., an executive KPI dashboard or a reverse-ETL sync into CRM. Starting small allows you to familiarize yourself and the team with the exposures framework and facilitate wider adoption.

#2: Establish team guidelines

Once you’ve added exposure tracking for the essential assets, it may be a good time to establish team guidelines, e.g., "every data/analytics engineer should maintain exposures for the BI assets they own" or "when creating a dashboard for stakeholders, always add an exposure."

Having clear guidelines makes it easy to maintain and enforce team-wide curation of exposures.

#3: Keep exposures healthy

As with dbt tests, it’s essential to keep exposures up-to-date. Once the information in exposures becomes stale, e.g. owners are no longer with the company, the BI tool url is broken, the dashboard was deprecated but is still tracked in exposures, data team members and business users will eventually lose trust in exposures and stop using them, which is the opposite of what we want. Returning to best practice #1 – it’s best to have fewer high-quality exposures that stay up-to-date than hundreds that are stale and untrusted.

dbt exposures limitations

While exposures are a simple and powerful way to document downstream data applications in a dbt project, they have two fundamental limitations.

dbt exposures must be manually created and maintained with YAML, which does not scale effectively

The more widely data is adopted in the organization (good thing), the harder it is for the data team to keep track of all downstream uses of the data they produce. In my data engineering days at Lyft, we used to have over 100 major dashboards across Looker and Tableau and over 10,000 reports in Mode.

Exposures don’t detect potential breakages during code changes

One big reason to have visibility into the downstream data uses is to prevent breaking data products when changing dbt code upstream. While exposures make defined dependencies visible in dbt docs, it still requires someone to go through a (sometimes giant) graph of dependencies to identify potential breaking changes.

Automating exposures with dbt + Datafold

Datafold complements dbt with the automated column-level lineage that implements with all major BI tools. Unlike exposures (that need to be defined manually) and dbt’s own data lineage (that is limited to dbt-project assets), Datafold relies on full semantic parsing of SQL logs from your data warehouse and combining that with metadata from BI tools to form a complete dependency graph that covers the entire data warehouse, including, but not limited to, data models and BI assets.

Choosing the Ideal Data Lineage Tool | Datafold
Lineage in Datafold that goes from source to data app assets, in this case, a reverse ELT sync in Hightouch

Furthermore, integrated in CI, Datafold automatically computes data diffs showing how the data changes when the changes to dbt code are made, and identifies impacted downstream applications such as Looker or Tableau dashboards directly in the pull request.

Datafold automatically adds a comment to your dbt PR indicating data differences between your prod and dev tables, as well as potentially impacted downstream data app dependencies

Conclusion

dbt Exposures, code-defined extensions of your dbt project that can be created to identify downstream data assets (like a reverse ETL sync, BI dashboard, or data science model), are a useful way to extend your default dbt DAG. They must be created and maintained with YAML, and manually defined for each exposure you want to add to your DAG. While they can be an effective way to understand the downstream use of your dbt models, they can be challenging to implement at scale.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes