Datafold + dbt: The Perfect Stack for Reliable Data Pipelines

Gleb Mezhanskiy

February 5, 2021

Whether the decision is made by an executive looking at a dashboard or by a machine learning model, one can no longer ignore the quality of the data that feeds those decisions at a modern organization: too much is at stake. Data teams are facing extraordinary complexity and volumes of data on the one hand, and increasing reliability expectations on the other. This reality is impossible to manage without the right tools that monitor and control data quality.

Software engineers faced a similar challenge a decade ago amid the explosion of cloud infrastructure and distributed applications. The tools for continuous integration, automated testing, and ubiquitous observability that make modern software systems possible are still new to the Data world. But it is the implementation of these ideas and processes that can enable Data teams to tame the complexity, move fast and with confidence.

We are building Datafold to 10x the productivity of Data professionals across all industries and company sizes by giving them full visibility into their data assets and automating toil tasks that currently consume most of their time. One of our first features – Data Diff – helps data developers to quickly verify the changes introduced to the data pipelines, effectively automating one of the most time-consuming and high-risk workflows.

At the same time, robust engineering practices are introduced in other parts of data stack: the dbt team has been leading the Analytics Engineering movement and enabled Data developers with a user-friendly approach to building SQL data pipelines that elegantly incorporates some of the most important principles of agile software engineering, such as:

  1. Unified structure for data transformations (SQL + Jinja templating)
  2. Version control for source code
  3. Automated data "unit testing"
  4. Documentation that lives with code

Now Data teams can take their workflows to the next level using the one-click Datafold integration with dbt to boost their productivity and move faster without risking degrading data quality thanks to three Datafold features:

  • Data Diff empowers developers to see how code changes impact the data in the modified table and downstream dependencies.
  • Data Catalog with full-text search, data profiler and metadata that syncs with dbt docs.
  • Column-level lineage for all dbt models maps dependencies between tables and columns show how data is produced, transformed, and consumed.

It’s a one-click integration with dbtCloud, and for teams hosting dbt themselves, we expose an API for connecting with the CI.Follow the discussion in dbt Slack or watch the demo to learn more!

Data Diff shows how a change in the code affects produced data

With Datafold Data Diff, you now have the ability to see how a change in your SQL code affects the data in your modified data table as well as its downstream dependencies.

  • Spend less time on QA, and see the full scope of changes without writing a single line of code
  • Eliminate accidental mistakes
  • Accelerate code reviews

Gain full visibility into your pipelines with column-level lineage

Data engineers spend countless hours manually mapping their data flows. When they aren’t digging into old spreadsheets or reading the source code files, they are asking their colleagues for help. Often the need for lineage comes with an urgency of resolving a data incident that requires immediate reaction to avoid costly damages.We realize how stressful and mundane this process is, which is why we’ve released column-level lineage. Using SQL files and metadata from the data warehouse, Datafold constructs a global dependency graph for all your data, from events to BI reports:

Detailed lineage can help you reduce incident response time, prevent breaking changes, and optimize your infrastructure. Goodbye to spending late nights answering questions such as:

  1. Where does the data used in this BI report come from?
  2. What will be the impact of changing a given column?
  3. What columns are not used vs used the most?

Discover your data through Datafold Catalog

Finding and understanding the data for every task is often a time-consuming process, considering that nowadays it's not uncommon for a data warehouse to have 5000+ tables and 100,000+ columns.With Datafold Data Catalog, you can keep your data documentation close to your code (e.g. dbt model) and serve it in a responsive interface with full-text search & per column-profiler. Alternatively, you can use a Notion-like editor to document your tables and columns:

Want to give it a try? Schedule a demo here to get early access.