Datafold + dbt: The Perfect Stack for Reliable Data Pipelines
Whether the decision is made by an executive looking at a dashboard or by a machine learning model, one can no longer ignore the quality of the data that feeds those decisions at a modern organization: too much is at stake. Data teams are facing extraordinary complexity and volumes of data on the one hand, and increasing reliability expectations on the other. This reality is impossible to manage without the right tools that monitor and control data quality.
Software engineers faced a similar challenge a decade ago amid the explosion of cloud infrastructure and distributed applications. The tools for continuous integration, automated testing, and ubiquitous observability that make modern software systems possible are still new to the data world. But it is the implementation of these ideas and processes that can enable a top data team to tame the complexity, move fast and with confidence.
We are building Datafold to 10x the productivity of data professionals across all industries and company sizes by giving them full visibility into their data assets and automating toil tasks that currently consume most of their time. One of our first features – Data Diff – helps data developers to quickly verify the changes introduced to the data pipelines, effectively automating one of the most time-consuming and high-risk workflows.
At the same time, robust engineering practices are introduced in other parts of data stack: dbt Labs has been leading the Analytics Engineering movement and enabled data developers with a user-friendly approach to building SQL data pipelines that elegantly incorporates some of the most important principles of agile software engineering, such as:
- Unified structure for data transformation (SQL + Jinja templating)
- Version control for source code
- Automated data "unit testing"
- Documentation that lives with code
Now data teams can take their workflows to the next level using the one-click Datafold integration with dbt to boost their productivity and move faster without risking degrading data quality thanks to three Datafold features:
- Data Diff empowers the data engineer or analyst to see how code changes impact the data in the modified table and downstream dependencies.
- Data Catalog with full-text search, data profiler and metadata that syncs with dbt docs.
- Column-level lineage for all dbt models maps dependencies between tables and columns show how data is produced, transformed, and consumed.
It’s a one-click integration with dbt Cloud, and for teams hosting dbt core themselves, we expose an API for connecting with the CI.
Data Diff shows how a change in the code affects produced data
With Datafold Data Diff, you now have the ability to see how a change in your SQL code affects the data in your modified data table as well as its downstream dependencies.
- Spend less time on QA, and see the full scope of changes without writing a single line of code
- Eliminate accidental mistakes
- Accelerate code reviews
Gain full visibility into your pipelines with column-level lineage
Data engineers spend countless hours manually mapping their data flows. When they aren’t digging into old spreadsheets or reading the source code files, they are asking their colleagues for help. Often the need for lineage comes with an urgency of resolving a data incident that requires immediate reaction to avoid costly damages.We realize how stressful and mundane this process is, which is why we’ve released column-level lineage. Using SQL files and metadata from the data warehouse, Datafold constructs a global dependency graph for all your data, from events to your BI tool and reports:
Detailed lineage can help you reduce incident response time, prevent breaking changes, and optimize your infrastructure. Goodbye to spending late nights answering questions such as:
- Where does the data used in this BI report come from?
- What will be the impact of changing a given column?
- What columns are not used vs used the most?
Discover your data through Datafold Catalog
Finding and understanding the data for every task is often a time-consuming process, considering that nowadays it's not uncommon for a data warehouse to have 5000+ tables and 100,000+ columns. With Datafold Data Catalog, you can keep your data documentation close to your code (e.g. dbt model) and serve it in a responsive interface with full-text search & per column-profiler. You can even use the catalog to understand data freshness and combine metadata from across your warehouse and dbt documentation.Alternatively, you can use a Notion-like editor to document your tables and columns:
Want to give it a try? Schedule a demo here to see it in action.
If you want to take your data quality skills to the next level, or if your stakeholders have simply refused to QA your data for you, consider trying Datafold.