We had a fear of the unknown. When you're responsible for review of data assets and artifacts, it's hard to be thorough and get a sense of how underlying data is impacted by changes. We could only rely on a best effort approach to reviewing PRs.
Effective use of data has been an essential to Dutchie's remarkable growth. The modern data stack enabled a relatively small team to build core pipelines and data products extremely fast. However, as the business grew and with it the volume and complexity of data, the fear of silent data failures started to affect the Data Team's ability to support the data-driven organization.
Moving fast meant making lots of changes to the ever growing SQL/dbt ELT codebase. And with every new change to schemas or business logic, rose the chance of accidentally introducing regressions – worst of all, those that creep in silently and remain undetected for a long time.
The Data Team knew that some other organizations attempted to solve data quality with alerting systems focused on anomaly detection in production, but this was antithetical to Dutchie’s philosophy. With proactive approaches as a norm across the organization, setting up an alert on a SQL threshold was seen as an anti-pattern. By the time you catch something with an alert, the Dutchie team considered it too late. The Data Team wanted a solution that detected data quality issues before they affected production to gain full confidence in what they were shipping to the data users.
The Data Team was primarily interested in the Data Diff product from Datafold and it was the only solution on the market that met Dutchie’s data quality requirements.
Dutchie first implemented Datafold using its turnkey integration with dbt Cloud and then, after migrating to a self-hosted dbt installation, the team leveraged Datafold's Python SDK to implement regression testing in the internal CI/CD pipeline.
Today, Datafold serves as a gatekeeper to every change made to the growing dbt/SQL repository: every time a team member opens a pull request, Datafold produces a detailed impact analysis report showing how the code change affects the produced data in the current model and in its downstream dependencies.
This helps the Dutchie Data Team identify three core issues that would otherwise be undetected:
- Schema and type changes. While dbt can occasionally catch these, Datafold's Data Diff quickly shows the team if any data types change. For example, a created_at field might not include a time zone, but a code update could include a time zone in the field. This error could cause inaccuracies in reporting on timed promotions, revenue allocation, and even incorrect tax collection or charges.
- High-level statistical changes. Sometimes updates to a case when statement look correct stylistically, but changes to the business logic can impact the data. It’s incredibly difficult to detect the effect of these changes without hours of manual testing. Statistical changes shown in Data Diff make it immediately apparent if you see that 20% of your data has moved from one field to another as a result of your proposed change.
- Primary key comparison. The team may not always test every model for primary key uniqueness. This out-of-the-box section of the Data Diff will warn you that the changes you’re making to a model will cause keys to be duplicated, sending you back to check your work before pushing it to production.
For example, an analyst was making changes to business logic in the case when statements to put customers into tiers. They were supposed to create a new tier that would replace an old tier, but the implementation was wrong - the new tier had no one in it and the old tier still existed. All the tests passed, but Datafold caught it.
This was the last chance to catch the issue before a stakeholder would have seen it in their Looker dashboard potentially days or even weeks later. Data Diff prevented the team from looking bad and the data consumers from losing confidence in the data quality with nuanced issues that would have been apparent to their stakeholders.
While Data Diff has been helpful in ensuring the quality of the changes made to the code, the Data Team has also found a lot of value in the overall increase in the data observability that the Datafold platform brought, including Catalog and column-level lineage. These provide immediate understanding of the underlying data without digging through SQL to see where data is coming from and what it represents.
- Zero data incidents. Since implementing Datafold in their pipeline, there have been zero production breakages or outages.
- Saved 100+ hours with streamlined regression testing. Before, thoroughness of the testing varied, if changes were tested at all. While small changes might have gone into production without manual checks, regression testing for fundamental changes to core business logic could take a week. With Datafold, it doesn’t matter the scale of the change - you don’t need to parse through the analysis, it will just let you know which Diff is weird and should be checked.
- Improved data democratization and team productivity. As Datafold is being rolled out across the distributed team at Dutchie, stakeholders can now answer their own questions about what data means and where it comes from using data catalog and column-level lineage. With the team 90% distributed, stakeholders often have questions throughout the day and night. Data catalog and lineage allow them to find something themselves without asking the data team for support or digging through SQL code.
- Data confidence as the team grows. The Dutchie team is preparing to grow at a rapid pace, with Datafold becoming exponentially more important as the team gets bigger. If a team of 30+ data analysts starts submitting 10+ dbt PRs per day, that’s too much to review and maintain a high level of quality in production without Datafold.
While Datafold is still young and the tool is in its early stage, the foundation of the business is super sound. The core platform is so valuable. Datafold is solving a problem that no one else is trying to solve.