The day you stopped breaking your data

In the recent edition of the Analytics Engineering Roundup, dbt Labs CEO Tristan Handy writes, Rather than building systems that detect and alert on breakages, build systems that don’t break.”

Every vendor that has something to do with data quality will tell you that it’s precisely their solution that will finally deliver you this peace of mind. It won’t – at least not exclusively on its own and not entirely. Data quality is a complex problem that involves technologies, tools, processes, culture, and cosmic rays.

But we can certainly do better, especially in the area of preventing data quality issues from occurring in the first place.

I am biased because I started Datafold to solve precisely this pain point, but I am also credible because I’ve broken way too many data pipelines in my data engineering days.

The case for proactive data quality

It’s that simple.

We break it

Someone in our organization, maybe even on the same team, maybe even you, made a breaking change.

Examples:

  • Analytics engineer changes the definition of <span class="code">tax_charge</span> in a dbt model per CFO’s request. Result: Sales team over reports revenue for three months. Apparently, they relied on the data synced from dbt into Salesforce to track revenue.
  • Software engineer removes the <span class="code">promo_code</span> field from the <span class="code">user_signup</span> event because “this context is now provided by another microservice.” Result: Marketing’s reporting is broken. Fixing it takes two weeks since it requires new instrumentation. The same scenario applies to changes to transactional tables replicated from OLTP databases into the warehouse for analytics. 
  • Analytics engineer renames a field in a dbt model “for consistency.” An online machine learning model that powers search is no longer online.

They break it

Data breaks for reasons [mostly] outside of our control:

Examples:

  • Airflow scheduler errors out; a task never ran but shows as completed.
  • Event was duplicated in the streaming pipeline, causing a fanout in the warehouse.
  • Vendor providing financial data shipped us a dataset omitting three markets.
  • Massive drop in signups today. Could be us pausing ads, could be a delay in the data ingestion pipeline, or it could be a real problem, but we won't know until someone spends a couple of hours digging into it.

In my experience (Daniel Kahneman rolls his eyes), we significantly overestimate the percentage of externally-caused data issues and dramatically underestimate how often we (people) break things for ourselves and others.

Having caused massive data incidents by making 2-lines-of-SQL “hotfixes”, I’ve come to believe that:

  1. The vast majority of data quality issues are caused by people when they make changes to the code that touches data. From event schemas to dbt/SQL to BI/ML apps.
  2. The vast majority of those issues can be prevented before they get to production by implementing a better change management process.

Again – preventing from getting into production as opposed to detecting in production (which is the focus of data monitoring).

In that same Roundup, Tristan continues: “Today, we’re catching far too many error states in production. This is why there is so much focus on “observability” today—error states show up in production and we need tooling to understand/diagnose/fix them. But instead of investing in ever-more-robust monitoring of our tightly-coupled and often-fragile production pipelines, what if we made our pipelines more robust?!”

Data monitoring is valuable, but it should not be relied upon for catching bugs introduced during development – those should be identified and fixed before they do damage in production: in staging, CI/CD, and during PR review (more on that below). In Software, shift-left testing has been long practiced.

One of the most effective ways to make a pipeline more robust is to make it less susceptible to human error. Indeed, we should expect a better change management workflow.

Per Tristan, “Probably, one of those expectations should be: don’t make changes that break downstream stuff.”

We make mistakes because we don’t know our data well enough

Of course, no one wants to blow up data for their stakeholders. But this is a complicated problem because:

  1. Data is complex (duh) – to write bug-free code, we need to know data properties well (e.g., distributions of values in each column, etc.). We often don’t, so we make assumptions, and those often bite us in unexpected ways much later.
  2. Data products contain massive amounts of complex business logic (e.g., at Lyft, calculating the “driver utilization” metric alone involved 1000+ lines of pure SQL.
  3. Data products are highly interdependent. Tracking those dependencies becomes hard to impossible even at a much smaller scale than Shopify’s 10,000 DAGs. Those dependencies evolve spontaneously – others create new work on top of our work all the time without telling us.
  4. Data products have to evolve fast. Data-driven organizations put a lot of pressure on data teams to build more data products faster to support essential decision-making and product automation.

To tackle these challenges, we need a proactive data quality approach focused on preventing preventable issues, such as those introduced when we make changes to the code that touches data.

Principles of proactive data quality

1 – Focus on improving the workflow

If bugs get introduced in the developer workflow, we need to improve that workflow rather than add a new one. Make it easy and fast to do the right thing!

One of the biggest challenges with implementing monitoring solutions (in both software and data) is that they get dropped outside of existing workflows. Shipping a new feature involves talking to people, writing code, and deploying the code. How does monitoring fit in? If anything, dozens of Slack notifications about data anomalies require prioritization, investigation, and fixing – those activities don’t exist in the original developer workflow. Furthermore, they are disruptive. Of course, we still want monitoring for external issues, but handling as many data quality issues as possible within the developer workflow is the key to minimizing noise and making monitoring more effective.

2 – Know what you ship

When you propose a change by opening a pull/merge request, it’s easy to assume you know what you are doing. But do you, really?

To “stop breaking downstream stuff”, every time we make a change anywhere in the data pipeline, we need to be confident that:

Ensure data correctness

The change has the correct effect on the exact thing we’re modifying. E.g., if we’re changing an event schema, we’ve named and typed everything correctly. If we’re changing a dbt model, make sure the data in the updated model is correct.

Check for downstream impact

  • Have we changed schemas/interfaces/contracts? Changes as benign as switching INT to FLOAT that don’t affect the actual numbers can have disastrous consequences downstream.
  • Have we changed any business definitions/calculations? Any modification to how user sessions are defined, for example, is almost guaranteed to blow up lots of key metrics.

Check for impact on infrastructure

  • What’s the impact of our change on data timeliness and SLAs? Can this new fancy column that we just added via a beefy JOIN impact runtime of this model and delay the entire pipeline?
  • What’s the impact of our change on the computing resources? Maybe it’s time to resize our Snowflake warehouse / Redshift cluster?

3 – Automate everything

I know staff-level data engineers at top tech companies who manually QA every change to financial data pipelines for 4-5 days, methodically filling out “acceptance” spreadsheets. An entire week to test a single change! 

I also know data people at respectable companies who… substitute testing with broadcasting Slack messages along the lines of “made some updates to tables X&Y, let me know if you notice anything wrong.” When I ask them why, they say they deliberately optimize for speed.

The good news is that we don’t need to choose between robust data pipelines and speed of development. 

Knowing what you ship means making this an automated process. To fully Improve the workflow, we should run those checks without requiring human input. In other words, no matter who changes what, those checks should be performed automatically.

Tools for proactive data quality

There are five critical elements to a robust change management process.

Version control

First, we should have version control for everything that touches data, from event schemas to SQL code to BI dashboards to data activation. Version control is fundamental to reliable change management, nuff said. It all happens in the pull request. If you could only improve one thing – improve how your team handles pull/merge requests. Pull requests tie together version control, automated tests, team and cross-team collaboration.

You should deal with data quality in the pull request as much as possible (simply because whatever bugs you miss in the pull request will blow up production).

Staging environment

Staging a change to a data pipeline means building a portion or the entire pipeline using the new code in an environment separate from production. This enables us to test, validate, QA, taste, and otherwise make sure our change is doing what we expect it to. dbt is worth adopting just for its ability to build staging environments with a single command!

While the current mainstream approach is to build staging using new code on production data, some use synthetic datasets to build staging because of privacy and/or cost concerns. It’s essential that we build a staging environment for every change (every pull/merge request) to isolate changes from one another cleanly. Once we have version control and staging, we can finally run some automated checks!

Assertions

(aka data tests or expectations)

Assertions to data are what unit tests are to software. They are simple checks designed to verify our expectations, e.g., the <span class="code">customer_id</span> field is always unique, and the <span class="code">user_email</span> is not NULL.

dbt comes with an integrated testing framework, and packages such as great_expectations make it easy to write assertions for any data in a relational database. Tip: Assertions can also be run regularly against production to monitor the data!

The major disadvantage of assertions is that they are not scalable: test coverage is always a moving target, and assertions must be written or at least maintained manually. Therefore, assertions are most effective when applied to the most critical data, but we can’t rely on them alone for data testing.

Impact analysis with data lineage

As we saw earlier, the lack of visibility into how data is used leads to us “breaking downstream stuff”. Data lineage solves precisely that problem. Data lineage shows how data moves through the entire stack. An end-to-end column-level lineage is most powerful since it can tell how a given field propagates from a data source (events/extracts) through the transformation layer (dbt/Airflow/Dagster) to data apps (e.g., BI, ML, data activation). 

For instance, Datafold computes the column-level lineage graph by crawling metadata throughout the data stack and performing static analysis of all SQL statements executed in the warehouse. The resulting lineage graph can be explored visually (below) or queried through a GraphQL API.

Lineage API can be leveraged in CI to check for the downstream impact of every change, for example:

A software engineer needs to remove an event field. How do they minimize the probability of breaking something important downstream?

  • They can post in the #analytics channel: “Planning to remove field X from event Y – any concerns”?
  • Better: they search the Airflow/dbt code base for Y.X references, run git blame, and notify owners. But important events/fields may have been referenced hundreds of times across the code base or not be searchable at all.
  • Best: once the software engineer opens a pull request, the CI runner (e.g., Github Action) makes a call to the lineage API: <span class="code">datafold_sdk.get_downstream_dependencies(table_name, column_name)</span>

Datafold can then surface those dependencies back to the pull request for everyone (including the change author and all relevant stakeholders) to see:

Given that lineage data is programmatically accessible, we can implement sophisticated rules to fortify our change management process further:

  • Automatically request reviews from owners of affected data assets downstream
  • Trigger tests in downstream apps and repos. For example, lineage can detect that some Looker dashboard indirectly (through multiple layers of dbt models) depends on an event column we are about to modify. We can then use Spectacles to dry-run Looker dashboards that draw data from the event to make sure nothing breaks.

Data Diff

Now, tell me: so you wrote/modified some SQL/Python/you-name-it code. How do you know it’s correct?

OK yeah, <span class="code">SELECT * LIMIT 100</span>. All right, you wrote/ran some assertions. Oh - <span class="code">SELECT COUNT(*), COUNT DISTINCT(primary_key)</span>. Of course.

But, seriously, how do you know that your code is doing the right thing? What gives you confidence that you accounted for all possible regressions and the long tail of issues, not just in the table you modified but in every table, dashboard, or Hightouch sync downstream? Human brains shine at many things, but running 300-line SQL statements across multiple tables is not one of them.

Per the Knowing What You Ship principle, we need to understand:

Data Profile: the properties of the dataset that our code produces

Data Diff: if this is not the first time we push this code (i.e. we changed some existing code), how does the data change when the code changes?

The basic properties we need to understand include distributions of data (histograms) for every column, quality metrics such as % of unique and null values, etc.

Quickly sanity-checking those indicators along with the data sample can save us from making all sorts of mistakes and make the development much more efficient and enjoyable.

For Python/Pandas workflows in Data Science, the pandas-profiling package provides an excellent overview of such metrics.

For SQL-centric workflows with/without dbt, we built Data Diff. The Data Diff Cloud also provides interactive profiling and slicing in the UI as well as a turnkey integration with code repo, CI, and dbt.

For example, let’s suppose we’re running a craft beer online store and need to modify bitterness categorization for our beers:

How will this dbt model change if we merge the new code?

While data diffs can be run ad-hoc, in the spirit of improving the workflow, it’s best to have them automatically run for every pull request/commit:

In this example, data diff compares production and staging (built with new code) datasets and tells us that the total number of rows, primary keys, and schema are the same, and 1 column has been modified (as expected), affecting 4.2% of all values in the model.

We can further drill down into the diff to see how our change to the categorization logic affected the column:

Or see the changes on row-level for individual beers:

E.g., for <span class="code">BEER_ID</span> 1802 <span class="code">BITTERNESS</span> will change from “Malty” to “Extra Malty”

This is cool, but how do we know the change in categorization won’t break anything downstream? Impact analysis with lineage! What if we combine the two tools and compute data diffs for the models downstream from the one we modified?

Wait, changing the <span class="code">BITTERNESS</span> column in the <span class="code">BEERS</span> table will cause <span class="code">PROMO_DELIVERIES</span> to have fewer rows. What? 🤯

Turns out the Marketing team launched a campaign to deliver samples of beers based on the bitterness profiles people like. Us changing the categorization logic threw this segmentation completely off, putting malty beer aficionados at risk of receiving some triple-hop-IPA stuff 😬

But hey – it’s all cool because we discovered this in the pull request before the change was merged! 😅

#AutomateEverything

Data pipelines don’t usually break when we anxiously expect them to. But they blow up spectacularly when we do “just a quick hotfix”.

If you find these kinds of checks useful, the only way to make them effective at preventing breakages is to run them for every single change, no matter how small, no matter how senior the person making it.

Fortunately, we can leverage CI/CD concepts tools to achieve complete automation of proactive data quality checks, including building staging, running lineage and data diff, and publishing the results back to the users.

Putting it all together:

Brilliance as a side effect

Eliminating breaking changes with proactive testing doesn’t just improve the data quality. It significantly levels up the Data team and everyone working with data:

Automated data testing shortens the time we spend on QA.

As a result:

  • We ship with higher confidence
  • We ship faster and more often
  • We spend less time on QA and firefighting
  • We spend more time doing great engineering
  • We shine 💎

It's no longer enough to talk about building data systems, we have to build resilient data systems where things aren't broken. All it takes for any Data team to get on top of proactive data quality is adopting three simple principles and the proper tooling. Datafold can help you with the tooling.

Anyone can build data systems that don’t break. All it takes is a few solid tools and the willingness to change the status quo. I can help you with the tools.

Thanks to Emilie Schario, Matt David, and Will Sweet for reading drafts of this.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes