What to Look for with Data Diffs

[Guest blog by Sarah Krasnik, Lead Data Engineer at Perpay]

The best part about my Saturday morning is usually the cappuccino I finally have time to enjoy.

However, that time disappears if I get a text about a bug in a query already in production. Wouldn’t it be nice if someone discovered the issue before it made its way into highly exposed dashboards?

Let’s consider an alternative.

An analytics engineer implements a query change to address some new metric definition and opens a pull request for the new logic. The opportunity is right in this step! Thoroughly QAing data before it gets into production results directly in enjoying that cappuccino.

However, “thorough QA” is an amorphous task. A dataset could contain hundreds of columns and billions of records, so a holistic and programmatic approach is the only path forward.

While stakeholders accept the current version of the dataset in production, it's vital to compare any proposed changes existing in a test or staging environment to this source of truth. Putting two cross-environment datasets alongside each other to highlight differences between them is what I’ll refer to as “Data Diff”.

How to prevent surprises in production dashboards

In the data quality world, surprises are never a good thing.

That’s why teams develop and test queries in a test (or better yet, staging) environment. Instead of testing in production, see the impact of changes to the dataset before deploying to production.

In software engineering, code reviews are only part of the process of getting code into production. Good engineering teams will view the result of a code change to ensure the impact to the application is the intended impact.

In the analytics world, reviewing the impact on the datasets is also just as important as reviewing code. The output dataset can be queried, sliced, and diced against assumptions by several peers. Generating a Data Diff before code actually gets to production creates a space to find bugs before they’re exposed to stakeholders. Understanding how dashboards will change from the get-go allows the analytics team to communicate downstream so no one even needs to ask “why did yesterday’s numbers change from when I looked this morning?”.

Let’s dive into what to actually look for in a Data Diff. I’ll be optimistic and consider the Data Diff a comparison between production and staging (ie: production-like data that no one actually looks at) environments. However, a comparison between production and test environments would yield similar benefits.

Audit type 1: Identify the total number of rows and primary keys changed

Consider one column to represent the metric change after the engineering team deploys the product update to the user funnel.

In the Data Diff, identify the total number of rows between the production and staging datasets where the column in question has changed. If the new feature isn’t yet launched, in theory, no existing row values should change. If they do, the new logic implementation might be altering historical data unintentionally.

Furthermore, looking at primary key differences between datasets could shed light on buggy SQL joins. Are there new user IDs in the staging dataset that aren’t in production? Potentially, data changed since the query was last run in production, which would be completely reasonable. However, the staging dataset could include an outer join when it shouldn’t.

Takeaway: Compare the total number of rows changed between datasets to make sure that the impact size is reasonable considering the business context of the implemented change.

Audit type 2: Analyze value distributions across environments

The total number of rows changed doesn’t tell the whole story. If fixing a bug, consider the bug to be the way the number of actions taken by a user is calculated. 

For example, in production, the value is null in places where it shouldn’t be. In the staging dataset, good news - the expected number of rows has changed! Diving in a bit deeper, the value for the column in question in those rows went to 1 instead of the null value that exists in production. Even though the correct number of rows changed, the values are not guaranteed to be correct.

In the illustrated example, the new values are all suspiciously the same. A double check on whether the new number is correct will ensure that the calculation is the correct one.

Takeaway: Value distributions add a layer of context to the volume of changes, which provide an opportunity to sanity check against business context.

Audit type 3: Run ad-hoc business logic checks

I’m all for automating everything and anything, especially when it comes to QA. Analytics engineers bring unique value in their ability to put data into business context, which should be leveraged while analyzing Data Diffs.

Each change will have unique circumstances around it. The new user funnel about to be deployed might only be exposed to users that meet certain criteria. With that context, the metric in question shouldn’t change for anyone that doesn't meet the requirements for the new flow.

If QAing pre-deployment of the product change, this won’t show up in the Data Diff. In the case when engineering deployed the new feature yesterday and analytics is playing catch up (no shame, I’ve certainly been there), only rows for users meeting the particular criteria should show up in the Data Diff.

In an analyst’s position, querying the Data Diff for all users not meeting the criteria could expose one of two things: either, the SQL logic doesn’t properly account for the special subset of users; or, the feature implemented by engineering isn’t actually filtering for the subset properly. Considering the latter is a possibility, Data Diffs are an essential part of data quality and cross-team QA.

Takeaway: QA should reflect the unique business context of a proposed change to ensure data makes sense to stakeholders.

More Analysis and Datafold

I’ve outlined the baseline audits that I do for any pull request - there are so many ways to actually go about doing a Data Diff. No matter the approach, generating and analyzing a Data Diff before merging code proactively improves data quality (thus resulting in more cappuccinos).

Detailed pull request reviews by peers are invaluable. Someone else on your team may have context from another part of the organization that you simply don’t.

When auditing, any of the checks can certainly be written in SQL. Directly comparing and joining datasets across environments may come with complexity, especially if environments are separated across databases.

Datafold has a mission of making proactive data quality easy for teams. The Data Diff feature exposes high-level metrics as well as row-level comparisons across datasets. Additionally, a holistic view of data lineage can expose changes to downstream assets that are unexpected.

Identify primary keys changed between datasets

Identify value distributions, which can be analyzed in business context

Compare value changes at the most discrete level

Interested in seeing Datafold in action? Book a demo here.

Have any other approaches to Data Diffs? I’m happy to chat on Twitter or LinkedIn.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes