Datafold catches unintended changes to immutable data

Datafold catches unintended changes to immutable data

As data engineers, we talk a lot about what it takes to optimize our complex data pipelines to efficiently process and transform data at scale. We often focus on improving performance and streamlining workflows to create data pipelines that enable our colleagues in analytics, operations, and strategy to work better together. 

But the issue of data quality often gets overlooked because it’s a messy problem. How do you validate, at scale, the accuracy of data produced across all your data pipelines? Popular methods include dbt tests, unit tests, manual SQL queries to verify ground truth – or perhaps more commonly than we like to admit, just shipping changes to production without any validation because it’s just too hard and time consuming to figure out. 

You can set up your pipeline meticulously to ensure everything is extracted, transformed, and loaded efficiently, but if you don’t have a way to compare the two versions of your data during deployment, you have no way of knowing if your immutable data changed when it should not change.

It doesn’t have to be this way–and we have a pretty simple solution for how you can proactively catch unintended changes to immutable data before anything gets deployed to production. 

When immutable data changes

First, a word on immutable data. What is it? It’s a type of data that once created, should not be modified because it’s expected to remain constant over time:

  • Names
  • Birthdays
  • Event timestamps

But immutable data can and does change–and that’s not as rare as you might imagine. There are four main reasons why immutable data can inadvertently change:

1. Coding errors: An error in an automated data processing script may overwrite existing timestamps or names with incorrect values, leading to unexpected changes in the data.

2. Data integration failure: When data from multiple sources is merged or synchronized, discrepancies can arise. For example, if conflicting birthdate information is received from different data sources, it could result in an incorrect update to the birthdate field.

3. Data transformation errors: The process of data cleaning or normalization can inadvertently modify immutable data. For instance, if a data transformation rule incorrectly updates name formatting or date representations, it could result in unintended changes to the immutable data.

4. Data migration errors: During data migration, data is transferred between systems or platforms, which increases the likelihood of data integrity issues. We’ve seen how inaccurate mapping or transformation of data during migration can result in unintended changes to immutable data fields.

Typical strategies don’t protect your data quality

So there are plenty of opportunities for immutable data to start changing. What are your options for catching this before bad data hits production?

Test What they do Where they fail
dbt tests Automated tests that are easy to use in your dbt project to validate your data transformations. Tests include assertions about data schema, data types, row counts, and other criteria to ensure that the transformed data meets expected standards. dbt tests don’t compare two versions of the data to compare changes, so they can’t catch unintended changes to immutable data.
Data “unit tests” These are similar to dbt tests in that they try to uncover “known unknowns”, such as the correctness of calculations, the accuracy of aggregations, or the integrity of certain data fields. They’re great for validating data transformations but are not specifically designed to detect changes in immutable data as they also don’t compare two versions of the data.
Custom SQL tests Bespoke SQL queries designed to catch your edge cases against predefined rules. They’re best suited to catch anomalies, and less so for finding changes in immutable data fields without additional logic or mechanisms for data diff-ing.
“Just ship it” An approach where data changes are deployed to production without thorough validation or testing. This one’s pretty self-explanatory! As data pipelines increasingly serve critical production services like ML models, not knowing if you have bad code impacting your data can have serious repercussions, such as compliance violations.

If you’ve already guessed it, there’s one clear reason why these approaches all fail to guarantee data quality: they don’t compare the two versions of the data to detect unintended changes. 

Value-level data diffs guarantee peace of mind

You need value-level data diffs to compare individual data points or records between two datasets to find differences. This is absolutely critical for detecting unintended changes in immutable data because it allows for a granular examination of data at the most fundamental level.  

There’s no way of checking for changes outside tests that you have predefined and scripted upfront. It’s just not possible for data engineers to provide complete custom test coverage for the hundreds or thousands of models they’re responsible for.

Even if there was only one data model, it would not be possible to write a test for every value. 

Through a value-level comparison, data engineers can pinpoint exactly which immutable data fields have been altered, allowing for targeted investigation and remediation. You will be able to figure out if it was just a single character change in a name field or a minute adjustment in a timestamp–changes that might be small, but which could have significant implications for data integrity. 

Also, unlike other testing methods that focus on validating data transformations or integrity rules, value-level diffs compare the entire dataset or selected subsets across two versions. This comprehensive comparison ensures that no changes to immutable data go unnoticed, regardless of the scale or complexity of the dataset.

Automating value-level data diffs with Datafold Cloud

Hopefully, we’ve now convinced you that value-level data diffs serve an important need that no existing testing tools can meet. 

Setting up value-level diffs to run during each deployment cycle can be tricky, unless you automate the process. Datafold Cloud makes it easy to do so with its integration into your development workflow. 

Whenever you open a new pull request with some code changes, our Datafold bot comments with diff summary statistics and value-level diffs that you and any reviewers can glance at:

Our friendly Datafold bot shares key insights on data changes with each pull request

‍

Similarly, the highlighted values make it easy to see how the data changed at a very granular level:

The Datafold Cloud App makes it easy to see where the values have changed–down to the row-level

If you’re curious to learn more about how Datafold’s data diffing in CI can help your team prevent shipping code that breaks production data, here are a couple of ways to learn more:

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes