To Diff or not to Diff: The cloud migration testing question

To diff, or not to diff: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous data inconsistencies,
Or to take arms against a sea of discrepancies
And by asserting, end them. To migrate: to test
No more; and by a sleep to say we end
The heart-ache and the thousand downstream consumers
That legacy systems are heir to: 'tis a consummation
Devoutly to be wish'd. To migrate, to diff;
To diff: perchance to dream: ay, there's the rub;
For in that diff, a test result what dreams may come,
When we have shuffled off this hosted instance,
Must give us pause: there's the data context
That makes cloud migrations so prolong life.

When Shakespeare first wrote about cloud migration testing, he knew there was more than one way to ensure parity between origin and destination. In his day, however, the technology wasn’t quite ready.

Thankfully, a mere four centuries later, we have everything we need to successfully test and complete a cloud data migration. Is there a clear-cut best way to do it? Well, that depends on the answers to a few questions:

  • Are you replacing existing data pipelines?
  • What types of queries are needed to extract data from the source?
  • How different are your legacy and next-gen destinations?
  • How important is long-term data quality for the datasets?

Let us not bear the whips and scorns of time. We shall proceed henceforth into the unknown—a small hamlet heretofore known as cloud migration testing. (Okay, that’s enough of that. I promise.) We’re going to cover three types of tests you can use for your cloud migration testing: assertion-based, summary statistics, and data diffing (obvs).

When to use assertion-based tests for cloud migrations

There’s a time and a place for everything, including assertion-based testing. It’s the simplest and most "naïve" form of testing, meaning it requires the least amount of sophistication and complex technical work.

Assertion-based testing in cloud data migrations is a matter of defining a suite of tests and conditions to validate your assumptions about the origin. Or, put more simply: make sure everything you think you know about your data destination is actually correct. These types of tests require:

  1. A data team to learn about the origin data, and then write tests to confirm that knowledge
  2. Sufficient time to do that learning, testing, and validation
  3. Tolerance for mistakes, missed use cases, and unexpected anomalies

Assertions are typically written in SQL or Python against the destination data and check for things like primary key and foreign key counts.

Assertion-based SQL test example

Assume you're migrating your user data from a legacy system to a new cloud-based database. One of your primary concerns is ensuring that no user records are lost or duplicated during the migration. To validate this, you implement an assertion-based test to compare the count of unique user IDs (primary keys) in both the source and destination databases.

-- Assertion test to validate user record counts match between source and destination databases

-- Count of unique user IDs in the source database
SELECT COUNT(DISTINCT user_id) AS source_count
FROM legacy_system.users;

-- Count of unique user IDs in the destination database
SELECT COUNT(DISTINCT user_id) AS destination_count
FROM cloud_database.users;

This simplified test runs these SQL queries separately in your legacy and new cloud databases to fetch the counts of unique user IDs. Then, you assert that `source_count` should equal `destination_count`. If the assertion fails, it indicates a discrepancy in the user records between the two databases, prompting further investigation to identify and correct the issue before proceeding with the migration.’

Again, this is a very naive approach to testing the parity of two tables. With results from tests like these, are you ever to say, with confidence, to your stakeholder, "The data remains unchanged between our old database and our new database" ? Probably not. Assertion-based tests against the source and destination tables are effective at getting a pulse check on core metrics (row counts, PKs, FKs), but lack the detailedness to capture edge cases and changing values. Not to mention they require considerable manual time writing and checking test results, slowing down your migration efforts.

When to use summary statistics tables for cloud migrations

There will be times when you need a bit more sophistication than assertion-based tests. Enter summary statistics tables, which are probably just what you’d expect: generative tables that allow you to see and compare summary statistics between your legacy data source and cloud-based data destination.

Here’s what a summary statistics table could look like:

MetricSource DataDestination DataNotes
Total Records1,000,000999,990Consistent record counts => BAD
Unique Customers250,000250,000Check for duplicate or missing customers => NONE
Null Values in Key Fields00Simple data quality check => GOOD
Average Transaction Value$150$150Check for financial data accuracy => GOOD
Records with Missing Values5,0004,990Check for data transformation issues => UH OH
Date Range of Transactions2015-01-01 to 2023-12-312015-01-01 to 2023-12-30Check for data completeness => HMM…

As you can see, there are a couple of data issues. It looks like we’re missing a day’s worth of data, which might account for the 10 records that are missing in the destination. Looks like there’s nothing wrong with the customer data, but there’s something else happening.

This is good information to have, but it’s not very specific, is it? You can’t just hand this to a hypothetical junior data engineer and say, "Go fix this before you go to lunch." You don’t really know where the 10 records are or why they’re missing.

For that, you need an even more sophisticated method: the data diff.

When to use data diffs for cloud migrations

Okay, it’s time to get to the cold, hard truth: data diffs are the best way to test. They tell you exactly which rows are missing or problematic and from what tables. If you’re missing three rows in your replication, a data diff report will show you which three rows are missing and from what tables. 

The only reason people don’t use data diffs more often is because: a) they think it’s more work than necessary (or technically challenging across databases), and b) they don’t know about Datafold’s migration solution.

Data diffs are so good, the hypothetical junior data engineer might actually be able to solve the data discrepancy issues before lunch. Not only are they incredibly detailed and helpful, but they can be fully automated. 

Let’s take the table above as an example of why summary statistics might fail you during a migration, and where a data diff might save you. In the summary statistics table above, the average transaction value ($150) is the same across both table versions. However, upon looking at the data diff, you discover that individual value-levels changed considerably: Order 1’s transaction value went from 200→100 and Order 2’s transaction value went from 100→200, and so on and so on. Therefore, the average transaction value did not change, but the individual row values changed considerably. This is the kind of information an assertion-based test or summary statistic table will never be able to show you, but will always be caught with a data diff from Datafold.

Datafold gives you data diffs—and confidence in your migrations

Perhaps you’ve never seen a data diff in action. Data diffing is like having a data migration wizard at your side. If you need to automatically identify parity of tables across database objects, all the way down to the row- and value-level, this is how you get it.

Here’s what a data diff between a MySQL and Snowflake table looks like Datafold:

Value-level differences of a DIM_ORGS table across MySQL and Snowflake

Now, your team knows exactly how your source and target versions of a table differ, all the way down to the row-level. In addition, Datafold’s efficient data diffing algorithm works at scale, so you can diff tables at scale (millions/billions of rows) in seconds/minutes. Data diffing allows your team to drastically reduce the time spent on validating a migration, as well as increase the confidence your team has in the migrated data. Sounds like a win-win, right?

The best way to test for data migrations: data diffs

Historians may disagree as to Shakespeare’s intent with his famous "to diff, or not to diff" soliloquy, but one thing is clear: the most important thing to test for during a migration is that data is unchanged between legacy and new systems. There are multiple ways of doing these tests, but only data diffs will give you the specifics you need to truly ensure parity.

Data diffs illuminate the path to verifying data integrity and empower teams to present concrete evidence of data parity to stakeholders. While assertion-based tests and summary statistics offer a broad view of data health, they often lack the granularity necessary to identify and rectify specific discrepancies at scale. This is where data diffs shine, providing a meticulous account of data discrepancies all the way down to the value-level.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes