How to diff your data during a data migration

We know—data migrations are hard. They’re lengthy, time-consuming, and repetitive projects that, for some organizations, have no end in sight 🔮. But, they’re a necessary step to guarantee the scalability, governability, and maintainability of your data stack.

In this blog post, we’re going to unpack:

  • How data diffing and Datafold Cloud enable data teams to migrate their data stacks at an accelerated pace and with greater confidence.
  • How to actually (!!) diff your data during a migration using Datafold Cloud.

Why you should data diff during a data migration

Backing up a bit, how does data diffing—a value-level comparison between two tables to identify critical changes and guarantee data quality—accelerate a migration for an organization? 

‍

  1. Automatically validate parity between databases: When migrating between data systems, an essential question that must be answered is: “Does the data match between legacy and new systems?” With data diff, you can answer this question rigorously, with absolute confidence. This is instead of running manual queries or calculating checksums which, by definition, do not check every value–issues can slip through the cracks.
  2. Receive stakeholder sign-off: As anyone who has completed a migration knows all too well, a migration is only really complete once you get that final, glorious, “LGTM 👍,” from your stakeholders. By validating the migration with data diffs, you can provide clear, convincing evidence that data matches exactly between the two systems. Data diffing is the fastest way to build trust in the new system and establish consensus that the migration is complete, so that the old system can be switched off.
  3. Accelerate #1 and #2 with cloud features and enterprise scale: A migration is not usually just a few tables–it’s typically hundreds or even thousands of tables with complex structures. By accelerating your data diffing with scheduling, automation, hosting, a smooth UI, and support for massive scale, you can save your team countless hours validating migration efforts, and ultimately receive stakeholder sign-off.

Ok, I’ve convinced you to data diff. Amazing! But what tool should you use to actually diff your data during a migration?

How to diff your data

Datafold superpowers migrations with data diffing through Datafold Cloud, which is built for automation, scale, hosting, and expanded database and data type support, storage of results in Datafold Cloud, and a smooth visual UI.

We’ll walk through how to data diff in both, and highlight how Datafold Cloud provides an enterprise-ready solution.

Datafold Cloud is the data diffing solutions for teams. With Datafold Cloud, you can run data diffs at any size and cadence using the Datafold Cloud UI, scheduler, or API, so your data diffs scale with your migration.

Datafold Cloud’s automation, scheduling, detailed value-level diffs, optimization efforts, support for long-running diffs at massive scale, expanded data types for primary keys, diff history, and programmatic API are how teams unlock the power, stability, and finesse needed to successfully deliver a migration.

Running a data diff across tables in Datafold Cloud UI or using our REST API for programmatic orchestration is incredibly simple, and you can begin today, with no need to build orchestration or infrastructure.

In the Cloud UI, simply hit the “New Data Diff” button in the right-hand corner of the UI, input the Data Source (data warehouse), the table, and any filters you want set for each database object.

Then, you’ll need to let Datafold know what the primary keys are for the tables you're diffing. You can also enable sampling if you don’t want to run the diff across the entire table. Datafold will then compute your data diff for you—even if your diff is billions of rows.

While the visual UI is easy and accessible to stakeholders of varying technical skills, for diffing at scale, leverage the Datafold Cloud API and Datafold’s out-of-the-box scheduler to automatically test your migrated objects as you migrate.

With Datafold Cloud, during your migration:

  • You can investigate at the row and value level to understand why certain rows might have disappeared or changed during a migration. This also allows you to easily filter on specific primary keys or column values that require further investigation.
  • Data diff results are stored in the Datafold Cloud application. This unlocks a centralized diff history to understand how your migration is progressing over time. You gain incredible insight and access into those important changes that might go unnoticed during a large-scale migration. 
  • Finally, any data diff can be easily shared by simply sending a link to diff results in the Datafold application to a key stakeholder.

Information stored in Datafold Cloud can also be queried programmatically using the Datafold Cloud REST API.

Interested in learning more about how Datafold Cloud can help accelerate your migration timeline? Talk to our team of data migration experts today.

Conclusion

Data diffing is a powerful method to guarantee your migration is correct—both to your internal team and your important stakeholders. 

With Datafold, you can:

  • Use the Datafold Cloud UI, API, and scheduler to validate the parity of tables (of any size and volume) within and between databases.
  • Support a wide range of primary key types (not only numeric!) and a wide range of database connectors.
  • Leverage cloud infrastructure so that even long-running diff jobs on the largest tables complete successfully, every time, without the need to internally host complex jobs.
  • Display and persist results in the Datafold Cloud UI to share interactive, clear results with stakeholders to gain buy-in and declare your data migration complete.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes