How to diff your data during a data migration
We know—data migrations are hard. They’re lengthy, time-consuming, and repetitive projects that, for some organizations, have no end in sight 🔮. But, they’re a necessary step to guarantee the scalability, governability, and maintainability of your data stack.
In this blog post, we’re going to unpack:
- How data diffing and Datafold Cloud enable data teams to migrate their data stacks at an accelerated pace and with greater confidence.
- How to actually (!!) diff your data during a migration using Datafold Cloud.
Why you should data diff during a data migration
Backing up a bit, how does data diffing—a value-level comparison between two tables to identify critical changes and guarantee data quality—accelerate a migration for an organization?
- Automatically validate parity between databases: When migrating between data systems, an essential question that must be answered is: “Does the data match between legacy and new systems?” With data diff, you can answer this question rigorously, with absolute confidence. This is instead of running manual queries or calculating checksums which, by definition, do not check every value–issues can slip through the cracks.
- Receive stakeholder sign-off: As anyone who has completed a migration knows all too well, a migration is only really complete once you get that final, glorious, “LGTM 👍,” from your stakeholders. By validating the migration with data diffs, you can provide clear, convincing evidence that data matches exactly between the two systems. Data diffing is the fastest way to build trust in the new system and establish consensus that the migration is complete, so that the old system can be switched off.
- Accelerate #1 and #2 with cloud features and enterprise scale: A migration is not usually just a few tables–it’s typically hundreds or even thousands of tables with complex structures. By accelerating your data diffing with scheduling, automation, hosting, a smooth UI, and support for massive scale, you can save your team countless hours validating migration efforts, and ultimately receive stakeholder sign-off.
Ok, I’ve convinced you to data diff. Amazing! But what tool should you use to actually diff your data during a migration?
How to diff your data
Datafold superpowers migrations with data diffing through two interconnected tools:
- data-diff, an open-source python package built and maintained by Datafold and the user community.
- Datafold Cloud, which is built for automation, scale, hosting, and expanded database and data type support, storage of results in Datafold Cloud, and a smooth visual UI.
We’ll walk through how to data diff in both, and highlight how Datafold Cloud provides an enterprise-ready solution that leverages the core capabilities of open source data-diff.
data-diff, the open source Python package
data-diff is the open source offering from Datafold. It’s a powerful utility, maintained by Datafold and the user community, that diffs tables within and between databases.
To diff a traffic_speeds table between PostgreSQL and Snowflake using open source data-diff, you can run this command in the CLI:
The returned results of a data diff between traffic_speeds in PostgreSQL vs Snowflake would look like this:
The result is a high-level overview of how these two tables in your different databases differ: the number of rows, exclusive rows, and an overall difference score.
Open source data-diff also can be configured to display value-level differences, with the results displayed as a formatted list of tuples. This can be incredibly powerful for connecting data-diff to external systems. (In fact, this is how Datafold Cloud works – supercharging data-diff into an enterprise-ready solution.)
Here, data-diff compares the same two tables, with value-level output:
Hopefully, you are clearly seeing the power–as well some limitations–of open source data-diff. With data-diff, you can see the summary and value-level differences between databases at unprecedented scale and speed.
However, the output is not exactly human-readable. And without building a system around it, it’s only usable for one-off jobs. data-diff is designed to plug into other tools that enable enterprise scalability.
Open source data-diff is great for pulse checks, one-off investigations, prototyping, and, if you’re up for it, significant infrastructure work to achieve scale and automation.
Datafold Cloud is the data diffing solutions for teams. With Datafold Cloud, you can run data diffs at any size and cadence using the Datafold Cloud UI, scheduler, or API, so your data diffs scale with your migration.
Datafold Cloud’s automation, scheduling, detailed value-level diffs, optimization efforts, support for long-running diffs at massive scale, expanded data types for primary keys, diff history, and programmatic API are how teams unlock the power, stability, and finesse needed to successfully deliver a migration.
Running a data diff across tables in Datafold Cloud UI or using our REST API for programmatic orchestration is incredibly simple, and you can begin today, with no need to build orchestration or infrastructure.
In the Cloud UI, simply hit the “New Data Diff” button in the right-hand corner of the UI, input the Data Source (data warehouse), the table, and any filters you want set for each database object.
Then, you’ll need to let Datafold know what the primary keys are. You can enable sampling of the diff if you don’t want to run the diff across the entire table. Datafold will then compute your data diff for you—even if your diff is billions of rows.
While the visual UI is easy and accessible to stakeholders of varying technical skills, for diffing at scale, leverage the Datafold Cloud API and Datafold’s out-of-the-box scheduler to automatically test your migrated objects as you migrate.
Similar to the open source data-diff output, you’ll receive a high-level diff summary in Datafold Cloud:
Unlike open source data-diff, which displays value-level results in a machine-readable interface, with Datafold Cloud, you can explore Datafold Cloud’s rich UI, which includes the individual value-level differences between two tables in different databases, as well as visual distribution differences.
Why is this so valuable? With Datafold Cloud’s UI:
- You can investigate at the row and value level to understand why certain rows might have disappeared or changed during a migration. This also allows you to easily filter on specific primary keys or column values that require further investigation.
- Data diff results are stored in the Datafold Cloud application. This unlocks a centralized diff history to understand how your migration is progressing over time. You gain incredible insight and access into those important changes that might go unnoticed during a large-scale migration.
- Finally, any data diff can be easily shared by simply sending a link to diff results in the Datafold application to a key stakeholder.
Information stored in Datafold Cloud can also be queried programmatically using the Datafold Cloud REST API.
Interested in learning more about how Datafold Cloud can help accelerate your migration timeline? Talk to our team of data migration experts today.
Data diffing is a powerful method to guarantee your migration is correct—both to your internal team and your important stakeholders.
While open source data-diff is a powerful utility engine for diffing, teams diffing at enterprise scale and with the highest standards of data integrity and operational efficiency choose Datafold Cloud.
With Datafold Cloud, you can:
- Use the Datafold Cloud UI, API, and scheduler to validate the parity of tables (of any size and volume) within and between databases.
- Support a wide range of primary key types–not only numeric!–and a wide range of database connectors.
- Leverage cloud infrastructure so that even long-running diff jobs on the largest tables complete successfully, every time, without the need to internally host complex jobs.
- Display and persist results in the Datafold Cloud UI to share interactive, clear results with stakeholders to gain buy-in and declare your data migration complete.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.