Data migrations typically run over budget and schedule because of these three challenges, which underscore the tight margins for error during each stage of the migration process.
What are common data migration challenges?
1. Validating parity between legacy and new systems
Validating one table with another is hard enough; trying to show that hundreds or thousands of tables match up to the very last value, is possibly one of the few things in tech that might be considered Sisyphean. Things get exponentially more complicated if you’re migrating between different databases, as you’ll need to also accurately translate SQL dialects to ensure that any data transformations are ported over with the same logic and business rules.
2. Freezing or reducing new analytics work
Data migrations displace existing or new analytics projects by taking up valuable time and resources. Migrations are typically months or years long projects, and the opportunity cost of resources used this way means that fewer people are available for new projects. This can create conflict between competing priorities, and is one reason why companies choose to engage external consultants to take on the migration work instead.
3. Achieving stakeholder sign-off
A migration is only finished when all your stakeholders decide it’s done. This goes beyond validating parity between your systems, as you’ll need to prove that the data remained unchanged despite the complex transformations. If the data did change, which isn’t necessarily a bad thing, you’ll also need to convince them why these changes were acceptable. If the data pipeline lacks visibility, it’ll be hard to investigate why the data changed, and few outside of the core migration team will feel confident about signing off on a data migration.
How they are typically solved
A good data migration involves plenty of prior planning around data mapping, transformation logic, validation procedures, and contingency measures. A great data migration involves four best practices:
1. Prioritizing asset migration
Once you have an exhaustive list of all the data assets you have, you’ll need to prioritize which ones get migrated first, and which ones won’t make the cut at all. Column-level lineage is a popular tool for helping identify critical assets (e.g. those with lots of dependencies) and deprecated ones (e.g. assets that aren’t being used anymore or are dead-ends in your DAG). Then, you’ll want to move all your data consumption endpoints before any production pipelines to the new system. Why? This is a nifty tip that does three things: Reduces the load on the old system, creates more leeway in the migration as users will start to query from the new system, and you can start to validate how your migration is going sooner than later.
2. Lift and shift the data
There are two approaches to a data migration: Refactor and remodel or lift-and-shift data in its current state. While your approach will be determined by your specific business context, we have seen that the best approach for almost all cases is to simply lift-and-shift the data because it best preserves data quality. The task of migration is difficult enough that extending the scope to include comprehensive rearchitecting introduces error into an already fragile process. Instead, choose the simplest and most elegant approach, and move everything in the legacy database (minus deprecated assets) to the new one.
3. Document and distribute your strategy
Getting your migration strategy and assumptions onto paper makes sure that everyone involved has clarity and confidence about what’s necessary for success. Stakeholders often have concerns that should be voiced as soon as possible, and involving them early in the process reduces the risk of misalignment between the technical and business stakeholders. As a bonus, if your migration is successful, you’ll have a playbook for replicating the same outcome when you embark on another one.
4. Automate validation between legacy and new systems
Validating data parity between the databases is the biggest challenge in a data migration because most approaches remain stubbornly manual and tedious. Data practitioners often resort to various workarounds like: Exporting data samples into spreadsheets for line by line comparisons; writing custom SQL queries for each migrated asset; or just going on a hunch that everything is fine and ready for sign-off.
But an automated, proactive validation system is a more comprehensive and safer approach. What a tool like Datafold can do in minutes is always going to be better than an error-prone manual process. Because a value-level, row-by-row data diff of the two databases is the only way to know for sure whether the data’s been changed, and how its been changed.