Your guide to legendary data migrations

Success stories, lessons learned, and tools to turn your migration into a milestone achievement.

Managing Zero-Downtime Migrations

How Emmanuel Ogunwede, a Data Engineer at Brooklyn Data, tackled entity resolution, schema drift, and zero-downtime requirements in an ETL-based migration.

‍

For Emmanuel Ogunwede, successful data migrations are as much about understanding the business as they are about executing the technical details. “I think my biggest mistake in the early days—which is something a lot of junior engineers do—is diving into the technicals as quickly as possible,” he shares. “It’s very important to first understand the business process. It might sound boring or like you might not necessarily see the value there upfront, but over time, you find that a lot of projects fail because of a poor understanding of business requirements.”

That philosophy guided Emmanuel’s approach to a recent ETL-based migration for a live transactional database with zero-downtime requirements. In this interview, he shares how his team tackled schema drift, navigated the unexpected challenges of entity resolution, and made the trade-offs necessary to keep operations running smoothly.

Testing and validation

This migration was a complex, multi-step operation that required aligning hundreds of disparate data sources while maintaining data integrity in a live transactional database. The data, primarily semi-structured JSON files, was sourced from web platforms and had to be transformed into a unified, domain-specific format. Emmanuel’s team established a canonical schema to standardize the data, using Snowflake to map incoming column metadata and dbt models to automate transformations at scale.

Datafold: Were there any other transformations needed during the migration, aside from cleaning up the data?

Emmanuel: On the surface, if I were to give a TL;DR, we were just renaming columns and then doing the transformation to model the data around the domain. But in reality the technical bits were much more complex.

We were dealing with hundreds of sources, which could potentially grow into thousands, right? And each source had slightly different representations of the same data.

For example, one source might represent a full name as a particular column, while another might have separate columns for first name, last name, and middle name. We had to unionize all this data. So we had to first agree on what the canonical schema—our final schema—would look like. Every time we encountered data from a new source, we had to extract the column metadata and mapped it to the canonical schema. This metadata mapping was stored in Snowflake, and we used dbt to automate the process of continuously handling new data and mapping it to the canonical schema.

Datafold: On the testing side of things, what sort of testing strategies did you use to ensure the data quality was good and that the data was validated to match the source?

Emmanuel: So there were two kinds of testing we did at two different layers. We were leveraging an orchestrator, Prefect, to handle the data coming in, and the system was built to follow an event-driven architecture. So as soon as data landed, Lambda would make an API call to Prefect. And so the first thing we usually did was schema validation, because we were dealing with a large number of sources. This involved looking at the current state of what the schema structure was for a particular source versus what it looked like when we first saw it, or when we previously saw it.

For example, let’s say we’re seeing data from Source A for the first time—we’d have a metadata database where we would register it and say, "Oh, this is the first time we’re seeing Source A, what does the the structure of the data look like?" Then, the next time we received data from Source A, we’d compare the current structure to what we’d previously seen. That way, we could tell if something has changed there, if there's a drift in the incoming columns or in the structure of the data.

Another thing we did, which was very important for the client, was anomaly detection around the size of the data. There were thresholds where they expected data with at least a few million rows or a few thousand rows. Every time there were large differences, like the data exceeding the threshold or falling under it, we had alert systems in place.

The second layer was the usual suspect—dbt. Referential integrity was very important here because we were dealing with a live database, and we didn’t want to crash it while pushing data. So we had to test for referential integrity, null values, and other constraints. Some columns in Postgres had custom types, so we used dbt tests to validate that the values we were pushing in matched what we had in Postgres.

Ultimately, it was a combination of dbt and custom testing in the extraction layer.

Datafold: So with regards to tools, you’ve mentioned dbt tests and custom scripts. Was there anything else, like vendor data quality software, ETL tools, or anything else that you found was critical to your success?

Emmanuel: Something that was also very critical for the last mile—sending data from Snowflake into Postgres—was the custom user-defined functions we had in Snowflake. These were helping us with an interesting requirement: we had to format the updates as CDC-style messages. So, basically, we had to work out upfront what changed, what columns were affected, and all of those things, and then send that off as metadata.

We eventually ended up using a custom UDF in Snowflake to handle the formatting of the data in the specific format that the other consultants needed, so they could complete shipping the data into Postgres.

Entity resolution challenges

Data migrations often involve reconciling data from multiple sources, each with its own unique structure and representation of the same entities. In Emmanuel’s case, this complexity was compounded by strict constraints on using external tools and the critical requirement for zero downtime. Below, Emmanuel shares how they tackled unexpected issues like entity resolution, schema drift, and the limitations of traditional data contracts.

Datafold: The custom testing required you to have very deep knowledge of the domain area and the specific types of data that you were trying to migrate, right? What were some unexpected challenges that you faced during the migration?

Emmanuel: Two challenges were particularly interesting. The first one was around entity resolution, which wasn’t something we saw coming. This was because, like I said, we had multiple sources providing information about the same subset of professionals, but how these sources represented the data was usually slightly different.

We had to be creative in determining if, let’s say, two different credentials belonged to the same professional, or if they were two separate professionals with some similarities in their names or other identifying information. The tricky part was that we had constraints—we couldn’t use external tools for entity resolution. So, everything was designed around writing SQL queries to figure out what the entity was.

Another challenge was related to tooling and the zero-downtime requirement from the business, because at first, we picked a tool that ended up locking the database for too long. Once you lock those rules, there will be issues upstream in the application layer. We had to be creative with how we handled bulk operations in Postgres, ensuring that we weren’t crushing the application.

Advice for migrations

Migrations are as much about strategy as they are about execution. Emmanuel reflects on lessons learned from his early projects, emphasizing the importance of understanding the business context before diving into technical details.

Datafold: Do you have any advice you wish someone had given you before your very first migration?

Emmanuel: For me, I think my biggest mistake in the early days—which is something a lot of junior engineers do—is diving into the technicals as quickly as possible. I think it’s very important to first understand the business process. It might sound boring or like you might not necessarily see the value there upfront, but over time, you find that a lot of projects fail because of a poor understanding of business requirements. So, something I think is very important is to start by understanding the business processes around the data you’re migrating.

You also need to spend a lot of time understanding your source system, your sink system, and the limitations of whatever tool you choose to use for your migration. A common issue with data migration projects is dealing with data types between systems. If you have a particular type in one system that’s abstracting away certain details, if you move to another system, that system might be representing it as a different type. You need to understand those differences.

A good example is precision for continuous values. You need to understand what the differences will be between the two systems, and the limitations of whatever tool you are choosing to use to do your migration. For the example I cited, initially we picked a tool for a migration, and then we realized it was locking rows in the database for too long, which was causing problems downstream. Eventually, we had to switch to a different tool. On the surface, it’s easy to run a quick test and confirm that a tool works, but when you start to factor in the scale of what you’re dealing with, things can be slightly different, so you have to spend a lot of time understanding the limitations.

What Data Practitioners Wish They Knew Before Their Migrations

Gleb, CEO of Datafold

Ryan, Senior Analytics Engineer at CHOP

Jasmin, Staff Data Analyst at Brooklyn Data

Emmanuel, Data Engineer at Brooklyn Data

Fabio, Senior Analytics Engineer at Brooklyn Data

Alex, Data Architect at Sorcero

Sandro, Director at Empire Life

Plan Like a Product Launch

Choose the Right Migration Strategy

Automate Code Translation and Validation

Validate and Prove Parity—the Modern Way

Data Migration Readiness Quiz

Previous Chapter

Next Chapter