The task for Faire was to migrate from Redshift to Snowflake. However, Faire also needed to maintain its data pipelines based on Stitch and Kinesis while having minimal impact on their end-user analytic tools provided by Mode. In addition, tons of machine-learning tooling was created in-house using Airflow to orchestrate the workflows.
They needed to move thousands of tables and seamlessly reconnect hundreds of dashboards that hundreds of Faire employees use. The team supporting the data stacks consisted of roughly 35 data scientists and ML engineers, along with about 80 data and infrastructure engineers.
Faire needed a solution that could do all of the following:
- Validate the data was moved over accurately
- Be able to clearly explain any differences in data between Redshift and Snowflake
They needed an existing solution that would make the migration and validation more manageable for the data team while ensuring the right level of accuracy for data users.
During the planning phase, Faire considered the level of effort to develop validation tooling which would allow them to compare the data in Redshift and Snowflake. They determined that developing that tooling would have put even more pressure on an already aggressive timeline, and it would have left their engineers stuck in a manual process involving unrewarding work.
Lastly, Faire recognized that skipping the data validation altogether was not an option. Without this validation, there was the risk of creating systemic issues in the data and a reduction in data user confidence, which would have naturally impacted the business and its performance.
Datafold was the solution that Faire decided on during their migration planning phase. It surfaced as a clear leader in the space due to a robust feature set that supports the data migration objective, and advanced data insights like their column-level lineage feature.
Having Datafold in a situation where you need 1 to 1 migration is critical
The process for applying Datafold’s Data Diff capability on the data models was heavily based on ownership, collaboration, and accountability. Faire followed a series of steps to provide consistent data deliverables at a high level of accuracy.
Step 1: Review data models
Data models were identified and categorized by their business group. This enabled a “divide and conquer” approach to the migration while also providing a clear set of data deliverables.
Step 2: Set targets
Each business group chose a business owner to represent their needs with the data engineering team. Likewise, the data engineering team assigned a data engineer responsible for one or several business groups. Together, they would agree on a target level of accuracy and provide detailed explanations for any discrepancies.
Step 3: Workflow with Data Diff
Data Diff was integrated into Airflow, which enabled diffs to be computed after each table refresh. An Airflow DAG would perform the following workflow:
- Fetch a table from Redshift
- Export the table into an S3 bucket
- Pull the table into a special schema in Snowflake
- Run Data Diff for comparison
Configuration of the table exports was done via YAML configuration so columns could be excluded as needed. The result was a series of Mode reports which showed the table, iterations, and correctness percentages.
Step 4: Iterate
The business owner and data engineer would iterate the models through Data Diff until a sufficient level of correctness was achieved, correcting or documenting discrepancies along the way. Data Diff was used heavily throughout the iterations due to its ability to show differences across rows, columns, and schemas, surfacing discrepancies through detailed drill-downs.
Step 5: Accept
After the business group iterated enough times to reach the agreed-upon level of accuracy, the business owner would sign off on the migration, indicating that a satisfactory level of correctness was achieved through the process.
Step 6: Terminate
The assigned data engineer would turn off access to the data model in Redshift. This would ensure that data drift did not occur between Redshift and Snowflake.
Faire was able to execute a full migration from Redshift to Snowflake through their use of Datafold. In addition to bringing a successful migration, leveraging the Data Diff feature from Datafold helped Faire to save the several months it would have taken to develop validation tools in-house.
Post migration, Faire continues to use Datafold for further migration projects as well as refactoring projects centered around data models and integrity. Their plans include incorporating Datafold into their continuous integration pipeline. That way, they can validate and interrogate models as changes are made before graduating the models to higher environments.
No question. I would recommend Datafold for any large-scale migration
Is your organization facing a massive migration endeavor or data validation challenge? Contact Datafold for a demo of Data Diff today.