Faire Case Study - Migration with Validation

Key metrics

6 month

Faster migration

Data Quality issues

5000+

Tables migrated

Introduction

Founded in 2017, Faire is an online wholesale marketplace for retail and brands. Faire is located in San Francisco, California, and falls into the mid-sized company category with 1000+ employees. At the time of writing, Faire operates with around $1 billion in funding with a valuation near $10 billion.

Customer quote

We front loaded the validation to be on the model level, when those things are reliable, everything downstream becomes trustworthy. This made it much easier to debug if a report looks weird. It means that the problem is most certainly on the Mode side or in the code generating the report instead of the datasets.

Jon Medwig

Staff Data Engineer

The Task

The task for Faire was to migrate from Redshift to Snowflake. However, Faire also needed to maintain its data pipelines based on Stitch and Kinesis while having minimal impact on their end-user analytic tools provided by Mode. In addition, tons of machine-learning tooling was created in-house using Airflow to orchestrate the workflows.

They needed to move thousands of tables and seamlessly reconnect hundreds of dashboards that hundreds of Faire employees use. The team supporting the data stacks consisted of roughly 35 data scientists and ML engineers, along with about 80 data and infrastructure engineers.

The Challenges

Faire needed a solution that could do all of the following:

Validate the data was moved over accurately
Be able to clearly explain any differences in data between Redshift and Snowflake

They needed an existing solution that would make the migration and validation more manageable for the data team while ensuring the right level of accuracy for data users.

During the planning phase, Faire considered the level of effort to develop validation tooling which would allow them to compare the data in Redshift and Snowflake. They determined that developing that tooling would have put even more pressure on an already aggressive timeline, and it would have left their engineers stuck in a manual process involving unrewarding work.

Lastly, Faire recognized that skipping the data validation altogether was not an option. Without this validation, there was the risk of creating systemic issues in the data and a reduction in data user confidence, which would have naturally impacted the business and its performance.

The Solution

Datafold was the solution that Faire decided on during their migration planning phase. It surfaced as a clear leader in the space due to a robust feature set that supports the data migration objective, and advanced data insights like their column-level lineage feature.

Having Datafold in a situation where you need 1 to 1 migration is critical

The process for applying Datafold’s Data Diff capability on the data models was heavily based on ownership, collaboration, and accountability. Faire followed a series of steps to provide consistent data deliverables at a high level of accuracy.

Step 1: Review data models

Data models were identified and categorized by their business group. This enabled a “divide and conquer” approach to the migration while also providing a clear set of data deliverables.

Step 2: Set targets

Each business group chose a business owner to represent their needs with the data engineering team. Likewise, the data engineering team assigned a data engineer responsible for one or several business groups. Together, they would agree on a target level of accuracy and provide detailed explanations for any discrepancies.

Step 3: Workflow with Data Diff

Data Diff was integrated into Airflow, which enabled diffs to be computed after each table refresh. An Airflow DAG would perform the following workflow:

Fetch a table from Redshift
Export the table into an S3 bucket
Pull the table into a special schema in Snowflake
Run Data Diff for comparison

Configuration of the table exports was done via YAML configuration so columns could be excluded as needed. The result was a series of Mode reports which showed the table, iterations, and correctness percentages.

Step 4: Iterate

The business owner and data engineer would iterate the models through Data Diff until a sufficient level of correctness was achieved, correcting or documenting discrepancies along the way. Data Diff was used heavily throughout the iterations due to its ability to show differences across rows, columns, and schemas, surfacing discrepancies through detailed drill-downs.

Step 5: Accept

After the business group iterated enough times to reach the agreed-upon level of accuracy, the business owner would sign off on the migration, indicating that a satisfactory level of correctness was achieved through the process.

Step 6: Terminate

The assigned data engineer would turn off access to the data model in Redshift. This would ensure that data drift did not occur between Redshift and Snowflake.

The Result

Faire was able to execute a full migration from Redshift to Snowflake through their use of Datafold. In addition to bringing a successful migration, leveraging the Data Diff feature from Datafold helped Faire to save the several months it would have taken to develop validation tools in-house.

Post migration, Faire continues to use Datafold for further migration projects as well as refactoring projects centered around data models and integrity. Their plans include incorporating Datafold into their continuous integration pipeline. That way, they can validate and interrogate models as changes are made before graduating the models to higher environments.

No question. I would recommend Datafold for any large-scale migration

Is your organization facing a massive migration endeavor or data validation challenge? Contact Datafold for a demo of Data Diff today.

Faire migrated from Redshift to Snowflake 6 months faster using Datafold’s Data Diff feature