Eventbrite wanted to scale beyond its existing tech stack to a new stack that would enable it to democratize data access and facilitate performant ad-hoc analytics. However, migrating to a new stack was complex.
Eventbrite is currently running two parallel data tech stacks, with the legacy stack based on self-managed Hive/Spark/Presto, and the future stack based on fully-managed services such as Snowflake, Airflow, dbt, and Datafold.
One of the primary challenges in moving from Presto to Snowflake is that the existing SQL can’t simply be copied over. Each model needed to be rebuilt in Snowflake using dbt.
With roughly 300 models across 20-30 data sources, Eventbrite needed to validate that the models rebuilt in Snowflake produced the same data transformations as the original models. This raised two concerns.
First, manual validation that each migrated model produced the same transformations would be inefficient and increase the risk of human error.
Second, taking each migrated model back to the stakeholders to check for correctness would have added weeks—or likely months—of lag time.
Eventbrite decided it wanted a tool that could give its data team and stakeholders confidence that the migrated models were correct, and provide metrics to help the data team meet its stakeholders’ SLAs.
Early in the migration process, Eventbrite integrated Data Diff to help automatically validate that the migrated models were achieving parity with the pre-migration models.
For any particular model or set of models, Eventbrite would determine an agreed-upon SLA with stakeholders. For example, an acceptable result might be 99.5% of column values and 98% of rows matching between pre-migration and post-migration models. With an agreement in place, they could begin migrating models to Snowflake by copying over the model and the data from Presto.
The data team used two primary operations to ensure quality during the migration: nightly builds and ad hoc builds. After a model build occurred, the data team would use Data Diff to spot-check the specific values and see any differences in distributions.
They could see what percent of the values matched exactly between the pre- and post-migration model data. This was a quick and easy way to determine if the accuracy achieved was within the agreed-upon SLA. Instead of constantly checking in with stakeholders, Datafold had hard numbers to confirm they were within an acceptable range.
Without Data Diff, Eventbrite would have spent time writing a solution from scratch to test and validate the output of every new dbt model. By integrating Data Diff into its pipeline, Eventbrite saved time and achieved accuracy during migration.
Seeing statistics and visualizations on how new model data compares to existing model data makes it easy to verify our migration SLAs with stakeholders. This has saved incredible amounts of time and drive efficiency for us
Through their use of Data Diff, Eventbrite was able to realize an approximate time savings of several months by using agreed-upon data SLAs and Datafold’s UI to check the model accuracy. Eventbrite noted significant resource savings by using Datafold and not having to do the following:
- Bring each migration to stakeholders for manual review
- Custom build testing and migration validation scripts for each model
In addition, Data Diff allowed Eventbrite to reduce the effort needed to onboard other team members during the migration effort. With simple and automated validation of models, Eventbrite could quickly see that new team members were performing the migration correctly, building trust and confidence.
The increase in migration speed and accuracy effectively created a democratization of data by reducing the amount of business knowledge needed to create data transformations. This, of course, made evaluating accuracy between pre- and post-migration easier—and with reduced risk.
Datafold will continue to be a key component in Eventbrite’s daily data operations, ensuring continued data quality confidence after the migration effort concludes. Eventbrite also plans to adopt more features from Datafold as its data set grows and requires more advanced functionality.
We’re very excited to use Datafold in phase two of our migration. Using Datafold as our guardrails to get downstream impact information into our pull requests allows us to be proactive, not reactive.
Is your organization facing a massive migration endeavor or data validation challenge? Contact Datafold for a demo of Data Diff today.