Data Quality Management According to Lyft, Shopify, and Thumbtack
Data quality management is the ever-important process of building, implementing, and maintaining systems to ensure your company's data is accurate and usable, especially at a large scale. According to Jamie Quint, General Partner at Uncommon Capital, a current challenge for data teams is ensuring data quality functionality. He describes functionality as being able to answer the question, “Is your data in good shape?” It may seem simple, but considering most companies' diversity of data sources, tremendous volume of data, and the rate at which data becomes outdated, that question is tough to answer.
Given the complexity of real-world data quality management, it’s helpful to look at actual teams that have managed to establish and scale strong management practices. This article explores the data challenges faced by Shopify, Lyft, and Thumbtack and the systems they built to conquer them.
Lyft: Solving data quality with a self-serve data testing framework
Lyft is a ride-sharing company that markets vehicles for hire and offers car riding services and even food or car part delivery. It is used primarily in the U.S. and Canada by millions of monthly users. When Lyft realized that their stakeholders lacked trust in their data, their Data Team of about 600 people confronted the problem head-on by building a service to reliably assure data quality and win the users' trust.
Jason Carey, a data platform technical lead at Lyft, acknowledged their data warehouse has about 110k datasets worth 40 petabytes. Even though this large amount of data gets into the data warehouse reliably, there was “a gap around the semantic correctness of the data,” for example, checking for the tolerance and completeness of a primary key column. It was hard maintaining these checks on an ongoing basis.
Lyft built a proprietary data quality tool called Verity to tackle their data quality management and testing problem. Verity is a system that monitors and checks the quality of offline data by paying attention to the semantic correctness of the data. Verity consists of a three-step model:
- Verity check: Verity Check configures a query and a condition
- Schedule: The Schedule model orchestrates quality checks pending on changes in data
- Notification: The Notification model determines who gets notified
An interesting feature of Verity is the ability to run data quality checks in a blocking and a non-blocking manner. Let’s break that down. In the case of a data quality check failure, Verity can block the downstream data consumption to avoid a blast radius in the ETL process — or decide not to block downstream consumption and get notifications when data changes or a check fails. By implementing Verity, Lyft was able to run 1,000+ data quality checks in production and achieve 65% unique and volume coverage for their data sets.
Shopify: Ensuring trust of 1M+ data users with SQL unit testing
Shopify is an eCommerce platform for businesses that runs on a large scale of data consumers and data producers. They implemented SQL unit testing on top of their dbt modeling tool to rapidly detect unexpected static changes in data and to reduce computing costs. It’s important to run data quality testing because you need quality data to make data-driven decisions.
Shopify introduced a modeling tool made from dbt and Google BigQuery to keep up with their rapid growth. The tool empowers the data team to report data pipelines efficiently. However, over time, this approach was no longer scalable as dbt tests slowed down production data, thus incurring more expenses. Another issue was different collaborators making frequent updates and changes to the modeling tool.
The data team took a proactive approach to enhance the performance of the dbt modeling tool by implementing an in-house framework — Seamster. This tool enabled their data developers to run SQL unit tests. SQL unit tests run SQL production data against mock datasets and compare results.
Code can be written so that it is a lot like LEGO blocks that snap together to make the finished product. Unit tests are tests run on each LEGO block before code changes are approved to be included in the final version that gets rolled out to production. This is all fairly standard; the innovation here is Seamster runs those unit tests against a small subset of curated test data that accurately simulates production data. Running test data and not production data allows unit tests to run fast enough to be practical for continuous integration. One key benefit of SQL unit testing is that it allows you to easily detect and fix flaws in the initial development phase of your code.
Shopify was able to test unexpected changes (edge cases) fast, and an increase in regression testing prevented collaborators from breaking the production environment by checking on bad code. Their data team has over 100+ models submitted by data scientists, and 300+ unit tests run with a full time of 22 minutes and an average time of 3 minutes in CI.
Thumbtack: Automating the Data Quality Check Process with Data Diff
Thumbtack is a company that helps local professionals and customers find each other. When data analysts make changes to data models that are full of complex business logic encoded in SQL without an easy way to test the output, data developers (analysts) often introduce errors.
Thumbtack required all data developers to produce manual diff reports showing the impact of every code change made on the data. To implement manual diff reports, Thumbtack’s data analysts made a bunch of spreadsheets to compare rows that were changed by every pull request. This manual process required a long review to make a simple change in the data warehouse.
It was not scalable because it was an expensive and cumbersome process, which also complicated process enforcement. One of the takeaways from our State of Data Quality report is how manual work is the primary reason for low productivity in data teams.
Thumbtack decided to use a different approach by onboarding Datafold’s Data Diff tool. With one click, Data Diff verifies what data changes were made across many rows and shows data changes even when it is modified. Thumbtack integrated the Data Diff tool in their GitHub repository, CI, and code review process. As a result, it automated the data QA process and protected the codebase from breaking.
Datafold's Diff tool now conducts automated regression checks on data changes for 100+ pull requests per month. Automating their data QA process has unleashed the productivity of their data warehouse.
Data Quality Management Acts Like a Flywheel
One key takeaway from Lyft, Shopify, and Thumbtack is that data quality management is a journey, not a destination. Each of them moved from cumbersome, manual processes to more automated, streamlined ones. But it took time.
Data quality management requires continuous, incremental improvements. A good first step is to build a data culture in your organization. This involves data literacy and understanding within your organization and promoting tools that help improve data quality and good data governance. Another action you can take to improve data quality management in your organization is implementing a proactive approach to the change management process. You can start by ensuring the code that transforms and processes your data is version controlled and creating a transparent review process on how your data is transformed.
We have a unique approach to data quality and have built a platform to simplify data observability. Contact us to discover how Datafold fits into your organization’s data quality management journey.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.