Founded in 2009, Thumbtack facilitates five million projects every year in nearly 1,100 unique categories for every zip code within the United States. With more than 10 million users and growing, Thumbtack raised more than $698.2 million dollars and it’s valued at $3.2 billion.
Pioneers in the home services gig economy
Data team size
Google Big Query
Director, Product Analytics at Thumbtack
hours saved per month
increase in productivity
Ready to learn more?Schedule Demo
Thumbtack, an online marketplace that connects local professionals with customers requiring services, built a highly successful data-driven product. Thumbtack’s data team consists of 50+ analysts and five data engineers, who together were submitting over 100 pull requests to the SQL pipelines per month.
The business was scaling quickly, which meant that the product could evolve as fast as data was available. However, data quality was a risk for the business, with one bug having the potential to corrupt an entire table which potentially drove other tables, leading to cascading issues. This wasn’t just a matter of messing with the CEO’s dashboards but could have serious business implications, as many parts of the product, such as search, have been powered by ML models trained on analytical data.
Whenever a data outage happened, the entire data team had to drop everything to find and fix the broken data, which compounded their existing workloads.
Thumbtack's manual testing: To minimize data outages and even the stress about their potential fallout, Thumbtack implemented a proactive manual process following a code review playbook to stop breaking data in the first place - data issues were a natural consequence of empowering so many analysts to submit so many SQL code changes.
Analysts would write SQL queries to check each pull request (PR) to ensure that it only impacted the rows and columns expected, loading each change’s details in spreadsheets for tracking. The process typically took between one to two hours per PR, although some could take as long as half a day, with some analysts simply skipping this process or not tracking changes adequately. Ensuring data quality and enforcing the change management workflow became increasingly difficult as the team grew. The manual review process helped reduce data outages but crippled the Thumbtack team from moving fast and at scale.
Datafold’s Data Diff was tested by the team following a two-hour deployment in Thumbtack’s own cloud environment. Based on the successful results of using Datafold’s Diff feature ad-hoc, Datafold was built into the continuous integration (CI) pipeline right in GitHub. Thus, every change to SQL code is validated through the Datafold API automatically, and the detailed impact analysis report is published for every change to the pull request discussion. Besides using Data Diff for testing changes in the code, the Thumbtack team has been leveraging Datafold’s column-level lineage feature to identify downstream implications of changes – a task that otherwise would take hours of chasing dependencies in the massive codebase.
Product Analyst at Thumbtack