Data Quality Meetup #3

About the Event

One of the biggest challenges for Data teams today is monitoring and managing the analytical data quality. We are bringing together practitioners and leaders from data-driven teams and the open-source and vendor community to share and learn the best practices around data quality & governance.

Our third Data Quality Meetup took place on March 11th 2021 with 9 speakers and over 160 live participants. We are excited to bring to you talks from the expert speakers and the panel discussions along with key takeaways from each of those segments in the digest below.

Also, feel free to check out the digest and recordings from Data Quality Meetup# 2 and RSVP for the upcoming the upcoming Data Quality Meetup session here.

Lightning Talks

  1. Automating Data Quality at Thumbtack
  2. The Purpose Meeting: Aligning Data Producers & Data Consumers
  3. Managing Modeling Debt using Metadata
  4. 7 Habits of Highly Effective Business Intelligence Engineers
  5. Panel Discussion: lessons from Carta, Thumbtack, Shopify & Clari

Automating Data Quality at Thumbtack

By John Lee, Director, Product Analytics @ Thumbtack

By empowering their Data Analysts to build their data warehouse, Thumbtack has been able to rapidly scale up analytics while maintaining a lean team of Data Engineers. The rapid surge in analyst-made data models brings its own set of challenges: maintaining data quality at scale.

Stack:

Airflow + Spark + BigQuery + Tableau + Mode

Data Quality vs. Speed

When making changes to data models full of complex business logic encoded in SQL without an easy way to test the output, data developers (analysts) often introduce errors that lead to the following problems:

  • Cascading Failures: one bug can propagate downstream and corrupt the entire data pipeline requiring a lengthy cleanup process
  • Eroding trust: broken or inaccurate dashboards cause stakeholders to lose faith in data
  • Lost $$$: direct bottom-line impact on product as well as marketing

Automated Data Quality Checks with Datafold

  • At first, Thumbtack required all developers to produce manual diff reports showing the impact of every code change on data. That was an expensive and cumbersome process.
  • Then Thumbtack onboarded Datafold’s Data Diff tool that computed the diff out of the box and integrated it into the Continuous Integration (CI) pipeline in GitHub
  • Datafold Diff now conducts automated regression checks on data changes for each pull request to the ELT code (100+ pull requests per month!)

The Purpose Meeting: Aligning Data Producers & Data Consumers

By Stefania Olafsdottir, Co-Founder & CEO @ Avo – developing next-gen analytics governance

The biggest challenge in data quality is the lack of alignment between the two major internal stakeholders:

  • Data Consumers: Users of data i.e. PMs, Data Scientists and BI Analysts, Management
  • Data Producers: Generators of data i.e. Developers

The first step in solving the alignment problem is to start treating data like a product. By matching the product management mindset, we can then narrow down the issues into four major product risks:

  1. Value risk
  2. Usability risk
  3. Feasibility risk
  4. Business viability risk

To effectively address and mitigate this risk, we introduce the Purpose Meeting that brings together Data Producers and Data Consumers to align on 3 key ideas:

  1. Goals: What would success look like?
  2. Metrics: How can you quantitatively measure success?
  3. Data: Design the event structure. What analytics events are required for the success metrics?

Once these three key elements are defined, the entire organization will develop data products in a more effective and aligned fashion.

Managing Modeling Debt using Metadata

By David Wallace, Senior Data Engineer @ Dutchie  (previously @ GoodEggs)

Throughout his career, David has extensively relied on dbt as a framework for organizing data transformations with SQL.

At GoodEggs, the main challenge around scaling up dbt rapidly was a large (30%+) of production dbt models that were outdated, duplicated, or unused, leading to a very bulky system. The “atrophy” – deterioration due to lack of use – happened also to downstream data assets such as Mode reports and Jupyter Notebooks creating bloat and hampering team productivity.

The GoodEggs team built a special pipeline in Dagster (an open-source data orchestrator) that identified vestigial data assets and deprecated them in an auditable manner.

Identifying and Removing Vestigial dbt Models

The automated process for deprecating stale data assets consists of four main steps:

  1. Link dbt models to Mode reports through query metadata
  2. Identify and remove dbt models with no downstream dependencies (vestigial)
  3. Deprecate data assets in a reversible manner with version control to allow restoring if needed
  4. Extend an identical approach to identify other vestigial data assets

7 Habits of Highly Effective Business Intelligence Engineers

Josh Temple, Analytics Engineer @ Spotify, Co-founder @ Spectacles

The 7 Habits

  1. Version Control: maintaining history for rollbacks; non-destructive experimentation and collaboration; version control of both code and content
  2. Testing: validating the contract between data warehouse and BI tools; sanity checks based on historical values; ensuring data uniqueness; logic tests
  3. CI (Continuous Integration): automating tests before deployment (CD); code style governance / best practices for consistency; simplification of code reviews
  4. Code Review: continuous learning and knowledge sharing; providing constructive and empathetic feedback; catch bugs not caught by CI
  5. Documentation
  6. Extensibility
  7. Access Control

Alphabet Soup: ETL vs ELT for Data Pipelines

By Gary Sahota, Head of Analytics @ Clari – revenue operations platform

ETLExtract, Transform, Load. The current industry norm. Data is sent through a series of transformations (cleaning, aggregating, blending), and output stored in the data warehouse for use in BI tools.

Pros

  • “Clean” data
  • Less data loaded into the warehouse
  • More cost-efficient

Cons

  • Rigid data pipelines
  • High-maintenance
  • Higher upfront costs

ELTExtract, Load, Transform. A newer industry trend that has emerged in the last 4-5 years. Raw data is first stored in the warehouse, followed by transformations prior to use downstream in BI. Transformations, as well as the raw data, are stored simultaneously providing concurrent access to both.

Pros

  • Easy access to raw data
  • Low maintenance
  • Flexibility to adapt to future changes

Cons

  • “Dirty” data is present alongside the cleaned/transformed
  • Large volume of data in the warehouse
  • Compliance challenges

Panel Discussion

Panelists (from left to right)

  • Gary Sahota, Head of Analytics @ Clari
  • Jillian Corkin, Developer Relations Advocate @ Fishtown Analytics
  • Zeeshan Qureshi, Engineering Manager @ Shopify
  • George Xing, ex-Head of Analytics @ Lyft
  • John Lee, Director, Product Analytics @ Thumbtack
  • Julia King, VP, Data & Analytics @ Carta
  • Gleb Mezhanskiy [Moderator], Co-founder & CEO @ Datafold

Questions

What is the biggest challenge that you have been recently facing as a Data Leader, and how has that changed over the last 3-4 years?

  • Hiring: the expectations and skillset around the data tools has been rapidly evolving, and finding the right match has been a challenge on the people side
  • Managing organizational scalability: challenges around rapidly growing and expanding data teams and how to leverage different skillsets while maintaining high efficiency in technical and non-technical roles
  • Complex customer demands: a better understanding of data by end-users have led to a greater demand for further in-depth, complex, and customized data insights for decision-making

What are some of the hacks that your teams have implemented for reliable and high-quality self-serve analytics in your organizations?

  • Align data consumers and data producers: empowering the data users with the skillsets and toolkits for the more regular and standardized analytical problems, with the Data Scientists being the support function for the more unique and complex challenges
  • Alignment: leveraging a robust data governance system to align the definitions and understanding of the different metrics across all teams. Challenge remains in being able to strike a balance between centralizing the information while also being flexible enough to serve the needs of the data users in the tool and interfaces they prefer.

For standardized analytics like funnels, what are some of the approaches and toolkits implemented by each of the teams?

  • Finding the right balance: there are lots of options, tools overlap in their scope, and data teams always desire to try out new tools. Striking a balance between optimizing the existing toolkits/resources and finding efficiency with new ones is a key aspect of growth.
  • Training: training non-technical team leads to navigate through dashboards and BI tools for simpler, out-of-the-box, standardized information, and relying on data teams for the newer and more complex insights and asks.

Given the modular nature of today’s data stacks, what are some of the tools that you decided to build in-house within your stack?

  • Value-based approach: take on a project if it adds significant value over a commercially available alternative, or if it needs significant customization to suit the needs of the users. Also, finding opportunities to eventually “productize” tools that were initially built for internal use. Challenge remains in accurately assessing and quantifying the bottom-line benefits of building in-house for a tool that is available commercially.

RSVP for the upcoming Data Quality Meetup sessions here

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes