April 7, 2021

Events, Data Quality Best Practices

Data Quality Meetup #3

Why implement regression testing for ETL code changes, how to align data producers and consumers, and what Data teams at Carta, Thumbtack, Shopify & Clari do to solve data quality.

No items found.

Datafold Team

About the Event

One of the biggest challenges for Data teams today is monitoring and managing the analytical data quality. We are bringing together practitioners and leaders from data-driven teams and the open-source and vendor community to share and learn the best practices around data quality & governance.

Our third Data Quality Meetup took place on March 11th 2021 with 9 speakers and over 160 live participants. We are excited to bring to you talks from the expert speakers and the panel discussions along with key takeaways from each of those segments in the digest below.

Also, feel free to check out the digest and recordings from Data Quality Meetup# 2 and RSVP for the upcoming the upcoming Data Quality Meetup session here.

Lightning Talks

Automating Data Quality at Thumbtack
The Purpose Meeting: Aligning Data Producers & Data Consumers
Managing Modeling Debt using Metadata
7 Habits of Highly Effective Business Intelligence Engineers
Panel Discussion: lessons from Carta, Thumbtack, Shopify & Clari

Automating Data Quality at Thumbtack

By John Lee, Director, Product Analytics @ Thumbtack

By empowering their Data Analysts to build their data warehouse, Thumbtack has been able to rapidly scale up analytics while maintaining a lean team of Data Engineers. The rapid surge in analyst-made data models brings its own set of challenges: maintaining data quality at scale.

Stack:

Airflow + Spark + BigQuery + Tableau + Mode

Data Quality vs. Speed

When making changes to data models full of complex business logic encoded in SQL without an easy way to test the output, data developers (analysts) often introduce errors that lead to the following problems:

Cascading Failures: one bug can propagate downstream and corrupt the entire data pipeline requiring a lengthy cleanup process
Eroding trust: broken or inaccurate dashboards cause stakeholders to lose faith in data
Lost $$$: direct bottom-line impact on product as well as marketing

Automated Data Quality Checks with Datafold

At first, Thumbtack required all developers to produce manual diff reports showing the impact of every code change on data. That was an expensive and cumbersome process.
Then Thumbtack onboarded Datafold’s Data Diff tool that computed the diff out of the box and integrated it into the Continuous Integration (CI) pipeline in GitHub
Datafold Diff now conducts automated regression checks on data changes for each pull request to the ELT code (100+ pull requests per month!)

The Purpose Meeting: Aligning Data Producers & Data Consumers

By Stefania Olafsdottir, Co-Founder & CEO @ Avo – developing next-gen analytics governance

The biggest challenge in data quality is the lack of alignment between the two major internal stakeholders:

Data Consumers: Users of data i.e. PMs, Data Scientists and BI Analysts, Management
Data Producers: Generators of data i.e. Developers

The first step in solving the alignment problem is to start treating data like a product. By matching the product management mindset, we can then narrow down the issues into four major product risks:

Value risk
Usability risk
Feasibility risk
Business viability risk

To effectively address and mitigate this risk, we introduce the Purpose Meeting that brings together Data Producers and Data Consumers to align on 3 key ideas:

Goals: What would success look like?
Metrics: How can you quantitatively measure success?
Data: Design the event structure. What analytics events are required for the success metrics?

Once these three key elements are defined, the entire organization will develop data products in a more effective and aligned fashion.

Managing Modeling Debt using Metadata

By David Wallace, Senior Data Engineer @ Dutchie (previously @ GoodEggs)

Throughout his career, David has extensively relied on dbt as a framework for organizing data transformations with SQL.

At GoodEggs, the main challenge around scaling up dbt rapidly was a large (30%+) of production dbt models that were outdated, duplicated, or unused, leading to a very bulky system. The “atrophy” – deterioration due to lack of use – happened also to downstream data assets such as Mode reports and Jupyter Notebooks creating bloat and hampering team productivity.

The GoodEggs team built a special pipeline in Dagster (an open-source data orchestrator) that identified vestigial data assets and deprecated them in an auditable manner.

Identifying and Removing Vestigial dbt Models

The automated process for deprecating stale data assets consists of four main steps:

Link dbt models to Mode reports through query metadata
Identify and remove dbt models with no downstream dependencies (vestigial)
Deprecate data assets in a reversible manner with version control to allow restoring if needed
Extend an identical approach to identify other vestigial data assets

7 Habits of Highly Effective Business Intelligence Engineers

Josh Temple, Analytics Engineer @ Spotify, Co-founder @ Spectacles

The 7 Habits

Version Control: maintaining history for rollbacks; non-destructive experimentation and collaboration; version control of both code and content
Testing: validating the contract between data warehouse and BI tools; sanity checks based on historical values; ensuring data uniqueness; logic tests
CI (Continuous Integration): automating tests before deployment (CD); code style governance / best practices for consistency; simplification of code reviews
Code Review: continuous learning and knowledge sharing; providing constructive and empathetic feedback; catch bugs not caught by CI
Documentation
Extensibility
Access Control

Alphabet Soup: ETL vs ELT for Data Pipelines

By Gary Sahota, Head of Analytics @ Clari – revenue operations platform

ETL – Extract, Transform, Load. The current industry norm. Data is sent through a series of transformations (cleaning, aggregating, blending), and output stored in the data warehouse for use in BI tools.

Pros

“Clean” data
Less data loaded into the warehouse
More cost-efficient

Cons

Rigid data pipelines
High-maintenance
Higher upfront costs

ELT – Extract, Load, Transform. A newer industry trend that has emerged in the last 4-5 years. Raw data is first stored in the warehouse, followed by transformations prior to use downstream in BI. Transformations, as well as the raw data, are stored simultaneously providing concurrent access to both.

Pros

Easy access to raw data
Low maintenance
Flexibility to adapt to future changes

Cons

“Dirty” data is present alongside the cleaned/transformed
Large volume of data in the warehouse
Compliance challenges

Panel Discussion

Panelists (from left to right)

Gary Sahota, Head of Analytics @ Clari
Jillian Corkin, Developer Relations Advocate @ Fishtown Analytics
Zeeshan Qureshi, Engineering Manager @ Shopify
George Xing, ex-Head of Analytics @ Lyft
John Lee, Director, Product Analytics @ Thumbtack
Julia King, VP, Data & Analytics @ Carta
Gleb Mezhanskiy [Moderator], Co-founder & CEO @ Datafold

Questions

What is the biggest challenge that you have been recently facing as a Data Leader, and how has that changed over the last 3-4 years?

Hiring: the expectations and skillset around the data tools has been rapidly evolving, and finding the right match has been a challenge on the people side
Managing organizational scalability: challenges around rapidly growing and expanding data teams and how to leverage different skillsets while maintaining high efficiency in technical and non-technical roles
Complex customer demands: a better understanding of data by end-users have led to a greater demand for further in-depth, complex, and customized data insights for decision-making

What are some of the hacks that your teams have implemented for reliable and high-quality self-serve analytics in your organizations?

Align data consumers and data producers: empowering the data users with the skillsets and toolkits for the more regular and standardized analytical problems, with the Data Scientists being the support function for the more unique and complex challenges
Alignment: leveraging a robust data governance system to align the definitions and understanding of the different metrics across all teams. Challenge remains in being able to strike a balance between centralizing the information while also being flexible enough to serve the needs of the data users in the tool and interfaces they prefer.

For standardized analytics like funnels, what are some of the approaches and toolkits implemented by each of the teams?

Finding the right balance: there are lots of options, tools overlap in their scope, and data teams always desire to try out new tools. Striking a balance between optimizing the existing toolkits/resources and finding efficiency with new ones is a key aspect of growth.
Training: training non-technical team leads to navigate through dashboards and BI tools for simpler, out-of-the-box, standardized information, and relying on data teams for the newer and more complex insights and asks.

Given the modular nature of today’s data stacks, what are some of the tools that you decided to build in-house within your stack?

Value-based approach: take on a project if it adds significant value over a commercially available alternative, or if it needs significant customization to suit the needs of the users. Also, finding opportunities to eventually “productize” tools that were initially built for internal use. Challenge remains in accurately assessing and quantifying the bottom-line benefits of building in-house for a tool that is available commercially.

RSVP for the upcoming Data Quality Meetup sessions here

‍