Digest of Data Quality Meetup #2

Gleb Mezhanskiy

November 26, 2020

About the event

One of the biggest challenges for analytical teams today is monitoring and managing the analytical data quality. We are bringing together professionals from data-driven teams and open-source community to share and learn the best practices around data quality & governance. Digest and recordings of the previous event is available here.

Our second online event happened on November 19th, and we are excited to share the key takeaways in the digest below. With 9 expert speakers & 155 live participants, in just one hour, we covered a lot of tooling & hard questions on the topic of Data Quality.

Topics

Automated testing of data pipelines: the best investment to make in 2021
Version control for datasets and ML models: a new standard or a hype?
ETL orchestration: can we do better than Airflow?
Data catalogs: cost and impact analysis
Lean Data Science: applying proven principles to overdeliver data projects

Lightning Talks

Data Testing

By Gleb Mezhanskiy, Co-founder & CEO @ Datafold

‍Slides

Three principles of effective data testing

Embed testing in existing workflows
Automate everything
Cut the noise

Data testing in production

Goals: detect issues as early and as upstream as possible

Types of tests: Assertions & Metric monitoring

Assertions are "hard rules" for testing data on a value-level or to verify integrity, e.g. primary key uniqueness. Assertions can be:

Embedded in ETL tools: dbt for SQL & Dagster for general ETL
Standalone: great_expectations for SQL & deequ for Spark

‍Metric monitoring is useful for tracking metrics such as row count, totals or averages in a dataset. Given the natural variance and seasonality in data, it's best to use time series ML models to reduce the noise. Tools for metric monitoring include: Prophet by Facebook and Datafold Alerts that provides turnkey experience.

Data testing in development

Goal: do no harm – prevent breaking things that work

Types of tests: Assertions & Data diff

Data diff is a tool that compares datasets on both values and distributions. It can be helpful for automating regression testing since it provides full visibility into the made changes. Tools for data diffing include dbt-audit-helper, Big Diffy by Spotify (both CLI-based) and Datafold Diff (turnkey experience with rich UI).

Data Discoverability @ SpotHero

By Maggie Hays, Senior Product Manager, Data Services @ SpotHero

‍Slides

Data discoverability problem at SpotHero

Data lineage is difficult to discover and navigate, regardless of role or tenure
Difficult to discover what data exists and/or what it represents
Confidence in data accuracy is neutral, but room for improvement

Data Catalog Evaluation

‍

DataHub POC effort

Research & Tool Evaluation: 180 hrs
Initial Rollout of DataHub POC: 300 hrs
Looker & Kafka Metadata Ingestion & Lineage: est. 160 hrs

Building Reliable Data Apps with Dagster

By Max Gasner, Co-author of Dagster – data orchestration framework for ETL

‍Slides

Data Apps – graphs of computations that consume and produce data assets. They are complex and heterogenous, developing and managing them is hard.

‍Data Application Lifecycle

Develop & Test
Deploy & Execute
Operate & Observe‍

Dagster design principlesFunctional data engineering

DAGs should be typed for faster development cycles
External resources should be isolated from business logic, so they can be mocked in test
Composable, configurable and reusable units of business logic

The orchestrator is a platform

Pluggable deployment locally and in a wide range of production environments
User code isolation so that individual errors can't take down prod
Fine-grained control over scheduling and execution

Use the graph

Unified logging and monitoring
Link assets to pipeline runs in catalog
Longitudinal views of asset metadata

Looking to level up data testing? Download our new guide to testing with dbt

Discover the goals of testing with dbt, types of tests, principles and best practices for an effective testing strategy.

Achieving Reproducibility With Unstructured Data using DVC

By Dmitry Petrov, Co-author of Data Version Control framework & CEO @ Iterative.ai

‍Slides

Principles of Reproducibility in ML pipelines

Central storage for all data artifacts (data files, models, metrics)
Decouple data from code. Use dataset metafiles. Do not read data from code directly.
Be metrics driven. Version metrics in Git along the data and the code.

Lean Data Science

By Michael Kaminsky, Data Strategist & Entrepreneur, Co-author of Locally Optimistic – blog on data organizationsSlides

Lean Data Science is a simple technique Data teams can apply to deliver high-impact projects consistently amid the naturally high uncertainty. It is based on 3 things:

Measure business outcomes, not model performance
Ship early and often
Embrace failure and accept good enough

Further reading on the topic in Michael's post.

Panel Discussion

‍

Panelists (from left to right)

Gleb Mezhanskiy [Moderator], Co-founder & CEO @ Datafold
Maggie Hays, Senior Product Manager, Data Services @ SpotHero
Mars Lan, Co-author of DataHub – open-source data catalog from LinkedIn
Tobias Macey, Engineering Manager, MIT, Author of Data Engineering Podcast
Scott Breitenother, Co-author of Locally Optimistic blog, founder of Brooklyn Data Co
Emilie Schario, Senior Engineering Manager @ Netlify

Questions

How to manage data quality for data team? Where to start?

Understand your data and where is the data coming from, what people are doing and here you will find information you need to start data testing
Agree on how to define data across team
Building a culture around caring of data quality. Validating quality of the data as upstream as possible

How to properly instrument ownership of data quality? Do you think that organization structure is important for ensuring data quality?

If you want to try something as you are building your data team, see how it works, if it stops working, change it, then see how long that works and then change it again. Feel free to iterate and find out what works for your team.
Better define what the different types of functions are. Establish a clear definition of what a Data Engineer does in your team and how it differs from what a Data Scientist does. Thus you can have experts really focused on the things they are good at.
Distinguish what the actual roles and responsibilities are and what is the scope of their work. It is also interesting to see some of the evolution of what the actual interface is between these different roles.

Assuming that the world of data transformations and ETL is going to converge on tools that cover end-to-end process (e.g. Dagster), is there place for data catalog? And is it most helpful to very large teams or, vice-versa, for small teams?

ETL tools will always have limited scope: they will provide lineage but only within their own scope. We need observability solutions for the entire data stack.
We are at this stage where the creation and storage of data is only getting easier and cheaper. It is easy to get more data, but then what are you doing with it? There are new costs associated with it – intellectual overheads. We need a solution that helps us quickly tie all of these separate pieces together so that you capture them early. Data catalog is, therefore, a critical piece.

All slides are available on SlideShare.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Looking to level up data testing? Download our new guide to testing with dbt

Discover the goals of testing with dbt, types of tests, principles and best practices for an effective testing strategy.

Datafold is the fastest way to test dbt code changes

Book a Demo