Digest of Data Quality Meetup #2

Gleb Mezhanskiy

November 26, 2020

About the event

One of the biggest challenges for analytical teams today is monitoring and managing the analytical data quality. We are bringing together professionals from data-driven teams and open-source community to share and learn the best practices around data quality & governance. Digest and recordings of the previous event is available here.

Our second online event happened on November 19th, and we are excited to share the key takeaways in the digest below. With 9 expert speakers & 155 live participants, in just one hour, we covered a lot of tooling & hard questions on the topic of Data Quality.

Topics

  1. Automated testing of data pipelines: the best investment to make in 2021
  2. Version control for datasets and ML models: a new standard or a hype?
  3. ETL orchestration: can we do better than Airflow?
  4. Data catalogs: cost and impact analysis
  5. Lean Data Science: applying proven principles to overdeliver data projects

Lightning Talks

Data Testing

By Gleb Mezhanskiy, Co-founder & CEO @ DatafoldSlides

Three principles of effective data testing

  • Embed testing in existing workflows
  • Automate everything
  • Cut the noise

Data testing in production

Goals: detect issues as early and as upstream as possible

Types of tests: Assertions & Metric monitoring

Assertions are "hard rules" for testing data on a value-level or to verify integrity, e.g. primary key uniqueness. Assertions can be: • Embedded in ETL tools: dbt for SQL & Dagster for general ETL • Standalone: great_expectations for SQL & deequ for SparkMetric monitoring is useful for tracking metrics such as row count, totals or averages in a dataset. Given the natural variance and seasonality in data, it's best to use time series ML models to reduce the noise. Tools for metric monitoring include: Prophet by Facebook and Datafold Alerts that provides turnkey experience.

Data testing in development

Goal: do no harm – prevent breaking things that work

Types of tests: Assertions & Data diff

Data diff is a tool that compares datasets on both values and distributions. It can be helpful for automating regression testing since it provides full visibility into the made changes. Tools for data diffing include dbt-audit-helper, Big Diffy by Spotify (both CLI-based) and Datafold Diff (turnkey experience with rich UI).

Data Discoverability @ SpotHero

By Maggie Hays, Senior Product Manager, Data Services @ SpotHeroSlides

Data discoverability problem at SpotHero

  • Data lineage is difficult to discover and navigate, regardless of role or tenure
  • Difficult to discover what data exists and/or what it represents
  • Confidence in data accuracy is neutral, but room for improvement

Data Catalog Evaluation

DataHub POC effort

  • Research & Tool Evaluation: 180 hrs
  • Initial Rollout of DataHub POC: 300 hrs
  • Looker & Kafka Metadata Ingestion & Lineage: est. 160 hrs

Building Reliable Data Apps with Dagster

By Max Gasner, Co-author of Dagster – data orchestration framework for ETLSlides

Data Apps – graphs of computations that consume and produce data assets. They are complex and heterogenous, developing and managing them is hard.Data Application Lifecycle 1. Develop & Test 2. Deploy & Execute 3. Operate & Observe

Dagster design principlesFunctional data engineering

  • DAGs should be typed for faster development cycles
  • External resources should be isolated from business logic, so they can be mocked in test
  • Composable, configurable and reusable units of business logic

The orchestrator is a platform

  • Pluggable deployment locally and in a wide range of production environments
  • User code isolation so that individual errors can't take down prod
  • Fine-grained control over scheduling and execution

Use the graph

  • Unified logging and monitoring
  • Link assets to pipeline runs in catalog
  • Longitudinal views of asset metadata

Achieving Reproducibility With Unstructured Data using DVC

By Dmitry Petrov, Co-author of Data Version Control framework & CEO @ Iterative.aiSlides

Principles of Reproducibility in ML pipelines

  • Central storage for all data artifacts (data files, models, metrics)
  • Decouple data from code. Use dataset metafiles. Do not read data from code directly.
  • Be metrics driven. Version metrics in Git along the data and the code.

Lean Data Science

By Michael Kaminsky, Data Strategist & Entrepreneur, Co-author of Locally Optimistic – blog on data organizationsSlides

Lean Data Science is a simple technique Data teams can apply to deliver high-impact projects consistently amid the naturally high uncertainty. It is based on 3 things:

  • Measure business outcomes, not model performance
  • Ship early and often
  • Embrace failure and accept good enough

Further reading on the topic in Michael's post.

Panel Discussion

Panelists (from left to right)

  • Gleb Mezhanskiy [Moderator], Co-founder & CEO @ Datafold
  • Maggie Hays, Senior Product Manager, Data Services @ SpotHero
  • Mars Lan, Co-author of DataHub – open-source data catalog from LinkedIn
  • Tobias Macey, Engineering Manager, MIT, Author of Data Engineering Podcast
  • Scott Breitenother, Co-author of Locally Optimistic blog, founder of Brooklyn Data Co
  • Emilie Schario, Senior Engineering Manager @ Netlify

Questions

How to manage data quality for data team? Where to start?

  • Understand your data and where is the data coming from, what people are doing and here you will find information you need to start data testing
  • Agree on how to define data across team
  • Building a culture around caring of data quality. Validating quality of the data as upstream as possible

How to properly instrument ownership of data quality? Do you think that organization structure is important for ensuring data quality?

  • If you want to try something as you are building your data team, see how it works, if it stops working, change it, then see how long that works and then change it again. Feel free to iterate and find out what works for your team.
  • Better define what the different types of functions are. Establish a clear definition of what a Data Engineer does in your team and how it differs from what a Data Scientist does. Thus you can have experts really focused on the things they are good at.
  • Distinguish what the actual roles and responsibilities are and what is the scope of their work. It is also interesting to see some of the evolution of what the actual interface is between these different roles.

Assuming that the world of data transformations and ETL is going to converge on tools that cover end-to-end process (e.g. Dagster), is there place for data catalog? And is it most helpful to very large teams or, vice-versa,  for small teams?

  • ETL tools will always have limited scope: they will provide lineage but only within their own scope. We need observability solutions for the entire data stack.
  • We are at this stage where the creation and storage of data is only getting easier and cheaper. It is easy to get more data, but then what are you doing with it? There are new costs associated with it – intellectual overheads. We need a solution that helps us quickly tie all of these separate pieces together so that you capture them early. Data catalog is, therefore, a critical piece.

All slides are available on SlideShare.