One of the biggest challenges for analytical teams today is monitoring and managing the analytical data quality. We are bringing together professionals from data-driven teams and open-source community to share and learn the best practices around data quality & governance. Digest and recordings of the previous event is available here.
Our second online event happened on November 19th, and we are excited to share the key takeaways in the digest below. With 9 expert speakers & 155 live participants, in just one hour, we covered a lot of tooling & hard questions on the topic of Data Quality.
Automated testing of data pipelines: the best investment to make in 2021
Version control for datasets and ML models: a new standard or a hype?
ETL orchestration: can we do better than Airflow?
Data catalogs: cost and impact analysis
Lean Data Science: applying proven principles to overdeliver data projects
By Gleb Mezhanskiy, Co-founder & CEO @ DatafoldSlides
Three principles of effective data testing
Embed testing in existing workflows
Cut the noise
Data testing in production
Goals: detect issues as early and as upstream as possible
Types of tests: Assertions & Metric monitoring
Assertions are "hard rules" for testing data on a value-level or to verify integrity, e.g. primary key uniqueness. Assertions can be: • Embedded in ETL tools: dbt for SQL & Dagster for general ETL • Standalone: great_expectations for SQL & deequ for SparkMetric monitoring is useful for tracking metrics such as row count, totals or averages in a dataset. Given the natural variance and seasonality in data, it's best to use time series ML models to reduce the noise. Tools for metric monitoring include: Prophet by Facebook and Datafold Alerts that provides turnkey experience.
Data testing in development
Goal: do no harm – prevent breaking things that work
Types of tests: Assertions & Data diff
Data diff is a tool that compares datasets on both values and distributions. It can be helpful for automating regression testing since it provides full visibility into the made changes. Tools for data diffing include dbt-audit-helper, Big Diffy by Spotify (both CLI-based) and Datafold Diff (turnkey experience with rich UI).
Data Discoverability @ SpotHero
By Maggie Hays, Senior Product Manager, Data Services @ SpotHeroSlides
Data discoverability problem at SpotHero
Data lineage is difficult to discover and navigate, regardless of role or tenure
Difficult to discover what data exists and/or what it represents
Confidence in data accuracy is neutral, but room for improvement
Data Catalog Evaluation
DataHub POC effort
Research & Tool Evaluation: 180 hrs
Initial Rollout of DataHub POC: 300 hrs
Looker & Kafka Metadata Ingestion & Lineage: est. 160 hrs
Building Reliable Data Apps with Dagster
By Max Gasner, Co-author of Dagster – data orchestration framework for ETLSlides
Data Apps – graphs of computations that consume and produce data assets. They are complex and heterogenous, developing and managing them is hard.Data Application Lifecycle 1. Develop & Test 2. Deploy & Execute 3. Operate & Observe
Dagster design principlesFunctional data engineering
DAGs should be typed for faster development cycles
External resources should be isolated from business logic, so they can be mocked in test
Composable, configurable and reusable units of business logic
The orchestrator is a platform
Pluggable deployment locally and in a wide range of production environments
User code isolation so that individual errors can't take down prod
Fine-grained control over scheduling and execution
Use the graph
Unified logging and monitoring
Link assets to pipeline runs in catalog
Longitudinal views of asset metadata
Achieving Reproducibility With Unstructured Data using DVC
How to manage data quality for data team? Where to start?
Understand your data and where is the data coming from, what people are doing and here you will find information you need to start data testing
Agree on how to define data across team
Building a culture around caring of data quality. Validating quality of the data as upstream as possible
How to properly instrument ownership of data quality? Do you think that organization structure is important for ensuring data quality?
If you want to try something as you are building your data team, see how it works, if it stops working, change it, then see how long that works and then change it again. Feel free to iterate and find out what works for your team.
Better define what the different types of functions are. Establish a clear definition of what a Data Engineer does in your team and how it differs from what a Data Scientist does. Thus you can have experts really focused on the things they are good at.
Distinguish what the actual roles and responsibilities are and what is the scope of their work. It is also interesting to see some of the evolution of what the actual interface is between these different roles.
Assuming that the world of data transformations and ETL is going to converge on tools that cover end-to-end process (e.g. Dagster), is there place for data catalog? And is it most helpful to very large teams or, vice-versa, for small teams?
ETL tools will always have limited scope: they will provide lineage but only within their own scope. We need observability solutions for the entire data stack.
We are at this stage where the creation and storage of data is only getting easier and cheaper. It is easy to get more data, but then what are you doing with it? There are new costs associated with it – intellectual overheads. We need a solution that helps us quickly tie all of these separate pieces together so that you capture them early. Data catalog is, therefore, a critical piece.