Data quality guide

Data quality is your moat, this is your guide

/•/

Fortify your data, fortify your business: Why high-quality data is your ultimate defense.

You need data quality metrics, but which ones?

Published

May 28, 2024

Data folks know that the central challenge, in whatever domain they work in, is not necessarily in acquiring the right tools or having the perfect data stack. The hard thing is knowing how to link all of those tools, dashboards, processes to produce high quality and actionable data.

Let’s leave questions about the right tools and processes aside for now and consider instead: how do data and analytics engineers connect the dots between data quality efforts and bottom-line results?

To measure how raising data quality impacts, well, everything else in the business, from revenue to customer satisfaction to operational efficiency, you need some kind of metric to measure any changes from your baseline. Which metric? How many? What threshold do you consider success or failure? How do you know what’s normal variation, versus something to be investigated?

You can use the eight data quality dimensions to help understand the different ways in which your data is of high quality (or not), but leveraging quantitative data quality measures is essential to assess the quality and usability of your data.

Data quality management practices and data quality tools that don't provide meaningful data quality metrics don’t undermine your process. They do something far worse: they don't provide a measurable path for improvement. The vast majority of data quality tools produce metrics that are aggregates. They provide a great high-level overview of data quality, but that doesn’t drill down to something granular enough that you can identify specific discrepancies, create meaningful data quality KPIs (key performance indicators) that can be used to capture impact on business intelligence.

A measurable path to improvement requires tracking the right metrics for the right purposes. In our search for impactful data quality metrics, we found four that every organization should prioritize. Let’s now turn to how they each set concrete benchmarks against which you can evaluate both your data quality and their direct impact on the data analytics work of a company.

Common data quality issues

Below are some of the issues that data teams experience that lead to poor data quality. (This is not a complete list by any means; every organization has data and problems with data quality) that are unique to them, but these are common ones we see across many data teams.)

Bad source data

Data teams face incredible frustration when the data they received from their raw data sources (e.g., ad platforms, backend databases, CRMs) that contains incorrect data records or is tampered with during the data replication/movement process. When ETL pipelines break or source data has considerable issues, this often looks like missing records, column values with excessive or unexpected null values, or rows with incorrect data values.

Poor data freshness

A key indicator of good data quality and overall data health is the freshness of it: is the data in the data warehouse when teams need it to transform it and eventually expose it to business users? A lack of timely data can occur for several reasons:

Delayed or broken ETL pipelines responsible for loading data into your data warehouse
Backend databases or source systems that are experience outages or downtime
Simply not running your data pipelines enough (which, to be fair, can be a costly thing to do)

Bad data transformations

Data and analytics engineers spend much of their time transforming raw data, using a software like dbt or stored procedures, to ensure usable data models and metrics for different use cases (e.g., BI reporting, business decisions, machine learning models). It's at this stage when data engineers use SQL to clean up and apply business logic to their data to create meaningful datasets that can be consumed for downstream use. However, because humans are imperfect and there are often a lack of data quality standards in place, and data is often messy and complicated (see pain points 1 and 2 above), data people make errors too! Introducing bad code changes that accidentally lead to data quality problems, and the inevitable firefighting that happens afterward, is a major time sink for many practitioners.

Data quality testing methodologies

Common data quality pain points 1 and 2 above are often out of the hands of data teams, so the majority of these proactive data quality metrics is on pain point number 3 (where data teams have the most control): creating data quality practices and rules to prevent data quality issues from arising in your data transformation process.

When a data engineer is modifying a dbt SQL model for updated business rules, how do they go about ensuring their code changes won't lead to inaccurate data or broken downstream dashboards? There's usually a slurry of methods used to perform a data quality assessment on proposed code changes.

Ad hoc data quality testing

Write and run manual ad hoc SQL test (e.g., naive row counts or aggregates between the production and development versions of your tables). Data teams using dbt may often use dbt tests that are manually defined for models to asses for important data quality checks like uniqueness, not nullness, referential integrity, and more. Testing like this often leads to only finding issues with your "known unknowns"—things that must be true about the data–and often missing out on edge cases or untested assertions.

No testing, ship it and pray!

The scariest and somewhat common testing method: do nothing! Teams may eyeball the data for any glaring issues, and just ship it and hope a stakeholder does not come knocking on their door soon. This is obviously a very risky way of incorporating code changes into a central database, but is often done for small and seemingly innocent code changes.

Implement testing during CI

Implement robust data quality rules and practices directly in the continuous integration (CI) process for all code changes. This establishes a baseline for data governance by ensuring every proposed code change undergoes the same level of data quality testing and review. We think that this is a strong way to ensure high data accuracy, speed up developer time, and prevent data quality issues from arising in your transformation process in the first place.

Wait for data observability tools to catch issues

Some teams may opt to merge code changes into production, and wait to see if a data quality monitoring system like a data observability tool picks up on any unexpected anomalies or issues. Many data observability tools will support basic data quality dashboards that give an overview of your data quality, like data freshness or anomalies identified.

Aggregates or value-based data quality metrics?

Good data quality depends on being able to predict and fix problems before they hit production data and systems. Being proactive means not just being watchful, but also knowing what metrics to keep an eye on that serve as early warning signals of data quality issues.

As you can see, there are a variety of ways data teams ensure high quality data. Which way is the most proactive to avoiding data quality issues in the first place? What key metrics should a data team be looking at to understand the impact of their code changes and quality of their data? In considering the challenge of building useful data quality metrics, we often draw upon one of two primary approaches: aggregates and value-based metrics.

Aggregates don't prevent data quality problems

A proactive approach to data quality requires first examining your data quality measures before merging any code or data changes to production. But data teams often rely on manual calculations or aggregates, which don’t usually cut it. Common data quality KPIs or measures look at things like:

Data completeness (percent of missing/incomplete records)
Data accuracy (percent of accurate data entries)
Data consistency (degree to which data values are consistent)
Data timeliness (time elapsed between data capture and data use)

There's no question that they form an important baseline for comparison when it comes to data quality monitoring. While they certainly good start to any data quality assessment, they hide far too much and bury the signal within a lot of noise.

Lack of granularity

These aggregate data quality KPIs may highlight a data quality issue but fail to pinpoint the exact nature or location of the problem. Without granularity, the problems remains opaque and resistant to any targeted intervention for improvement. Aggregated metrics might indicate a data completeness rate of 90%, suggesting high quality. It's impossible to know whether the missing 10% comprises crucial data like customer contact details or financial transaction amounts, impacting decision-making.

Limited insight into (unwanted) data transformations

Aggregates also don’t reveal unwanted changes from data transformations on individual data values. Unwanted transformations may distort the original meaning or context of individual data values, or worse, lead to non-compliance with regulatory requirements or internal data governance policies. What if your data accurate rate at 95% masks a transformation error? For instance, a decimal point shift in financial data might result in inaccurate profit calculations, leading to erroneous financial reports despite seemingly accurate aggregated metrics.

Inadequate for root cause analysis

Aggregated data quality metrics lack the depth required for thorough root cause analysis of data quality issues. Without detailed insights into individual data points, identifying the underlying causes of discrepancies becomes challenging. Your data might have a 99% consistency rate, but what if it fails to uncover inconsistencies in how a column is coded due to an outdated database synchronization process?

Inhibits efforts to repair and restore data quality

Aggregates mask varying severity levels of data errors. Imagine if your customer profiles database has a data completeness rate of 98%, which seems pretty good. But what if the 2% missing data only affects critical demographic information like age and income due to a glitch in the data import process? Without granularity, it's difficult to pinpoint specific data anomalies or errors, making it hard to take corrective action quickly and potentially prolonging any data quality issues.

Data quality metrics should be precise

You need a more granular data quality KPI to find the problems: value-level based metrics. For data engineering’s impact to be visible and understandable by the rest of the organization, you need to show what the individual data points look like and how they changed when you pushed new code changes:

Did the number of rows change?
Did any columns change type?
Did any rows with exclusive primary keys change? (rows with PKs in one table but not the other)
Did any of the individual data values of my rows change?

Why these four data quality metrics? Think of them as four vigilant knights patrolling the perimeter of your business' moat. Together, they work to monitor each individual data point to catch any anomalies from slipping through and support high data quality standards.

Value changes

These capture the overall impact of changes on the dataset so you can quickly identify whether they were unexpected deletions, additions, or other changes, before and after code changes. Sudden fluctuations serve as early warning signals of poor data quality, prompting further investigation into potential issues with the underlying code or data processing pipelines. For instance, a sudden increase in deleted values may signal a data deletion issue, while unexpected additions could indicate data duplication problems.

Changes in column types

Alterations in column types can significantly affect data integrity and downstream processes. For example, a change from a numerical to a categorical data type could impact analytical models or reporting accuracy. By tracking changes in column types, data teams can proactively identify and address any discrepancies that may arise.

Changes in rows with exclusive primary keys

Rows with exclusive primary keys (PKs) serve as unique identifiers within a dataset, so any changes here are vital to catching discrepancies between datasets. You can flag missing or mismatched PKs which suggest data synchronization issues or errors in data transformations. Addressing these discrepancies promptly is essential to preventing data inconsistencies and ensuring the accuracy and reliability of downstream analyses and decision-making processes.

Conclusion

If you can diff data at the value-level, it’s easy to build and track the right metrics that will help build the foundation for better data infrastructure. Just as a castle's defenses require constant vigilance (and maintenance) to remain effective, so too must organizations actively manage and monitor their data quality.

In the next section, we’ll look at case studies from leading data engineering teams that have put the pieces together to build a culture of proactive data quality. Their example showcases how implementing a cohesive data quality management strategy, with the right tools and metrics, leads to tangible results on the business.

previous Passage

Next Passage