Data quality guide

Data quality is your moat, this is your guide

/•/

Fortify your data, fortify your business: Why high-quality data is your ultimate defense.

What is data quality?

Published

May 28, 2024

Google “what is data quality” and you’ll encounter plenty of definitions that are easy to understand but hard to translate to concrete implementations.

As a concept, data quality is pretty abstract. We define data quality as a quantifiable measure of data's fitness for use.

At its core, data quality is about ensuring that data meets the needs and expectations of its users. People are at the heart of data quality, not tools, and so it’s a mistake to think of it only involving quantifiable metrics like accuracy, and not also consider the context in which the data will be used.

In conversations with data practitioners and data managers, we’ve narrowed it down to 8 dimensions that capture both the technical and non-technical aspects of producing high quality data.

Why does data quality matter?

Data quality is often conflated with better decision-making. Having high quality data does not guarantee good insights, but having bad data almost certainly guarantees flawed conclusions and misguided decisions.

In many teams, data quality often takes a back seat because the process of testing can feel daunting and time-consuming. It's not that data practitioners don't recognize its importance, but rather that the prospect of setting up comprehensive testing frameworks can be overwhelming (or frankly, tedious), or not worth it when there’s other fires to be put out.

This is an unfortunate outcome since data drives everything for a business. Getting the data right is like investing in a well-constructed moat around your business operations; your dashboards, decisions, machine learning models, and reverse ETL syncs are protected on all sides. If you do this right, you can enable every other function of the business to focus more on growth than fending off your adversaries at the castle gates.

Key dimensions of data quality

Data quality isn't just one thing; much like software quality, it's a multi-faceted problem.

You have accuracy—is the data correct? Completeness—do you have all the pieces you need? Consistency—is everything uniform and tidy? Timeliness—is your data up-to-date? And validity—does it meet the rules you've set? Depending on who you ask, there could be anywhere between 3 to 20 more benchmarks to consider.

That's quite a few things to worry about. To simplify it, we've organized approaching data quality through these 8 dimensions.

Let’s go over what each of them is with a sales-related example and discuss common misconceptions.

Accuracy

High-quality data is synonymous with accuracy, ensuring that the data faithfully represents the real-world entities or events it describes. Accuracy means that the data can be trusted to provide reliable insights and support informed decision-making processes.

Example: In a sales dataset with one transaction per row, accuracy means that every value in the "amount" column equals the actual amount the customer was charged.

Common misconception: Perfect accuracy may not always be attainable or necessary, and the level of accuracy required should be balanced based on the specific needs of the analysis or decision-making process.

Completeness

Data completeness ensures that all necessary data points are present, preventing gaps in analysis and enabling comprehensive insights. It goes beyond simply having a large volume of data to encompass having all the required data to answer a question or solve a problem effectively.

Example: For a sales reporting team, completeness would involve ensuring that each sales transaction includes essential details such as product information, customer details, salesperson information, and transaction timestamps.

Common misconception: Completeness should be evaluated based on the specific requirements of the analysis or decision-making process, rather than striving for exhaustive data collection.

Consistency

Consistency in data quality ensures uniformity across datasets or measurements, avoiding contradictions or discrepancies that can undermine data reliability and interpretability (e.g. standard currency, naming conventions, data formats, and encoding standards across different datasets and databases).

‍Example: A very common inconsistency example is referring to something as "user_id" and then "userid" a few months later. Inconsistencies in naming conventions or data formats can lead to confusion when integrating or analyzing sales data from multiple sources, especially as your data and team grows.

Common misconception: Despite what non-technical business users may think, raw source data is rarely consistent on its own! Because of the variability in formatting, timestamp types, and data types in raw source data, much of the data work that must happens to create consistency happens in the data transformation process.

Timeliness

Timeliness involves delivering data to the right audience, in the right format, and at the right time. This enables optimal decision-making and proactive responses to changing conditions.

Example: In sales dashboards, timely data updates may ensure that stakeholders have access to real-time or near-real-time sales insights, facilitating timely decision-making.

Common misconception: Timeliness does not mean you need to have real-time data. There are practical trade-offs between data freshness, cost, and complexity to consider. (To be honest, we haven’t encountered that many instances in the real world where real-time data won over a regular 30-minute refresh.)

Uniqueness

Uniqueness in data quality prevents data duplication, ensuring that each data point represents a distinct entity or event, thus maintaining a single source of truth. This single source of truth is crucial for avoiding friction and maintaining trust, even if the data is accurate in multiple places.

Example: Uniqueness constraints on identifiers or primary keys in sales databases are used to prevent duplicate sales records or entries.

Common misconception: Uniqueness goes beyond your primary keys. It’s important to build a data warehouse that has unique tables and models, and make sure that work is not duplicated through your database’s system. Creating a whole lot of, "Wait, why do we have DIM_ORGS_ FINAL and DIM_ ORGS_ FINAL_ FINAL?"

Reliability

Reliable data not only accurately reflects the underlying reality it represents but also ensures continuous data availability and uptime. This translates to business confidence.

‍Example: In sales performance reporting, reliable data would accurately depict required metrics (e.g., sales transactions, salesperson performance metrics, and revenue figures) and be consistently delivered on time with accurate data.

Common misconception: It’s easy to assume that reliability is a technical problem to solve, but many non-technical factors such as organizational processes, governance, and cultural factors play a part.

Usefulness

Usefulness ensures that data serves its intended purpose and provides valuable insights for decision-making, optimizing the efficiency and effectiveness of analytical processes. It involves aligning data with specific business objectives and priorities to deliver actionable insights. One way to gauge usefulness is by considering whether the data influences important decisions: Would the decision be made the same way with or without this data? Another way is to use frameworks like Value of Information, which provide structured methodologies for quantifying the impact and significance of data in decision-making processes.

Example: This is objectively a harder dimension to measure; where possible, we recommend looking at the usage or consumption of core reports or tables (e.g. looking at the number of queries or bookmarks to a report), to help determine whether data work is actually being used by the business. Useful sales data may be as simple as a 1-tile dashboard with new ARR gained per quarter; or something more complex like an interactive data app the finance team can plug-and-chug numbers into; again, it all depends on the use case and end user needs.

Common misconception: Usefulness can seem like a more subjective measure dependent on an individual’s preference than the other dimensions. But it can be set as an explicit benchmark between data practitioners and stakeholders based on your company’s goals and priorities. This is the metric or dimension often left out of the data quality discussion, but we think it’s worth investing in understanding for your business: after all, there’s nothing more frustrating then spending days or weeks building a model or analysis only for it to go unused by your stakeholders.

Differences

This wouldn’t be a blog by Datafold if we didn’t talk about data differences. Data differences highlight variations or discrepancies in the data, helping identify anomalies, errors, or inconsistencies that may affect data quality and decision-making. They provide insights into changes between different versions of a table (think: a staging and production version of a table, or a DIM_ORGS table in Oracle and Snowflake), facilitating data validation and reconciliation processes.

Example: Data diffing tools like Datafold can compare sales tables (within and across databases) to detect additions, deletions, or modifications in sales datasets, enabling sales data professionals to understand how sales data changes over time.

Common misconception: The goal is often not to eliminate data differences, since there will be situations where you want the data to change based on new sources and refined metrics. Instead, look for a way to assess whether a change is expected and acceptable, or unexpected and requiring further investigation.

If you need a cheat sheet on these dimensions, here’s a handy table:

Dimension	Definition	How it's measured
Accuracy	How well the data represents reality	Percentage of correctness against ground truth
Completeness	All the required data is present	Percentage of non-NULL values for each column, or percentage of features tracked with instrumentation
Consistency	Data is consistent across different datasets and databases	Occurrences of changes in definitions, schemas, or naming
Reliability	The data is trustworthy and credible	Percentage of time data does not conform to quality expectations (e.g., inaccurate, inconsistent, incomplete, delayed)
Timeliness	Data is delivered to the right audience, in the right format, and at the right time.	Meets users expectations (e.g., if the user expects a certain table to be refreshed by 9 a.m. and it is, it's timely)
Uniqueness	There are no data duplications	Percentage of columns, rows and datasets that present duplicate information
Usefulness	Data is applicable and relevant to problem-solving and decision-making	How many important decisions were made with the data? How many users or applications use the data?
Differences	Users know exactly how and where data differs	Percentage of different rows, columns, and values across two datasets; often represented with a data diff

Beyond the 8 dimensions of data quality lies trust

While the 8 dimensions provide valuable insights into the various aspects of data quality, none of them truly matter if the data users, whether internal or external stakeholders, don't trust the data they're working with.

Think of the 8 dimensions as leading indicators of trust. For instance, you might receive an immediate alert or angry DM if your data exhibits inconsistencies or inaccuracies. As the saying (loosely) goes ,"Trust is hard to win, and easy to lose"; all it takes is one bad data quality incident to erode the trust of the business.

If your data consistently proves to be unreliable over time, trust will inevitably decline. And it's a delicate asset that can take considerable effort to (re)build and maintain, yet it can be quickly eroded by data inconsistencies or inaccuracies. As data practitioners, we know that without trust, the entire data infrastructure and decision-making processes are vulnerable to failure—much like a castle without its defenses.

previous Passage

Next Passage