The 8 dimensions of data quality
"Data quality" is one of those terms that people talk about without much specificity. Everyone wants high-quality data—why wouldn’t they? High-quality data is the stuff of "data-driven decision making" and "building a data-driven culture," goals every company in the world is pursuing.
Yet, data quality is more than "timely, accurate, and complete" data. And even these terms warrant further investigation. For example, how timely does the data need to be? Do you actually need it in real-time or is that just something that sounds good? What are the business needs and justifications for data to be that timely? Do the costs and complexity warrant real-time data?
We need a more complete way of understanding and measuring data quality. Let’s go beyond the buzzwords and get specific. In this article, we'll dive into a practical framework and data quality definition that'll help you understand and measure data quality in a way that actually makes sense for your business.
The eight dimensions of data quality
Data quality solves problems, and not every problem requires the same level of data quality. So, we need ways to articulate data quality and compare it to the requirements of solving the problem.
We propose these eight dimensions to ensure data quality:
- Accuracy: The data represents reality
- Completeness: All the required data is present
- Consistency: Data is consistent across different datasets and databases
- Reliability: The data is trustworthy and credible
- Timeliness: Data is up-to-date for its intended use
- Uniqueness: There are no data duplications.
- Usefulness: Data is applicable and relevant to problem-solving and decision-making
- Differences: Users know exactly how and where data differs
Accuracy in data quality means the data is correct and reliable. What you see in the data matches the real-world facts. If your data is accurate, you can trust it to help you make good decisions. Inaccurate data can make you think something is true when it's not.
Data accuracy issues can arise at any time, from collection at the data source to its final transformation at the data destination. Issues at the source can be anything from a fat-finger typo from manual entry to a malfunctioning sensor to someone manipulating data to be intentionally inaccurate. Issues can also arise when data is transferred or transformed between different systems or formats.
Data can get stale, and that messes with its accuracy. Like when people change emails—your old info becomes useless and incorrect.
People usually measure accuracy in percentages. If your data is 95% accurate, that means it's correct 95 out of 100 times. That means you can trust your data 95% of the time you use it, or with a 95% level of confidence in its accuracy.
It’s easy to mistake completeness as “having all the data” rather than “having all the data you need” to answer a question or solve a problem. You may find that you only need half of the rows or columns in a dataset, or you might need to join various amounts of data from a dozen tables across several data sources. Think of it as having all the pieces to complete a puzzle.
The idea of completeness isn't about hoarding all the data, but about gathering just what's needed to answer a question or solve a problem. For example, a specific business issue may require only customer ID, transaction dates, and zip codes. The focus isn't necessarily on the accuracy of this data, but on ensuring all essential data points are collected to tackle the issue.
Completeness is typically measured in multiple percentages. An accurate measurement requires answering questions like, “Do we have all the data we need?” from both the data column and row perspectives. If you lack the columns you need then you are unlikely to have the rows you need. If you have all the columns you need, then you must determine whether there are gaps in the rows.
To test for completeness, begin by identifying the questions you're trying to answer or the problems you want to solve. List the data types essential for your objectives and compare it to the data you actually have. For instance, if your list includes customer names, emails, and phone numbers, then any missing field in your actual data counts as a gap.
Data consistency means applying uniform practices across your data warehouse and dbt models. It includes everything from naming conventions to timestamp formatting to representing currency values. Inconsistencies might look like using CamelCase for one table name and snake-case for another. Or, you might have “24.99USD” in one table and “$24.99” in another.
These inconsistencies often start right at the data source and can be a headache to manage. Take timezones, for example—there's no universal rule on how they should be formatted or included in timestamps. For data pros, fixing this mess means writing code that gives you the same results, no matter who's doing it or where it's being done.
Managing data quality consistency requires establishing governable and scalable processes for your data teams. dbt allows you to easily rebuild tables and use data references. Datafold Cloud improves on dbt by plugging into your CI/CD pipeline and ensuring your data is always tested in the same way, using data diffs every time a pull request is submitted.
To measure consistency, you might compare data from different sources or at different times to see if they match. If your sales numbers are the same in two different reports, they’re consistent.
Reliable data will stay consistent every time you measure or use it. It’s like weighing yourself on a scale multiple times and getting the same result each time. If your data is reliable, you feel confident using it over and over.
Reliability issues can arise if your data storage or collection methods change—which is inevitable over the course of time. If one person collects data differently than another, or if your system has glitches, your data might not be reliable. There could be gaps, inconsistencies, or even changes in how to interpret certain data.
You can measure reliability with a couple of methods. One way is to run repeated tests over time to check for consistent results. Another is to compare data from different sources, like cross-referencing Google Analytics with server logs. For example, if a survey gives you the same results today as it did last week, that's a good sign of reliability. Both approaches help ensure your data stays consistent across different situations.
If you get data when you need it, your data is timely. Imagine needing the weather forecast for tomorrow but getting it a week later. That's not timely!
Most data problems aren’t about weather forecasts, but about vaguely knowable future events that require predictive models based on historical data. Highly-competitive markets, like stock trading, often require as-real-time-as-possible data. Other problems, like monthly subscriber churn, may only require data from recent weeks or days.
Many practitioners think they need real-time data, but in many cases it’s unnecessary. Data needs to be timely, yes, but it doesn’t need to be live right up to this very moment. Timeliness is about making a decision and that timeframe is different for every company and every problem. A small company might need data real fast and a large company might need that same data over the course of a year.
To measure timeliness, look at the time it takes for data to transition from collection to a usable state. You might set a goal, like wanting sales data within 24 hours after a sale. If you consistently get it later, you're not hitting the timeliness mark.
You can use deadlines or time-stamps to keep track, or build data freshness tests for your data sources with dbt. Try setting severity criteria, like “Salesforce data must be no more than 4 hours old or a warning goes to the data team.”
Uniqueness in data quality means that each piece of data is different from the rest and shows up only once. It's like having a class where every kid has a different name; it makes roll call a lot easier. If your data isn't unique, you might count the same thing twice or mix stuff up.
Uniqueness is similar to completeness, but focuses on identifying and removing duplicate data. Duplicate data can show up in primary keys, as duplicate rows in one or more tables, or even as entire tables. It’s not uncommon to see tables called “customers'' and “customers_v1” in the same database.
Issues with uniqueness usually occur when merging data from different places or adding new data. Combining two email newsletter lists, for example, might include the same email address twice in two rows.
Measuring uniqueness requires checking for duplicates. You can sort the data to see if anything shows up more than once or use software to find duplicates for you. If you find that 5 out of 100 records are duplicates, your data's uniqueness could be considered 95% unique. The goal is to make sure each piece of data is a "one and only" so you can trust what it's telling you.
Usefulness is more of a heuristic than a measurable data quality dimension. It’s about whether the data actually helps people do what they need to do. If data is useful, it gives people what they need to make good decisions or solve problems. What’s useful for one person might not be for another, which makes usefulness tricky.
Measuring usefulness is a matter of looking at your data goals. Ask yourself, “Does this data help me answer my questions or achieve what I'm aiming for?” If yes, it's useful. If not, you might need to find different data or ask different questions.
You can also get feedback from people who use the data to see if it’s helping them out. If most folks say it's useful, you're on the right track. If not, figure out what's missing or what could be better. This requires more of a product mindset than a typical data analyst approach. Data practitioners should work with stakeholders to ensure they're aligned on what the data needs to do.
And remember to start off by keeping things simple. Many times, people just need a pivot table and not a sophisticated ML model that takes weeks to develop.
Data differences usually mean variations or discrepancies in the data you’re looking at. Think of it like getting different answers from two calculators; it's confusing and makes you question which one's right.
These differences can happen when data is collected, stored, or processed in different ways. Sometimes it's just human error, like writing an incorrect data transformation and sometimes it’s a matter of how data sources are collecting and storing information.
To measure data differences, compare the varying data points to see how far off they are from each other. You might look at averages, ranges, or specific examples where the data doesn't match up. We built data diff, a novel testing solution that helps data professionals understand the impact of data changes before pushing code to production—or between any two tables in the same (or different) databases. It provides specific metrics, down to individual columns and rows, and how data differs between two environments.
To learn more about data diffing and Datafold Cloud, make sure to join us at our next live Datafold Cloud Demo Day!
Getting a full picture of your data quality
Getting the full picture and improving data quality can be challenging, especially for large datasets and demanding data-driven operations. It requires building and maintaining a data quality framework, using automation to keep everything on track. Managing your data quality helps you prevent issues before they happen while also giving you data that’s accurate, useful, and trustworthy.
If you're looking to choose a data quality tool, think about what problems you're trying to solve. Need to catch errors early? Want it to work well with other tools? Make sure you're not just grabbing the first tool you find. Take time to pick one that suits your needs.
Datafold has both the open source data-diff and the Datafold Cloud platform, which is purpose-built for demanding data analytics work. It’ll spot exactly what’s different between datasets, plug into your CI/CD pipeline, and give you every meaningful metric to keep tabs on these eight data quality dimensions.
Don't wait for a data mess to happen. Be proactive. Monitor your data dimensions and with Datafold Cloud to keep your data clean and useful. Your future data-driven decisions will thank you.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.