Demystifying data quality management
The data world has a lot of buzzwords, largely because we work with abstractions. Our work is complex and intangible, so we need a simple way to broadly describe what we’re doing.
For example, data engineers talk about "data stacks" and "data ecosystems" to describe the technologies involved in creating, moving, storing, transforming, testing, and using data. "Stacks" and "ecosystems" are descriptive terms that are well understood because they describe the tools and interactions we work with every day. There’s no literal stack of anything, nor will you find an actual ecosystem anywhere. You can ask a hundred data engineers, "What is a data stack?" and get fairly consistent answers.
But then we have terms like "data quality" and "data quality management." You’d think they’d be well-understood, but ask a hundred engineers and you might get 20 different answers. You can’t ask Google or you’ll be inundated with million-dollar enterprise solutions that solve almost zero real-world data problems.
It’s time to draw some lines in the sand. Let’s get specific about "data quality" and "data quality management" with concrete definitions and examples.
Data quality is quantifiable and specific
Data quality is a quantifiable measure of data’s fitness for use. It’s not abstract or ambiguous. We can look at and measure data quality across eight different dimensions:
High data quality typically means:
- Accuracy is approaching 100%
- Completeness is approaching 100%
- Consistency is approaching 100%
- Reliability has 0 failed tests
- Timeliness is approaching the appropriate number of seconds, minutes, hours, or days for the dataset
- Uniqueness is approaching 100%
- Differences between datasets meets the expected value (e.g. I expect my dev and prod versions of DIM_ORGS to be exactly the same, or I expect DEV.DIM_ORGS to have 1000 more rows than PROD.DIM_ORGS)
“Usefulness” is a bit squishy and open to interpretation. In short, you should build metrics into your data usage and report on those metrics in a way that’s meaningful to your organization. It might be something like daily active users of a data product or the percentage of known users measured against the number of queries against the data.
The main point is: yes, you can objectively measure the quality of your data. And you absolutely should be measuring it because you can’t manage data quality unless you are measuring it. Once you are measuring it, you can make efforts to improve it with data quality management.
Set and maintain measures to manage data quality
Every organization should have its own set of standards and practices for data quality and trustworthiness—data quality management, one might say. There are points at which data quality is high enough to be useful, too low to be trusted, or somewhere in between. Of course, this depends on the use case and the risk involved with the data and its use. If you’re a public company reporting on quarterly financial performance, you can’t rely on low-quality data. It must be accurate, complete, consistent, etc.
But what do you do if, say, the SEC runs an investigation on those quarterly numbers? You certainly can’t tell the SEC the data “looked good enough to us at the time.” No, you’d need to demonstrate that you knew exactly what the data quality looked like at the time of the report and that you had an overarching strategy in place for measuring and managing data quality.
Managing data quality is like any other continuous improvement effort. It requires goals, standards, and accountability. Here’s what each of those might look like:
Goals are typically metric-driven, standards are typically rule-driven, and accountability measures are typically people- or responsibility-driven.
Data-mature organizations have goals, standards, and accountability measures in place to demonstrate effective tracking and maintenance of data quality standards. It’s not enough to occasionally meet a data freshness target when your business is actually making data-driven decisions. You need to offer assurances that both your data and your data quality processes are setting data consumers up for success.
Read more on why we think implementing data quality tests during your CI process is the best way to set up your team and data for success.
Common misunderstandings of data quality management
Managing data quality is a crucial business activity that is often misunderstood in a few key ways:
- It’s a one-time fix: Some think once you clean your data, you're done. Nope! Data quality management is ongoing. As new data and use cases come in, they’ll need regular checks and maintenance to remain clean and useful.
- It can be managed with one production or solution: Yes, there exist many data quality management “tools” that are often marketed towards large enterprises as “one-size-fits-all” solutions. These softwares may offer data cleaning, deduplication, lineage, data parsing, and other all-in-one solutions that are individually often very useful. But unless data quality software is paired with automation and reinforced with a culture that is actively working towards higher data quality, this (often very expensive) software can fall short in expectations, not scale with data or organization growth, or be overkill for meeting your organization’s data quality pain points. Instead, we recommend adopting testing during your development, deployment, and replication processes using data diffs to ensure data pipelines are tested with consistent, thorough standards.
- It’s only about removing errors: Yes, it’s vital to remove one-off errors, but data quality management is also about ensuring data is regularly consistent, complete, accurate, and in a format that's usable for its intended purpose. This comes through defining and enforcing data quality standards like ensuring every dbt model has appropriate tests set.
- It doesn’t need a strategy: Effective data quality management requires a clear strategy. This includes defining what good quality data looks like for your specific needs, how you’ll maintain it, and who’s responsible for what. Larger organizations will often adopt data product managers or data governance managers whose responsibilities often revolve around creating practices and processes for effective data management.
- It’s expensive and time-consuming: This is partially correct, but the cost of not managing data quality–like making bad decisions based on poor data–is usually much higher. Data quality testing also does not have to be manual or retroactive if you use a tool like Datafold to automatically find data quality issues before they happen.
- All data needs the same level of quality: The level of quality control depends on how you’ll use the data. Some data needs to be super accurate, while for other purposes, a bit of inaccuracy might be okay. Again, this will vary industry to industry and team to team, but establishing what is “acceptable” data quality will often differ between organizations.
When you have an effective and meaningful data quality management strategy, you’ll have a better sense of what’s acceptable and what isn’t. Every organization is different, so standards will vary. They can affect risk tolerance, speed of decision making, and more.
High data quality is no accident
If you ask 100 data engineers about data quality and data quality management, they should ideally tell you:
- Data quality is a quantifiable and demonstrates whether data can or should be used
- Data quality management is a continuous and strategic process for ensuring data is usable
Sure, they may vary on the data quality dimensions (we use eight dimensions, other organizations may use a different number) or on what is acceptable. But at the end of the day, every data team should be striving for quantifiably high-quality data.
Managing data is messy, so high data quality is never an accident. You need to set quality standards and expectations, measure across the dimensions, and validate your assumptions. Put simply, all you really need are tests on specific parts of your pipelines.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.