Demystifying data quality management

The data world has a lot of buzzwords, largely because we work with abstractions. Our work is complex and intangible, so we need a simple way to broadly describe what we’re doing. 

For example, data engineers talk about "data stacks" and "data ecosystems" to describe the technologies involved in creating, moving, storing, transforming, testing, and using data. "Stacks" and "ecosystems" are descriptive terms that are well understood because they describe the tools and interactions we work with every day. There’s no literal stack of anything, nor will you find an actual ecosystem anywhere. You can ask a hundred data engineers, "What is a data stack?" and get fairly consistent answers. 

But then we have terms like "data quality" and "data quality management." You’d think they’d be well-understood, but ask a hundred engineers and you might get 20 different answers. You can’t ask Google or you’ll be inundated with million-dollar enterprise solutions that solve almost zero real-world data problems.

It’s time to draw some lines in the sand. Let’s get specific about "data quality" and "data quality management" with concrete definitions and examples.

Data quality is quantifiable and specific

Data quality is a quantifiable measure of data’s fitness for use. It’s not abstract or ambiguous. We can look at and measure data quality across eight different dimensions:

Dimension Definition How its measured
Accuracy How well the data represents reality Percentage of correctness and trustworthiness
Completeness All the required data is present Percentage across rows and columns
Consistency Data is consistent across different datasets and databases Percentage across multiple tables and data sets
Reliability The data is trustworthy and credible Successful pass/fail tests in search of repeatable results
Timeliness Data is up-to-date for its intended use The amount of time it takes for data to transition from collection to a usable state (e.g. seconds, minutes, hours, days); often measured or held accountable with SLAs
Uniqueness There are no data duplications Counts/percentage of duplicates across tables
Usefulness Data is applicable and relevant to problem-solving and decision-making Pass/Fail or percentage-based tests measuring whether and how people are using data to achieve their goals
Differences Users know exactly how and where data differs Comparing two tables to determine the row-by-row differences that may exist between the two; often represented with a data diff

‍

High data quality typically means:

  • Accuracy is approaching 100%
  • Completeness is approaching 100%
  • Consistency is approaching 100%
  • Reliability has 0 failed tests
  • Timeliness is approaching the appropriate number of seconds, minutes, hours, or days for the dataset
  • Uniqueness is approaching 100%
  • Differences between datasets meets the expected value (e.g. I expect my dev and prod versions of DIM_ORGS to be exactly the same, or I expect DEV.DIM_ORGS to have 1000 more rows than PROD.DIM_ORGS)

“Usefulness” is a bit squishy and open to interpretation. In short, you should build metrics into your data usage and report on those metrics in a way that’s meaningful to your organization. It might be something like daily active users of a data product or the percentage of known users measured against the number of queries against the data.

The main point is: yes, you can objectively measure the quality of your data. And you absolutely should be measuring it because you can’t manage data quality unless you are measuring it. Once you are measuring it, you can make efforts to improve it with data quality management.

Set and maintain measures to manage data quality 

Every organization should have its own set of standards and practices for data quality and trustworthiness—data quality management, one might say. There are points at which data quality is high enough to be useful, too low to be trusted, or somewhere in between. Of course, this depends on the use case and the risk involved with the data and its use. If you’re a public company reporting on quarterly financial performance, you can’t rely on low-quality data. It must be accurate, complete, consistent, etc. 

But what do you do if, say, the SEC runs an investigation on those quarterly numbers? You certainly can’t tell the SEC the data “looked good enough to us at the time.” No, you’d need to demonstrate that you knew exactly what the data quality looked like at the time of the report and that you had an overarching strategy in place for measuring and managing data quality.

Managing data quality is like any other continuous improvement effort. It requires goals, standards, and accountability. Here’s what each of those might look like:

Data management goals
  • Achieve at least 98% accuracy in customer data for marketing campaigns.
  • Ensure data completeness where no more than 1% of records in the CRM system are missing critical information.
  • Maintain data freshness with all sales data in reports being updated within 24 hours of transaction completion.
  • Limit data duplication in the customer database to less than 0.5%.
  • Validate that 100% of the data in regulatory compliance reports aligns with legal standards.
Data management standards
  • Data must be verified for accuracy and completeness before being entered into the production database.
  • All financial reporting data must adhere to GAAP (Generally Accepted Accounting Principles) standards.
  • Data used for customer profiling must comply with GDPR and other relevant privacy regulations.
  • Data for health and safety reports must be validated against industry benchmarks and guidelines.
  • All dbt models must have baseline testing (ex. All primary keys tested for nullness and uniqueness) set.
  • All code changes must undergo testing during CI before that PR can be reviewed and merged into production.
  • Any data used for strategic decision-making must undergo a peer-review process for validity.
Data management accountability measures
  • The Data Governance Committee is responsible for overseeing data compliance with regulatory standards.
  • The IT department must ensure system uptime of 99.9% for all data storage and processing infrastructure.
  • Marketing team members are accountable for verifying the accuracy of customer data they collect and use.
  • The finance team is held responsible for the integrity of all financial data and reporting.
  • The customer service department is charged with maintaining and updating customer information records accurately.

‍

Goals are typically metric-driven, standards are typically rule-driven, and accountability measures are typically people- or responsibility-driven. 

Data-mature organizations have goals, standards, and accountability measures in place to demonstrate effective tracking and maintenance of data quality standards. It’s not enough to occasionally meet a data freshness target when your business is actually making data-driven decisions. You need to offer assurances that both your data and your data quality processes are setting data consumers up for success.

Read more on why we think implementing data quality tests during your CI process is the best way to set up your team and data for success.

Common misunderstandings of data quality management

Managing data quality is a crucial business activity that is often misunderstood in a few key ways:

  • It’s a one-time fix: Some think once you clean your data, you're done. Nope! Data quality management is ongoing. As new data and use cases come in, they’ll need regular checks and maintenance to remain clean and useful.
  • It can be managed with one production or solution: Yes, there exist many data quality management “tools” that are often marketed towards large enterprises as “one-size-fits-all” solutions. These softwares may offer data cleaning, deduplication, lineage, data parsing, and other all-in-one solutions that are individually often very useful. But unless data quality software is paired with automation and reinforced with a culture that is actively working towards higher data quality, this (often very expensive) software can fall short in expectations, not scale with data or organization growth, or be overkill for meeting your organization’s data quality pain points. Instead, we recommend adopting testing during your development, deployment, and replication processes using data diffs to ensure data pipelines are tested with consistent, thorough standards.
  • It’s only about removing errors: Yes, it’s vital to remove one-off errors, but data quality management is also about ensuring data is regularly consistent, complete, accurate, and in a format that's usable for its intended purpose. This comes through defining and enforcing data quality standards like ensuring every dbt model has appropriate tests set.
  • It doesn’t need a strategy: Effective data quality management requires a clear strategy. This includes defining what good quality data looks like for your specific needs, how you’ll maintain it, and who’s responsible for what. Larger organizations will often adopt data product managers or data governance managers whose responsibilities often revolve around creating practices and processes for effective data management.
  • It’s expensive and time-consuming: This is partially correct, but the cost of not managing data quality–like making bad decisions based on poor data–is usually much higher. Data quality testing also does not have to be manual or retroactive if you use a tool like Datafold to automatically find data quality issues before they happen.
  • All data needs the same level of quality: The level of quality control depends on how you’ll use the data. Some data needs to be super accurate, while for other purposes, a bit of inaccuracy might be okay. Again, this will vary industry to industry and team to team, but establishing what is “acceptable” data quality will often differ between organizations.

When you have an effective and meaningful data quality management strategy, you’ll have a better sense of what’s acceptable and what isn’t. Every organization is different, so standards will vary. They can affect risk tolerance, speed of decision making, and more.

High data quality is no accident

If you ask 100 data engineers about data quality and data quality management, they should ideally tell you:

  • Data quality is a quantifiable and demonstrates whether data can or should be used
  • Data quality management is a continuous and strategic process for ensuring data is usable

Sure, they may vary on the data quality dimensions (we use eight dimensions, other organizations may use a different number) or on what is acceptable. But at the end of the day, every data team should be striving for quantifiably high-quality data.

Managing data is messy, so high data quality is never an accident. You need to set quality standards and expectations, measure across the dimensions, and validate your assumptions. Put simply, all you really need are tests on specific parts of your pipelines.

Learn how Datafold is standardizing the way data teams govern and test their data using data diffing during the CI process.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes