Data quality guide

Data quality is your moat, this is your guide

/•/

Fortify your data, fortify your business: Why high-quality data is your ultimate defense.

How to improve data quality: Practical strategies

Published

May 28, 2024

Summer’s coming, so let’s imagine it’s a beautiful Saturday and you’re going to one of your local big-box home improvement stores. They’ve got a killer deal on a patio furniture set on clearance for 75% off — sweet! Unfortunately, they only have one left and It’s hot pink. But there’s a photo of a nice turquoise set on the price tag. You flag down an employee to ask if they have any turquoise in stock. They check an app on their phone. “We’re all out, but the store 5 miles away has 10 of them.”

You speed-walk back to your car, drive the 5 miles, and guess what — they’re also out of turquoise. You pull up their website on your phone. It says the store another 10 miles away has 2 left. You drive another 10 miles. No turquoise in sight. You ask another employee who says, “Oh yeah. We sold out of those a month ago.”

Suddenly, your recreational weekend shopping has become an exercise in frustration. Your heart rate is high enough that your smart watch is asking if you need a moment of mindfulness. You take a deep breath and drive 15 miles back home with no patio furniture.

The eight pillars of data quality

Low data quality can be a frustrating experience for downstream data consumers and patio furniture shoppers. It could lead to anything from mild weekend frustration to stock price downturns and SEC investigations. The effects and consequences of low data quality can depend on the context.

Our patio furniture example is a simple illustration of problems with data accuracy and timeliness. Those are just two of the eight dimensions of data quality:

Accuracy: The data represents reality
Completeness: All the required data is present
Consistency: Data is consistent across different datasets and databases
Reliability: The data is trustworthy and credible
Timeliness: Data is delivered the right format, at the right time, to the right audience
Uniqueness: There are no data duplications.
Usefulness: Data is applicable and relevant to problem-solving and decision-making‍
Differences: Users know exactly how and where data differs

When you pursue perfection in each of these dimensions, you’re going to make big strides in the overall quality of your data. Some of the common outcomes include:

Increased revenue and profitability
Better reporting and analytics
Higher-quality and timely decision-making
Happier customers and consumers
Better chance of success with AI

Data quality management is a discipline that’s difficult to master. It requires a comprehensive understanding of how and where data originates, what happens when you ingest and store it, and how it’s used.

At first, data quality management seems fairly straightforward: extract data at the source, load it into your data warehouse, transform it for each use case, and preserve its meaning and value at every step of the way. But in reality, a lot of other stuff can happen between extraction and usage leading to poor quality data.

Current challenges in data quality

Let’s get into some of the everyday things that can easily bungle up your data quality: data silos, inconsistent standards, large data volumes, complex data sources, and an undeveloped data culture.

These problems are practically inevitable for any growing organization. They’re not always a signal that you don’t know what you’re doing (though they sometimes are). They can often be the consequence of business decisions and technological constraints. And they can lead to other consequences:

Data pipeline failures
Data engineers wasting a lot of time on low-level work
Risk of going out of compliance
Erroneous reports and inaccurate analytics leading to bad or poorly-timed decisions

Data silos keep data from broader accountability and create data quality problems

Data silos occur when different parts of an organization store their information separately and can’t share it effectively with one another. Silos are a problem for data quality because the data becomes isolated from the broader accountability and data practices of the organization, like leftovers in a tupperware way in the back of the fridge. They occur for various reasons, like departments using different systems (e.g. SQL in finance, NoSQL in IT) or data management practices evolving independently in different parts of the company.

Inconsistent standards create compatibility issues and poor quality data

Inconsistent data standards can be a real headache for data quality. Issues arise when different groups use various formats or rules for the same data, making it difficult to combine or compare information accurately. As a trivial example, one department might record dates in a DD-MM-YYYY format, while another uses MM-DD-YYYY, leading to potential errors, confusion, and lack of consistent data.

This usually happens organically because different people use (and store) data differently (we see you, backend engineers). It also happens when there are no common data standards, or if there are, nobody’s enforcing them.

Large data volumes complicate processing and analytics

Managing small amounts of data is relatively straightforward. However, as data volumes scale up to thousands or even millions of records, the challenge becomes significantly more complex. Large data volumes mean it’s much harder to keep track of what’s going on in your data. If you can’t manage the scale of your data, then you probably are going to make compromises in data quality.

Complex data sources make data usage and integration difficult

Dealing with data from different sources can also be tricky for data quality. There are no industry standard schemas, so Facebook’s data schema is totally different from Google Analytics, which differs from Marketo’s. When you have all this data coming from different places all at once, it’s hard to manage it all while also seeing the big picture.

Every data source has its own way of doing things. It’s up to you to normalize and format that data so you can store it and prepare it for downstream usage. Plus, you have to cross your fingers hoping that none of your data sources change schemas over time.

A poor data culture can hurt data quality

When a company isn’t focused on being data-driven, data quality is extraordinarily difficult to manage. If people across different teams aren’t really sure how to use data or why it's important, it can lead to data entry and cleanliness issues. Plus, if there’s no big push from the top to make data a priority, these bad habits just keep going, creating considerable data tech debt.

High data quality results from a culture where everyone is invested in maintaining it. This means teaching everyone — from the top executives to entry-level staff — about how high-quality data can make their jobs easier and help the company succeed. It also means making sure that using data right is a part of everyone’s job description.

Strategies for improving data quality

While many of the challenges we’ve just described can rear their head at any company of any size, we have a few effective antidotes. We’ll mostly recommend automation because, really, you just don’t want to be spending your time manually reviewing and maintaining data quality. It’s a lot of work and it’s not feasible for a human to manage gigabytes and terabytes of complex data.

Use data quality testing in your CI/CD pipelines

When you put data quality testing into your CI/CD pipelines, you’re automatically checking the quality of data anytime there are changes in data or code. By automating these data quality checks, you catch problems early — before they can do any real harm or spread too far.

This proactive approach helps to ensure that only clean, reliable data is moving through your systems, as well as creates data quality standards and data governance practices. Whether it’s a minor code update or a big data migration, having these automated checks in place means you can trust that your data is accurate and up to standards, helping to prevent major headaches down the line.

Use data diffs to monitor changes in your data

Using data diffs to identify changes in your data is a practical way to keep tabs on what's happening with your information. You can compare data before and after changes to get a clear picture of how those changes potentially affect your data. Data diffs are especially helpful in spotting any unintended modifications that could disrupt your operations or downstream data usage (we call these your unknown unknowns).

Data diffs allow you to take quick corrective actions if something doesn't look right. Whether you're updating schema configuration or replicating data from one system to another, having the ability to immediately see and address discrepancies helps maintain the integrity of your data and prevents small errors from becoming big problems.

Implement column-level lineage for transparency

Column-level lineage creates transparency in how data moves and changes across your systems. When you know who’s using what columns from each of your tables, you can better maintain quality and ensure the data meets downstream requirements. When there’s a problem downstream, you can trace it right back to the data source or table.

Shift left on data quality

Shifting left on data quality means catching and fixing data quality issues early in the development process. Instead of seeing errors after the data has been deployed, you check for problems right from the start of data handling and development phases. Doing this early on greatly reduces the risk of affecting end-users or influencing critical business decisions later.

Doing this gets you two things: high data integrity and minimized disruptions in day-to-day data work. Early, proactive detection of errors means fewer fixes later, which can save time and reduce costs associated with post-deployment corrections. By the time data reaches your users or influences decisions, it's already been vetted for quality and accuracy, supporting smoother operations and more reliable outcomes.

Other tips for data quality improvement

Here are some quick tips for improving your data quality:

Standardize and validate to create consistent data
Clean your data to reduce inaccuracy
Eliminate and prevent data silos through automated governance and data accessibility
Train people on data quality
Test for data quality issues at the source and correct them through automation
Implement data quality tools like proactive testing tools or data monitoring tools

Quick data quality assessment

To understand if your team has opportunity to improve their data quality, and overall data management, take a look at the short checklist below for some questions that can help identify where you might have gaps in your overall data quality strategy:

Do your data team members have a consistent methodology for testing during their development phases?
Is there any data validation, testing, or data monitoring on source-level data for detecting data quality problems earlier in the workflow? How far left can you shift your data quality testing?
If your team uses dbt, do all dbt models have core generic tests for key columns? Are data engineers running dbt tests for these models? If not, how can you enforce that with data quality rules or data governance policies?
Do data transformation models and tests have clear owners?
Are key data quality metrics like data freshness or overall data accuracy captured in any data quality dashboards?
Does your data team rely on any testing automation to create data governance policies for data quality testing?
Does your team need to leverage data observability tools to monitor data quality in your production environment?

Case studies: Success stories of data quality improvement

Datafold has helped a lot of companies improve their data quality and establish better data quality management practices. Here are a couple of quick examples from the data teams at Eventbrite and Thumbtack on how they leveraged Datafold for their data quality improvement efforts.

Eventbrite

Eventbrite migrated from Presto to Snowflake with Datafold. They faced challenges in model validation across roughly 300 models from 20-30 data sources, requiring a solution that could verify the accuracy of these migrations efficiently. Our data diff feature allowed Eventbrite to automate validation, meet SLAs, and increase confidence in their data operations.

Thumbtack

Thumbtack used Datafold to enhance their data quality strategy, saving over 200 hours per month and boosting productivity by 20%. Facing challenges with a large volume of SQL pull requests and the potential for data quality issues affecting their product, they integrated Datafold into their continuous integration pipeline for automated validation.

Data quality is best done with automation

Data quality is a big, complicated topic that involves understanding and quantifying what’s happening across your data ecosystem. You need to know what data you’re getting, what’s happening to it between source and destination, and how it’s being used downstream. The larger the ecosystem, the harder it is to maintain data quality.

The strategies we’ve shared above aren’t a guarantee for solving the most common challenges in data quality management. But, we’re pretty sure that anyone who takes automation and establishing practices that standardize data quality testing seriously will get there. There will always be stumbling blocks on the road toward high data quality and the teams who automate have the greatest chance of overcoming them quickly.

Ultimately, the onus is on data practitioners to build and maintain automation that keeps data quality in check. We need to strive for 100% data quality in every dimension we can measure, knowing that it may never truly happen because sources and business requirements evolve over time. It’s not quite a game of whack-a-mole, but it’s not that far off, either.

previous Passage

Next Passage