Data quality guide

Data quality is your moat, this is your guide

/•/

Fortify your data, fortify your business: Why high-quality data is your ultimate defense.

A practical roadmap for creating a robust data quality system

Published

May 28, 2024

So far, we’ve cleared up confusion around what data quality is (and isn’t), the state of data quality testing today, and how a proactive data quality approach should guide the modern data team.

We’ve also drilled down to specific data quality metrics that help ground our conceptual understanding of data quality, as progress should be measured by a quantifiable path for improvement, and presented real case studies from leading data engineering teams who have figured it out.

Finally, we’ve also looked at where we should think more carefully about ensuring better data quality during the data lifecycle and how companies can best nurture a data quality mindset across functions and levels of responsibility.

We want our guide to be comprehensive and easy to map back to your context, no matter what technical stack or business model you operate in. In our final section, we’ll pull everything together and explain what you need to know to implement these new concepts.

What are the components of a data quality system?

Just as a castle relies on various elements for defense, a data quality system encompasses several components to safeguard the integrity and reliability of data. Your data quality system can be broken down into two major components: the depth of your CI system and which software engineering best practices you choose to integrate into your workflow.

Stone walls, towers, and moats

Oftentimes, we see that teams new to CI are overwhelmed by the technical and conceptual complexity of various setups they see, and end up putting off implementing even the most basic one that would improve their data quality system, since having one is better than having nothing in place at all.

Start with a wall

Teams expect systems to be perfectly optimized from day 1, but we’re all at different levels of readiness, capacity, and maturity. We think it’s ok to start small, like with a stone wall for basic defense.

Your foundational wall: Ad hoc SQL tests, dbt tests, and a whole lot of "eyeballing the data"—we all have to start somewhere!

‍

A stone wall for your castle isn’t sufficient for more sophisticated forms of attack, but they provide good enough protection and a framework for future iteration. Early-stage teams can often get away with basic ad hoc SQL tests that are created to validate basic data integrity, such as row counts and summary statistics. This allows them to establish foundational practices and gain confidence in their data processes, even though they’re still a pretty rudimentary form of protection.

Then, a tower

When you’re looking for a more fortified structure for defense, and a way to coordinate your counterattack, the next upgrade looks like a tower. This offers obvious advantages including an elevated vantage point for surveillance and defense.

Your watchtower: Data quality and governance checks built into your CI pipeline

As teams grow and mature, they can progressively advance to a CI setup. It might even be better this way, since you’ll get a more accurate sense of which tools make sense for your specific data quality workflow before you start automating it.

CI pipelines are meant to handle compilation and data quality testing whenever triggered by an action, like a pull request. This enforces automation and consistency in data quality testing across the organization.

Finally, a moat

The moat: The ultimate defense system to safeguarding your data—and business

A moat is a really good defense innovation that’s taken for granted because it looks so simple. After all, it’s just digging a trench around your castle and flooding it to create a physical barrier. If you wanted to, you could add crocodiles, though it’s completely optional. But it works, because it’s a very large catch-all defense, and even if anything makes it through, it certainly slows down their advance so you have time to respond.

In the most advanced stage of a proactive data quality system, CI checks are augmented with value-level data diffs. Performing value-level checks between your staging and production datasets is a similar catch-all defense against surfacing expected and unexpected data anomalies.

Let’s recap the right CI setups for your team in this summary table:

‍

Team maturity	CI system type	What it is	Purpose	How it measures up
Early stage	Basic	Ad hoc SQL Tests	Basic checks using SQL queries for row counts and summary statistics.	Least mature: Relies on manual intervention and lacks systematic testing procedures, prone to oversights and inconsistencies.
Growing/Scaling	Intermediate	CI checks	Compilation and testing of data projects triggered by pull requests.	Moderately mature: Enhances automation and consistency but may still require manual intervention and lacks comprehensive testing.
Scaling/Mature	Advanced	CI checks with value-level data diffs	Integration of Datafold’s Data Diffs for standardized and automated data quality testing in CI pipelines.	Sophisticated and preventative: Promotes standardization and full automation of data quality testing, enhancing reliability and scalability.

‍

Software engineering best practices

The data team aka the modern equivalent of knights at the round table

Retrospectives

Once a solution has been found for a data quality incident, it’s tempting to jump onto the next ticket. It’s uncomfortable to dwell on why and how a problem occurred, but it’s the only way to actually engage in continuous improvement.

Retrospectives in data quality can look as simple as a team contributing to a memo to discuss recent incidents and brainstorm solutions, and as complex as a structured, multi-day workshop involving stakeholders. We’re agnostic on what format works best, as long as it covers some key software engineering practices:

Identify and document specific data quality incidents, including any and all anomalies, inconsistencies, and failures that have impacted data integrity.
Root cause analysis to understand why these issues occurred at all. Was it the data source, transformation, workflow, or human error?
Discuss the impact of the data quality incidents on business operations, decision-making, and stakeholder trust.
Brainstorm potential solutions to prevent similar issues in the future. If you’re stuck, experiment with how you can improve processes, swap tools, automate where you can, improve technical understanding through training, or update data governance practices.
Assign action items to implement any solutions. It’s important to establish ownership with explicit deadlines to establish accountability.
Before the meeting is over, set a date on everyone’s calendar to follow up on where things are at with everyone’s action items. Retrospectives are not a one-time event but rather a continuous improvement process.

Shifting data to the left

Proactively finding data quality issues starts with moving testing to the left

As we discussed earlier in the Data Quality Guide, the concept of shift-left testing is pretty critical to a modern approach to data quality, and it owes its origins to the shift-left philosophy in software development. It emphasizes the importance of moving testing processes earlier in the development lifecycle, allowing teams to detect and address issues sooner rather than later.

Shifting data quality to the left lies at the heart of our approach to data quality. By integrating testing into the initial stages of data pipeline development, shift-left testing helps identify and mitigate data quality issues closer to their source, reducing the risk of downstream impacts and fostering a culture of proactive quality assurance.

It’s both a simple and seemingly obvious idea, but we’re surprised that it’s still rarely seen in practice as organizations often think that prioritizing data quality means slower deployment. This has been disproven in practice. FINN was able to scale with speed without sacrificing data quality standards, and memorably described this as breaking through the quality-speed frontier. This was made possible by implementing an automated CI pipeline with Datafold to validate data quality.

Metadata management

In software engineering, metadata provides essential information about code structure, function, and dependencies. As data engineers, we use metadata similarly as a way to capture information about data structures, schemas, lineage, usage, and ownership. It provides otherwise buried context and purpose of each dataset, its relationships with other datasets, and its lineage from source to consumption.

So we know why metadata is valuable, but how do we actually get value from it to support our data quality framework? We’ve found that while there are many tools out there that give you ways to get value from your metadata, you really only need these three:

1. Assessing the quality of your data using metadata attributes such as data quality scores, validation rules, and anomaly detection metrics

How? Integrate data quality metrics into your metadata repository to track and monitor data quality over time

2. Track your data lineage to trace the origin, transformations, and consumption of your data assets. This allows you to identify potential points of data quality degradation or integrity issues and take proactive measures to address them.

How? Column-level lineage tools are quickly becoming a standard feature among data quality tools, and are essential for analyzing the impact of changes on both upstream data sources and on downstream datasets and BI assets

3. Enforce data governance policies to ensure that data assets are managed, protected, and used responsibly. This acts as a gatekeeper around data quality.

How? Define metadata ownership, stewardship roles, and governance rules within the metadata repository. Establishing ownership is key as individuals or teams are ultimately responsible for handling metadata and its access.

Ownership and accountability

Establishing clear ownership for different parts of the data pipeline helps nurture a culture of proactive data quality testing. But practitioners know that assigning ownership doesn’t magically translate to accountability. This is where attempts to introduce accountability around maintaining data quality fail.

But there are ways to nudge accountability in the right direction. The leading data engineering teams support ownership effort institutionalize these two practices:

1. Providing training on data quality best practices. You don’t need to splurge on expensive courses or send data folks to conferences to upskill a team. We learn best from our peers and by getting our hands dirty. This can look as informal as internal lunch-and-learns, having the team read industry white papers and figure out where improvements can be made, or by encouraging regular pair-programming sessions to learn how senior data practitioners actually incorporate data quality checks into their workflow.

2. Establish metrics to assess data quality and hold data owners accountable. This is where you can mix-and-match KPIs based on your context and priorities, and choose from metrics such as data accuracy, completeness, timeliness, and consistency. If you’re already doing this and want to be more ambitious, another step would be to construct KPIs on meeting data quality targets (like SLAs) and resolving incidents in a timely manner.

previous Passage

Next Passage