The Data Quality Flywheel
If there’s one thing that we can all agree on, it’s that data quality is important (even if we don’t all necessarily agree on what “data quality” means exactly). And while efforts to “improve data quality” are pretty much universally unobjectionable, I think that we’ve spent insufficient time thinking about what data quality means practically and how we can systematically improve the quality of the data we’re working with.
Getting more concrete, I believe that data quality is a good thing to the extent that it makes our decisions better. Those might be business decisions made after an executive looks at a dashboard, or automated decisions powered by a machine learning or AI product that is trained on the data. So investments in data quality pay off in the long run through better organization-wide decision-making and more effective data-products.
The biggest misconception when it comes to the idea of “data quality” is that there’s a finish line -- that we can make an investment today and move from having bad-quality data to good quality data. It’s critical to recognize that data quality lies on a spectrum, and that data quality is a journey, not a destination. That is, data quality is going to require continued investment over time.
We want the organization as a whole to recognize that:
- The data we have are not perfect now and never will be
- But, we can all work together to improve the data quality over time
Point number one can be a little challenging for some people to accept. However, in any organization that is generating, processing, and analyzing new data over time (pretty much every non-academic organization) this is going to be the case. Data quality has a natural entropy to it that can never be completely controlled, and the best we can do as data practitioners is accept that reality and attempt to build systems that are as robust as possible to that natural and persistent data quality degradation.
The Data Quality Flywheel
Data quality can be a nefarious problem in any organization. Once the organization believes that their data quality is bad, it means that they have lost faith in their data. Losing faith in the data means losing a data driven culture (or preventing one from developing) and allowing other non-data-driven decision-making procedures to take root (coin-flipping, HiPPO, etc.).
This data learned-helplessness can feel impossible to overcome. However, I like to use the idea of the Data Quality Flywheel to describe how we can kick-start the data quality journey within an organization and begin to build trust in the data and improve the data quality in a self-reinforcing cycle. This data quality flywheel, when constructed correctly and kept well-oiled, can generate self-sustaining data quality improvements over time that counteract the effects of natural data entropy.
The premise of the data quality flywheel is this:
The more people that are looking at the data, and the more apps that are using the data, the faster data quality issues will be identified and resolved.
That is, the more value people and the business gets out of the data, the more they are incentivized to work to improve data quality. If no one is using the data, then no one is incentivized to point out errors or maintain the data quality. Conversely, if lots of people are using the data and getting value out of it, then they are strongly incentivized to contribute to quality improvements.
Once the flywheel starts spinning, data quality is not a thing to be “fixed” by the data team, but rather a shared source-of-value that the organization as a whole is committed to improving. It is critical to reach this point in the organization -- if data quality is simply something everyone complains about without contributing to improving, then the game is already lost. The data team will never be able to identify and fix all of the data quality issues alone. The road to data quality must be traveled together!
Building and Starting the Flywheel
Getting started can be the hardest part. If you’re building a data practice in a startup or in an organization that is new to data, then you have it (relatively) easy as you can start out building trust in the data. If you’re working in an organization that already has data-quality-trust issues then you will have to be more strategic in how you approach these data quality problems.
The first thing to do is get to a minimum baseline of data quality. If you’re working in a large organization, that might mean choosing one department or business vertical to focus on first just to make the task feel manageable. What the minimum data quality level needs to be will vary widely by organization or and focus area, but in general the goal should be to focus on the most important KPIs and datasets first and then work to match whatever the generally-accepted source of truth for that metric might be. Those most important KPIs tend to be at the “core” of the business -- e.g., Bookings/Users/Hosts for Airbnb, Rides/Passengers/Drivers for Lyft, and those are where you should focus your attention first.
Once we are at a place where we have a baseline level of data quality that is generally accepted, then we want to get people in the organization using the data. Whether that’s via a BI tool or via a data science project, we want people using the data and getting value out of it -- the goal should be to provide additional value to the data consumers beyond what they’re able to do currently. Once people feel like they’re getting value out of the data, then it’s easy to convince them to contribute to keeping the data quality high (or lifting it even higher).
It’s important to clarify here what we mean by “keeping the data quality high”. For less-technical folks, that just might mean being willing to report discrepancies anomalies to the data team. Since these business owners often have more context than the dedicated data folks, just pointing out potential issues can actually add a lot of value!
Keeping the Flywheel Spinning
Once we have a baseline level of data quality and we have people actually using our data, we need to keep the flywheel spinning. This is the most critical aspect of a data quality system that many people forget to build!
Most importantly, we need a process for reporting and responding to data quality issues. If the head of the marketing department sees some data in their dashboard that looks wrong, they need a way of reporting that issue to someone who can do something about it. And then, most importantly, we need to make sure that those data issues get addressed quickly -- that head of the marketing department ought to feel like those data quality issues are addressed in a timely manner otherwise they will stop reporting the data quality issues and stop trusting the data (breaking the flywheel).
There are lots of ways to build processes that can handle these types of issues efficiently, but in my experience they tend to have the following qualities:
- Clear receptacle or point-person for reporting data quality issues. This could be the dedicated analyst assigned to a business unit or an online ticketing system that routes issues to the appropriate person on the data team
- Clear communication policies for what happens when issues are identified -- this could range from a company-wide email announcing that there’s an ongoing issue that needs to be addressed to a quick note to the person who reported the issue letting them know that there’s not an issue and the company actually did have a great sales day yesterday.
- Clarity around how long it might take an issue to be fixed. Ideally this length of time should be very short, but it should always be communicated back to the person who reported the issue.
Greasing the Bearings
There are a number of tools and processes that we can put in place to help ensure that the data quality flywheels spins as smoothly as possible.
The first thing we want to have set up is automated data monitoring so that we can proactively catch potential data issues before our users do. If our data team gets an alert that the number of sales yesterday was unusually low, they can proactively reach out to the rest of the organization to let them know that they’re already looking into the issue to see if there’s a problem -- this sort of proactive engagement builds trust with the data users.
The second thing we want to do is to make sure that we have a good change-management process for our data pipelines so that we don’t accidentally deploy changes that reduce data quality. A good recipe for this sort of change management looks like:
- Data testing and assertion suites (e.g., via dbt or great_expectations)
- Reviewing data diffs showing the full scope of changes between dev and prod (dbt-audit-helper, Diffy by Spotify, Data Diff by Datafold)
- Code review that considers the outputs of #1 and #2 (and potentially additional types of automated review using the outputs of those tools)
There’s been lots of development in this type of tooling over the last few years, and I feel very confident that we’ll continue to see innovation in tools that help us “productionalize” our data pipelines.
Investing in data quality in your organization can pay dividends for many years into the future. However, once an organization loses trust in the quality of the data, it can be very difficult to win it back.
By utilizing the self-propelling nature of the data quality flywheel, you can help your organization improve your data quality over time while simultaneously building trust in the data.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.