In earlier chapters of our guide, we’ve focused on increasing your technical proficiency of how to execute data quality within different types of data architectures and data workflows. But you need more than just better continuous integration (CI) pipelines and data validation tools to deliver quality data.
The challenge at the heart of data quality is that there are too many unknown unknowns out there: no checklist, however well thought out, can cover all undetected data issues. To build data quality systems that persist and improve over time requires fostering a culture that prizes the craft of data quality as a goal in itself.
The data quality mindset
In the last decade, embattled data practitioners have all heard about becoming more data driven. Usually, this means one of two things: A push to use data to inform decision-making across an organization, or encouraging data literacy among all employees and setting up self-serve to better enable non-analytics folks to learn data-driven thinking.
Data quality is usually treated as immediately relevant only to the data teams (we know from personal experience that data quality is often the thing that keeps us up at night). But the data quality mindset should sit equal to being data driven as first class mental models guiding an organization’s success. Both are complementary and mutually reinforcing, with data quality serving as the foundation upon which data-driven success is built.
Nurturing a data quality mindset
Building this requires five core building blocks. We’ll run through them as questions that you can use to roughly benchmark where data quality is viewed in your organization.
Again, all of these core principles and practices may not be fully applicable to your business; maybe your data team is hundreds of people with a decentralized org, maybe your data team is just you, trying to make the best of what you have. Instead, think of this section as things to eventually work up to, and practices to customize for your data team and business.
1. What are your tech processes to support developers deploy with speed and confidence?
Least acceptable practice: Manual deployment processes with limited or no automated testing. Developers manually deploy code without adequate testing procedures in place, leading to frequent errors and data quality issues in production.
Best practice: Implementing CI checks for your data transformation work from (ideally) day one (but realistically, when you have bandwidth to add it). Top performing organizations have CI pipelines that automate testing and deployment processes, enabling developers to deploy code quickly, confidently, and in a standard fashion while ensuring data quality is maintained throughout the deployment lifecycle.
2. How do you document and assign ownership of data quality issues? Are owners empowered to resolve incidents?
Least acceptable: Lack of documentation and unclear ownership of data quality issues. Data quality issues are not adequately documented, and there’s confusion over who owns which issue, leading to delays in resolution and unresolved incidents.
Best practice: Establishing clear documentation and assigning ownership of data quality issues. This empowers designated owners to resolve incidents promptly and efficiently, ensuring accountability and timely resolution. Following protocols of software engineers, we recommend appointing issue “commanders” or point-people, identifying levels of severity of issues, and creating post-mortems for large data quality incidents.
3. What is the ratio of manual to automated data validation tasks?
Least acceptable: Heavy reliance on manual data quality management tasks. Data quality management processes are predominantly manual, leading to inefficiencies, inconsistencies, and increased risk of errors. This may look like an analytics engineer spending hours (or even days) manually checking the validity of a proposed PR.
Best practice: Automating tasks wherever possible, such as data validation, monitoring, and anomaly detection, reducing reliance on manual processes, improving efficiency, and ensuring consistency and accuracy in data quality assurance efforts.
4. Are your stakeholders provided with visibility into data quality issues? Is communication about resolution efforts proactive or retroactive?
Least acceptable: Limited transparency and reactive communication about data quality issues. Stakeholders are not provided with visibility into data quality issues, and communication about resolution efforts is reactive, leading to distrust and uncertainty regarding data quality.
Best practice: Proactive communication and transparency about data quality issues. These provide stakeholders with visibility into data quality issues and communicate resolution efforts proactively, fostering transparency, trust, and confidence in the reliability of data-driven decisions. We recommend, again, leaning on some software engineering best practices to help facilitate regular communication between your data team and stakeholders. In practice, this could look like adopting:
- Post-mortems for large data quality incidents
- “Changelogs” for big updates to dbt models or BI dashboards
- Enablement and educational session for your stakeholders on data fundamentals
- A #data-alerts channel for key stakeholders to be notified about data pipeline failures
- Working with a ticketing system like software engineers do, so stakeholders have clear visibility into the work being done by the data team
5. To what extent is your team able to confidently interpret the data?
Least acceptable: If your stakeholders lack the knowledge to interpret data effectively, you’re not alone. Most data-driven organizations don’t actually have data literate stakeholders; a recent piece by Commoncog details all the ways in which businesses with sophisticated analytics stacks and professionals often don’t feel like the KPIs and charts build a solid intuition of what levers actually affect their business model. Many struggle with relatively simple concepts, such as knowing when to use the mean instead of the median, or how to tell if the data is correct.
Best practice: Investing in comprehensive data literacy training and education programs for stakeholders (this can look like a productivity tool or educational allowance) to build confidence and ensure that understanding data quality and becoming data-driven aren’t just buzzwords. And this doesn’t need to be fancy (and expensive!) courses. We recommend starting off simply here by implementing regular data office hours, or creating “set” curriculum around specific data topics (e.g., core data metrics and how they’re calculated, creating dashboards in your BI tool, contributing a small update to dbt) that can help lower the barrier to understanding and contributing to analytics work.