Data integrity vs. data quality

We’re living in a time where data integrity and data quality are part of our everyday conversation, even if we don’t use the terms explicitly. Any time we talk about ChatGPT or other generative AI tools, we’re ultimately having a conversation about data integrity and data quality. In the case of LLMs, when we hear about “hallucinations,” we’re actually talking about high data integrity and questionable data quality.

For example, we asked ChatGPT to write some data jokes:

  • Why don't databases make good comedians? They can't handle any tables turning on them!
  • Why was the computer cold at the office? It left its Windows open.
  • What did the data analyst say to the large dataset? "I've got a 'bit' of a crush on you."

All the elements of a joke are right there: the curiosity-inducing setup and the punchline. But these jokes fall flat. AI jokes illustrate the differences between data integrity and data quality. They have the right delivery and elements (data integrity) but they’re not actually funny (data quality).

It’s easy to confuse data integrity with data quality. They “sound” like similar things and there is some overlap in their use. However, they are different and often misunderstood. So let’s talk about what they are, what they aren’t, and how to make the most of them.

What is data integrity?

When data is created at the source and ingested, it should be stored as unaltered as possible to maintain its integrity. Modern data warehouses and storage platforms are pretty good at storing raw data in its original format, regardless of structure. That’s why we have data lakes that can store PDFs, YAML files, and tabular data. 

It’s not uncommon to need to transform the data at least a little for one or more reasons. You might need to standardize date formats (e.g. YYYY-MM-DD vs. MM-DD-YYYY), use consistent naming conventions (e.g. “Customer_ID” vs. “Cust_ID”), or ensure compliance and privacy (e.g. anonymize names and other PII).

The key components of data integrity are:

  • Accuracy: Data is correct and error-free
  • Consistency: Data remains unchanged across all systems and over time
  • Reliability: Data can be trusted for critical decision-making
  • Completeness: All necessary data is present and available for use

Misunderstandings about data integrity

It’s easy to get confused about data integrity. Let’s dive into some common misconceptions.

“Data integrity is simply about preventing data loss.” This misconception likely stems from a narrow interpretation of the term “integrity” in its most basic sense—keeping data whole and intact. It’s overly focused on preservation, which is indeed a crucial part of data integrity but not the whole picture.

“Once data is verified, data integrity is permanent.” Nope. Maintaining data integrity requires continuous monitoring because new errors can be introduced through updates, migrations, or external integrations.

“Data integrity issues are obvious.” Many assume that data integrity problems are self-evident. Much as we wish this was true, issues like subtle data corruption or gradual database degradation can go unnoticed for a long time.

How to enforce data integrity

The best time to enforce data integrity is at the time of collection or at the source. Here are some suggestions:

  • Input validation: Ensure only valid data is entered into the system
  • Data constraints: Use primary keys, foreign keys, and unique constraints to maintain consistency and avoid duplication
  • Use transactions: Ensure a series of operations either all succeed or fail together (i.e. “atomicity”)
  • Audit trails: Keep logs of data changes to track and verify data manipulation over time
  • Clean data regularly: Periodically review and correct data to remove inaccuracies or inconsistencies

What is data quality?

Whereas data integrity is more about qualities of the data across its lifespan, data quality is about preparing for downstream usage. Data quality isn’t so concerned with how the data is stored or whether it’s been massaged. It’s more concerned with whether it is suitable for its final destination(s) and consumption use cases.

There are eight dimensions of data quality:

  • Accuracy: The data represents reality
  • Completeness: All the required data is present
  • Consistency: Data is consistent across different datasets and databases
  • Reliability: The data is trustworthy and credible
  • Timeliness: Data is delivered the right format, at the right time, to the right audience
  • Uniqueness: There are no data duplications.  
  • Usefulness: Data is applicable and relevant to problem-solving and decision-making‍
  • Differences: Users know exactly how and where data differs

Quality data ensures decisions and outcomes are based on the most accurate, complete, and relevant information. It means you’re serving up data in exactly the way downstream consumers need, whether they’re people or automated systems.

Misunderstandings about data quality

Data quality can be ambiguous at times and used to describe both qualitative and quantitative aspects of data. This can be a source of confusion about data quality. Let’s dig into some of the more common misconceptions.

“More data means more quality.” Sorry, that’s a whole pile of nope. Simply having a large volume of data doesn't mean diddly. A big dataset full of outdated or duplicate records is less valuable.

“Data quality is primarily about accuracy.” First of all, data quality is about all eight dimensions, not just one. For example, you can’t use sales data from 2010 for a 2024 market analysis, no matter how accurate it is.

“Once data is clean, it’ll stay that way.” Wishful thinking. There can be a million things that can make data change over time.

“Automated tools fix all data quality issues.” Relying solely on software to address data quality overlooks complex issues that require human judgment, such as contextual misinterpretations.

How to enforce data quality

There are two opportunities to enforce data quality: when data is dumped into your warehouse and when you actually use and model it for analytics. You can enforce some data quality standards at collection using the data integrity checks we mentioned above. However, there’s no guarantee that the data you get will be high quality. Even false information can have high data integrity.

Let’s take a look at the two scenarios and what you can do in each.

When data is dumped into your warehouse:

  • Cross-database validation: Ensure parity as soon as data enters your analytics warehouse by verifying that it matches across different databases and systems
  • Freshness checks: Also known as “timeliness checks,” these involve verifying that data is up-to-date and reflects the most current information available
  • Source-level checks: Used for completeness and consistency, source-level checks validate that all expected data is present and correctly formatted at the point of entry from the source

When you’re actually using data downstream:

  • Classic data quality tests: Ad hoc SQL, unit testing, and dbt tests ensure that data transformations, calculations, and aggregations are correct and reliable. This set of tools can help identify discrepancies, validate business logic, and maintain integrity throughout the data pipeline
  • Integrate automated testing in your CI/CD process: Testing during your CI process is one of the best ways to guarantee bad data never enters production analytics environments. It also forces you to create a standardized process for data quality testing

Dynamic data modeling and anomaly detection: Automatically spot unexpected changes or unusual patterns in data as it flows through the systems by addressing issues in real time, preventing the propagation of errors in downstream processes

Data IntegrityData Quality
PurposeTo ensure data is accurate, consistent, and reliable.To ensure data meets the needs for specific purposes.
Why it mattersCritical for security, compliance, and decision-making.Affects user satisfaction, decision accuracy, and insights.
How it’s usedThrough controls like encryption, audits, and constraints.By enhancing, cleaning, and standardizing data.
Key componentsAccuracy, consistency, reliability.Eight dimensions
ScopeThroughout the data lifecycle.Focuses on user needs and data usability.
What goes wrongData breaches, loss, corruption.Misleading insights, poor decisions, user dissatisfaction.

Best practices for maintaining data integrity and data quality

It’s important to remember that you want both data integrity and data quality in your analytics ecosystem. Don’t prioritize one over the other. The end goal is having the best data possible.

So, we recommend you automate as much as humanly possible. Here are some tips:

  • Automated data validation to ensure consistency and accuracy across datasets
  • Continuous integration (CI) pipelines for data models to catch errors early
  • Using automated monitoring and alerting to track data anomalies or integrity issues in real-time
  • Data observability tools to automatically track the health of data systems
  • Implement schema change detection to automatically identify and address unintended alterations

Also, do other things that aren’t automation

  • Regularly train staff on data handling procedures
  • Establish clear data governance policies
  • Conduct manual audits and reviews of data for accuracy
  • Foster a culture of data accountability where team members are encouraged to report discrepancies
  • Maintain detailed documentation on data sources, transformations, and lineage to track data's journey and ensure its integrity and quality at every stage

Data integrity and data quality matter for every company

No matter where you work in your company, no matter how far removed from your CEO — if you influence data quality and integrity, you are playing a big part in your company’s success. Data integrity and quality are foundational to trust and decision-making in every industry. High standards in data can drive better business outcomes, innovation, and customer satisfaction.

You’re helping to mitigate risks, including financial losses, reputational damage, and regulatory penalties. You’re providing a competitive edge in a world where data matters more than ever. And you’re making a commitment to data excellence.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes