Where Do Data Quality Issues Come From?

  1. Third-Party Data
  2. Data Transformation Bugs
  3. Changing Business Requirements
  4. Data Type and Schema Inaccuracies
  5. Systemic Pipeline Breakages

Let's explore each and how to prevent them from impacting production data.

#1: Third-Party Data

One of the most common causes of data quality issues is the ingestion of third-party data sources. This is because you have no control over the underlying data; it can be reliable at one moment but suddenly change without notice. These issues may also be difficult to fix immediately, thereby persisting for some time without resolution.

Common issues stemming from third-party data ingestion

Let’s consider some concrete examples of data quality issues that can stem from third-party data sources.

File formats changing without warning

Consider a Python data pipeline that ingests CSV files that are gzipped, given the .gz  extension, and sourced from a third-party SFTP server. 

# Python
import gzip

with open('third-party-files/daily_extract.gz', 'rb') as f:
    data = f.read()
    // do something.


This would be a common function to find in many pipelines that ingest third-party data. Unfortunately, this sort of approach will immediately break when that third party decides—intentionally or not—to stop gzipping their files.


A simple solution like checking file types before ingestion can help prevent and alert on file type issues.

# Python
from glob import glob

files = glob(‘/third-party-files/*’)
for file in files:
    if ‘.gz’ not in file:
        # alert etc.


The above example incorporates alerting which, while still resulting in a stopped pipeline, will help engineers quickly determine and resolve the issue. Another approach would be to implement code that can handle several common file types.

Column names changing without warning

Changing column names is another typical issue. On the surface, they may seem small, but the results can range from immediately breaking data pipelines to missing data that is undetected for an extended period of time.

Some forward-looking data teams might try to build code and pipelines to ignore new columns while alerting on missing columns. However, in practice, this is rarely implemented or thought about strategically. Also, the less frequently data is received from third-party sources, the more likely it is for this problem to occur. This is especially true when the third-party source uses manual processes to generate its data.

Malformed data values and the addition or removal of data values

Third-party data sources can leave you with malformed data values or the addition or removal of data points without warning. Malformed data values can be difficult to catch and address, as each data type and normal value range can vary widely depending on the data source. The addition and removal of data points is easier to spot and deal with programmatically in an automated fashion.

Let’s take the example of an order file from a third-party vendor that sends order files that are ingested by a data pipeline. This is the first row of a normal file:

order_id | product_id | product_desc | order_quantity | amount
1234     | 5678       | widget       | 1              | 10.00


However, another file comes in later on, and it differs without warning. It looks like this:

order_id | product_description | order_quantity | amount
1234     | widget              | 1              | 1e9+9


It’s common for description columns to be renamed unwittingly, from product_desc to product_description. In addition, however, the amount field suddenly shows in scientific notation due to a source data type error, and the product_id field has disappeared entirely.

When undetected, these column name changes can create pipeline breakages. Even worse, a pipeline may ingest the data incorrectly, resulting in a widespread population of NULL values. In many cases, these issues may go undetected for weeks or months. The same is true for malformed data points. Without in-depth data quality monitoring and alerting, a data team may be relying on pure luck to be made aware of such changes.

These sorts of data quality issues are much harder to overcome, as they require the datasets to be processed in part or in full to find them. Using tools like Great Expectations or Soda Core can easily catch such issues, but many teams have yet to adopt such specific data quality tooling.

#2: Data Transformation Bugs

The transform step is the next place that often gives rise to data quality issues. This is arguably the part of the development and ongoing maintenance in which data engineers spend most of their time, and it gets the most human interaction.

Because data transformations can be such a high-touch area, data quality issues commonly arise here. These issues most usually stem from the following two causes:

Cause 1: Incorrect implementation of technology

Some data transform bugs can be introduced from something as simple as the incorrect use of a particular technology. This is a code implementation problem.

Let’s consider a column that holds a STRING data type that needs to be parsed to obtain a particular value. For example, we have an order_information from which we need to extract a date.

df = spark.createDataFrame(
  [('2022/01/02 order placed.'),], ['order_information'])
df.select(
  substring(df.'order_information', 1, 10).alias('order_date')
).collect()

[Row('order_date'='2022/01/02')]


What if the incorrect implementation simply misses the correct substring location by one character?

df.select(
  substring(df.'order_information', 2, 10).alias('order_date')
).collect()

[Row('order_date'='022/01/02')]


We now have a malformed order_date that truncates the year and is not valid. These types of code implementation transform bugs are very common.

Cause 2: Misapplied or misunderstood requirements

On the other end is the correct implementation of technology (correct code), but a misunderstanding of requirements. This also leads to data quality issues. For example, consider a CASE statement in an SQL transformation that looks like this:

SELECT 
  DISTINCT customer_id,
  customer_name,
  CASE
    WHEN customer_status = 1
    THEN 'active'
    ELSE 'inactive' END
    as customer_status
FROM default.customer_orders
WHERE order_date >= CAST(current_timestamp() as DATE)


In the above example, the data transformation is using a CASE statement to determine the status of a customer. Common data quality issues arise when logic such as this is simply misinterpreted. For example, it’s possible that a customer_status value of 1 actually means the customer has deleted their account.

Catching data quality issues during data transformation

Data teams can put checks and processes in place to catch data quality issues that could be introduced during transformation. The classic approach—which still has yet to be widely adopted by data teams—is unit testing. In the software engineering world, unit testing has been a best practice for decades. Unit tests have their value, in that they enable teams to test many scenarios and prevent the introduction of obvious bugs that might cause downstream data quality issues. However, unit tests don’t catch everything.

In our example of misinterpreting values for customer_status, simple unit tests will not catch issues where requirements are misunderstood. It’s likely that the unit test will also make incorrect assumptions about requirements, resulting in validation of implementation but not business logic.

The importance of monitoring tooling and alerting

As we’ve stepped through typical data quality issues we’ve moved from simple and easy-to-catch data ingestion bugs to more complex implementation bugs introduced during transformation. As data quality issues become more nuanced and detailed, the need for automated data quality alerting and monitoring increases. This is because it becomes impossible to manually think of and prepare for all the various issues that could arise in every dataset.

#3: Changing Business Requirements

As we’ve touched on misapplied or misunderstood requirements, this brings us to the larger topic of stale or changing business requirements. This area is also a significant source of data quality issues. Perhaps one of the most significant challenges to data quality is the translation between various business departments and the engineering organizations supporting them. 

Often stemming from departmental silos and lack of communication, organizations may have poor processes for managing ever-changing business definitions. Usually, business definitions directly correlate to a single point in the data platform.

Once business requirements or definitions change, data quality issues arise if there is no corresponding change to the code and platform managed by engineering. This can have disastrous consequences.

An example of this kind of data quality problem could be as simple as changing the definition of “new customer”. Perhaps the original definition of “new customer” was someone who ordered five items or more in the last 30 days. Meanwhile, a new decision from the C-suite has changed that criteria to three items instead of five.

SELECT customer_id, SUM(order_quantity) as ordered_items_cnt
FROM default.customer_orders
WHERE order_date >= CAST(current_timestamp() as DATE) - INTERVAL 30 DAYS
GROUP BY customer_id
HAVING ordered_items_cnt > 5;


A seemingly simple query and definition that feeds executive dashboards would immediately be incorrect, feeding the wrong information to downstream consumers.

As business requirements change or go stale, the lack of information flow between business units can have a serious impact. In addition, the data quality issues that arise from this cause can be very hard to find and manage. Even classic data monitoring and alerting tools would almost have no way of catching such errors. Implementation details aren’t necessarily “wrong” so much as they are out of date.

Solutions to changing business requirements

The normal tooling—unit testing and alerts—will have little effect on data quality issues arising from business requirement changes. To help with these problems, the following areas can be improved upon or implemented:

  • Updated documentation and dissemination of the documentation
  • Regularly scheduled interdepartmental meetings and updates
  • Cross-functional standups
  • Scheduled cross-functional collaboration

These are non-technical ways to deal with many types of data quality issues and can be instrumental in preventing more serious but harder to catch problems and challenges.

#4: Data Type and Schema Inaccuracies

Data quality issues also come from incorrect data types and schemas, but these can be relatively simple to address. This set of problems is much easier to catch through properly implemented monitoring and alerting. However, some cases within this cause can be less obvious.

Data type problems can be both straightforward and hard to identify, depending on the context. For example, if there is a column called amount that has a data type of DECIMAL or FLOAT, trying to insert a CHAR value of some sort will lead to obvious errors that will likely surface immediately.

On the opposite end of the spectrum, if we have a column called product_description that inadvertently gets swapped for a product_id column that is normally an INTEGER, the technology might just automatically infer and store the integer id as a CHAR without warning or error. However, this depends on the tooling and programming languages involved.

An example of how a lazy approach to schemas can lead to data quality issues would be not using constraints. Let’s say we have the following DDL to hold information about our customers:

CREATE TABLE IF NOT EXISTS default.customer_information (  
	customer_id INTEGER,  
  first_name STRING,  
  last_name STRING,  
  ...
 );


It’s clear to most data professionals that the customer_id in our example is an extremely important column with many dependencies. However, nothing protects such an important data point from being filled with NULL values should upstream issues arise.

Simple changes to the schema of this dataset would protect this customer_id from bad values that would cascade to other datasets. Adding a NOT NULL constraint would protect the data from incorrect values.

CREATE TABLE IF NOT EXISTS default.customer_information (
  customer_id INTEGER NOT NULL,
  first_name STRING,
  last_name STRING,
  ...
);


 

In addition, most data teams can easily automate the processes of checking and validating data values and schemas with little effort.

#5: Systemic Pipeline Breakages

The last and maybe less obvious source of data quality issues comes from the operation of the data platform and architecture on which the data is stored and processed. Even if an organization uses the newest tooling in a Modern Data Stack, poor implementation can lead to systemic breakages, resulting in an overall reduction in data quality.

Why do constantly breaking pipelines and architecture lead to a degradation of data quality?

  • Rushed fixes to transforms or pipeline breakages introduce new bugs.
  • Constant fire-fighting leaves no time to focus on data quality.
  • Over time, the team focuses solely on keeping the pipelines running.

Data quality issues can also come from poor tooling choices that lead to complex or poor implementations, resulting in inaccurate and malformed data. If a data architecture has become overly complex, then—as with any engineering solution—the addition of new features or data quality monitoring or alerting is too much to tackle and never gets enough support to be implemented.

Solutions that focus on long-term decisions related to data quality and technology go hand in hand. Wise tool choices coupled with their proper implementation will lead to clean and accurate data.

Conclusion

The Modern Data Stack has brought a rich set of tools and solutions to most data teams, but there is also a dark side to this ever-expanding toolbox. As more tools and touchpoints are brought on board, more frequent and more challenging data quality issues may arise. This reality has compelled teams to focus on data quality, monitoring, and alerting in recent years. 

In summary, we’ve reviewed five of the most common data quality issues found in most data teams, regardless of their technology stack.

  1. The ingestion of third-party data sources
  2. Bugs introduced during data transformation
  3. Stale or changing business requirements
  4. Data type and schema inaccuracies
  5. Systemic pipeline breakages

As we see the Modern Data Stack grow in complexity along with the increased range in types and size of data being ingested, proactive data quality solutions are the future. The automated implementation of end-to-end data observability and testing tooling is the only way for data teams to finally stop playing catch-up and actually get ahead of data quality.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes