Data reconciliation challenges and solutions

The tricky business of data reconciliation in three all-too-familiar scenarios: 

  • "I'm trying to replicate data from MySQL to PostgreSQL and keep running into data type mismatches. How do I handle different date and time formats between these databases?"
  • We’re migrating data from Oracle to Snowflake and facing collation issues leading to different sorting orders. Any tips on ensuring consistent text data comparison across these systems?”
  • “There are significant performance bottlenecks in our replication pipelines due to large data volumes. Is there anything we can do to optimize the process?”

Validating your data replication pipelines is a technical challenge beyond what most imagine. These three scenarios are not necessarily independent questions from different data teams–they could all occur within one organization. 

But because of the complexities involved in validating tables across databases, companies often only end up just migrating or replicating the data, and push off testing until something breaks down. 

In part two of our three-part series on data reconciliation, we’re here to help by looking at the top five technical challenges involved in data reconciliation, including speed, efficiency, detail, data types, and collations. Then we’ll turn to three categories of solutions: manual, rule-based, and data diffs. 

To recap, part one looked at data reconciliation use cases, techniques, and challenges, while in part three we will turn to four technical best practices, including selecting validation metrics, efficient resource management, and automating data quality testing. 

Five data reconciliation challenges

There are five data reconciliation challenges that data teams will encounter during a replication or migration. We’ll go over the technical hurdles and then share an effective solution. 

Challenge #1: Speed

When you're comparing data between two databases, speed is crucial. Imagine you have a huge amount of data to check – if this process takes too long, it can slow down your entire system.

That’s obvious. What does it take to diff your data between databases fast?

The challenge here is balancing speed with accuracy. You want the comparison to be fast, but you also need it to be precise. If you rush the process, you might miss differences or errors in the data. For instance, when synchronizing customer databases across two servers, a quick comparison might overlook small but critical differences in customer addresses or transaction dates. 

On the other hand, being too meticulous by checking every single value can make the process unbearably slow. While this is a comprehensive way to validate the data, ironically it might proceed too slowly to catch important discrepancies on a timely basis, which might lead to data integrity issues if not caught in time.

Datafold uses stochastic checksumming to speed up the process. It’s a way to compare databases without needing to look at every single row and value.

Instead of comparing every piece of data between two databases, stochastic checksumming takes random samples and creates small, unique “fingerprints” (checksums) of the data samples. By checking these samples, you can get a good idea of whether the data is generally consistent without the heavy lifting of a full comparison. 

And, those checksums are also compared between the source and target databases. If the checksums match, the data is likely the same. If they don't, it indicates a difference. 

This method is much faster than line-by-line comparisons because it reduces the data to much smaller samples and checksums. You can quickly verify data replication accuracy without getting bogged down in a full, detailed comparison of every single data point.

Challenge #2: Efficiency

When comparing data between databases, one factor affecting efficiency is the amount of data you need to send over the network to conduct the comparison. Sending large amounts of data over the network can quickly use up available bandwidth. This can slow down the comparison process significantly and impact the performance of other network-dependent operations.

Transferring data over long distances also introduces latency. The more data you send, the longer it takes, which can slow down the entire replication testing process. Reducing the amount of data sent helps keep the process fast and responsive.

Cost is also top of mind for many data teams, and data replication and its consequential testing can be an expensive endeavor if you don’t pay attention. Many cloud services and data centers charge based on the amount of data transferred. Sending large volumes of data can become expensive. By minimizing data transfer, you can reduce costs.

Challenge #3: Detail

To ensure data integrity/parity between systems, it's not enough to know that differences exist; you need to pinpoint exactly which rows and values are different. This precision allows you to understand the scope and nature of any discrepancies.

Oftentimes, manual SQL queries between two databases can give you a general estimate of accuracy. But, when something goes wrong, it can be challenging to pinpoint the exact problem using this method. Knowing the exact differences makes it easier to fix errors. If you can see exactly which rows and values don't match, you can correct them directly without having to re-check the entire dataset.

Detailed comparisons also help maintain consistency across databases. By identifying and correcting specific discrepancies, you ensure that both databases are in sync, which is crucial for applications relying on consistent data.

In regulated industries, you need to know which rows and values differ to create audit trails and ensure compliance.

Challenge #4: Data types

Dealing with different data types can be really tricky when comparing data across databases. Databases often have their own unique set of data types and definitions. For example, one database might store dates as a DATE type, while another might use DATETIME or TIMESTAMP. These differences can lead to inconsistencies when comparing data. 

It’s important to pay attention to these three differences during replication testing:

Non-aligned behavior

Even when databases have similar data types, they might behave differently. For instance, a VARCHAR in one database might have different maximum length limits or character encoding compared to another database. These small differences can cause discrepancies that are hard to spot.

Numbers and null values

Numeric data types, such as FLOAT or DECIMAL, can have different precision and scale (number of digits and decimal places) across databases. This can lead to rounding errors or mismatched values when comparing datasets.

Databases might handle default values and nulls differently. One database might treat a missing value as NULL, while another might use a default value like 0 or an empty string. This can result in apparent differences that are actually due to how each database handles missing data.

Conversion challenges

When data types differ between the source and target databases, the data often needs to be converted from one type to another. For example, converting a VARCHAR to a TEXT, or a DATETIME to a TIMESTAMP. If these conversions are not handled properly, they can lead to errors such as truncated data, rounding issues, or even data loss.

Challenge #5: Collations

Collations are rules that define how data is sorted and compared in a database. Different databases might use different collation rules, leading to discrepancies when comparing data between them. There are many reasons why collations matter in replication testing, but we’ll look at these four:

Sorting order

Collations determine the order in which text data is sorted. For example, one collation might sort characters case-insensitively (treating 'a' and 'A' as equal), while another might sort them case-sensitively (treating 'a' and 'A' as different). This means that the same dataset can appear in a different order depending on the collation used.

Comparing text data 

Different collations might cause two identical strings to be considered different due to case differences or accent marks. This can lead to false positives in data comparison, where the data looks different even though it should be considered the same.

Indexing and searching

Collations also affect how text indexes are built and how search queries are processed. If the source and target databases use different collations, the same search query might return different results. This impacts the consistency and reliability of replicated data.

Multi-language data

Collations are especially important in databases that store multi-language data. Different languages have different sorting rules. For example, sorting rules for Spanish differ from those for English. If the source and target databases use different collations for multi-language data, this can lead to sorting and comparison issues.

Data reconciliation solutions

When you search the web for “data reconciliation solutions”, it’s hard to know when you're getting the most effective tools for your specific needs or merely the most marketed ones. Let’s take a look at three categories of solutions to see their respective strengths and weaknesses:

Manual data reconciliation

This is the simplest form of reconciliation, often used in smaller or less complex environments where data volumes are manageable without automated tools. 

In manual reconciliation, data entries are compared side by side, often using simple tools like spreadsheets or manual SQL queries. This process is highly dependent on the skills and attentiveness of the person performing the reconciliation and is prone to human error. It’s only suitable for organizations with minimal data integration needs or where automated solutions are not cost-effective.

Rule-based data reconciliation

This involves defining specific rules that data must conform to when being compared across systems. These rules can include checks for data type, format, consistency, duplication, and completeness. 

Rule-based systems can automate the reconciliation process to a large extent, reducing the time and error associated with manual methods. This approach is more scalable and can handle large data volumes. It is often used in financial reconciliations, customer data integrations, and situations where data integrity is critical for compliance and operational effectiveness.

Data diffs

Data diffs involve comparing two sets of data to find discrepancies or changes at the value-level. (If you’ve never heard of the term, it’s basically git diffs but for your data.) This has particular strengths that make it highly effective under certain conditions and often outperforms other data reconciliation solutions. 

Because data diffs are designed to precisely detect changes at a granular level, you can identify exactly what has changed, been added, or been removed. And because it’s so customizable to specific business rules or data structures, you can apply it for specialized use cases whether between the same or different types of databases. 

Data diffs also scale to large datasets for any SQL-based databases, making data comparison and reconciliation a faster process for your team.

An added benefit is that it inherently provides a clear audit trail of what has changed, which is crucial for compliance and regulatory needs. It allows organizations to maintain detailed logs of data changes, which can be critical for audits, historical analysis, and legal compliance.

Getting started with data reconciliation

If you’re ready to learn more about how Datafold is changing the way data teams test their data during data reconciliation, there are a few options to getting started:

  • Request a personalized demo with our team of data engineering experts. Tell us about your data replication or migration pipelines, tech stack, scale, and concerns. We’re here to help you understand if data diffing is a solution to your data reconciliation efforts.
  • For those who are ready to start playing with cross-database diffing today, we have a free trial experience of Datafold Cloud, so you can start connecting your databases as soon as today.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes