The Data Engineering Podcast: Reconciling the data in your databases

Tobias Macey, host of the Data Engineering Podcast, recently sat down with our founder, Gleb Mezhanskiy, to chat about the complexities of managing reconciliation, its different failure modes and error conditions, and how different systems interoperate. They discussed:

  • The challenges of validating and reconciling data at scale
  • Common patterns that data teams encounter in data replication, and how that changes between continual and one-time replication
  • Lessons learnt from building a solution that can understand the differences between datasets, at scale, and across different databases
  • Surprising and innovative applications from Datafold’s customers 
  • Where Gleb sees the future of data engineering tooling

Check out the podcast, or read on to learn about the highlights!

Data reconciliation is more than matching data

Data reconciliation is about reconciling differences in data across database environments or within database environments, and it gets complicated fast. Gleb conceptualizes data reconciliation as one dimension that makes up data quality, and so it’s easy to see how it applies to a number of workflows that data practitioners commonly encounter: 

  1. Change management to compare data between staging and production environments to ensure that any changes made to data processing code are accurately reflected in the production environment. 

Data processing code can be complex and may undergo frequent updates, making it challenging to accurately track changes and ensure that they are reflected correctly in the production environment. Managing large volumes of data in both staging and production environments can complicate the process of identifying and reconciling changes, especially when dealing with real-time data processing.

  1. Migrations between databases and assessing whether data integrity was maintained throughout. 

Migrations aren’t over until you’ve proven to your data users that the data is the same as what they had in legacy environments. And too many teams still use manual solutions such as sampling or group bys to validate that the post-migration data matches the source. This drags engineering productivity downwards and puts the entire migration at risk. 

  1. Data replication and verifying the accuracy of replicated data across different systems or databases.

Replicating data across different database systems is a complex problem due to the high throughput and scale of OLTP systems. Achieving perfect replication is difficult due to network latency, data format differences, and schema changes, but it’s absolutely critical for teams to get this right at scale for data governance. 

Data reconciliation and accounting?

Gleb started Datafold to solve chronic pain points in his workflow that he experienced early in his career as a data engineer at Lyft. Many of these were embedded in workflows that data practitioners will encounter at some point in their careers, and remain challenging to execute: change management to compare data between different environments; orchestrating database migrations; and validating replicated data across different systems. 

He was surprised to find that clients were coming to Datafold Cloud to for help with a fourth workflow: compliance and auditing. Here, the users went beyond data engineers and heads of data teams to include accountants. 

In hindsight, it’s clear why: the financial services industry prepares for regular audits. Traditionally, important data resided solely within ERP systems and was accessed exclusively through those systems. Today, organizations increasingly rely on metrics computed from analytical sources or blend transactional data from ERP systems with analytical sources for reporting to stakeholders, customers, and regulatory authorities. This shift has introduced a new level of complexity and necessitates robust controls to ensure data accuracy and integrity. 

Gleb shared that these clients have successfully used Datafold to demonstrate to auditors that changes made to key metrics and data pipelines do not compromise data integrity. By reconciling data across database environments, data engineers could work collaboratively with the accounting team to provide auditors with comprehensive insights into their data workflows, facilitating smoother audits, saving time, and reducing audit-related costs.

What’s missing in the data tooling stack

“Old school” data engineering is making a comeback. As more businesses are integrating AI and LLM applications into their business models, this means that many companies will need to get the data fundamentals right: how can data engineering build more reliable data pipelines to deliver more data more quickly?

Data engineers are still waiting for our GitHub Copilot. To help data practitioners, you have to deeply understand their specific context. Current AI-enabled IDEs and other data engineering tooling just aren’t able to incorporate business context, which is essential in anything to do with data. And because context is so specific to each company, it’s hard to create a universal solution. But not impossible. 

Check out the full interview on the Data Engineering Podcast for more data engineering insights!

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes