Folding Data #36 Open sourcing data diff

Folding Data #36

Gleb Mezhanskiy

CEO of Datafold

Open source data-diff

Last week we launched an open source package for diffing data between databases e.g. PostgreSQL <> Snowflake. Check it out here: https://github.com/datafold/data-diff/

Every data platform involves moving data around databases, so it’s no surprise that there are over a dozen major solutions for data replication. Some are open-source, some are SaaS, some are older, some newer – but given the complexity of the problem of synchronizing datasets across databases quickly and reliably, what always puzzled me was that there was no convenient tooling to validate the consistency of replicated data across databases. Every data team I spoke to seemed to have encountered the problem, but the solutions tend to be rather painful and manual (e.g. run COUNT(*) in both source and destination and dig into discrepancies). So we built a tool that can help every data team ensure the correctness of their data replication between databases, at scale and fast.

data-diff is cool because it:

🚀 Is very, very fast!

- PostgreSQL > Redshift check of 1B rows under 2 minutes 🤯

- Speed comes from using binary search on checksums instead of sending all rows across the network for comparison

🔎 Identifies discrepancies down to individual rows and values

🤝 Handles tricky column type differences between DBs (e.g. Double ⇆ Float ⇆ Decimal)

🐍 Has CLI and Python API – just pip install data-diff

This open source package enables many new use cases for Datafold:

Verify data migrations. Verify that all data was copied when doing a critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.
Verifying data pipelines. Moving data from a relational database to a warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.
Alerting and maintaining data integrity SLOs. You can create and monitor your SLO of e.g. 99.999% data integrity, and alert your team when data is missing.
Debugging complex data pipelines. When data gets lost in pipelines that may span a half-dozen systems, without verifying each intermediate datastore it's extremely difficult to track down where a row got lost.
Detecting hard deletes for an updated_at-based pipeline. If you're copying data to your warehouse based on an updated_at-style column, then you'll miss hard-deletes that data-diff can find for you.
Make your replication self-healing. You can use data-diff to self-heal by using the diff output to write/update rows in the target database.

If you want to read further about why we built and open-sourced data-diff check out our blog post here: https://www.datafold.com/blog/open-source-data-diff

Also, if you ever encountered the problem of validating data replication, give it a star on Github!

Open source data-diff

Interesting read: Motif Analytics

As someone who has built dashboards to understand user flows, I cannot tell you the frustration in only seeing total conversion rates at each step instead of seeing what various flows customers took in order. Motif Analytics argues this constraint has to do with SQL itself not being set up for sequence analysis and I couldn’t agree more. Curious to think about over types of data use cases SQL may be blinding us from asking.

Everything is a Funnel

Tool of the week (again): Hex

If you have used a BI tool you are painfully familiar with query performance and watching load timers spin as you wait to build the next part of your analysis. Hex takes a novel approach here by creating new modes of working that keep the compute focused in the warehouse and being smart about combining queries that reference data from other cells to keep data moving fast. They also support Python in the same way and have made charts infinitely scalable. If you need a seriously powerful notebook for analytics and data science look no further.