Data Diff gets faster and simpler: One algorithm, better performance

Nick Carchedi

Data Diff gets faster and simpler: One algorithm, better performance

The problem with choice

For the past year, Datafold offered two algorithms for cross-database comparisons:

Hashdiff: Pushed computation down to individual databases
In-memory: Reads the data into a central location and diffs it there

This created an unnecessary burden. Users had to understand technical trade-offs—which approach would perform best for their specific databases, datasets, and data types—when they simply wanted to validate their data.

One algorithm, optimized for everything

After analyzing hundreds of real-world deployments, we've deprecated hashdiff entirely. Our new unified approach uses in-memory diffing for all cross-database comparisons.

Here's why this is better:

Diff any data source: Supports all relational data sources equally well, including both analytical and transactional databases. If we don't support a source you need, we can generally add it quickly—just ask.
Supports diffing files: Seamlessly compare CSVs, Excel, and parquet files to other files, or even to relational data like tables and views.
Handle any data volume: Efficiently diffs datasets up to 10M rows, and for larger datasets, we provide sophisticated filtering and sampling tools to balance accuracy with performance.
Consistent data type support: Supports diffing virtually any data type found in relational databases, including text, float, JSON, and more.

Setting up new cross-database comparisons is now straightforward: just connect your sources and run the diff. No algorithm selection, performance tuning, or technical decisions to research. Plus, we've automatically upgraded existing diff monitors to use the new algorithm without changing your settings.

Built for scale

This isn't just about simplification—it's about handling the real-world complexity of modern data infrastructure. Teams are diffing between increasingly diverse systems with larger datasets, and need results they can trust.

The unified algorithm makes diffing easy wherever your data lives and regardless of its shape or size. And that's exactly how it should be.

As always, Happy Diffing.

‍