Request a 30-minute demo

Our product expert will guide you through our demo to show you how to automate testing for every part of your workflow.

See data diffing in real time
Data stack integration
Discuss pricing and features
Get answers to all your questions
Submit your credentials
Schedule date and time
for the demo
Get a 30-minute demo
and see datafold in action
September 9, 2025
Data quality best practices, data testing

Data Diff gets faster and simpler: One algorithm, better performance

Nick Carchedi
Nick Carchedi
Data Diff gets faster and simpler: One algorithm, better performance

The problem with choice

For the past year, Datafold offered two algorithms for cross-database comparisons:

  • Hashdiff: Pushed computation down to individual databases
  • In-memory: Reads the data into a central location and diffs it there

This created an unnecessary burden. Users had to understand technical trade-offs—which approach would perform best for their specific databases, datasets, and data types—when they simply wanted to validate their data.

One algorithm, optimized for everything

After analyzing hundreds of real-world deployments, we've deprecated hashdiff entirely. Our new unified approach uses in-memory diffing for all cross-database comparisons.

Here's why this is better:

  • Diff any data source: Supports all relational data sources equally well, including both analytical and transactional databases. If we don't support a source you need, we can generally add it quickly—just ask.
  • Supports diffing files: Seamlessly compare CSVs, Excel, and parquet files to other files, or even to relational data like tables and views.
  • Handle any data volume: Efficiently diffs datasets up to 10M rows, and for larger datasets, we provide sophisticated filtering and sampling tools to balance accuracy with performance.
  • Consistent data type support: Supports diffing virtually any data type found in relational databases, including text, float, JSON, and more.

Setting up new cross-database comparisons is now straightforward: just connect your sources and run the diff. No algorithm selection, performance tuning, or technical decisions to research. Plus, we've automatically upgraded existing diff monitors to use the new algorithm without changing your settings.

Built for scale

This isn't just about simplification—it's about handling the real-world complexity of modern data infrastructure. Teams are diffing between increasingly diverse systems with larger datasets, and need results they can trust.

The unified algorithm makes diffing easy wherever your data lives and regardless of its shape or size. And that's exactly how it should be.

As always, Happy Diffing.

In this article