data-diff and MotherDuck: Better Together

We’re elated to announce the Datafold + MotherDuck open source integration with data-diff. 

It solves problems you care about everyday:

  • How do I verify our extract and load process match raw/replicated data exactly?
  • How do I verify transformed dev/prod data change exactly as expected before I merge new code?

What does that mean for you and the problems you face everyday as a data engineer? Well let’s walk through a story you live through everyday to see how you can get started for free!

You built a data pipeline that does the following:

  1. Replicate product usage data from Postgres into a database like Snowflake, Databricks, BigQuery, or MotherDuck.
  2. You transform the data in MotherDuck using dbt into something more human-readable and usable for other tools and people to query directly (think: are people using the new feature as expected and increasing usage revenue?).
  3. You make a pull request for a quick and “innocent” change (think: slightly adjust a data type for better revenue reporting).
  4. You merge dbt code into production and let it automatically run with a CI/CD trigger.
  5. Then something fiery happens. Your product/sales team sends you DMs and tags you in channels you didn’t know exist and you’re left scrambling to put out fires without a decent starting point. And your new KPI becomes the number of slack notifications resolved. 

Your story, on the other hand, may look a bit more like this:

The several hidden fires that lurk behind the visible fire

Our teams don’t think this should be normal for you anymore. Let’s see what this story can look and feel like when we tackle this problem with the new open source data-diff integration with MotherDuck!

If you want to follow along with this workflow, check out the repo here: https://github.com/datafold/demo-motherduck.

Step 1: Replication Validation

Just as you did before, replicate product usage data from Postgres into a database like Snowflake, Databricks, BigQuery, or MotherDuck. 

Now, instead of just assuming that the replicated data is correct, you run a `data-diff` check between Postgres and MotherDuck, and see that there are lagging records, so you rerun your replication job to update MotherDuck with the latest records. Rerun `data-diff` and now there are no differences!

Example code can be found here on how to do this: here.
A sample data-diff output showing zero diffs...fast!

Step 2: Transformation Validation

You then transform the data in MotherDuck using dbt into something more human-readable and usable for other tools and people to query directly. Instead of fiddling with query previews across dev and prod versions of your SQL code, you can now run data-diff against dev and prod to validate that the changes are looking as expected.

Get started with open source data-diff today here!
The --dbt data-diff extension makes diffing dbt models in MotherDuck fast, easy, & clear

You merge dbt code into production and let it automatically run with a CI/CD trigger. Then something delightful happens: nothing. People have this visceral trust that the "numbers look good." When people doubt your work, you have the easiest audit trail possible: data-diff. And you get to move on with your life!

Data Diffing, a simple mechanic, automates away a lot of boring toil when building data pipelines across the ELT lifecycle. Motherduck makes it easier to evolve DuckDB into a multiplayer experience. We make sure you have a fast, multiplayer experience without bad data. Here’s to doing more work you love as a result.

Cheers!

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes