Changelog

20240221
February 21, 2024

February Product Newsletter — Impact analysis that matters

February is usually the month people forget about—that 28-or-29-day drop-off after the high of the new year. But at Datafold, this February was an extraordinary one; one filled with exciting new energy (energy induced by not only love-themed confections, but our love of data quality and automated data testing) and product updates for you.

Here’s an overview of what’s new:

1️⃣ Downstream Impact Tab of Data Diff results

2️⃣ Datafold is now available on Azure marketplace

3️⃣ MySQL 🐬 support for cross-database diffing and data reconciliation

4️⃣ Coming soon: Replication testing 👀

And make sure to join us at our next Datafold Demo Day on February 28th! Our team of data engineering experts will walk through some of these newest product updates, and demonstrate how Datafold is elevating the data quality game.

🌍 Downstream Impact Tab

We’ve been there: You’ve opened a PR for your dbt project with some code changes you just know are going to impact your head of finance’s core reports. You don’t necessarily know how (or even why), but your domain experience (and gut) tell you to merge with extreme caution.

Now, with the Datafold Downstream Impact tab, your fear of merging (and breaking) is removed. In one singular view, understand all potentially modified downstream impacts of a code change—from your downstream dbt models to that one dashboard your CFO is refreshing every 10 minutes.

The Downstream Impact tab leverages Column-Level Lineage so that (potentially very many) table-level downstreams are purposefully not included, if the specific columns connected to those downstreams are unchanged in the PR.  Talk about less noise, more signal, am I right?💡

Quickly search and sort by dependency depth, type, and name, so you never have to experience a bad dbt deploy again.

The Downstream Impact Tab will populate for any data diff in Datafold Cloud triggered by a CI job, manual data diff run, or an API call.

The Downstream Impact tab in Datafold for faster impact analysis

🛒 Azure marketplace listing

We’re excited to announce that Datafold Cloud is now available on the Azure marketplace. This enables data teams looking to automate testing for their dbt projects, migrations, and ongoing data reconciliation efforts using their pre-committed Azure spend—making data quality testing more accessible than ever.

🐬 New MySQL integration

Datafold Cloud is proud to support a new integration for MySQL, so you can leverage Datafold’s fast cross-database diffing to validate parity for migrations or ongoing replication within MySQL and our 13+ other database integrations.

A data diff result between a DIM_ORGS model in MySQL and Snowflake

👀 Coming soon: Data replication testing 👀

Perhaps the thing I am most excited about to share with you all over the coming months: Datafold’s approach to monitoring and testing an often overlooked (but critical) part of the stack—data replication pipelines.

We know how important it is for the data you’re replicating across databases to be right. We know this data is often mission-critical—powering core analytics work and machine learning models, and guaranteeing data reliability and accessibility. We recognize that broken replication pipelines and consequential data quality issues have been a persistent, unsolved pain for data engineers.

Datafold’s solution to validating ongoing source-to-target replication is going to continue what we do best: data diffing…but pairing it with net new scheduling and alerting functionality from our end.

If your team is interested in gaining transparency into the your replication efforts, feel free to email kira@datafold.com to be included in our beta waitlist.

📖 Resources

Check out some new great blogs we’ve seen floating around in the data ether!

Happy diffing!

-Kira, Product Marketing Manager

20240125
January 25, 2024

Datafold January Product Newsletter — New Year, New Integrations 🎉

The Datafold team has kicked off the new year with some exciting new product updates. Here’s an overview of what’s new:

  • 1️⃣ Azure DevOps + Bitbucket integrations
  • ​2️⃣ Tabular lineage view
  • 3️⃣ Diff metadata now visible in diff UI
  • 4️⃣ Support for Tableau Server
  • 5️⃣ Coming soon: Data diff Columns tab
P.S. — Make sure to join us at our next Virtual Hands-On Lab on February 8th! Our team of data engineering experts will walk through how to build your own CI pipeline for your dbt project. If you’ve wanted to have a CI process, but didn’t know where to start, we’ve got you covered 🙂.

🆕 Azure DevOps + Bitbucket integrations

Datafold Cloud now supports code repository integrations with Azure DevOps (ADO) and Bitbucket. Similar to our GitHub and GitLab integrations, upon a PR open in ADO or Bitbucket, Datafold will automatically add a comment providing an overview of the data diff between your branch and production tables and identify potential impact on downstream tables and data apps.

🗺️ Lineage at scale: Introducing Tabular Lineage view

We get it: When your DAG contains thousands of models and downstream BI assets, it can be hard to wade through it in a graphical format. (Spaghetti lines, who?)

We’re excited to share that Datafold Cloud now supports a Tabular Lineage view, so you can filter, sort, and explore lineage in a columnar format (the way data people usually like to, well, interact with data 😂).

Column-level lineage is now viewable and explorable in a columnar format

🤑 More data, less problems: Diff metadata now visible in diff UI

Metadata about diffs (diff start/end time, creator, and runtime) are now visible within a diff result in Datafold Cloud. Using these new easily accessible data points, immediately know who to ask questions about a specific diff or dig into diff performance.

Diff metatdata now quickly visible in diff result (see red box)

🤖 Now supported: Tableau Server

Datafold now supports integrating Tableau Server-hosted assets in column-level lineage and within the Datafold CI impact analysis comment, so you can:

  • Understand how your data works its way from source —> workbook
  • Prevent breaking data changes to your core Tableau assets

Datafold's integration with Tableau also works with Tableau Cloud.

A preview of Datafold's column-level lineage integrate with Snowflake, dbt, and Tableau Server assets

🔜 Coming soon: Data diff Columns tab

We've heard loud and clear that users want to see the information they need at a glance: one summary, no clicking around. In particular, we want you to see which columns are different (and by how much), and which are the same. All that is available now in one place: the Columns tab of a data diff's results.

You can clearly see the differences and similarities between the two version of the table being diffed. Not overly general results; not too much detail (though you can get into the weeds in the Values tab). The Columns tab is, like a bowl of porridge that Goldilocks encounters during a walk through the forest—just right. This feature will be rolled out to customers over the next couple weeks.

Please reach out if you want to be an early user, or with any feedback!

A preview of the new Columns tab which displays differences (or lack thereof) of columns in a quick overview

Happy diffing!

20231220
December 20, 2023

December Product Updates — Azure, better diff UX, migrations toolkit, and more!

Just because it's the end of the year doesn't mean the Datafold Team is slowing down! There's been a lot of great product updates lately, so let's jump right into them:

  • 1️⃣ Support for Azure deployments in Datafold Cloud
  • 2️⃣ ICYMI: The 3-in-1 migrations toolkit from Datafold
  • 3️⃣ Column remapping for cross-database diffs
  • 4️⃣ NEW: Delete diffs and set a data retention policy
  • 5️⃣ Tableau workbooks are now visible in Datafold lineage and CI impact analysis report

P.S. Make sure to join us at our next Virtual Hands on Lab on January 11th! Our team of data engineering experts will walk through how to use open source data-diff and Datafold Cloud to test your data during development and deployment.

👋 Hello, Azure

Datafold Cloud now supports deployment options in Azure, so you can run your data diffs wherever you see fit. As a reminder, Datafold Cloud also supports single-tenant deployment options in Google Cloud and AWS.

📦 The 3-in-1 product toolkit for accelerated migrations

At Datafold, we think data migrations shouldn't suck. Which is why we’re support a 3-part product experience to plan, translate, and validate your migration with speed. Using Datafold, you can use column-level lineage to identify assets to migrate and deprecate, our SQL translator to move scripts over from one SQL dialect to another, and cross-database diffing to validate migration efforts—at any scale.

✨ Better diff UX!

Smarter diff with partially null primary keys

Previously in Datafold, composite primary keys with a null column would be identified as a null primary key. Now, you can set a composite primary key that includes a column that can sometimes be null in Datafold Cloud. Talk about a small but mighty quality of life improvement for those more complex tables!

Column remapping in cross-database diffing

If you’re diffing across databases, Datafold Cloud can now diff tables that have changed column names with user-provided mapping. For example, you can now indicate that ORG_ID in Oracle is ORGID in Snowflake so that Datafold does not interpret them as different columns.

🔒 More flexible deletion and retention policies

Now in Datafold Cloud, you can easily delete diffs and create custom retention policies for your diffs. In addition to deleting individual diffs, you can configure Datafold to automatically delete all diffs older than X days. What does this mean for you? Greater control of your data and (more importantly), keeping your legal and security teams happy 😉.

📖 Workbooks now supported in Datafold Cloud Tableau integration

Tableau workbooks are now visible in Datafold Cloud column-level lineage and the CI impact analysis report! If your team is struggling with the noise of Sheets in lineage or the Datafold CI comment, make sure to check this out.

And that's a wrap for 2023—see everyone in the new year!

20231205
December 5, 2023

Product Launch: Datafold for Migrations

Datafold is slinging updates to support data migrations. Data migrations tend to be high stakes, mission critical, and downright gnarly.

With cross-database diffing for data reconciliation, SQL translation, and column-level lineage, the daunting endeavor of data migration can be a success instead of over budget, delayed, and never quite complete.

Cross-Database Diffing between Legacy and New Databases

Diffing between databases is critical to ensuring consistency between old and new data. Datafold has been shipping new database connectors at a rapid pace. Critically, Datafold Cloud users can now diff between different databases at scale (we’re talking billions of rows).

SQL Translator

With Datafold’s SQL Translator, you can efficiently and accurately convert SQL from the old dialect to the new:

It’s like Google Translate, but for your SQL.

Datafold’s SQL Translator can be used to translate thousands of lines of legacy code (such as stored procedures, DDL, and DML) into the dialect of your new data system. Oh, you can also use it for quick syntax checks as you write ad hoc queries.

Putting it all together

These new capabilities add to Datafold’s existing suite of tools, including our Column-Level Lineage graph, which can be used to identify what to migrate.

Datafold's column-level lineage of a Redshift database and the downstream data app assets

All of this significantly increases your chance of gaining stakeholder sign-off within budget and timeline, and ultimately, unplugging the old data system—the true sign of a successful migration.

20231113
November 13, 2023

Product Launch: Downstream Tableau Assets Now Accessible in Pull Request and Lineage

We’re excited to announce the new Tableau integration for Datafold Cloud that shows users the Tableau Data Sources, Sheets, and Dashboards that could be impacted by your dbt models.

These Tableau assets will be visible in the Column-Level Lineage explorer in Datafold Cloud…

…as well as right within your pull request:

So your team has completely visibility into the Tableau assets that will be potentially changed with your code updates.

With the Tableau Integration for Datafold Cloud, users can now have a robust look at how their data travels through their stack, and prevent data quality issues from entering one of the most important tools of their business.

FAQ

What about dbt Exposures?

dbt Exposures require manual configuration, which is not scalable or automated. With Datafold Cloud’s Tableau Integration, your column-level lineage and impact analysis just works out-of-the-box.

Is this only for dbt models?

Nope — Tableau assets that are downstream of any data warehouse object will appear in Datafold Cloud Column Level Lineage.

What about Tableau's native lineage?

Tableau's cataloging and lineage is built directly within Tableau. With the Datafold Cloud Tableau Integration, we support lineage and impact analysis within Tableau, as well as your data warehouse, dbt project, or other transformation code, so you have a full-view into how your data goes from source to BI dashboard.

What about other BI tools?

Datafold Cloud currently supports BI tool integrations with Looker, Tableau, and Mode. If you’re interested in another integration, please reach out to us at support@datafold.com.

I’m convinced—how can I get started today?

To get started with the Datafold Cloud Tableau integration, please reach out to a Datafold Solutions Engineer who can get you set up.

20231030
October 30, 2023

Datafold Changelog — October 2023

What a month it’s been! ICYMI, the Datafold team has been on our world tour of data conferences, meeting folks like yourself in person and learning about your data pain points. It’s been so great to hear about the exciting (and challenging) data quality work and projects your organizations have been undergoing.

Datafold team at Coalesce San Diego!

On the product front, the Datafold team has been hard at work improving your experiences with data diffing, Datafold Cloud, and new product innovations. Here’s an overview of what’s new:

1️⃣ Microsoft SQL Server and Oracle support in Datafold Cloud

2️⃣ Cyclic dependency identifier

3️⃣ Auto-type matching

4️⃣ New and improved ✨ Datafold docs ✨

5️⃣ E X C I T I N G new product betas 👀

🏘️ You asked for more database connectors? We said, “How many?”

Don’t let your data stack prevent you from high quality data. Leverage Datafold Cloud’s new connectors with both Microsoft SQL Server and Oracle to data diff where you need it.

🔀 Improved UX when cyclic dependencies appear in Datafold Lineage

🤚 Raise your hand if you’ve ever created a cyclic dependency 🙈? Now, when you’ve created this data modeling no-no, Datafold Cloud will alert you of the cyclical loop as well as identify the impacted dependencies in that loop. This can help your team quickly identify any bad practices or incorrectly modeled data.

✨ New and improved Datafold Docs

The Datafold docs have been given a facelift! Our new docs are easily searchable and organized by use case so you can get the most out of Datafold Cloud.

New Datafold docs design

⚡ Automatic type matching

Now, if there are two columns in the tables being diffed with the same column name, but with differing types of one of the following...

  • int <-> string
  • decimal <-> string
  • int <-> decimal

...Datafold will automatically cast and compare — no more unhelpful type mismatches. This means you get actual useful diff results instead of a generic "type mismatch" error. Datafold is all about diffing, and we don’t want type mismatches to get in your way!

👀 Just around the corner: Exciting new product launches!

The Datafold team is excited for fall and winter. And not because of the plethora of holidays, but because of the insanely exciting new product launches that are coming to your Datafold instance very soon:

  • 📈 Tableau BI integration in column-level lineage and impact analysis
  • ⚔️ Cross-database diffing in Datafold Cloud for accelerated migrations and replication validation. Take a sneak peek of this new feature here.
  • 🪣 Bitbucket git support
  • …and more?!?!

If you’re interested in receiving early access for any of these new features, please don’t hesitate to email me with your name, organization, and product feature interest list.

Happy diffing!

20230920
September 20, 2023

Datafold Product Newsletter September 2023 — 🍂 Falling into data conferences 🍂

As we say goodbye to the last few days of summer, we start saying hello to #conferenceseason 🎃. This fall, you can expect to see the Datafold team IRL at 3️⃣ upcoming conferences!

On the product front, the Datafold team has been hard at work improving your experiences with data diffing, Datafold Cloud, and new product innovations. Here’s an overview of what’s new:

  • 1️⃣ Datafold Cloud Looker Integration
  • 2️⃣ Automatic primary key inference for incremental and snapshot dbt models
  • 3️⃣ Less noise, more signal in the enhanced Datafold Cloud CI printout

🐦❤️👁️ Datafold Cloud Looker Integration

ICYMI we launched the Datafold Cloud Looker Integration: bringing enhanced lineage and impact analysis to your dbt project and beyond. Using the Looker integration, you can:

  • Visualize Looker assets (Explores, Views, Dashboards, and Looks) in Datafold’s column-level lineage
  • See potentially impacted Looker assets from your dbt code change in the Datafold CI comment
The Datafold Cloud Looker integration shows potentially impacted Looker assets in the Datafold Cloud CI printout and in column-level lineage

Yes, we think this is some very cool tech (what can we say we’re a bit biased 😂). But more importantly we think that this means you stop getting those “you broke my dashboard” DMs 😉.

⚡Automatic primary key inference for incremental and snapshot models

Previously, Datafold Cloud identified primary keys from an additional YAML config or the dbt uniqueness test. Now, when you define a unique_key in your dbt model config, Datafold Cloud will automatically infer that is the primary key to be used for Datafold’s diffing. Unique keys defined in dbt can be both singular or composite keys. This is particularly useful for more complex incremental and snapshot models, where you may want to define a unique key, but not test uniqueness in dbt.

Datafold Cloud will automatically infer primary keys configured as unique_keys in dbt

🔊 Enhanced Datafold Cloud CI printout: Goodbye noise, hello signal

Datafold CI comment will soon highlight which values are different between dev and prod by pulling them to the top of the comment. This will reduce alert fatigue and make it much easier to see whether your code changes will change the data (and how), or keep it the same.

Rows, columns and PKs that are not different will be grouped together under the NO DIFFERENCES dropdown.

Please note that this feature is currently being rolled out to existing customers over the next few weeks.

New collapsable section of NO DIFFERENCES for columns, rows, and PKs with no differences between tables

👋 Come see the Datafold team IRL at upcoming conferences!

We’d love to hear about your data quality pain points and wins face-to-face at some upcoming data conferences. The Datafold team will be present at the following:

  • 🇬🇧 Big Data London (Sept. 20-21) — get your FREE tickets here! Our team will be at booth #552 ready to talk all things data quality, UK football, and best pubs in London 😉
  • 🏖️ Coalesce Conference San Diego (Oct. 16-19) — tickets can be found here. We’ll be at booth #108 offering tips on data quality and some cool refreshments 🍹 And don’t forget to join us at our Coalesce After Party with Hightouch, Airbyte, Databricks, Secoda, and Hex 🥳!
  • 🎡 …and we couldn’t get enough of London, so we’ll back again to the UK for Coalesce London (Oct. 17) — get your tickets here.

Oh, and did we mention we’ll have some fun swag there for folks who come by and say hi 😎 We can’t wait to see you there!

20230829
August 29, 2023

Downstream Looker assets in Pull Requests and Lineage

When you make a change to your dbt project, how do you make sure Looker Views, Explores, Looks, and Dashboards don’t unexpectedly change—breaking data pipelines, business processes, and stakeholder trust?

🙁 Opening many tabs and fiddling with dashboards

😭 CTRL+F-ing your LookML repository

🗯️ Asking teammates on Slack

🤔 🤷‍♀️ 🥲

Starting today, your answer to that can be simple: “Datafold.”

We’ve launched a Looker integration that shows Datafold Cloud users the Looker Views, Explores, Looks, and Dashboards that could be impacted by your dbt models.

These Looker assets will be visible in Column-Level Lineage in the Datafold Cloud UI …

… as well as right within your pull request:

What about dbt Exposures? What about Spectacles?

  • dbt Exposures require manual configuration, which is not scalable or automated. Datafold Cloud’s Looker integration Just Works™️.
  • Spectacles in CI will tell you if your LookML is broken, but not if the data changed. This is like a dbt build in CI which is successful, but the data is wrong. Datafold and Spectacles work great as side-by-side partners to ensure you’re only allowing the highest quality data into your BI tool.

Is this only for dbt models?

Nope — Looker assets that are downstream of any data warehouse object will appear in Datafold Cloud Column Level Lineage.

Enough! How can I get started today?

To get started with the Datafold Cloud Looker integration, please reach out to a Datafold Solutions Engineer who can get you set up. You can also check out our docs 📚 to see how simple it is to begin.

20230822
August 22, 2023

VS Code extension, improved Datafold Cloud CI, and upcoming launches

💡 Did you know that you Datafold Cloud’s column-level lineage includes assets outside of your dbt project? Because Datafold’s column-level lineage is built based on your data warehouse’s query history (and not your dbt project’s manifest.json), you can have a full view of how your data moves its way through your ecosystem—including through all those dbt models, ad hoc tables built by analysts, and BI tool assets.

Now back to our scheduled program….the Datafold team has been hard at work improving your experiences with data diffing, Datafold Cloud, and new product innovations. Here’s an overview of what’s new:

1️⃣ Datafold VS Code Extension

2️⃣ Quality of life product improvements in Datafold Cloud (intuitive column remapping in CI and saying goodbye 👋 to stuck CI)

3️⃣ Some very exciting product launches on the horizon (hint: BI tool integrations in Datafold Cloud)

🚀 Datafold VS Code extension

ICYMI we launched the Datafold VS Code extension: a powerful new developer tool bringing data diffing directly to your dev environment. Use the Datafold VS Code extension to quickly run and diff dbt models in clean GUI, and develop dbt models with confidence and speed.

In addition, by installing the Datafold VS Code extension, you’ll receive free 30-day trial access to value-level differences—a Datafold Cloud exclusive (❗) feature. Join us in the #tools-datafold channel in the dbt Community Slack for feedback and any questions about this 🙂.

☁️ Datafold Cloud improvements

Column remapping in CI comments

If your PR includes updates to column names, you can specify these updates in your git commit message using the following syntax: datafold remap old_column_name new_column_name. That way, Datafold will understand that the renamed column should be compared to the column in the production data with the original name 🙏.

By specifying column remapping in the commit message, when you rename a column, instead of thinking one column has been removed, and another has been added, Datafold will recognize that the column has been renamed.

In the example above, the column sub_plan is renamed to plan, and Datafold recognizes these are the same column with this commit message. This feature is particularly useful if there are changes to upstream data sources that impact many downstream models.

Faster, leaner, and smarter Datafold in CI

Datafold is all about giving you the information you need, where and when you need it, as soon as possible. That includes getting out of the way quickly when it's not yet time to data diff. Now, when your dbt PR job does not complete for any reason, Datafold will detect that right away and cancel itself, allowing your CI checks to complete. Everyone loves faster (unstuck) CI!

A Datafold CI run being cancelled upon the dbt PR job skipping/failing

👀 Coming soon - betas and upcoming launches

Keep an eye out for exciting developments on:

  • 📈 Evolved lineage with Looker and Tableau integrations in Datafold Cloud. If your team is interested in seeing the Looker integration live, come join us at an upcoming Datafold Cloud Demo!
  • 🔀 Cross-data warehouse diffing for accelerated database migrations and validating data replication.
  • ...and more!

Happy diffing!

20230801
August 1, 2023

🆕 Announcing the Datafold VS Code Extension

We’ve launched the Datafold VS Code Extension—a new developer experience tool that’s integrating data quality testing, data diffing, and Datafold into your development workflow.

The VS Code extension is an enhancement of the open source data-diff product from Datafold. Using the extension, you can easily install open source data-diff, run your dbt models, and see immediate diffing results between your dev and prod environments in a clean GUI—all within your VS Code IDE.

⬇️ Install the Datafold VS Code Extension

You can install the Datafold VS Code Extension by using the VS Code Extension tab.

💻 Data diff a dbt model using the GUI

Once you’ve followed the simple steps in our documentation to get started, you’ll be able to diff any dbt model or set of models using either the simple GUI.

First, open the Datafold Extension by clicking on the bird of Datafold on the left hand side of your VS Code window. Then, click on any model's "play" button to run a data diff between the development and production version of that model.

💡 Be sure to dbt build or dbt run any models that you plan to edit or diff, to ensure relevant development data models and dbt artifacts exist.

⚒️ Data diff your most recent dbt run or build

You can also use the “Datafold: Diff latest dbt run results” command in the VS Code command palate. This enables you to automatically diff a group of models that were built in the last dbt build or dbt run.

🔎 Explore value-level data diff results

By installing the Datafold VSCode extension, you’ll receive free 30-day trial access to value-level differences—a Datafold Cloud exclusive feature (❕). To see value-level differences, click on the blue "Explore values diff" next to the "Values" section to see and interact with value-level differences.

👁️ Data diff in real time as you develop with Watch Mode

In the settings of the Datafold VS Code extension, you can enable "Diff watch mode." With watch mode on, the Datafold VS Code Extension will automatically run diffs after each dbt invocation that changes the run_results.json of your dbt project. Turn on this setting if you want diffs to be automatically run between changed dbt models.

🎥 Demo video

Watch Datafold Solutions Engineer Sung Won Chung install and use the Datafold VS Code extension!

📖 Resources

For additional resources, please check out the following:

Happy diffing!

20230701
July 1, 2023

Skip diffs, advanced filters, and a beta Looker integration!

We’re excited to share some new product updates that give you greater control over what gets diffed, how you interact with diffs and their results with advanced filtering, and identifying how code changes impact your BI tools.

Here’s an overview of what’s new:

  • Skip diffs with commit messages
  • More powerful values filtering
  • Datafold’s new Looker integration pre-release

Skip diff functionality with commit message

We get it—not every commit needs a diff! Now, you can choose to skip a diff generated by a commit by adding this string (datafold-skip-ci) in your commit message. By adding this string anywhere in your commit message, your commit will not trigger a Datafold CI run.

This feature is particularly useful if you’re adding in hotfix commits, committing many commits back-to-back in a short timeframe, or looking to reduce compute costs from unnecessary diff runs.

Skip diff CI runs by adding datafold-skip-ci in your CI comment

Negative filtering in search

Never has filtering been more intuitive! We recently added functionality for negative search in Lineage data explorer. Using negative search, by adding a dash (-) before the term to exclude, you can more easily filter on specific patterns of schemas (compared to deselecting those that don’t meet your criteria). We’ve additionally added support for *  and ? wildcards, where * matches any number of characters and ? matches any single character.

Examples:

  • ORG_ACTIVITY -DEV will match any asset name that contains ORG_ACTIVITY and does not include the string DEV
  • RUDDERSTACK*MARKETING will match any asset name that contains RUDDERSTACK followed by MARKETING at any point in the string
  • PR_???_ will match any asset name that contains PR_ followed by any 3 characters and _ . For example, PR_???_ will return PR_123_ and exclude PR_12_ from search results
Use negative filters to quickly find assets in Lineage explorer

💥 More powerful diffs and values filtering

We’ve added new filtering capabilities in your diffs log and values-level diffs to make searching for diffs (and potential errors) faster and easier.

Quickly filter out diffs with differences

Using the Result filter in your log of Data Diffs, simply filter on Different to find only diffs where there were differences.

Quickly filter out diffs with differences using the Different Result

Filter columns in the UX

For Data Diff results with many differing columns, you’re now able to search and filter columns at scale—no more never-ending scrolling to the right to find the column you need!

To use, open the “Show Filters” menu to select and sort across your diff results at scale.

Stop endlessly scrolling to the right! Filter on relevant diff columns using this new filtering ability.

Filter columns by value

For easier value-level diff navigation, you can filter on specific column-values. Simply click on the gray filter symbol to the right of a column name and input the value you want to filter for. For example, look for a diff based on a primary key that’s giving you an issue!

Filter on specific row-level values by using the filter icon in each column

👭 Join our Looker integration beta!

We’re very excited to share a pre-release of Datafold's new Looker integration! If your business uses Looker for reporting, you can enable Looker Views in Datafold’s lineage explorer and see potentially impacted Looker Views in Datafold’s CI comment—bringing impact analysis beyond your dbt project.

If you have any interest in trying out the new Datafold Cloud Looker integration early, please sign-up here.

👀 Coming soon

Keep an eye out for exciting developments on:

  • 💻 Enhanced developer experience with our ✨new✨ VSCode Extension—click here if you would like to be a beta tester 🧑‍🔬
  • 🔀 Cross-data warehouse diffing (pssst if you're interested in trying the alpha for this, please respond to the product newsletter email or email gleb@datafold.com)
  • 📈 More BI tool integrations
  • ...and more!
20230601
June 1, 2023

Data Diff Management + Version Control Integration

We’ve increased the amount of context available from your Github & Gitlab integrations to the Datafold user interface so you can more clearly understand the relationship between your diffs and specific commits and pull requests.

Filter Data Diffs by the pull request creator

Easily filter for pull requests by Github or Gitlab user names, trace that back to the specific pull request or the commit that triggered it.
​​​
​​​​​

Data Diffs grouped by pull request

This makes it easier to navigate between your pull request, commit history and the associated diffs, tracking changes and validation over time.

​​​​​

Grouped diff deletion and cancellation

You can now select a group of diffs and click the Delete Data Diff or Cancel Data Diff options in the top right section of the page.


Streamlined Data Diff Results

We’ve shortened the feedback loop in our results pages to rapidly show more relevant information. For example, you’ll now see column-level metrics earlier in the CI/CD and command-line data-diff results. We’ll also show downstream app dependencies for each Data Diff within the UI, allowing you to quickly get the appropriate lineage for a given downstream dependency.

20230501
May 1, 2023

data-diff -- dbt

We have released the dbt integration for our open-source data-diff tool. Data-diff helps to quantify the difference between any two tables in your database. You can now see the data impact of dbt code changes directly from your command line interface. No more ad-hoc SQL queries or aimlessly clicking through thousands of rows in spreadsheet dumps.

If you use dbt with Snowflake, BigQuery, Redshift, Postgres, Databricks, or DuckDB, try it out and share your feedback. It only takes a couple settings and a one line command to see your data diffs in development:dbt run --select <model(s)> && data-diff --dbt

This shows the state of your data before and after your proposed code change:

Use this dbt + data-diff integration to quickly ensure your code changes have the intended effect before opening up a pull request.

Try it out today!

We’re excited to hear your feedback via the project’s GitHub project page or the #tools-datafold channel in the dbt Slack community.

20230331
March 31, 2023

Diffing Hightouch models in CI/CD

We’ve all been there - accidentally breaking downstream dependencies that don’t live in your warehouse. How could you possibly have known what another team’s pesky filter was going to be? Well your business intelligence, marketing & operations teams can sleep more soundly knowing that your data team has full visibility into these types of breaking changes, and can prevent them before they happen.

Ship faster and more confidently after Datafold compiles and materializes Hightouch models based on each branch of a change, and then diffs them to flag any potential changes in the query output. We see teams moving towards faster and actionable data pipelines every day, so confidence in every change is vital to keeping your team humming along.

Diffing Hightouch models will show up alongside your standard diff results in Github comments and within the Datafold app.

To get started, first configure your Hightouch account within Datafold. Then, since this feature is still in beta,  opt-in here to enable it!

Hightouch Diff in CI/CD Pipeline
20230324
March 24, 2023

CI Jobs Management

It’s now even easier to manage CI runs within Datafold, and we’ve added several navigation improvements. The goal is to make it even easier to manage your CI jobs at scale.

  1. First, find and filter for your CI job quickly via the Datafold Jobs tab, which is now visible to all users.
  2. Status Page - We’ve added a more detailed CI job status page with a breakdown of individual steps and results
  3. Cancel + Rerun CI Jobs - You can now easily cancel running jobs, or rerun jobs within the Datafold user interface.

All users now have access to the Jobs user interface, and are able to see a CI Job results page, and view the individual Data Diffs associated. Each CI Job results page contains the status of all data diffs, intermediate steps, and gives you the ability to cancel an active CI Job run. Excited to hear your feedback.

Clearer Diff Sampling Logic

Sampling diff results is helpful for speedy and efficient checks of extremely large data sets. However, sometimes you need to ensure 100% test coverage of every single row, even for large data sets. To assist, we’ve added more clarity to when and where data will be sampled during the in-app diff creation workflow.

You can now explicitly disable sampling for a diff. Users running data migrations, where running a data diff against the entirety of a dataset is required for user acceptance testing, it’s now clearer and easier.

20230116
January 16, 2023

Introducing Slim Diff in CI/CD

  • Slim Diff helps teams prioritize business-critical models in CI/CD workflows - it gives teams control over exactly which models to diff on each pull request. When enabled - Slim Diff runs data diffs for only specified models based on dbt metadata, and skips models that aren’t explicitly tagged or are excluded from data diffing.

Column Remapping in Data Diff creation flow

  • Quickly remap columns within the Data Diff UI or API creation flow for known column name changes to ensure all columns are compared correctly.

Schema Comparison Sorting

  • Faster schema comparisons to see what changed inline, especially when column order has changed.

Cancel In-Progress Data Diffs

  • Now you can quickly cancel currently running diffs in both the Data Diff results, as well as the administrator interface. As always, you can cancel all diffs within CI run as before from the same administrator interface.

Globally exclude tables from CI/CD diffs

v1.60
December 8, 2022

Lightning-fast in-database comparisons for the data-diff library + DuckDB support

  • Have you ever wanted to quickly and easily get a diff comparison of two tables in your dbt development workflow? Now you can! Our wonderful Solutions Engineers spun up a tutorial on how to use our open-source data-diff library to find potential bugs that unit testing or monitoring would have missed.
  • Additionally, our data-diff community contributors have continued to improve the product - including adding DuckDB support. We appreciate the support @jardayn!
  • The latest release of Datafold’s free, open-source data-diff library is optimized for even faster Data Diffs within the same database. Compare any two tables within a warehouse and receive a detailed breakdown of schema, row and column differences.

Improved Diff Results Sorting and Filtering

  • We’ve added improved sorting and filtering interfaces to the Data Diffs analysis workflow, making it easy to find specific rows within your diff results. For example, if you’re trying to confirm that the values for a particular primary key in your sea of modified data changed exactly as expected, filter for the specific primary key or changed column value you’re looking for.

CSV Export

  • You can now export CSVs of Data Diff results and primary keys that are exclusive to one of the datasets in your comparison! This is perfect for debugging and reconciling missing data between two data sets, and sharing that information across your organization.

  • Don’t forget you can always materialize your Data Diff results to a table in your database and natively join your results to your source data, or do a deeper analysis on those differences. Enabling this setting in the Data Diff creation flow via our API or the Datafold app will create a table in your temporary schema with matched rows, values, and flags for which columns.

Materialize diff results to table is an option within the Data Diff creation workflow in both the Datafold App and our REST API.

v1.50
September 19, 2022

Lineage Usage Metrics

  • Column and Table-level query metrics in Lineage - right-click on any table or column reference within the Datafold Lineage UI to view how many times a particular user account has read or written to a particular table, allowing you to identify commonly or infrequently used data points.
  • Popularity metrics now include all cumulative downstream usage of column or table, showing the total downstream reads for a particular client.
  • Popularity Filters - Filter lineage nodes by their relative popularity compared to all indexed tables in Lineage

Data Diff Improvements

  • Cancel CI Job button via the Datafold Jobs UI - Admin users are now able to cancel CI/CD diff tasks via the Jobs UI in Admin Settings.
  • Copy Data Diff Configuration JSON to Clipboard - the info button within the diff results page now contains a button to copy the JSON payload required to create a diff via the REST API.
  • Set diff time travel logic at the dbt-model level. For example, if your dev and production tables have known differences due to timing of incremental source data, you can add a time-travel configuration to ignore the most recent data, preventing false positives in CI/CD. Learn more about time travel here and more about dbt metadata configuration here.

Other Improvements

  • Catalog search improvements to weight exact-text matches more aggressively, and hide less relevant results.
  • Datafold CI/CD integration now populates a list of deleted dbt models within the pull request comments.
  • Improve lineage support for dbt-based Hightouch models
v1.42
July 7, 2022

Popularity counters in Lineage

To help understand how frequently the assets in your warehouse are used, Lineage now displays an absolute access count per table and column for the last 7 days. To help you interpret that information, a relevant popularity rating from 0 to 4 is assigned, indicating how relatively popular a particular database object is relative to others.


Other changes

  • For on-premise deployments, we now support data diff in CI for Github on-premise servers. To use your own private Github server instead of a cloud version (https://github.com), set a <span class="code">GITHUB_SERVER</span> environment variable and set it to your Github on-prem URL.
  • In the app, the BI Settings section has been renamed to “Data Apps” and now includes both Mode and Hightouch integrations.
  • Performance improvements to lineage.
  • In the Lineage UI, Hightouch models and syncs now link to Hightouch App. This can be configured using the “workspace URL" field in the Hightouch integration settings.
  • Visual improvements to data source names and logos in Catalog and Lineage.
  • Updated display of long names of tables in Lineage.
  • Popularity is now a general filter in Catalog. It can be applied to both tables and columns.
  • Data Source and Data App source filters in Catalog are now merged for better search experience.
  • Users can now add, remove, and query tags for Mode dashboards, Hightouch models, and Hightouch syncs using GraphQL API.
  • Added usage info for tables and columns to GraphQL API.
  • CI configurations can now be paused, preventing them from running checks on pull requests.
  • Added support for BigQuery’s bignumeric and bigdecimal data types.
  • Now data source mapper field in Data Apps create/edit form is validated after all the data sources are mapped.
  • In the Data App settings, we’ve added direct links to our documentation.

Bug fixes

  • In some cases, data diffs were not canceled after CI run cancellation. These diffs were stuck in a WAITING status forever.
v1.41
June 23, 2022

Multidimensional Alerts (Beta)

Users can use <span class="code">GROUP BY</span> in alert queries to dynamically produce several time series at once. Each dimension is named after the values of the dimensional/categorical field(s) of <span class="code">GROUP BY </span>; its thresholds and anomaly detection can be configured separately. New time series will appear (and disappear over time) according to the data’s changes without the need to modify a plethora of alerts with <span class="code">WHERE</span> filters.

This feature is currently in Beta and is available upon request — please reach out to support@datafold.com to enable it for your organization.


Datafold <> Hightouch Integration

Hightouch models and syncs are now discoverable through the Datafold Catalog and visible in Datafold’s Column-Level Lineage - making it possible to trace data from source to activation.

This feature is currently available upon request — please reach out to support@datafold.com to enable it for your organization.


See downstream data applications in PR Comments

Datafold now shows downstream data applications, e.g. Mode reports and Hightouch syncs, that might be affected by a code change.


Data Diff results materialization

Users can now save Data Diff results in their databases for further analysis. Current support is limited to PK duplicates, exclusive PKs, and all value level differences.


Other changes

  • Significantly improved CI-based Data Diff performance for large warehouses with many tables, schemas, etc.
  • Expandable metric graphs to make comparison more convenient.
  • For On-Premises Implementations - If the environment variable <span class="code">DATAFOLD_AUTO_VERIFY_SAML_USERS</span> is set to "true", then users created during SAML sign-up will not have to verify their emails.
  • Better display for values match indicator in Data Diff -> Values tab.
  • Reformatted long alert names in the filter popup for readability.

Bug fixes

  • Resolved the issue where the Datafold-sdk failed to perform a primary keys check for manifest.json if there were some tables in the manifest that had not yet been created in DB.
  • Jobs request fails when filters are cleared.
v1.40
June 9, 2022

Databricks support

You can now add Databricks as a data source, with full support for Data Diff, table profiling, and column-level lineage.

Other changes

  • Data Diff sampling thresholds are no longer limited to hardcoded defaults and can now be configured from the UI.
  • We updated the Jobs page to make connection types, table names, and runtimes easier to read.

Bug fixes

  • Slack and email alert notifications were not delivered for some customers between 2022-05-31 18:00 UTC and 2022-06-07 11:00 UTC (SaaS)
  • Profile histograms and completeness info did not render immediately on load.
  • Job Source filter did not contain all the possible values that our API can return.
  • “Created time” and “last updated time” were not displayed in the list of Jobs.
  • Incorrect status in gitlab CI pipelines. Datafold App will no longer block a merge if something is wrong with the Datafold App.
v1.39
May 26, 2022

Lineage UI filters

Navigating large lineage graphs is now easier with filters that help filter out the noise. Datasource/database/schema filters allow you to control the amount of information displayed.


User group mapping between Datafold and SAML Identity Providers

Organizations using a SAML Identity Provider (Okta, Duo, and others) to authenticate users to Datafold via Single Sign-On can now set up a mapping between SAML and Datafold user groups.. Users will be automatically assigned to desired Datafold groups according to the pre-configured mapping when using SAML login.

This feature is available on request — please get in touch with Datafold to enable it for your organization.


Other changes

  • Added a special method to our SDK to check the correctness of dbt artifacts submitted to Datafold when using the dbt Core integration. Now Data Diff can finish even if something is wrong with uploading dbt artifacts. See the documentation for details.
  • Now Datafold shows Slack users/groups with the conventional @-form, like in the Slack App.
  • SAML validation & configuration errors are now exposed to users so that they can debug their setup.

Bug fixes

  • Sometimes the job status is displayed as `notAvailable`.
  • BI reports with special characters in names (slashes, hashes, etc) are not displayed or routed correctly.
  • When BI report's preview is downloaded with an error, the loading indicator is displayed forever.
  • Multi-word search requests were squashed, omitting spaces.
  • Inviting a user that was already in Datafold caused an error with an unclear message. Now it says explicitly that the problem is with the user being already invited.
v1.39
May 26, 2022

Lineage UI filters

Navigating large lineage graphs is now easier with filters that help filter out the noise. Datasource/database/schema filters allow you to control the amount of information displayed.

User group mapping between Datafold and SAML Identity Providers

Organizations using a SAML Identity Provider (Okta, Duo, and others) to authenticate users to Datafold via Single Sign-On can now set up a mapping between SAML and Datafold user groups.. Users will be automatically assigned to desired Datafold groups according to the pre-configured mapping when using SAML login.

This feature is available on request — please get in touch with Datafold to enable it for your organization.

Other changes

  • Added a special method to our SDK to check the correctness of dbt artifacts submitted to Datafold when using the dbt Core integration. Now Data Diff can finish even if something is wrong with uploading dbt artifacts. See the documentation for details.
  • Now Datafold shows Slack users/groups with the conventional @-form, like in the Slack App.
  • SAML validation & configuration errors are now exposed to users so that they can debug their setup.

Bug fixes

  • Sometimes the job status is displayed as `notAvailable`.
  • BI reports with special characters in names (slashes, hashes, etc) are not displayed or routed correctly.
  • When BI report's preview is downloaded with an error, the loading indicator is displayed forever.
  • Multi-word search requests were squashed, omitting spaces.
  • Inviting a user that was already in Datafold caused an error with an unclear message. Now it says explicitly that the problem is with the user being already invited.
v1.38
May 12, 2022

Data Diff sampling for small tables disabled by default

To avoid unnecessary overhead, Data Diff sampling is disabled for smaller tables. At this point the thresholds for table size are hardcoded defaults, configuration UI is coming. See the documentation for more details.

Other changes

  • Alert query columns are automatically classified to time dimension and metric columns; there is no more need to put the time column first.
  • Datafold no longer uses labels on GitLab to track the status of the Data Diff process, the status can now be tracked from the CI pipelines functionality.

Bug fixes

  • Issue with include and exclude columns in diffs
  • Off-charts dependencies of the in-focus table in Lineage are now displayed (and act) correctly as "Show more" → Change direction of Lineage
  • The Settings menu item in the Admin section is sometimes not rendered correctly
  • Catalog search by one- and two-letter words does not work
  • Rows with NULL primary keys always got filtered out during data diff if sampling had been enabled
v1.37
April 28, 2022

Data Diff filters can be configured in the dbt model YAML

Now you can configure Data Diff filter defaults in dbt model YAML. Filtering can be used to force Data Diff to compare only a subset of data, i.e. you may want to compare just the latest week to save DWH resources and reduce diff execution time. See the documentation for details.

Other changes

  • Selecting a column and its connected nodes in Lineage is now followed by an indicator that also allows to exit the selected path mode. Click on empty space is deprecated.
  • Fold sections of Github / Gitlab printouts to save screen space. They can be easily unfolded to check verbose diff information.
  • Show actual Slack error codes on test notifications, so that users can debug their Slack-Datafold integration.
  • Datafold now sends a confirmation email when SAML users are auto-created.
  • Now Lineage is showing all columns of table that are in the database, not only ones that have connections detected by Lineage.
  • Improvement to the autocomplete feature in Data Diff.

Bug fixes

  • API key not copied into clipboard with input built-in tool
  • Cell data in Data Diff Sampling tab is not copied from the popover
  • Sometimes NaN appears instead of alert weekly estimates.
  • Disabled users logging in through OAuth no longer raise an error.
v1.36
April 20, 2022

You can now receive Alert notifications at arbitrary webhooks with arbitrary payloads (including but not limited to JSON) — in addition to Slack & email notifications. See the documentation for details.

This feature is available only on request — please contact Datafold to enable it for your organization.

For API-first users, all API errors from all API endpoints are now unified as per RFC-7807 with the same structured JSON payload, the 4xx HTTP status codes are normalized for most cases. This might simplify parsing the error messages, for example, due to invalid input and incompatible configuration. The UI error messages will be more descriptive in some cases where they were not.

Other changes

  • A new API endpoint <span class="code">`/api/v1/dbt/check_artifacts/{ci_id}`</span>to check for dbt artifacts after uploading. This endpoint might be triggered during a CI process, for example, in Github actions or Gitlab CI, to help Datafold understand the status of downstream tasks.
  • Improved performance of dataset suggestions in Data Diff, now search-based.

Bug fixes

  • Lineage off-chart dependencies for upstream nodes not displayed
  • Snowflake table/column casing issues are resolved
  • Special characters are now properly handled on the data source names
  • Table profiling will not be done for disabled data sources
  • Lineage column selection dropped after table expansion
  • Jobs UI now shows main jobs instead of result sub-jobs for profiling and data diffs
  • Off-chart edge switches lineage direction for primary table
  • Redirect to lineage from profile was sometimes broken
v1.35
April 4, 2022

Refactored navigation design

Other changes

  • Improved formatting of integers for column profiles in Data Diff
  • Now we're displaying columns list, their description and tags in Profile, even if profiling is disabled
  • Added excludes/includes support to GraphQL search endpoint

Bug fixes

  • Fix: lineage not expanding for the second time
  • Fix: last run filter in search showing numbers instead of days/weeks
  • Fix: expanding lineage showing incomplete list of tables
  • Fix: incorrect sorting in a primary key block in the Data Diff UI
  • Fix: ability to navigate to data source creation dialog with non-confirmed e-mail
v1.34
March 29, 2022

SAML

Organizations can now use any SAML Identity Provider to authenticate users to Datafold via Single Sign-On. This includes Google, Okta, Duo, and many others, including private/corporate identity providers.

Other changes

  • During CI runs, data diff jobs will automatically select a created_at or updated_at column with an appropriate timestamp type as the time dimension
  • Catalog search has been improved in both performance and result ranking
  • Tags automatically created during dbt processes that have been superseded are periodically removed
  • A custom database can be specified for Lineage metadata in Snowflake sources

Bug fixes

  • Masked fields in Snowflake data sources could cause errors when materializing temporary tables
  • Disabled users could not be re-enabled
  • Posting labels to Gitlab triggered notifications when there were no changes
  • Table profiling failing for views in PostgreSQL data sources
v1.33
March 8, 2022

New Lineage UI

The lineage UI was updated to improve the performance for large graphs and to make exploring dependencies more intuitive. Among other changes, the view now distinguishes between upstream and downstream graph directions, and filter settings have moved to the top to provide a larger area for the lineage canvas.

Improved Slack alert messages

To make the anomaly notifications more actionable, the notifications now include the alert name, the actual value and provide more context to the anomaly that occurred.

Reduced verbosity for new tables in the Data Diff CI output

When new tables are created in a PR, the block has been reduced to only show the number of rows and number of columns, and a link to the table profile is inserted.

Other changes

  • Automatically created tags from ETL are now cleaned up automatically after their initial use to reduce tag clutter

Bug fixes

  • BI dashboards stopped displaying in the catalog
  • Added missing icons of BI data sources
  • Lineage paging stopped loading off-chart dependencies
  • Github refresh button didn’t work correctly
  • dbt metadata synchronization for dbt older than 1.0.0 in combination with Snowflake didn’t work correctly
v1.32
February 21, 2022

Fine-grained control of what data assets show up in Datafold

There is such a thing as too much data observability. To help you separate signal from the noise and only see tables that actually matter, we added fine-grained settings that allow you to define which databases, schemas, and tables should show up in Datafold Catalog and Lineage and which should be hidden (e.g. dev/temp tables). The filtered out data assets can still be found by their full name (e.g. “db.schema.table”)

Alert subscriptions for Slack user groups

Slack user groups can be now subscribed to alerts — e.g. all members of team X, on-call engineers, incident commanders. Special handles @channel & @here can also be notified in case of alerts — for all or currently online members of a channel respectively.

Pausing data source in the UI

You can temporarily disable or pause data source in the UI

Other changes

  • Subscribed users will be notified in case an alert has an execution error (e.g. database permission/connection failure) — not only on actual anomalies
  • Improved alert texts in Slack
  • Dramatic speedup of schema download from Snowflake
  • For Data Diff in CI, unchanged tables are grouped at the top of the report
  • For manually created Data Diffs, the primary key case is automatically inferred
  • Data diffs on Snowflake are now running much, much faster

Bug fixes

  • Fix: Notifications were sent to deleted integrations/destinations for some time after the deletion. No more
  • Fix: Slack App integrations were sometimes not showing users & channels if reinstalled from Slack, not from Datafold
  • Fix: Plain CI configuration could not be saved/edited when the template variables section was empty.
  • Fix: Setting update time for Alerts
  • Fix: Proper DB types mapping for the new Snowflake schema downloader
  • Fix: non-existing Slack users are filtered from Alerts
  • Fix: A lot of upstream deps take too much space in the layout. Now we're showing the first 3, and the rest are available in Lineage UI
  • Fix: Multiple tables in a CI diff were too large for a single comment post. The tables are now paginated across multiple comments
  • Fix: Hours jump in Alerts time picker
v1.31
February 4, 2022

Data Diff can now compare VARIANT type in Snowflake

Other changes

  • Added the ability to pause a data source in the API. When a data source is paused, all its data is retained in the system but schema indexing, profiling, and lineage processing are disabled
  • Improved error reporting for Redshift data sources when Datafold does not have permissions to access the table
  • Lineage speed improvements

Bug fixes

  • Fixed a bug where spaces in Data Diff values tab were missing
  • Fixed an issue where a Github integration didn't show an error message when it cannot be deleted
  • Fixed a bug where the user invite link for organizations that have Okta enabled did not work
  • Fixed a bug where BI reports could appear orphaned, not having any links to tables
  • Fixed a bug where a CI run could fail if the dbt manifest didn’t contain the raw relation name
  • Fixed a bug where the CI reported booleans instead of numbers for the number of mismatched columns
  • Fixed a bug in CI where, when a table has no differences, the link to the table profile malfunctioned
  • Fixed testing Github repository connections
  • Fixed Slack notifications where the integration could not be deleted if currently used in alerts. In the new behavior, it will unsubscribe all related notification methods from alerts as the integration is deleted
v1.30
January 25, 2022

Allow CI to continue if Data Diff fails

When you integrate Data Diff into the CI flow, you can control whether an error during Data Diff processing causes the CI flow to fail or continue. This allows you to configure Datafold to be non-blocking in your CI which can be helpful when introducing Data Diff in your development process initially.


Support for key-pair authentication for Snowflake

In our effort to support the most secure practices possible, we’ve added the ability to configure a Snowflake data source to use key-pair authentication. This is more secure than password authentication alone. See Datafold’s Snowflake documentation for details.

Other changes

  • Visually collapse Data Diff reports if no changes are detected to save users time
  • Optimized schema fetching during a data diff to reduce the runtime of a single diff, as well as the load on the data warehouse
  • Irrelevant diff views are not hidden if the primary key was not specified
  • The “time dimension” field in the Data Diff view now suggests only date/time columns

Bug fixes

  • Integrations could not be deleted if they were used in any alerts
  • Minor rendering issue with Datafold logo on the login page
  • In the Data Diff view, each of the Dataset text entry fields had its input blocked while its loading indicator was active
v1.29.1
January 12, 2022

Data Diffs without primary keys

Now you can run data diffs without specifying primary keys to compare table schemas and column profiles. Specifying primary keys is required for value-level comparison.

Other changes

  • GitLab CI integrations now respect the file ignore lists (previously, it was supported only for GitHub)
  • Improved filters autocomplete performance

Bug fixes

  • Alert deletion could sometimes be slow or time out
  • An unnecessary expand icon in the data source tree filter is not shown anymore
  • UI could break if you had more than 500 tags in the organization
v1.28.6
December 29, 2021

Data Diff improvements

Sum and Average diff metrics

Data Diff now also compares sums and averages for numerical columns which can be helpful for analyzing changes in distributions:

Improved handling of long values

When browsing value-level diffs, overflowing values can be explored and compared by hovering over them. The long values can now be copied to clipboard for further analysis.

Ignoring certain files in Data Diff CI

A new setting for CI integrations allows users to selectively ignore files modified in a PR and skip running Datafold for irrelevant changes. Files can be excluded, re-included, and re-excluded again, thus allowing complex patterns for the cases like “only run datadiffs if any dbt files have changed, except for the .txt and .md files in that folder”.

Lineage Improvements

Original SQL queries

You can see SQL query that was used to create/update a table or refresh a BI report in both Datafold Catalog or Lineage views:

BI report filtering

BI reports in Lineage can now be filtered by popularity and freshness:

Mode dashboard previews

You can see a preview screenshot for any Mode report on the Datafold Lineage graph:

Other changes

  • Timepicker in Alerts schedule now has a correct “Now” button that converts current time to UTC using the time zone from the browser
  • Now you can use Cmd/Ctrl + Click to open a data diff or an alert in a new tab
  • You can now see “Last run datetime” in the list of alerts.

Bug fixes

  • SQL queries are now visible again in Profile and Lineage for tables
  • Multiple lineage UI improvements
v1.27.2
December 13, 2021

New Datafold Slack App and alert subscriptions

Adding Slack channel destinations is easier with the new Slack App. Users can subscribe to alerts and get mentioned in the designated channels allowing for more targeted alerting and collaborative incident resolution. Documentation is available here.

Single sign-on through Okta

Single sign-on through Okta is now available for Datafold Cloud.

Datafold <> Mode Integration now in beta

Mode reports are now discoverable through Datafold Catalog and appear in Lineage which enables tracing data flows on a field level all the way to Mode reports and dashboards. Let us know if you would like to enable it for your account.

Other changes

  • Fix for faux-off-chart-deps in Lineage
  • Added a UTC notation to Last Run in Catalog results
  • Row counts in Diff now take time travel specifiers into account
  • Improved refreshes for the GitHub app to use the app authentication token instead of user to server token
  • Added the database name to all Redshift and PostgreSQL tables. This allows for use of dbt integration for those databases, and lineage in case of Redshift if cross-database queries are used in the ETL process.

v1.26.1
November 29, 2021

Diffing for advanced data types

Data Diff can now compare Snowflake's VARIANT and ARRAY types. Profiling information won't be generated for those columns, but they will show up in overall statistics, and in the Values tab. Previously VARIANT and ARRAY types were ignored during comparisons.

Improved diff sampling

When comparing tables (for example, Staging and Prod versions of your dbt model), Data Diff provides a sample of divergent values for every column that doesn’t fully match between tables. Previously Diff would select ~15 rows for every column that had differences. If there were just a few such columns, the overall sample size could be quite small. The algorithm now selects ~1,000 rows regardless of the number of columns that are different.

Bug fixes

  • Fixed an issue where the “$” character was not accepted in a password
  • Improved integer formatting throughout the app
  • Improved performance in the Catalog search input
  • Fixed 5+ smaller UI issues
v1.25.3
November 20, 2021

Mode reports in Lineage & Catalog

Mode is now available as an integration in Datafold in alpha testing mode. Once enabled, Datafold will index all reports in your Mode account to make them available in the Datafold Catalog search and Lineage.

You can now discover relevant Mode reports alongside datasets in the same search experience. It’s also possible to filter Mode reports based on popularity and freshness.

You can trace field-level data lineage to Mode reports in the Datafold Lineage view to see which tables and columns feed what report, making it easy to perform refactorings and troubleshoot issues:

New Jobs UI

With the new Jobs UI you can check what tasks are currently running in your Datafold account and easily troubleshoot various integrations such as Diff in CI as well as audit the use of Datafold.

Bug fixes

  • Fixed displaying of Alert schedules when an hourly interval is selected.
v1.24.1
November 11, 2021

Automatic inference of primary keys for dbt models + CLI tool to check primary key settings for Data Diff

For Data Diff to work in CI, it needs to know the primary key for each table it analyzes. Datafold provides a few options for defining primary keys in the dbt model:

  • Define it as meta.primary_key in dbt YAML
  • Define it as a table or column-level tag in dbt YAML
  • Automatically infer primary keys based on uniqueness tests

To help you ensure that Data Diff can look up or infer primary keys for all tables in your dbt project, we added check-primary-keys command to the Datafold CLI.

Quickly navigate to columns using Go To search bar in Diff UI

Now you can quickly jump to any column in the Diff Values tab which can be helpful when diffing especially wide tables:

v1.23.0
October 14, 2021

Run Data Diffs only with the Datafold label

There are situations where you don't want to run Data Diff in your CI unconditionally. Running it on every change, is the recommended way, to make sure that you don't let any unindented changes slip through. Similar to running the unit and integration tests in the CI, you don't want to disable the tests, since it will likely break a test without you knowing it.

When you're integrating Data Diff, you sometimes want to try it on a select number of changes. This is why we added a new option to the CI integration:


Checking this box won't start a Data Diff right away when opening up a new Pull Request. After setting the Datafold label in Github/Gitlab, it will start the actually diff.

Improvements for Postgres data sources

Postgres has a feature where a currently logged in user can change to acquire only the privileges of a selected role. This is done using the <span class="code">SET ROLE</span> command. <span class="code">SET ROLE</span> effectively drops all the privileges assigned directly to the session user and to the other roles it is a member of, leaving only the privileges available to the named role. This is now implemented for both PostgreSQL and PostgreSQL Aurora as an extra optional parameter in the datasource configuration.

For Aurora PostgreSQL data sources, we've also added an optional keep-alive setting that will allow you to turn on keep-alives for very long running queries. This is a parameter specified in seconds. Leave the option empty to disable keep alives.

Tooltips added to data source fields to avoid confusion

To provide some more context to the options available in the data sources configuration screen, we have added tooltips. We hope this makes the configuration settings a little bit easier without changing back-and-forth between our documentation pages.

Optimization for GraphQL

Our new GraphQL API is also becoming more mature. We applied a performance optimization for loading database and schema info. Previously it was required to load the tables first, but those can now be queried separately.

Bug fixes

We have also added a couple of bug fixes:

  • Fixes bug where a CI configuration could not be created without the require_label set
  • Fixes selected suggestion id flashing in search autocomplete
  • Fixes page size navigation in the Data Diff's Values tab
  • Fixes error that was thrown when empty sampling results arrived in the Table Profile sample tab
  • Fixes the frontend flooded with 500 errors when alert estimates encountered an error
  • Fixes sampling table not being re-rendered when new results come in after reload
v1.22.0
October 2, 2021

Improved messaging on the GitHub integration

This update is based on customer feedback to have more meaningful feedback in the Data Diff process. We updated more information to the GitHub statuses when running the Data Diff:


For example, we include the git hash of the job that it is waiting for. After the job starts, it will show a link to the actual job:


This can be either the job building the pull-request or the main branch. This helps to understand what’s going on when running the Data Diff, and what it is waiting for.

v1.21.0
September 20, 2021

datafold-sdk upload-and-wait

The datafold-sdk is used for synchronizing the information after a dbt run into Datafold. Datafold will extract the table and column information and it is used for Data Diff when running on a pull request.

It is a common practice to clean up the tables after a run on a pull request has ran. But Datafold might need these tables to run the Data Diff. Therefore we have the Datafold upload-and-wait command. Instead of starting the Data Diff asynchronously, it will block for the Data Diff to complete. This makes sure that you don’t drop all the tables before the Data Diff has finished.

Catalog support for dbt sources and seeds

Datafold works seamlessly with dbt. With the latest version of Datafold, we support synchronizing the metadata from dbt’s sources and seeds. Sources are tables that are external to dbt, often tables in the landing zone. When declaring a source, you can annotate it with additional information, which is also synchronized to Datafold.

Smart scheduler

New Smart Scheduler service to manage data source concurrency when scheduling table profiling tasks.

We’ve implemented a new scheduler that we call the smart scheduler. Most users know that certain tasks can impose some load on the data warehouse. This allows us to have more control on the tasks that are running, resulting in a more predictable load. We built this together with our Redshift users because Redshift doesn’t handle concurrency very well. This provides a way to run the tasks in a gentle way.

Descriptive errors on profiling errors

It can happen that a query against the data warehouse results in an error. Maybe the database is offline? Maybe the table is huge and it takes a very long time? Or in the example, below we’re having a divide by zero at runtime. We now have more informative errors when the profiling job fails.


Lineage edges are now hoverable showing source and target nodes, which are highlighted on edge click.

Improved Lineage navigation: when switching central table origin, also switch table for Profiling and Sampling tabs.

v1.20.0
September 8, 2021

Add GraphQL API for lineage

GraphQL is an increasingly popular method for retrieving information. It gives the developer more control over the desired entities and which specific fields they want to access. We now support a GraphQL API for our lineage information. Read more about it in this technical blog.
We’re continuously adding more information to the GraphQL API. For the latest state, please refer to the documentation.

Support dbt_utils for inferring Primary Keys

For running Datafold, we use the primary key of the table to see what changed. One popular way of checking this constraint is using the unique_combination_of_columns function from dbt_utils. With Datafold we now detect the use of these tests, and infer the primary key from it. This allows you to easily get started with Data Diff. Next to this, you can always set the primary key explicitly if desired.

Revamped the signup flow with new UI to create better user experience, and simplified dbt configurations.

v1.19.0
August 20, 2021

Data Diff time travel in BigQuery and Snowflake

Time travel is a useful feature of some modern data warehouses that allows querying table at a particular point in time. Using that feature in combination with Data Diff can be very helpful to detect data drift in a table by diffing it against its older version. When testing changes in prod vs. dev environments, time travel can also help align both environments on the state of source data.

v1.18.8
August 12, 2021

Gitlab support for Data Diff

Now it’s possible to automate full impact analysis of every PR to ETL code in Gitlab repositories.See how a change in the code will impact the data produced in the current and downstream tables.

More information on how to set it up can be found in the docs.

Added support for alerts on scalar values

While the true power of ML-aided alerts comes from monitoring metrics in time, sometimes it may be helpful to check a single value against a set threshold.

v1.18.0
August 5, 2021

Catalog learns about your data from everywhere

Datafold will now automatically populate Catalog with column and table descriptions & tags from dbt, Snowflake, BigQuery, Redshift and other systems, creating a unified view.

Additional descriptions can be added using Datafold’s built-in rich text editor.

v1.17.1
July 27, 2021

Primary keys for dbt models for Data Diff CI integration can now be specified on a table level

  • Errors and warnings are now collapsed in Github/Gitlab comments to avoid bloat
  • Improved performance of the Catalog search filter
  • Improved handling of dbtCloud retries: Datafold now retries 4 times after receiving 500 errors from the dbtCloud service for up to 4 seconds
  • Data source log extractor for lineage can now be done on a cron schedule
  • Alerts now show the modified at timestamp
  • Improved chrontab validation: removed once-an-hour restrictions on scheduling
  • It is now possible to disable alert query notifications
  • Catalog now shows the timestamp when the dataset was last modified
v1.16.0
July 16, 2021

Customizable Tags

Since tags became a really popular way to document tables, columns, and alerts in Catalog, many of you have requested a better way to manage them including the ability to customize their color to enhance readability. Now all tags can be created, edited and deleted in the Settings menu.

Improvements

  • Improved profiler reliability
v1.15.0
June 29, 2021

Interactive external dependencies

Lineage graphs can often get very complex and messy with all dependencies plotted at once. That’s why by default, Datafold shows a slice of the full lineage graph centered on a particular table (“dim_businesses” in the image below). That means that the graph will show tables and columns directly upstream or downstream of the chosen table.

At the same time, downstream tables (“report_hourly_bysiness_pageviews”) may have other upstream dependencies unrelated to the table on which the lineage view is centered. To avoid bloat, those dependencies are shown as dashed lines. Clicking on them will center the lineage graph on the chosen table.

v1.13.0
May 28, 2021

Per-column Data Diff Tolerances

Sometimes it may be helpful to compare columns with a threshold instead of strict equality. For instance, when a database column is a FLOAT computed as a division of aggregates (e.g. COUNT(*) / SUM(someFloatCol)), the results of the computation are not strictly deterministic, resulting in differences that are irrelevant from the business standpoint but would be flagged by diff if strict equality is used: 1.1200033 vs. 1.1200058. Diff tolerance allows you to specify an absolute or relative threshold below which differences in values would be considered equal.

Tags autocomplete

When entering tags, you can rely on autocomplete to avoid creating semantically similar tags:

Improvements

  • Fixed a bug that prevented admins from sending password reset emails
  • "Discourage manual profiling" flag added to data source settings. If the flag is set, when the user tries to refresh a data profile, a warning popup will appear.
v1.12.8
May 18, 2021

Fixed saving datasources and CI integrations with empty cron schedule.

v1.12.0
May 13, 2021

On-prem deployments now require an install password at first install used to check the state of the CI process.

v1.11.0
May 7, 2021

New Data Diff UI & Landing Page

Streamlined UI with more settings

Improvements

  • The application now posts update messages when waiting for dbt runs to finish.
  • Added an API endpoint to get status of CI runs. It can be used to check state of a CI process.
  • Use standard notation for crontab format
  • Fixed a bug where the dbt meta schedule stopped working
v1.10.0
May 1, 2021

New application root page

  • The CI config ID is now visible in the CI settings screen
  • Allow using the dbt CLI to post the manifests to Datafold, so that Datafold can run diffs in a similar way as in the dbtCloud integration
  • Documentation is now available from the header in the app
  • Fixes a bug where the dbt cloud account number was passed as a string
v1.9.2
April 23, 2021

The dbt configuration now presents a list of accounts instead of hardcoding the account name manually.

v1.9.0
April 20, 2021

Automatic dbt docs sync to Datafold Catalog

  • Fixed a bug where Snowflake timezone-aware fields were compared against timezone-naive instances
  • Search: added <span class="code">Select all</span>/ <span class="code">Deselect all</span> to data source filter
  • Updated loading indication when loading data source schema
  • Search: the user is redirected on <span class="code">/search page when no results are round in <span class="code">as-you-type</span> mode
  • Updated usage of URL params for search
  • Search: tree and sider are now responsive (expand if schema names don't fit into width)
  • Updated scrolling UX
  • Profiler: removed <span class="code">experimental_</span> guards from new profiling and sampling UIs
  • Profiler: fixed an issue with DATE & DATETIME for Snowflake table profiles
v1.8.9
April 15, 2021

  • Lineage: fixed hanging PostgreSQL query due to query planner misoptimization
  • Lineage: hotfix for Snowflake + dbt
v1.8.8
April 13, 2021

  • Lineage: multiple small bugfixes

v1.8.7
April 12, 2021

  • Lineage: support for Snowflake semistructured data
  • Lineage: fixed a bug where some parts of the graph were not displayed
  • Profiling: bugfix in settings
  • Data Diff: fixed handling of <span class="code"> time</span> datatype
  • Data Diff: soft-fail on <span class="code">inf</span> and <span class="code">NaN</span> float values
  • Made sure that CI data diffs are resilient to server-side interruptions
v1.8.6
April 9, 2021

  • Correctly display arrays and maps in profiler sample
  • Several bugfixes in lineage UI
  • Fixes in the color scheme
  • Added support for incremental SQL log fetching to build column-level lineage
  • Several fixes in the lineage query parser
v1.8.5
April 7, 2021

Incremental Column-level Lineage

Instead of querying the entire SQL query history, Datafold now looks at only new queries and updates the lineage graph incrementally. Currently works for Snowflake and Bigquery.

Faster Column Profiler

Now supports browsing super-wide (100+ col) tables without any interface lags.