Changelog
Introducing Monitors in Datafold
We're excited to announce that Datafold Monitors are now in GA. Monitors complement our existing features—such as data testing in CI, migration validation, and column-level lineage—to provide our customers with a more comprehensive data quality platform.
We support the following monitor types, each of which addresses a different set of problems:
- Data Diff: Identify discrepancies between source and target databases on an ongoing basis.
- Metric: Use machine learning to flag anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric.
- Schema Change: Get notified immediately when a table’s schema changes.
- Data Test: Validate your data with custom business rules, either in production or as part of your CI/CD workflow.
All of these monitor types can be created and managed in our application. However, for customers who prefer a more programmatic approach, our Monitors as Code feature allows you to configure monitors via version controlled YAML.
This is just the beginning. Keep an eye out for many exciting updates to come.
July 2024 Changelog — Enhanced lineage & Netezza integration
Here’s an overview of what’s new:
1️⃣ Enhanced Looker integration
2️⃣ dbt Exposures now available in the Data Explorer
3️⃣ Netezza integration
4️⃣ Materialize monitor diff results
Enhanced Looker integration
Datafold users can now import field labels and descriptions from Looker, enabling them to view and interact with this metadata directly in the Data Explorer.
dbt Exposures are now available in Datafold lineage
dbt Exposures are now accessible in Datafold's column-level lineage. For downstream applications or use cases defined with exposures, automatically explore lineage and understand the downstream impact of code changes on them.
Netezza integration is now available
Netezza is now available as a data source in Datafold. Datafold users can now perform in-database data diffs for tables in Netezza, or across Netezza and another data source (say during a data migration) to access the power of cross-database data diffing.
Materialize full data diff results for Monitors
For teams using cross-database monitors—scheduled cross-database data diffs—they can now materialize the full data diff result into their selected database. Quickly analyze discrepancies, log diff results, or perform ad-hoc analysis—faster! Simply enable the "Materialize diff results" toggle in a monitor's settings to enable this functionality.
Introducing No-Code CI in Datafold
Setting up automated and efficient CI pipelines can be a challenge.
With our new No-Code CI integration, data teams can easily incorporate data diffing into their code review process—regardless of their data transformation or orchestration tooling—so they can deploy faster and more confidently.
As long as you’re version controlling your data pipeline code, this is for you.
Check out our recent blog post for more information.
May 2024 Changelog
Here’s an overview of what’s new:
1️⃣ No-Code CI Testing
2️⃣ Improved Downstream Impact Preview for Pull Requests
3️⃣ Case Insensitivity for String Diffing
No-code CI testing
We know not every data team uses dbt to transform and model their data. Now, Datafold Cloud users can incorporate data diffing into any CI workflow, regardless of your orchestrator. You can use the new No Code CI integration to tell Datafold which tables to diff for each pull request via the Datafold UI. For teams that want to programmatically send Datafold a list of tables to diff for each PR, use the new API functionality.
Now available: Improved downstream impact for pull requests
Datafold extends the concept of version control, similar to tools like GitHub for code, to data itself with Data Diffs. Datafold's CI testing also allows you to see how potential changes in your data will impact other dependent assets, such as downstream dbt models, tables, views, and BI dashboards—which we think is awesome! However, we know for PRs that change many data models, this information can potentially be challenging to digest.
Very soon, we’ll introduce a combined Impact Preview that shows the full set of downstream data assets that may be impacted based on the changes in your pull request—all in a single view.
Case insensitivity for string diffing
Datafold users can now choose to ignore string case sensitivity when diffing; this can be useful if you’ve purposefully changed the casing of a string value in a dbt change, or are okay with "Coffee" and "coffee" being identified as the same!
Happy diffing!
Kira, PMM
April 2024 Changelog
What’s new in Datafold:
1️⃣ Dremio support in Datafold Cloud
2️⃣ Sample for cross-database diffs
3️⃣ Improved CI impact view
Dremio integration in Datafold
We are excited to announce the launch of our new integration with Dremio in Datafold Cloud. This integration will enable current (and future!) Dremio users to:
- Accelerate migrations to Dremio faster with automated data reconciliation
- Enhance dbt data quality with automated testing of dbt models in CI/CD
- Validate data replication into Dremio with Datafold’s cross-database diffing and Monitors
Dremio is the Unified Lakehouse Platform designed for self-service analytics for flexibility and performance offered at an affordable cost. With Datafold, the data testing automation platform, Dremio lakehouse users can benefit from faster higher data development velocity while having full confidence in their data products.
Now supported: Sampling in cross-database diffs
Datafold now supports sampling for cross-database diffs. For large cross-database diffs, leverage sampling to compare a subset of your data instead of the full dataset. Sampling supports sampling tolerances, which dictates the acceptable level of primary key errors before sampling is disabled, sampling confidence, ensuring that the sample accurately reflects the entire dataset, and a sampling threshold.
Consolidated impact previews for pull requests
Datafold extends the concept of version control, similar to tools like GitHub for code, to data itself with Data Diffs. Datafold's CI testing also allows you to see how potential changes in your data will impact other dependent assets, such as downstream dbt models, tables, views, and BI dashboards—which we think is awesome! However, we know for PRs that change many dbt models, this information can potentially be challenging to digest.
Very soon, we'll introduce a combined Impact Preview that shows the full set of downstream data assets that may be impacted based on the changes in your pull request—all in a single view.
Happy diffing!
New: Introducing Data Replication Testing in Datafold
Monitors: Scheduled cross-database data diffs
Monitors in Datafold are the best way to identify parity of tables across systems on a continuous basis. With Monitors, your team can:
- Run data diffs for tables across databases on a scheduled basis.
- Set error threshold for the number or percent of rows with differences between source and target.
- Receive alerts with webhooks, Slack, PagerDuty, email when data diff results deviate from your expectations.
- Investigate data diffs—all the way to the value-level—to quickly troubleshoot data replication issues.
Datafold will also keep a historical record of your Monitor's data diff results, so you can look back on replication pipeline performance and keep an auditable trail of success.
Demo
Watch Solutions Engineer Leo Folsom walk through how data replication testing works in Datafold.
Happy diffing!
March 2024 Changelog
Here’s an overview of what’s new:
1️⃣ Cross-database diffing 2.0
2️⃣ Improved column selection
3️⃣ Datafold API enhancements
4️⃣ Just around the corner: Data replication testing
Cross-database diffing improvements
As we continue to improve this product experience and evolve the testing of data reconciliation efforts, we are invested in making cross-database validation as efficient and impactful as possible for data teams.
I’m excited to share three major improvements in Datafold to make source-to-target validation faster and more impactful:
- Faster cross-database data diffing: Up to 10x faster data diffing for cross-database diffs, reducing time-to-insight and compute costs against your warehouses.
- Real-time diff results: See data differences live as Datafold identifies them, rather than waiting for the entire diff to complete.
- Representative samples: Establish difference "thresholds" to stop diffs once the set number of differences has been found per column, saving your compute costs and time.
Improved column selection
Want to complete a data diff, but only care about a handful of the columns? Save time and compute costs by selecting only specific columns to be compared during a data diff. This is very useful for larger tables where there are known (and acceptable) differences for certain columns.
Datafold API enhancements
Datafold's API now allows you to easily fetch the db.schema.table
of materialized diffs. For teams that are materializing diff results and want to programmatically query those tables, having this table path easily accessible via the Datafold API makes it straightforward to automate your custom pipeline.
Happy diffing!
March Changelog: How we’re evolving cross-database diffing in Datafold
Performance improvements
Our engineering team has spent months fine-tuning and adjusting our data diffing algorithm and we’re excited to share that teams can now experience up to 10x faster cross-database diffing.
Whether you’re undergoing a migration or performing ongoing data replication, leverage Datafold’s proprietary and performant cross-database diffing algorithm to validate parity across databases faster than manual testing ever could. Not only does this save data teams valuable time, but it reduces compute load and costs on their warehouses.
Real-time diff results
One of the innovations I’m personally most excited to share is real-time diff results. For large data diffs, we understand that you can act on partial information before the entire table diff is complete.
Now, with real-time diff results, the Overview and Value Tabs will populate as Datafold finds differences. How does this impact you?
- If you start seeing real-time value-level differences that you know are wrong, you can stop a diff in its tracks, and identify and fix the problem sooner.
- Leveraging the Overview tab in Datafold, quickly understand the magnitude of differences. For many teams, we recognize that there is often an error threshold/acceptable lack of parity. With real-time diff results, find out sooner if the diff you’re running is meeting those error expectations, and stop a diff if it’s exceeding it.
No more waiting for a longer-running diff to complete. Simply start seeing differences as we identify them.
Find differences faster with the representative samples
With Datafold’s new Per-Column Diff Limit, you can now automatically stop a running a data diff once a configurable threshold value of differences has been found per column. Similar to all of these new cross-database diffing improvements, the goal of this feature is to enable your team to find data quality issues that arise during data reconciliation faster by providing a representative sample of your data differences, while reducing load on your databases.
In the screenshot below, we see that exactly 4 differences were found in user_id
, but “at least 4,704 differences” were found in total_runtime_seconds
. user_id
has a number of differences below the Per-Column Diff Limit, and so we state the exact number. On the other hand, total_runtime_seconds
has a number of differences greater than the Per-Column Diff Limit, so we state “at least.” Note that due to our algorithm’s approach, we often find significantly more differences than the limit before diffing is halted, and in that scenario, we report the value that was found, while stating that more differences may exist.
Happy diffing!
February Changelog
Here’s an overview of what’s new:
1️⃣ Downstream Impact Tab of Data Diff results
2️⃣ Datafold is now available on Azure marketplace
3️⃣ MySQL 🐬 support for cross-database diffing and data reconciliation
4️⃣ Coming soon: Replication testing 👀
And make sure to join us at our next Datafold Demo Day on February 28th! Our team of data engineering experts will walk through some of these newest product updates, and demonstrate how Datafold is elevating the data quality game.
Downstream Impact Tab
We’ve been there: You’ve opened a PR for your dbt project with some code changes you just know are going to impact your head of finance’s core reports. You don’t necessarily know how (or even why), but your domain experience (and gut) tell you to merge with extreme caution.
Now, with the Datafold Downstream Impact tab, your fear of merging (and breaking) is removed. In one singular view, understand all potentially modified downstream impacts of a code change—from your downstream dbt models to that one dashboard your CFO is refreshing every 10 minutes.
The Downstream Impact tab leverages Column-Level Lineage so that (potentially very many) table-level downstreams are purposefully not included, if the specific columns connected to those downstreams are unchanged in the PR. Talk about less noise, more signal, am I right?💡
Quickly search and sort by dependency depth, type, and name, so you never have to experience a bad dbt deploy again.
The Downstream Impact Tab will populate for any data diff in Datafold Cloud triggered by a CI job, manual data diff run, or an API call.
Azure marketplace listing
We’re excited to announce that Datafold Cloud is now available on the Azure marketplace. This enables data teams looking to automate testing for their dbt projects, migrations, and ongoing data reconciliation efforts using their pre-committed Azure spend—making data quality testing more accessible than ever.
New MySQL integration
Datafold Cloud is proud to support a new integration for MySQL, so you can leverage Datafold’s fast cross-database diffing to validate parity for migrations or ongoing replication within MySQL and our 13+ other database integrations.
Coming soon: Data replication testing
Perhaps the thing I am most excited about to share with you all over the coming months: Datafold’s approach to monitoring and testing an often overlooked (but critical) part of the stack—data replication pipelines.
We know how important it is for the data you’re replicating across databases to be right. We know this data is often mission-critical—powering core analytics work and machine learning models, and guaranteeing data reliability and accessibility. We recognize that broken replication pipelines and consequential data quality issues have been a persistent, unsolved pain for data engineers.
Datafold’s solution to validating ongoing source-to-target replication is going to continue what we do best: data diffing…but pairing it with net new scheduling and alerting functionality from our end.
If your team is interested in gaining transparency into the your replication efforts, feel free to email [email protected] to be included in our beta waitlist.
Happy diffing!
January 2024 Changelog
The Datafold team has kicked off the new year with some exciting new product updates. Here’s an overview of what’s new:
- 1️⃣ Azure DevOps + Bitbucket integrations
- 2️⃣ Tabular lineage view
- 3️⃣ Diff metadata now visible in diff UI
- 4️⃣ Support for Tableau Server
- 5️⃣ Coming soon: Data diff Columns tab
�Azure DevOps + Bitbucket integrations
Datafold Cloud now supports code repository integrations with Azure DevOps (ADO) and Bitbucket. Similar to our GitHub and GitLab integrations, upon a PR open in ADO or Bitbucket, Datafold will automatically add a comment providing an overview of the data diff between your branch and production tables and identify potential impact on downstream tables and data apps.
Lineage at scale: Introducing Tabular Lineage view
We get it: When your DAG contains thousands of models and downstream BI assets, it can be hard to wade through it in a graphical format. (Spaghetti lines, who?)
We’re excited to share that Datafold Cloud now supports a Tabular Lineage view, so you can filter, sort, and explore lineage in a columnar format (the way data people usually like to, well, interact with data 😂).
Diff metadata now visible in diff UI
Metadata about diffs (diff start/end time, creator, and runtime) are now visible within a diff result in Datafold Cloud. Using these new easily accessible data points, immediately know who to ask questions about a specific diff or dig into diff performance.
Now supported: Tableau Server
Datafold now supports integrating Tableau Server-hosted assets in column-level lineage and within the Datafold CI impact analysis comment, so you can:
- Understand how your data works its way from source —> workbook
- Prevent breaking data changes to your core Tableau assets
Datafold's integration with Tableau also works with Tableau Cloud.
Coming soon: Data diff Columns tab
We've heard loud and clear that users want to see the information they need at a glance: one summary, no clicking around. In particular, we want you to see which columns are different (and by how much), and which are the same. All that is available now in one place: the Columns tab of a data diff's results.
You can clearly see the differences and similarities between the two version of the table being diffed. Not overly general results; not too much detail (though you can get into the weeds in the Values tab). The Columns tab is, like a bowl of porridge that Goldilocks encounters during a walk through the forest—just right. This feature will be rolled out to customers over the next couple weeks.
Please reach out if you want to be an early user, or with any feedback!
Happy diffing!
Azure, better diff UX, migrations toolkit, and more!
New product updates inlcude:
- 1️⃣ Support for Azure deployments in Datafold Cloud
- 2️⃣ ICYMI: The 3-in-1 migrations toolkit from Datafold
- 3️⃣ Column remapping for cross-database diffs
- 4️⃣ NEW: Delete diffs and set a data retention policy
- 5️⃣ Tableau workbooks are now visible in Datafold lineage and CI impact analysis report
Azure support
Datafold Cloud now supports deployment options in Azure, so you can run your data diffs wherever you see fit. As a reminder, Datafold Cloud also supports single-tenant deployment options in Google Cloud and AWS.
The 3-in-1 product toolkit for accelerated migrations
At Datafold, we think data migrations shouldn't suck. Which is why we’re support a 3-part product experience to plan, translate, and validate your migration with speed. Using Datafold, you can use column-level lineage to identify assets to migrate and deprecate, our SQL translator to move scripts over from one SQL dialect to another, and cross-database diffing to validate migration efforts—at any scale.
Better diff UX!
Smarter diff with partially null primary keys
Previously in Datafold, composite primary keys with a null column would be identified as a null primary key. Now, you can set a composite primary key that includes a column that can sometimes be null in Datafold Cloud. Talk about a small but mighty quality of life improvement for those more complex tables!
Column remapping in cross-database diffing
If you’re diffing across databases, Datafold Cloud can now diff tables that have changed column names with user-provided mapping. For example, you can now indicate that ORG_ID in Oracle is ORGID in Snowflake so that Datafold does not interpret them as different columns.
More flexible deletion and retention policies
Now in Datafold Cloud, you can easily delete diffs and create custom retention policies for your diffs. In addition to deleting individual diffs, you can configure Datafold to automatically delete all diffs older than X days. What does this mean for you? Greater control of your data and (more importantly), keeping your legal and security teams happy 😉.
Workbooks now supported in Datafold Cloud Tableau integration
Tableau workbooks are now visible in Datafold Cloud column-level lineage and the CI impact analysis report! If your team is struggling with the noise of Sheets in lineage or the Datafold CI comment, make sure to check this out.
Datafold's 3-in-1 Data Migration Toolkit
Datafold is slinging updates to support data migrations. With cross-database diffing for data reconciliation, SQL translation, and column-level lineage, the daunting endeavor of data migration can be a success instead of over budget, delayed, and never quite complete.
Cross-Database Diffing between Legacy and New Databases
Diffing between databases is critical to ensuring consistency between old and new data. Datafold has been shipping new database connectors at a rapid pace. Critically, Datafold Cloud users can now diff between different databases at scale (we’re talking billions of rows).
SQL Translator
With Datafold’s SQL Translator, you can efficiently and accurately convert SQL from the old dialect to the new:
It’s like Google Translate, but for your SQL.
Datafold’s SQL Translator can be used to translate thousands of lines of legacy code (such as stored procedures, DDL, and DML) into the dialect of your new data system. Oh, you can also use it for quick syntax checks as you write ad hoc queries.
Putting it all together
These new capabilities add to Datafold’s existing suite of tools, including our Column-Level Lineage graph, which can be used to identify what to migrate.
Product Launch: Downstream Tableau Assets Now Accessible in Pull Request and Lineage
We’re excited to announce the new Tableau integration for Datafold Cloud that shows users the Tableau Data Sources, Sheets, and Dashboards that could be impacted by your dbt models.
These Tableau assets will be visible in the Column-Level Lineage explorer in Datafold Cloud…
…as well as right within your pull request:
So your team has completely visibility into the Tableau assets that will be potentially changed with your code updates.
With the Tableau Integration for Datafold Cloud, users can now have a robust look at how their data travels through their stack, and prevent data quality issues from entering one of the most important tools of their business.
FAQ
What about dbt Exposures?
dbt Exposures require manual configuration, which is not scalable or automated. With Datafold Cloud’s Tableau Integration, your column-level lineage and impact analysis just works out-of-the-box.
Is this only for dbt models?
Nope — Tableau assets that are downstream of any data warehouse object will appear in Datafold Cloud Column Level Lineage.
Datafold Changelog — October 2023
The Datafold team has been hard at work improving your experiences with data diffing, Datafold Cloud, and new product innovations. Here’s an overview of what’s new:
1️⃣ Microsoft SQL Server and Oracle support in Datafold Cloud
2️⃣ Cyclic dependency identifier
3️⃣ Auto-type matching
4️⃣ New and improved ✨ Datafold docs ✨
5️⃣ E X C I T I N G new product betas 👀
🏘️ New database connectors: SQL Server &Oracle
Don’t let your data stack prevent you from high quality data. Leverage Datafold Cloud’s new connectors with both Microsoft SQL Server and Oracle to data diff where you need it.
🔀 Improved UX when cyclic dependencies appear in Datafold Lineage
🤚 Raise your hand if you’ve ever created a cyclic dependency 🙈? Now, when you’ve created this data modeling no-no, Datafold Cloud will alert you of the cyclical loop as well as identify the impacted dependencies in that loop. This can help your team quickly identify any bad practices or incorrectly modeled data.
✨ New and improved Datafold Docs
The Datafold docs have been given a facelift! Our new docs are easily searchable and organized by use case so you can get the most out of Datafold Cloud.
⚡ Automatic type matching
Now, if there are two columns in the tables being diffed with the same column name, but with differing types of one of the following...
- int <-> string
- decimal <-> string
- int <-> decimal
...Datafold will automatically cast and compare — no more unhelpful type mismatches. This means you get actual useful diff results instead of a generic "type mismatch" error. Datafold is all about diffing, and we don’t want type mismatches to get in your way!
👀 Just around the corner: Exciting new product launches!
The Datafold team is excited for fall and winter. And not because of the plethora of holidays, but because of the insanely exciting new product launches that are coming to your Datafold instance very soon:
- 📈 Tableau BI integration in column-level lineage and impact analysis
- ⚔️ Cross-database diffing in Datafold Cloud for accelerated migrations and replication validation. Take a sneak peek of this new feature here.
- 🪣 Bitbucket git support
- …and more?!?!
Happy diffing!
New: Datafold Looker integration, PK inference, and improved CI printout
🐦❤️👁️ Datafold Cloud Looker Integration
ICYMI we launched the Datafold Cloud Looker Integration: bringing enhanced lineage and impact analysis to your dbt project and beyond. Using the Looker integration, you can:
- Visualize Looker assets (Explores, Views, Dashboards, and Looks) in Datafold’s column-level lineage
- See potentially impacted Looker assets from your dbt code change in the Datafold CI comment
Yes, we think this is some very cool tech (what can we say we’re a bit biased 😂). But more importantly we think that this means you stop getting those “you broke my dashboard” DMs 😉.
⚡Automatic primary key inference for incremental and snapshot models
Previously, Datafold Cloud identified primary keys from an additional YAML config or the dbt uniqueness test. Now, when you define a unique_key in your dbt model config, Datafold Cloud will automatically infer that is the primary key to be used for Datafold’s diffing. Unique keys defined in dbt can be both singular or composite keys. This is particularly useful for more complex incremental and snapshot models, where you may want to define a unique key, but not test uniqueness in dbt.
🔊 Enhanced Datafold Cloud CI printout: Goodbye noise, hello signal
Datafold CI comment will soon highlight which values are different between dev and prod by pulling them to the top of the comment. This will reduce alert fatigue and make it much easier to see whether your code changes will change the data (and how), or keep it the same.
Rows, columns and PKs that are not different will be grouped together under the NO DIFFERENCES dropdown.
Please note that this feature is currently being rolled out to existing customers over the next few weeks.
Downstream Looker assets in Pull Requests and Lineage
We’ve launched a Looker integration that shows Datafold Cloud users the Looker Views, Explores, Looks, and Dashboards that could be impacted by your dbt models.
These Looker assets will be visible in Column-Level Lineage in the Datafold Cloud UI …
… as well as right within your pull request:
VS Code extension, improved Datafold Cloud CI, and upcoming launches
The Datafold team has been hard at work improving your experiences with data diffing, Datafold Cloud, and new product innovations. Here’s an overview of what’s new:
1️⃣ Datafold VS Code Extension
2️⃣ Quality of life product improvements in Datafold Cloud (intuitive column remapping in CI and saying goodbye 👋 to stuck CI)
3️⃣ Some very exciting product launches on the horizon (hint: BI tool integrations in Datafold Cloud)
🚀 Datafold VS Code extension
ICYMI we launched the Datafold VS Code extension: a powerful new developer tool bringing data diffing directly to your dev environment. Use the Datafold VS Code extension to quickly run and diff dbt models in clean GUI, and develop dbt models with confidence and speed.
In addition, by installing the Datafold VS Code extension, you’ll receive free 30-day trial access to value-level differences—a Datafold Cloud exclusive (❗) feature. Join us in the #tools-datafold channel in the dbt Community Slack for feedback and any questions about this 🙂.
☁️ Datafold Cloud improvements
Column remapping in CI comments
If your PR includes updates to column names, you can specify these updates in your git commit message using the following syntax: datafold remap old_column_name new_column_name. That way, Datafold will understand that the renamed column should be compared to the column in the production data with the original name 🙏.
By specifying column remapping in the commit message, when you rename a column, instead of thinking one column has been removed, and another has been added, Datafold will recognize that the column has been renamed.
In the example above, the column sub_plan is renamed to plan, and Datafold recognizes these are the same column with this commit message. This feature is particularly useful if there are changes to upstream data sources that impact many downstream models.
Faster, leaner, and smarter Datafold in CI
Datafold is all about giving you the information you need, where and when you need it, as soon as possible. That includes getting out of the way quickly when it's not yet time to data diff. Now, when your dbt PR job does not complete for any reason, Datafold will detect that right away and cancel itself, allowing your CI checks to complete. Everyone loves faster (unstuck) CI!
👀 Coming soon - betas and upcoming launches
Keep an eye out for exciting developments on:
- 📈 Evolved lineage with Looker and Tableau integrations in Datafold Cloud. If your team is interested in seeing the Looker integration live, come join us at an upcoming Datafold Cloud Demo!
- 🔀 Cross-data warehouse diffing for accelerated database migrations and validating data replication.
- ...and more!
Happy diffing!
🆕 Announcing the Datafold VS Code Extension
We’ve launched the Datafold VS Code Extension—a new developer experience tool that’s integrating data quality testing, data diffing, and Datafold into your development workflow.
The VS Code extension is an enhancement of the open source data-diff product from Datafold. Using the extension, you can easily install open source data-diff, run your dbt models, and see immediate diffing results between your dev and prod environments in a clean GUI—all within your VS Code IDE.
⬇️ Install the Datafold VS Code Extension
You can install the Datafold VS Code Extension by using the VS Code Extension tab.
💻 Data diff a dbt model using the GUI
Once you’ve followed the simple steps in our documentation to get started, you’ll be able to diff any dbt model or set of models using either the simple GUI.
First, open the Datafold Extension by clicking on the bird of Datafold on the left hand side of your VS Code window. Then, click on any model's "play" button to run a data diff between the development and production version of that model.
💡 Be sure to dbt build or dbt run any models that you plan to edit or diff, to ensure relevant development data models and dbt artifacts exist.
⚒️ Data diff your most recent dbt run or build
You can also use the “Datafold: Diff latest dbt run results” command in the VS Code command palate. This enables you to automatically diff a group of models that were built in the last dbt build or dbt run.
🔎 Explore value-level data diff results
By installing the Datafold VSCode extension, you’ll receive free 30-day trial access to value-level differences—a Datafold Cloud exclusive feature (❕). To see value-level differences, click on the blue "Explore values diff" next to the "Values" section to see and interact with value-level differences.
👁️ Data diff in real time as you develop with Watch Mode
In the settings of the Datafold VS Code extension, you can enable "Diff watch mode." With watch mode on, the Datafold VS Code Extension will automatically run diffs after each dbt invocation that changes the run_results.json of your dbt project. Turn on this setting if you want diffs to be automatically run between changed dbt models.
🎥 Demo video
Watch Datafold Solutions Engineer Sung Won Chung install and use the Datafold VS Code extension!
📖 Resources
For additional resources, please check out the following:
- Detailed docs on the Datafold VS Code functionality
- Blog post on why we built this extension, and where we see it going in the future
Happy diffing!
Skip diffs, advanced filters, and a beta Looker integration
We’re excited to share some new product updates that give you greater control over what gets diffed, how you interact with diffs and their results with advanced filtering, and identifying how code changes impact your BI tools.
Here’s an overview of what’s new:
- Skip diffs with commit messages
- More powerful values filtering
- Datafold’s new Looker integration pre-release
⏩ Skip diff functionality with commit message
We get it—not every commit needs a diff! Now, you can choose to skip a diff generated by a commit by adding this string (datafold-skip-ci) in your commit message. By adding this string anywhere in your commit message, your commit will not trigger a Datafold CI run.
This feature is particularly useful if you’re adding in hotfix commits, committing many commits back-to-back in a short timeframe, or looking to reduce compute costs from unnecessary diff runs.
➖ Negative filtering in search
Never has filtering been more intuitive! We recently added functionality for negative search in Lineage data explorer. Using negative search, by adding a dash (-) before the term to exclude, you can more easily filter on specific patterns of schemas (compared to deselecting those that don’t meet your criteria). We’ve additionally added support for * and ? wildcards, where * matches any number of characters and ? matches any single character.
Examples:
- ORG_ACTIVITY -DEV will match any asset name that contains ORG_ACTIVITY and does not include the string DEV
- RUDDERSTACK*MARKETING will match any asset name that contains RUDDERSTACK followed by MARKETING at any point in the string
- PR_???_ will match any asset name that contains PR_ followed by any 3 characters and _ . For example, PR_???_ will return PR_123_ and exclude PR_12_ from search results
💥 More powerful diffs and values filtering
We’ve added new filtering capabilities in your diffs log and values-level diffs to make searching for diffs (and potential errors) faster and easier.
Quickly filter out diffs with differences
Using the Result filter in your log of Data Diffs, simply filter on Different to find only diffs where there were differences.
Filter columns in the UX
For Data Diff results with many differing columns, you’re now able to search and filter columns at scale—no more never-ending scrolling to the right to find the column you need!
To use, open the “Show Filters” menu to select and sort across your diff results at scale.
Filter columns by value
For easier value-level diff navigation, you can filter on specific column-values. Simply click on the gray filter symbol to the right of a column name and input the value you want to filter for. For example, look for a diff based on a primary key that’s giving you an issue!
👭 Join our Looker integration beta!
We’re very excited to share a pre-release of Datafold's new Looker integration! If your business uses Looker for reporting, you can enable Looker Views in Datafold’s lineage explorer and see potentially impacted Looker Views in Datafold’s CI comment—bringing impact analysis beyond your dbt project.
If you have any interest in trying out the new Datafold Cloud Looker integration early, please sign-up here.
👀 Coming soon
Keep an eye out for exciting developments on:
- 💻 Enhanced developer experience with our ✨new✨ VSCode Extension—click here if you would like to be a beta tester 🧑🔬
- 🔀 Cross-data warehouse diffing (pssst if you're interested in trying the alpha for this, please respond to the product newsletter email or email [email protected])
- 📈 More BI tool integrations
- ...and more!
Data Diff Management + Version Control Integration
We’ve increased the amount of context available from your Github & Gitlab integrations to the Datafold user interface so you can more clearly understand the relationship between your diffs and specific commits and pull requests.
Filter Data Diffs by the pull request creator
Easily filter for pull requests by Github or Gitlab user names, trace that back to the specific pull request or the commit that triggered it.
Data Diffs grouped by pull request
This makes it easier to navigate between your pull request, commit history and the associated diffs, tracking changes and validation over time.
Grouped diff deletion and cancellation
You can now select a group of diffs and click the Delete Data Diff or Cancel Data Diff options in the top right section of the page.
Streamlined Data Diff Results
We’ve shortened the feedback loop in our results pages to rapidly show more relevant information. For example, you’ll now see column-level metrics earlier in the CI/CD and command-line data-diff results. We’ll also show downstream app dependencies for each Data Diff within the UI, allowing you to quickly get the appropriate lineage for a given downstream dependency.
data-diff -- dbt
We have released the dbt integration for our open-source data-diff tool. Data-diff helps to quantify the difference between any two tables in your database. You can now see the data impact of dbt code changes directly from your command line interface. No more ad-hoc SQL queries or aimlessly clicking through thousands of rows in spreadsheet dumps.
If you use dbt with Snowflake, BigQuery, Redshift, Postgres, Databricks, or DuckDB, try it out and share your feedback. It only takes a couple settings and a one line command to see your data diffs in development:dbt run --select <model(s)> && data-diff --dbt
This shows the state of your data before and after your proposed code change:
Use this dbt + data-diff integration to quickly ensure your code changes have the intended effect before opening up a pull request.
We’re excited to hear your feedback via the project’s GitHub project page or the #tools-datafold channel in the dbt Slack community.
Diffing Hightouch models in CI/CD
We’ve all been there - accidentally breaking downstream dependencies that don’t live in your warehouse. How could you possibly have known what another team’s pesky filter was going to be? Well your business intelligence, marketing & operations teams can sleep more soundly knowing that your data team has full visibility into these types of breaking changes, and can prevent them before they happen.
Ship faster and more confidently after Datafold compiles and materializes Hightouch models based on each branch of a change, and then diffs them to flag any potential changes in the query output. We see teams moving towards faster and actionable data pipelines every day, so confidence in every change is vital to keeping your team humming along.
Diffing Hightouch models will show up alongside your standard diff results in Github comments and within the Datafold app.
To get started, first configure your Hightouch account within Datafold. Then, since this feature is still in beta, opt-in here to enable it!
CI Jobs Management
It’s now even easier to manage CI runs within Datafold, and we’ve added several navigation improvements. The goal is to make it even easier to manage your CI jobs at scale.
- First, find and filter for your CI job quickly via the Datafold Jobs tab, which is now visible to all users.
- Status Page - We’ve added a more detailed CI job status page with a breakdown of individual steps and results
- Cancel + Rerun CI Jobs - You can now easily cancel running jobs, or rerun jobs within the Datafold user interface.
All users now have access to the Jobs user interface, and are able to see a CI Job results page, and view the individual Data Diffs associated. Each CI Job results page contains the status of all data diffs, intermediate steps, and gives you the ability to cancel an active CI Job run. Excited to hear your feedback.
Clearer Diff Sampling Logic
Sampling diff results is helpful for speedy and efficient checks of extremely large data sets. However, sometimes you need to ensure 100% test coverage of every single row, even for large data sets. To assist, we’ve added more clarity to when and where data will be sampled during the in-app diff creation workflow.
You can now explicitly disable sampling for a diff. Users running data migrations, where running a data diff against the entirety of a dataset is required for user acceptance testing, it’s now clearer and easier.
Introducing Slim Diff in CI/CD
- Slim Diff helps teams prioritize business-critical models in CI/CD workflows - it gives teams control over exactly which models to diff on each pull request. When enabled - Slim Diff runs data diffs for only specified models based on dbt metadata, and skips models that aren’t explicitly tagged or are excluded from data diffing.
Column Remapping in Data Diff creation flow
- Quickly remap columns within the Data Diff UI or API creation flow for known column name changes to ensure all columns are compared correctly.
Schema Comparison Sorting
- Faster schema comparisons to see what changed inline, especially when column order has changed.
Cancel In-Progress Data Diffs
- Now you can quickly cancel currently running diffs in both the Data Diff results, as well as the administrator interface. As always, you can cancel all diffs within CI run as before from the same administrator interface.
Globally exclude tables from CI/CD diffs
- Use your dbt metadata to exclude particular folders or models from being tested against in CI/CD workflows. Use cases vary from excluding sensitive tables to unsupported downstream usages. Your data team can configure Datafold to be aligned with their priorities.
Lightning-fast in-database comparisons for the data-diff library + DuckDB support
- Have you ever wanted to quickly and easily get a diff comparison of two tables in your dbt development workflow? Now you can! Our wonderful Solutions Engineers spun up a tutorial on how to use our open-source data-diff library to find potential bugs that unit testing or monitoring would have missed.
- Additionally, our data-diff community contributors have continued to improve the product - including adding DuckDB support. We appreciate the support @jardayn!
- The latest release of Datafold’s free, open-source data-diff library is optimized for even faster Data Diffs within the same database. Compare any two tables within a warehouse and receive a detailed breakdown of schema, row and column differences.
Improved Diff Results Sorting and Filtering
- We’ve added improved sorting and filtering interfaces to the Data Diffs analysis workflow, making it easy to find specific rows within your diff results. For example, if you’re trying to confirm that the values for a particular primary key in your sea of modified data changed exactly as expected, filter for the specific primary key or changed column value you’re looking for.
CSV Export
- You can now export CSVs of Data Diff results and primary keys that are exclusive to one of the datasets in your comparison! This is perfect for debugging and reconciling missing data between two data sets, and sharing that information across your organization.
- Don’t forget you can always materialize your Data Diff results to a table in your database and natively join your results to your source data, or do a deeper analysis on those differences. Enabling this setting in the Data Diff creation flow via our API or the Datafold app will create a table in your temporary schema with matched rows, values, and flags for which columns.
Materialize diff results to table is an option within the Data Diff creation workflow in both the Datafold App and our REST API.
Lineage Usage Metrics
- Column and Table-level query metrics in Lineage - right-click on any table or column reference within the Datafold Lineage UI to view how many times a particular user account has read or written to a particular table, allowing you to identify commonly or infrequently used data points.
- Popularity metrics now include all cumulative downstream usage of column or table, showing the total downstream reads for a particular client.
- Popularity Filters - Filter lineage nodes by their relative popularity compared to all indexed tables in Lineage
Data Diff Improvements
- Cancel CI Job button via the Datafold Jobs UI - Admin users are now able to cancel CI/CD diff tasks via the Jobs UI in Admin Settings.
- Copy Data Diff Configuration JSON to Clipboard - the info button within the diff results page now contains a button to copy the JSON payload required to create a diff via the REST API.
- Set diff time travel logic at the dbt-model level. For example, if your dev and production tables have known differences due to timing of incremental source data, you can add a time-travel configuration to ignore the most recent data, preventing false positives in CI/CD. Learn more about time travel here and more about dbt metadata configuration here.
Other Improvements
- Catalog search improvements to weight exact-text matches more aggressively, and hide less relevant results.
- Datafold CI/CD integration now populates a list of deleted dbt models within the pull request comments.
- Improve lineage support for dbt-based Hightouch models
Popularity counters in Lineage
To help understand how frequently the assets in your warehouse are used, Lineage now displays an absolute access count per table and column for the last 7 days. To help you interpret that information, a relevant popularity rating from 0 to 4 is assigned, indicating how relatively popular a particular database object is relative to others.
Other changes
- For on-premise deployments, we now support data diff in CI for Github on-premise servers. To use your own private Github server instead of a cloud version (https://github.com), set a <span class="code">GITHUB_SERVER</span> environment variable and set it to your Github on-prem URL.
- In the app, the BI Settings section has been renamed to “Data Apps” and now includes both Mode and Hightouch integrations.
- Performance improvements to lineage.
- In the Lineage UI, Hightouch models and syncs now link to Hightouch App. This can be configured using the “workspace URL" field in the Hightouch integration settings.
- Visual improvements to data source names and logos in Catalog and Lineage.
- Updated display of long names of tables in Lineage.
- Popularity is now a general filter in Catalog. It can be applied to both tables and columns.
- Data Source and Data App source filters in Catalog are now merged for better search experience.
- Users can now add, remove, and query tags for Mode dashboards, Hightouch models, and Hightouch syncs using GraphQL API.
- Added usage info for tables and columns to GraphQL API.
- CI configurations can now be paused, preventing them from running checks on pull requests.
- Added support for BigQuery’s bignumeric and bigdecimal data types.
- Now data source mapper field in Data Apps create/edit form is validated after all the data sources are mapped.
- In the Data App settings, we’ve added direct links to our documentation.
Bug fixes
- In some cases, data diffs were not canceled after CI run cancellation. These diffs were stuck in a WAITING status forever.
Multidimensional Alerts (Beta)
Users can use <span class="code">GROUP BY</span> in alert queries to dynamically produce several time series at once. Each dimension is named after the values of the dimensional/categorical field(s) of <span class="code">GROUP BY </span>; its thresholds and anomaly detection can be configured separately. New time series will appear (and disappear over time) according to the data’s changes without the need to modify a plethora of alerts with <span class="code">WHERE</span> filters.
This feature is currently in Beta and is available upon request — please reach out to [email protected] to enable it for your organization.
Datafold <> Hightouch Integration
Hightouch models and syncs are now discoverable through the Datafold Catalog and visible in Datafold’s Column-Level Lineage - making it possible to trace data from source to activation.
This feature is currently available upon request — please reach out to [email protected] to enable it for your organization.
See downstream data applications in PR Comments
Datafold now shows downstream data applications, e.g. Mode reports and Hightouch syncs, that might be affected by a code change.
Data Diff results materialization
Users can now save Data Diff results in their databases for further analysis. Current support is limited to PK duplicates, exclusive PKs, and all value level differences.
Other changes
- Significantly improved CI-based Data Diff performance for large warehouses with many tables, schemas, etc.
- Expandable metric graphs to make comparison more convenient.
- For On-Premises Implementations - If the environment variable <span class="code">DATAFOLD_AUTO_VERIFY_SAML_USERS</span> is set to "true", then users created during SAML sign-up will not have to verify their emails.
- Better display for values match indicator in Data Diff -> Values tab.
- Reformatted long alert names in the filter popup for readability.
Bug fixes
- Resolved the issue where the Datafold-sdk failed to perform a primary keys check for manifest.json if there were some tables in the manifest that had not yet been created in DB.
- Jobs request fails when filters are cleared.
Databricks support
You can now add Databricks as a data source, with full support for Data Diff, table profiling, and column-level lineage.
Other changes
- Data Diff sampling thresholds are no longer limited to hardcoded defaults and can now be configured from the UI.
- We updated the Jobs page to make connection types, table names, and runtimes easier to read.
Bug fixes
- Slack and email alert notifications were not delivered for some customers between 2022-05-31 18:00 UTC and 2022-06-07 11:00 UTC (SaaS)
- Profile histograms and completeness info did not render immediately on load.
- Job Source filter did not contain all the possible values that our API can return.
- “Created time” and “last updated time” were not displayed in the list of Jobs.
- Incorrect status in gitlab CI pipelines. Datafold App will no longer block a merge if something is wrong with the Datafold App.
Lineage UI filters
Navigating large lineage graphs is now easier with filters that help filter out the noise. Datasource/database/schema filters allow you to control the amount of information displayed.
User group mapping between Datafold and SAML Identity Providers
Organizations using a SAML Identity Provider (Okta, Duo, and others) to authenticate users to Datafold via Single Sign-On can now set up a mapping between SAML and Datafold user groups.. Users will be automatically assigned to desired Datafold groups according to the pre-configured mapping when using SAML login.
This feature is available on request — please get in touch with Datafold to enable it for your organization.
Other changes
- Added a special method to our SDK to check the correctness of dbt artifacts submitted to Datafold when using the dbt Core integration. Now Data Diff can finish even if something is wrong with uploading dbt artifacts. See the documentation for details.
- Now Datafold shows Slack users/groups with the conventional @-form, like in the Slack App.
- SAML validation & configuration errors are now exposed to users so that they can debug their setup.
Bug fixes
- Sometimes the job status is displayed as `notAvailable`.
- BI reports with special characters in names (slashes, hashes, etc) are not displayed or routed correctly.
- When BI report's preview is downloaded with an error, the loading indicator is displayed forever.
- Multi-word search requests were squashed, omitting spaces.
- Inviting a user that was already in Datafold caused an error with an unclear message. Now it says explicitly that the problem is with the user being already invited.
Lineage UI filters
Navigating large lineage graphs is now easier with filters that help filter out the noise. Datasource/database/schema filters allow you to control the amount of information displayed.
User group mapping between Datafold and SAML Identity Providers
Organizations using a SAML Identity Provider (Okta, Duo, and others) to authenticate users to Datafold via Single Sign-On can now set up a mapping between SAML and Datafold user groups.. Users will be automatically assigned to desired Datafold groups according to the pre-configured mapping when using SAML login.
This feature is available on request — please get in touch with Datafold to enable it for your organization.
Other changes
- Added a special method to our SDK to check the correctness of dbt artifacts submitted to Datafold when using the dbt Core integration. Now Data Diff can finish even if something is wrong with uploading dbt artifacts. See the documentation for details.
- Now Datafold shows Slack users/groups with the conventional @-form, like in the Slack App.
- SAML validation & configuration errors are now exposed to users so that they can debug their setup.
Bug fixes
- Sometimes the job status is displayed as `notAvailable`.
- BI reports with special characters in names (slashes, hashes, etc) are not displayed or routed correctly.
- When BI report's preview is downloaded with an error, the loading indicator is displayed forever.
- Multi-word search requests were squashed, omitting spaces.
- Inviting a user that was already in Datafold caused an error with an unclear message. Now it says explicitly that the problem is with the user being already invited.
Data Diff sampling for small tables disabled by default
To avoid unnecessary overhead, Data Diff sampling is disabled for smaller tables. At this point the thresholds for table size are hardcoded defaults, configuration UI is coming. See the documentation for more details.
Other changes
- Alert query columns are automatically classified to time dimension and metric columns; there is no more need to put the time column first.
- Datafold no longer uses labels on GitLab to track the status of the Data Diff process, the status can now be tracked from the CI pipelines functionality.
Bug fixes
- Issue with include and exclude columns in diffs
- Off-charts dependencies of the in-focus table in Lineage are now displayed (and act) correctly as "Show more" → Change direction of Lineage
- The Settings menu item in the Admin section is sometimes not rendered correctly
- Catalog search by one- and two-letter words does not work
- Rows with NULL primary keys always got filtered out during data diff if sampling had been enabled
Data Diff filters can be configured in the dbt model YAML
Now you can configure Data Diff filter defaults in dbt model YAML. Filtering can be used to force Data Diff to compare only a subset of data, i.e. you may want to compare just the latest week to save DWH resources and reduce diff execution time. See the documentation for details.
Other changes
- Selecting a column and its connected nodes in Lineage is now followed by an indicator that also allows to exit the selected path mode. Click on empty space is deprecated.
- Fold sections of Github / Gitlab printouts to save screen space. They can be easily unfolded to check verbose diff information.
- Show actual Slack error codes on test notifications, so that users can debug their Slack-Datafold integration.
- Datafold now sends a confirmation email when SAML users are auto-created.
- Now Lineage is showing all columns of table that are in the database, not only ones that have connections detected by Lineage.
- Improvement to the autocomplete feature in Data Diff.
Bug fixes
- API key not copied into clipboard with input built-in tool
- Cell data in Data Diff Sampling tab is not copied from the popover
- Sometimes NaN appears instead of alert weekly estimates.
- Disabled users logging in through OAuth no longer raise an error.
You can now receive Alert notifications at arbitrary webhooks with arbitrary payloads (including but not limited to JSON) — in addition to Slack & email notifications. See the documentation for details.
This feature is available only on request — please contact Datafold to enable it for your organization.
For API-first users, all API errors from all API endpoints are now unified as per RFC-7807 with the same structured JSON payload, the 4xx HTTP status codes are normalized for most cases. This might simplify parsing the error messages, for example, due to invalid input and incompatible configuration. The UI error messages will be more descriptive in some cases where they were not.
Other changes
- A new API endpoint <span class="code">`/api/v1/dbt/check_artifacts/{ci_id}`</span>to check for dbt artifacts after uploading. This endpoint might be triggered during a CI process, for example, in Github actions or Gitlab CI, to help Datafold understand the status of downstream tasks.
- Improved performance of dataset suggestions in Data Diff, now search-based.
Bug fixes
- Lineage off-chart dependencies for upstream nodes not displayed
- Snowflake table/column casing issues are resolved
- Special characters are now properly handled on the data source names
- Table profiling will not be done for disabled data sources
- Lineage column selection dropped after table expansion
- Jobs UI now shows main jobs instead of result sub-jobs for profiling and data diffs
- Off-chart edge switches lineage direction for primary table
- Redirect to lineage from profile was sometimes broken
Refactored navigation design
Other changes
- Improved formatting of integers for column profiles in Data Diff
- Now we're displaying columns list, their description and tags in Profile, even if profiling is disabled
- Added excludes/includes support to GraphQL search endpoint
Bug fixes
- Fix: lineage not expanding for the second time
- Fix: last run filter in search showing numbers instead of days/weeks
- Fix: expanding lineage showing incomplete list of tables
- Fix: incorrect sorting in a primary key block in the Data Diff UI
- Fix: ability to navigate to data source creation dialog with non-confirmed e-mail
SAML
Organizations can now use any SAML Identity Provider to authenticate users to Datafold via Single Sign-On. This includes Google, Okta, Duo, and many others, including private/corporate identity providers.
Other changes
- During CI runs, data diff jobs will automatically select a created_at or updated_at column with an appropriate timestamp type as the time dimension
- Catalog search has been improved in both performance and result ranking
- Tags automatically created during dbt processes that have been superseded are periodically removed
- A custom database can be specified for Lineage metadata in Snowflake sources
Bug fixes
- Masked fields in Snowflake data sources could cause errors when materializing temporary tables
- Disabled users could not be re-enabled
- Posting labels to Gitlab triggered notifications when there were no changes
- Table profiling failing for views in PostgreSQL data sources
New Lineage UI
The lineage UI was updated to improve the performance for large graphs and to make exploring dependencies more intuitive. Among other changes, the view now distinguishes between upstream and downstream graph directions, and filter settings have moved to the top to provide a larger area for the lineage canvas.
Improved Slack alert messages
To make the anomaly notifications more actionable, the notifications now include the alert name, the actual value and provide more context to the anomaly that occurred.
Reduced verbosity for new tables in the Data Diff CI output
When new tables are created in a PR, the block has been reduced to only show the number of rows and number of columns, and a link to the table profile is inserted.
Other changes
- Automatically created tags from ETL are now cleaned up automatically after their initial use to reduce tag clutter
Bug fixes
- BI dashboards stopped displaying in the catalog
- Added missing icons of BI data sources
- Lineage paging stopped loading off-chart dependencies
- Github refresh button didn’t work correctly
- dbt metadata synchronization for dbt older than 1.0.0 in combination with Snowflake didn’t work correctly
Fine-grained control of what data assets show up in Datafold
There is such a thing as too much data observability. To help you separate signal from the noise and only see tables that actually matter, we added fine-grained settings that allow you to define which databases, schemas, and tables should show up in Datafold Catalog and Lineage and which should be hidden (e.g. dev/temp tables). The filtered out data assets can still be found by their full name (e.g. “db.schema.table”)
Alert subscriptions for Slack user groups
Slack user groups can be now subscribed to alerts — e.g. all members of team X, on-call engineers, incident commanders. Special handles @channel & @here can also be notified in case of alerts — for all or currently online members of a channel respectively.
Pausing data source in the UI
You can temporarily disable or pause data source in the UI
Other changes
- Subscribed users will be notified in case an alert has an execution error (e.g. database permission/connection failure) — not only on actual anomalies
- Improved alert texts in Slack
- Dramatic speedup of schema download from Snowflake
- For Data Diff in CI, unchanged tables are grouped at the top of the report
- For manually created Data Diffs, the primary key case is automatically inferred
- Data diffs on Snowflake are now running much, much faster
Bug fixes
- Fix: Notifications were sent to deleted integrations/destinations for some time after the deletion. No more
- Fix: Slack App integrations were sometimes not showing users & channels if reinstalled from Slack, not from Datafold
- Fix: Plain CI configuration could not be saved/edited when the template variables section was empty.
- Fix: Setting update time for Alerts
- Fix: Proper DB types mapping for the new Snowflake schema downloader
- Fix: non-existing Slack users are filtered from Alerts
- Fix: A lot of upstream deps take too much space in the layout. Now we're showing the first 3, and the rest are available in Lineage UI
- Fix: Multiple tables in a CI diff were too large for a single comment post. The tables are now paginated across multiple comments
- Fix: Hours jump in Alerts time picker
Data Diff can now compare VARIANT type in Snowflake
Other changes
- Added the ability to pause a data source in the API. When a data source is paused, all its data is retained in the system but schema indexing, profiling, and lineage processing are disabled
- Improved error reporting for Redshift data sources when Datafold does not have permissions to access the table
- Lineage speed improvements
Bug fixes
- Fixed a bug where spaces in Data Diff values tab were missing
- Fixed an issue where a Github integration didn't show an error message when it cannot be deleted
- Fixed a bug where the user invite link for organizations that have Okta enabled did not work
- Fixed a bug where BI reports could appear orphaned, not having any links to tables
- Fixed a bug where a CI run could fail if the dbt manifest didn’t contain the raw relation name
- Fixed a bug where the CI reported booleans instead of numbers for the number of mismatched columns
- Fixed a bug in CI where, when a table has no differences, the link to the table profile malfunctioned
- Fixed testing Github repository connections
- Fixed Slack notifications where the integration could not be deleted if currently used in alerts. In the new behavior, it will unsubscribe all related notification methods from alerts as the integration is deleted
Allow CI to continue if Data Diff fails
When you integrate Data Diff into the CI flow, you can control whether an error during Data Diff processing causes the CI flow to fail or continue. This allows you to configure Datafold to be non-blocking in your CI which can be helpful when introducing Data Diff in your development process initially.
Support for key-pair authentication for Snowflake
In our effort to support the most secure practices possible, we’ve added the ability to configure a Snowflake data source to use key-pair authentication. This is more secure than password authentication alone. See Datafold’s Snowflake documentation for details.
Other changes
- Visually collapse Data Diff reports if no changes are detected to save users time
- Optimized schema fetching during a data diff to reduce the runtime of a single diff, as well as the load on the data warehouse
- Irrelevant diff views are not hidden if the primary key was not specified
- The “time dimension” field in the Data Diff view now suggests only date/time columns
Bug fixes
- Integrations could not be deleted if they were used in any alerts
- Minor rendering issue with Datafold logo on the login page
- In the Data Diff view, each of the Dataset text entry fields had its input blocked while its loading indicator was active
Data Diffs without primary keys
Now you can run data diffs without specifying primary keys to compare table schemas and column profiles. Specifying primary keys is required for value-level comparison.
Other changes
- GitLab CI integrations now respect the file ignore lists (previously, it was supported only for GitHub)
- Improved filters autocomplete performance
Bug fixes
- Alert deletion could sometimes be slow or time out
- An unnecessary expand icon in the data source tree filter is not shown anymore
- UI could break if you had more than 500 tags in the organization
Data Diff improvements
Sum and Average diff metrics
Data Diff now also compares sums and averages for numerical columns which can be helpful for analyzing changes in distributions:
Improved handling of long values
When browsing value-level diffs, overflowing values can be explored and compared by hovering over them. The long values can now be copied to clipboard for further analysis.
Ignoring certain files in Data Diff CI
A new setting for CI integrations allows users to selectively ignore files modified in a PR and skip running Datafold for irrelevant changes. Files can be excluded, re-included, and re-excluded again, thus allowing complex patterns for the cases like “only run datadiffs if any dbt files have changed, except for the .txt and .md files in that folder”.
Lineage Improvements
Original SQL queries
You can see SQL query that was used to create/update a table or refresh a BI report in both Datafold Catalog or Lineage views:
BI report filtering
BI reports in Lineage can now be filtered by popularity and freshness:
Mode dashboard previews
You can see a preview screenshot for any Mode report on the Datafold Lineage graph:
Other changes
- Timepicker in Alerts schedule now has a correct “Now” button that converts current time to UTC using the time zone from the browser
- Now you can use Cmd/Ctrl + Click to open a data diff or an alert in a new tab
- You can now see “Last run datetime” in the list of alerts.
Bug fixes
- SQL queries are now visible again in Profile and Lineage for tables
- Multiple lineage UI improvements
New Datafold Slack App and alert subscriptions
Adding Slack channel destinations is easier with the new Slack App. Users can subscribe to alerts and get mentioned in the designated channels allowing for more targeted alerting and collaborative incident resolution. Documentation is available here.
Single sign-on through Okta
Single sign-on through Okta is now available for Datafold Cloud.
Datafold <> Mode Integration now in beta
Mode reports are now discoverable through Datafold Catalog and appear in Lineage which enables tracing data flows on a field level all the way to Mode reports and dashboards. Let us know if you would like to enable it for your account.
Other changes
- Fix for faux-off-chart-deps in Lineage
- Added a UTC notation to Last Run in Catalog results
- Row counts in Diff now take time travel specifiers into account
- Improved refreshes for the GitHub app to use the app authentication token instead of user to server token
- Added the database name to all Redshift and PostgreSQL tables. This allows for use of dbt integration for those databases, and lineage in case of Redshift if cross-database queries are used in the ETL process.
Diffing for advanced data types
Data Diff can now compare Snowflake's VARIANT and ARRAY types. Profiling information won't be generated for those columns, but they will show up in overall statistics, and in the Values tab. Previously VARIANT and ARRAY types were ignored during comparisons.
Improved diff sampling
When comparing tables (for example, Staging and Prod versions of your dbt model), Data Diff provides a sample of divergent values for every column that doesn’t fully match between tables. Previously Diff would select ~15 rows for every column that had differences. If there were just a few such columns, the overall sample size could be quite small. The algorithm now selects ~1,000 rows regardless of the number of columns that are different.
Bug fixes
- Fixed an issue where the “$” character was not accepted in a password
- Improved integer formatting throughout the app
- Improved performance in the Catalog search input
- Fixed 5+ smaller UI issues
Mode reports in Lineage & Catalog
Mode is now available as an integration in Datafold in alpha testing mode. Once enabled, Datafold will index all reports in your Mode account to make them available in the Datafold Catalog search and Lineage.
You can now discover relevant Mode reports alongside datasets in the same search experience. It’s also possible to filter Mode reports based on popularity and freshness.
You can trace field-level data lineage to Mode reports in the Datafold Lineage view to see which tables and columns feed what report, making it easy to perform refactorings and troubleshoot issues:
New Jobs UI
With the new Jobs UI you can check what tasks are currently running in your Datafold account and easily troubleshoot various integrations such as Diff in CI as well as audit the use of Datafold.
Bug fixes
- Fixed displaying of Alert schedules when an hourly interval is selected.
Automatic inference of primary keys for dbt models + CLI tool to check primary key settings for Data Diff
For Data Diff to work in CI, it needs to know the primary key for each table it analyzes. Datafold provides a few options for defining primary keys in the dbt model:
- Define it as meta.primary_key in dbt YAML
- Define it as a table or column-level tag in dbt YAML
- Automatically infer primary keys based on uniqueness tests
To help you ensure that Data Diff can look up or infer primary keys for all tables in your dbt project, we added check-primary-keys command to the Datafold CLI.
Quickly navigate to columns using Go To search bar in Diff UI
Now you can quickly jump to any column in the Diff Values tab which can be helpful when diffing especially wide tables:
Run Data Diffs only with the Datafold label
There are situations where you don't want to run Data Diff in your CI unconditionally. Running it on every change, is the recommended way, to make sure that you don't let any unindented changes slip through. Similar to running the unit and integration tests in the CI, you don't want to disable the tests, since it will likely break a test without you knowing it.
When you're integrating Data Diff, you sometimes want to try it on a select number of changes. This is why we added a new option to the CI integration:
Checking this box won't start a Data Diff right away when opening up a new Pull Request. After setting the Datafold label in Github/Gitlab, it will start the actually diff.
Improvements for Postgres data sources
Postgres has a feature where a currently logged in user can change to acquire only the privileges of a selected role. This is done using the <span class="code">SET ROLE</span> command. <span class="code">SET ROLE</span> effectively drops all the privileges assigned directly to the session user and to the other roles it is a member of, leaving only the privileges available to the named role. This is now implemented for both PostgreSQL and PostgreSQL Aurora as an extra optional parameter in the datasource configuration.
For Aurora PostgreSQL data sources, we've also added an optional keep-alive setting that will allow you to turn on keep-alives for very long running queries. This is a parameter specified in seconds. Leave the option empty to disable keep alives.
Tooltips added to data source fields to avoid confusion
To provide some more context to the options available in the data sources configuration screen, we have added tooltips. We hope this makes the configuration settings a little bit easier without changing back-and-forth between our documentation pages.
Optimization for GraphQL
Our new GraphQL API is also becoming more mature. We applied a performance optimization for loading database and schema info. Previously it was required to load the tables first, but those can now be queried separately.
Bug fixes
We have also added a couple of bug fixes:
- Fixes bug where a CI configuration could not be created without the require_label set
- Fixes selected suggestion id flashing in search autocomplete
- Fixes page size navigation in the Data Diff's Values tab
- Fixes error that was thrown when empty sampling results arrived in the Table Profile sample tab
- Fixes the frontend flooded with 500 errors when alert estimates encountered an error
- Fixes sampling table not being re-rendered when new results come in after reload
Improved messaging on the GitHub integration
This update is based on customer feedback to have more meaningful feedback in the Data Diff process. We updated more information to the GitHub statuses when running the Data Diff:
For example, we include the git hash of the job that it is waiting for. After the job starts, it will show a link to the actual job:
This can be either the job building the pull-request or the main branch. This helps to understand what’s going on when running the Data Diff, and what it is waiting for.
datafold-sdk upload-and-wait
The datafold-sdk is used for synchronizing the information after a dbt run into Datafold. Datafold will extract the table and column information and it is used for Data Diff when running on a pull request.
It is a common practice to clean up the tables after a run on a pull request has ran. But Datafold might need these tables to run the Data Diff. Therefore we have the Datafold upload-and-wait command. Instead of starting the Data Diff asynchronously, it will block for the Data Diff to complete. This makes sure that you don’t drop all the tables before the Data Diff has finished.
Catalog support for dbt sources and seeds
Datafold works seamlessly with dbt. With the latest version of Datafold, we support synchronizing the metadata from dbt’s sources and seeds. Sources are tables that are external to dbt, often tables in the landing zone. When declaring a source, you can annotate it with additional information, which is also synchronized to Datafold.
Smart scheduler
New Smart Scheduler service to manage data source concurrency when scheduling table profiling tasks.
We’ve implemented a new scheduler that we call the smart scheduler. Most users know that certain tasks can impose some load on the data warehouse. This allows us to have more control on the tasks that are running, resulting in a more predictable load. We built this together with our Redshift users because Redshift doesn’t handle concurrency very well. This provides a way to run the tasks in a gentle way.
Descriptive errors on profiling errors
It can happen that a query against the data warehouse results in an error. Maybe the database is offline? Maybe the table is huge and it takes a very long time? Or in the example, below we’re having a divide by zero at runtime. We now have more informative errors when the profiling job fails.
Lineage edges are now hoverable showing source and target nodes, which are highlighted on edge click.
Improved Lineage navigation: when switching central table origin, also switch table for Profiling and Sampling tabs.
Add GraphQL API for lineage
GraphQL is an increasingly popular method for retrieving information. It gives the developer more control over the desired entities and which specific fields they want to access. We now support a GraphQL API for our lineage information. Read more about it in this technical blog.
We’re continuously adding more information to the GraphQL API. For the latest state, please refer to the documentation.
Support dbt_utils for inferring Primary Keys
For running Datafold, we use the primary key of the table to see what changed. One popular way of checking this constraint is using the unique_combination_of_columns function from dbt_utils. With Datafold we now detect the use of these tests, and infer the primary key from it. This allows you to easily get started with Data Diff. Next to this, you can always set the primary key explicitly if desired.
Revamped the signup flow with new UI to create better user experience, and simplified dbt configurations.
Data Diff time travel in BigQuery and Snowflake
Time travel is a useful feature of some modern data warehouses that allows querying table at a particular point in time. Using that feature in combination with Data Diff can be very helpful to detect data drift in a table by diffing it against its older version. When testing changes in prod vs. dev environments, time travel can also help align both environments on the state of source data.
Gitlab support for Data Diff
Now it’s possible to automate full impact analysis of every PR to ETL code in Gitlab repositories.See how a change in the code will impact the data produced in the current and downstream tables.
More information on how to set it up can be found in the docs.
Added support for alerts on scalar values
While the true power of ML-aided alerts comes from monitoring metrics in time, sometimes it may be helpful to check a single value against a set threshold.
Catalog learns about your data from everywhere
Datafold will now automatically populate Catalog with column and table descriptions & tags from dbt, Snowflake, BigQuery, Redshift and other systems, creating a unified view.
Additional descriptions can be added using Datafold’s built-in rich text editor.
Primary keys for dbt models for Data Diff CI integration can now be specified on a table level
- Errors and warnings are now collapsed in Github/Gitlab comments to avoid bloat
- Improved performance of the Catalog search filter
- Improved handling of dbtCloud retries: Datafold now retries 4 times after receiving 500 errors from the dbtCloud service for up to 4 seconds
- Data source log extractor for lineage can now be done on a cron schedule
- Alerts now show the modified at timestamp
- Improved chrontab validation: removed once-an-hour restrictions on scheduling
- It is now possible to disable alert query notifications
- Catalog now shows the timestamp when the dataset was last modified
Customizable Tags
Since tags became a really popular way to document tables, columns, and alerts in Catalog, many of you have requested a better way to manage them including the ability to customize their color to enhance readability. Now all tags can be created, edited and deleted in the Settings menu.
Improvements
- Improved profiler reliability
Interactive external dependencies
Lineage graphs can often get very complex and messy with all dependencies plotted at once. That’s why by default, Datafold shows a slice of the full lineage graph centered on a particular table (“dim_businesses” in the image below). That means that the graph will show tables and columns directly upstream or downstream of the chosen table.
At the same time, downstream tables (“report_hourly_bysiness_pageviews”) may have other upstream dependencies unrelated to the table on which the lineage view is centered. To avoid bloat, those dependencies are shown as dashed lines. Clicking on them will center the lineage graph on the chosen table.
Per-column Data Diff Tolerances
Sometimes it may be helpful to compare columns with a threshold instead of strict equality. For instance, when a database column is a FLOAT computed as a division of aggregates (e.g. COUNT(*) / SUM(someFloatCol)), the results of the computation are not strictly deterministic, resulting in differences that are irrelevant from the business standpoint but would be flagged by diff if strict equality is used: 1.1200033 vs. 1.1200058. Diff tolerance allows you to specify an absolute or relative threshold below which differences in values would be considered equal.
Tags autocomplete
When entering tags, you can rely on autocomplete to avoid creating semantically similar tags:
Improvements
- Fixed a bug that prevented admins from sending password reset emails
- "Discourage manual profiling" flag added to data source settings. If the flag is set, when the user tries to refresh a data profile, a warning popup will appear.
Fixed saving datasources and CI integrations with empty cron schedule.
On-prem deployments now require an install password at first install used to check the state of the CI process.
New Data Diff UI & Landing Page
Streamlined UI with more settings
Improvements
- The application now posts update messages when waiting for dbt runs to finish.
- Added an API endpoint to get status of CI runs. It can be used to check state of a CI process.
- Use standard notation for crontab format
- Fixed a bug where the dbt meta schedule stopped working
New application root page
- The CI config ID is now visible in the CI settings screen
- Allow using the dbt CLI to post the manifests to Datafold, so that Datafold can run diffs in a similar way as in the dbtCloud integration
- Documentation is now available from the header in the app
- Fixes a bug where the dbt cloud account number was passed as a string
The dbt configuration now presents a list of accounts instead of hardcoding the account name manually.
Automatic dbt docs sync to Datafold Catalog
- Fixed a bug where Snowflake timezone-aware fields were compared against timezone-naive instances
- Search: added <span class="code">Select all</span>/ <span class="code">Deselect all</span> to data source filter
- Updated loading indication when loading data source schema
- Search: the user is redirected on <span class="code">/search page when no results are round in <span class="code">as-you-type</span> mode
- Updated usage of URL params for search
- Search: tree and sider are now responsive (expand if schema names don't fit into width)
- Updated scrolling UX
- Profiler: removed <span class="code">experimental_</span> guards from new profiling and sampling UIs
- Profiler: fixed an issue with DATE & DATETIME for Snowflake table profiles
- Lineage: fixed hanging PostgreSQL query due to query planner misoptimization
- Lineage: hotfix for Snowflake + dbt
- Lineage: multiple small bugfixes
- Lineage: support for Snowflake semistructured data
- Lineage: fixed a bug where some parts of the graph were not displayed
- Profiling: bugfix in settings
- Data Diff: fixed handling of <span class="code"> time</span> datatype
- Data Diff: soft-fail on <span class="code">inf</span> and <span class="code">NaN</span> float values
- Made sure that CI data diffs are resilient to server-side interruptions
- Correctly display arrays and maps in profiler sample
- Several bugfixes in lineage UI
- Fixes in the color scheme
- Added support for incremental SQL log fetching to build column-level lineage
- Several fixes in the lineage query parser
Incremental Column-level Lineage
Instead of querying the entire SQL query history, Datafold now looks at only new queries and updates the lineage graph incrementally. Currently works for Snowflake and Bigquery.
Faster Column Profiler
Now supports browsing super-wide (100+ col) tables without any interface lags.