Changelog
Introducing Slim Diff in CI/CD
- Slim Diff helps teams prioritize business-critical models in CI/CD workflows - it gives teams control over exactly which models to diff on each pull request. When enabled - Slim Diff runs data diffs for only specified models based on dbt metadata, and skips models that aren’t explicitly tagged or are excluded from data diffing.
Column Remapping in Data Diff creation flow
- Quickly remap columns within the Data Diff UI or API creation flow for known column name changes to ensure all columns are compared correctly.

Schema Comparison Sorting
- Faster schema comparisons to see what changed inline, especially when column order has changed.

Cancel In-Progress Data Diffs
- Now you can quickly cancel currently running diffs in both the Data Diff results, as well as the administrator interface. As always, you can cancel all diffs within CI run as before from the same administrator interface.

Globally exclude tables from CI/CD diffs
- Use your dbt metadata to exclude particular folders or models from being tested against in CI/CD workflows. Use cases vary from excluding sensitive tables to unsupported downstream usages. Your data team can configure Datafold to be aligned with their priorities.
Lightning-fast in-database comparisons for the data-diff library + DuckDB support
- Have you ever wanted to quickly and easily get a diff comparison of two tables in your dbt development workflow? Now you can! Our wonderful Solutions Engineer Leo spun up a tutorial on how to use our open-source data-diff library to find potential bugs that unit testing or monitoring would have missed.
- Additionally, our data-diff community contributors have continued to improve the product - including adding DuckDB support. We appreciate the support @jardayn!
- The latest release of Datafold’s free, open-source data-diff library is optimized for even faster Data Diffs within the same database. Compare any two tables within a warehouse and receive a detailed breakdown of schema, row and column differences.
Improved Diff Results Sorting and Filtering
- We’ve added improved sorting and filtering interfaces to the Data Diffs analysis workflow, making it easy to find specific rows within your diff results. For example, if you’re trying to confirm that the values for a particular primary key in your sea of modified data changed exactly as expected, filter for the specific primary key or changed column value you’re looking for.

CSV Export
- You can now export CSVs of Data Diff results and primary keys that are exclusive to one of the datasets in your comparison! This is perfect for debugging and reconciling missing data between two data sets, and sharing that information across your organization.

- Don’t forget you can always materialize your Data Diff results to a table in your database and natively join your results to your source data, or do a deeper analysis on those differences. Enabling this setting in the Data Diff creation flow via our API or the Datafold app will create a table in your temporary schema with matched rows, values, and flags for which columns.

Materialize diff results to table is an option within the Data Diff creation workflow in both the Datafold App and our REST API.
Lineage Usage Metrics
- Column and Table-level query metrics in Lineage - right-click on any table or column reference within the Datafold Lineage UI to view how many times a particular user account has read or written to a particular table, allowing you to identify commonly or infrequently used data points.

- Popularity metrics now include all cumulative downstream usage of column or table, showing the total downstream reads for a particular client.
- Popularity Filters - Filter lineage nodes by their relative popularity compared to all indexed tables in Lineage
Data Diff Improvements
- Cancel CI Job button via the Datafold Jobs UI - Admin users are now able to cancel CI/CD diff tasks via the Jobs UI in Admin Settings.
- Copy Data Diff Configuration JSON to Clipboard - the info button within the diff results page now contains a button to copy the JSON payload required to create a diff via the REST API.

- Set diff time travel logic at the dbt-model level. For example, if your dev and production tables have known differences due to timing of incremental source data, you can add a time-travel configuration to ignore the most recent data, preventing false positives in CI/CD. Learn more about time travel here and more about dbt metadata configuration here.
Other Improvements
- Catalog search improvements to weight exact-text matches more aggressively, and hide less relevant results.
- Datafold CI/CD integration now populates a list of deleted dbt models within the pull request comments.
- Improve lineage support for dbt-based Hightouch models
Popularity counters in Lineage
To help understand how frequently the assets in your warehouse are used, Lineage now displays an absolute access count per table and column for the last 7 days. To help you interpret that information, a relevant popularity rating from 0 to 4 is assigned, indicating how relatively popular a particular database object is relative to others.

Other changes
- For on-premise deployments, we now support data diff in CI for Github on-premise servers. To use your own private Github server instead of a cloud version (https://github.com), set a <span class="code">GITHUB_SERVER</span> environment variable and set it to your Github on-prem URL.
- In the app, the BI Settings section has been renamed to “Data Apps” and now includes both Mode and Hightouch integrations.
- Performance improvements to lineage.
- In the Lineage UI, Hightouch models and syncs now link to Hightouch App. This can be configured using the “workspace URL" field in the Hightouch integration settings.
- Visual improvements to data source names and logos in Catalog and Lineage.
- Updated display of long names of tables in Lineage.
- Popularity is now a general filter in Catalog. It can be applied to both tables and columns.
- Data Source and Data App source filters in Catalog are now merged for better search experience.
- Users can now add, remove, and query tags for Mode dashboards, Hightouch models, and Hightouch syncs using GraphQL API.
- Added usage info for tables and columns to GraphQL API.
- CI configurations can now be paused, preventing them from running checks on pull requests.
- Added support for BigQuery’s bignumeric and bigdecimal data types.
- Now data source mapper field in Data Apps create/edit form is validated after all the data sources are mapped.
- In the Data App settings, we’ve added direct links to our documentation.
Bug fixes
- In some cases, data diffs were not canceled after CI run cancellation. These diffs were stuck in a WAITING status forever.
Multidimensional Alerts (Beta)
Users can use <span class="code">GROUP BY</span> in alert queries to dynamically produce several time series at once. Each dimension is named after the values of the dimensional/categorical field(s) of <span class="code">GROUP BY </span>; its thresholds and anomaly detection can be configured separately. New time series will appear (and disappear over time) according to the data’s changes without the need to modify a plethora of alerts with <span class="code">WHERE</span> filters.
This feature is currently in Beta and is available upon request — please reach out to support@datafold.com to enable it for your organization.


Datafold <> Hightouch Integration
Hightouch models and syncs are now discoverable through the Datafold Catalog and visible in Datafold’s Column-Level Lineage - making it possible to trace data from source to activation.
This feature is currently available upon request — please reach out to support@datafold.com to enable it for your organization.

See downstream data applications in PR Comments
Datafold now shows downstream data applications, e.g. Mode reports and Hightouch syncs, that might be affected by a code change.

Data Diff results materialization
Users can now save Data Diff results in their databases for further analysis. Current support is limited to PK duplicates, exclusive PKs, and all value level differences.


Other changes
- Significantly improved CI-based Data Diff performance for large warehouses with many tables, schemas, etc.
- Expandable metric graphs to make comparison more convenient.
- For On-Premises Implementations - If the environment variable <span class="code">DATAFOLD_AUTO_VERIFY_SAML_USERS</span> is set to "true", then users created during SAML sign-up will not have to verify their emails.
- Better display for values match indicator in Data Diff -> Values tab.
- Reformatted long alert names in the filter popup for readability.
Bug fixes
- Resolved the issue where the Datafold-sdk failed to perform a primary keys check for manifest.json if there were some tables in the manifest that had not yet been created in DB.
- Jobs request fails when filters are cleared.
Databricks support
You can now add Databricks as a data source, with full support for Data Diff, table profiling, and column-level lineage.

Other changes
- Data Diff sampling thresholds are no longer limited to hardcoded defaults and can now be configured from the UI.
- We updated the Jobs page to make connection types, table names, and runtimes easier to read.
Bug fixes
- Slack and email alert notifications were not delivered for some customers between 2022-05-31 18:00 UTC and 2022-06-07 11:00 UTC (SaaS)
- Profile histograms and completeness info did not render immediately on load.
- Job Source filter did not contain all the possible values that our API can return.
- “Created time” and “last updated time” were not displayed in the list of Jobs.
- Incorrect status in gitlab CI pipelines. Datafold App will no longer block a merge if something is wrong with the Datafold App.
Lineage UI filters
Navigating large lineage graphs is now easier with filters that help filter out the noise. Datasource/database/schema filters allow you to control the amount of information displayed.

User group mapping between Datafold and SAML Identity Providers
Organizations using a SAML Identity Provider (Okta, Duo, and others) to authenticate users to Datafold via Single Sign-On can now set up a mapping between SAML and Datafold user groups.. Users will be automatically assigned to desired Datafold groups according to the pre-configured mapping when using SAML login.
This feature is available on request — please get in touch with Datafold to enable it for your organization.


Other changes
- Added a special method to our SDK to check the correctness of dbt artifacts submitted to Datafold when using the dbt Core integration. Now Data Diff can finish even if something is wrong with uploading dbt artifacts. See the documentation for details.
- Now Datafold shows Slack users/groups with the conventional @-form, like in the Slack App.
- SAML validation & configuration errors are now exposed to users so that they can debug their setup.
Bug fixes
- Sometimes the job status is displayed as `notAvailable`.
- BI reports with special characters in names (slashes, hashes, etc) are not displayed or routed correctly.
- When BI report's preview is downloaded with an error, the loading indicator is displayed forever.
- Multi-word search requests were squashed, omitting spaces.
- Inviting a user that was already in Datafold caused an error with an unclear message. Now it says explicitly that the problem is with the user being already invited.
Lineage UI filters
Navigating large lineage graphs is now easier with filters that help filter out the noise. Datasource/database/schema filters allow you to control the amount of information displayed.

User group mapping between Datafold and SAML Identity Providers
Organizations using a SAML Identity Provider (Okta, Duo, and others) to authenticate users to Datafold via Single Sign-On can now set up a mapping between SAML and Datafold user groups.. Users will be automatically assigned to desired Datafold groups according to the pre-configured mapping when using SAML login.
This feature is available on request — please get in touch with Datafold to enable it for your organization.


Other changes
- Added a special method to our SDK to check the correctness of dbt artifacts submitted to Datafold when using the dbt Core integration. Now Data Diff can finish even if something is wrong with uploading dbt artifacts. See the documentation for details.
- Now Datafold shows Slack users/groups with the conventional @-form, like in the Slack App.
- SAML validation & configuration errors are now exposed to users so that they can debug their setup.
Bug fixes
- Sometimes the job status is displayed as `notAvailable`.
- BI reports with special characters in names (slashes, hashes, etc) are not displayed or routed correctly.
- When BI report's preview is downloaded with an error, the loading indicator is displayed forever.
- Multi-word search requests were squashed, omitting spaces.
- Inviting a user that was already in Datafold caused an error with an unclear message. Now it says explicitly that the problem is with the user being already invited.
Data Diff sampling for small tables disabled by default
To avoid unnecessary overhead, Data Diff sampling is disabled for smaller tables. At this point the thresholds for table size are hardcoded defaults, configuration UI is coming. See the documentation for more details.
Other changes
- Alert query columns are automatically classified to time dimension and metric columns; there is no more need to put the time column first.
- Datafold no longer uses labels on GitLab to track the status of the Data Diff process, the status can now be tracked from the CI pipelines functionality.
Bug fixes
- Issue with include and exclude columns in diffs
- Off-charts dependencies of the in-focus table in Lineage are now displayed (and act) correctly as "Show more" → Change direction of Lineage
- The Settings menu item in the Admin section is sometimes not rendered correctly
- Catalog search by one- and two-letter words does not work
- Rows with NULL primary keys always got filtered out during data diff if sampling had been enabled
Data Diff filters can be configured in the dbt model YAML
Now you can configure Data Diff filter defaults in dbt model YAML. Filtering can be used to force Data Diff to compare only a subset of data, i.e. you may want to compare just the latest week to save DWH resources and reduce diff execution time. See the documentation for details.

Other changes
- Selecting a column and its connected nodes in Lineage is now followed by an indicator that also allows to exit the selected path mode. Click on empty space is deprecated.

- Fold sections of Github / Gitlab printouts to save screen space. They can be easily unfolded to check verbose diff information.
- Show actual Slack error codes on test notifications, so that users can debug their Slack-Datafold integration.
- Datafold now sends a confirmation email when SAML users are auto-created.
- Now Lineage is showing all columns of table that are in the database, not only ones that have connections detected by Lineage.
- Improvement to the autocomplete feature in Data Diff.
Bug fixes
- API key not copied into clipboard with input built-in tool
- Cell data in Data Diff Sampling tab is not copied from the popover
- Sometimes NaN appears instead of alert weekly estimates.
- Disabled users logging in through OAuth no longer raise an error.
You can now receive Alert notifications at arbitrary webhooks with arbitrary payloads (including but not limited to JSON) — in addition to Slack & email notifications. See the documentation for details.
This feature is available only on request — please contact Datafold to enable it for your organization.

For API-first users, all API errors from all API endpoints are now unified as per RFC-7807 with the same structured JSON payload, the 4xx HTTP status codes are normalized for most cases. This might simplify parsing the error messages, for example, due to invalid input and incompatible configuration. The UI error messages will be more descriptive in some cases where they were not.

Other changes
- A new API endpoint <span class="code">`/api/v1/dbt/check_artifacts/{ci_id}`</span>to check for dbt artifacts after uploading. This endpoint might be triggered during a CI process, for example, in Github actions or Gitlab CI, to help Datafold understand the status of downstream tasks.
- Improved performance of dataset suggestions in Data Diff, now search-based.
Bug fixes
- Lineage off-chart dependencies for upstream nodes not displayed
- Snowflake table/column casing issues are resolved
- Special characters are now properly handled on the data source names
- Table profiling will not be done for disabled data sources
- Lineage column selection dropped after table expansion
- Jobs UI now shows main jobs instead of result sub-jobs for profiling and data diffs
- Off-chart edge switches lineage direction for primary table
- Redirect to lineage from profile was sometimes broken
Refactored navigation design

Other changes
- Improved formatting of integers for column profiles in Data Diff

- Now we're displaying columns list, their description and tags in Profile, even if profiling is disabled

- Added excludes/includes support to GraphQL search endpoint
Bug fixes
- Fix: lineage not expanding for the second time
- Fix: last run filter in search showing numbers instead of days/weeks
- Fix: expanding lineage showing incomplete list of tables
- Fix: incorrect sorting in a primary key block in the Data Diff UI
- Fix: ability to navigate to data source creation dialog with non-confirmed e-mail
SAML
Organizations can now use any SAML Identity Provider to authenticate users to Datafold via Single Sign-On. This includes Google, Okta, Duo, and many others, including private/corporate identity providers.

Other changes
- During CI runs, data diff jobs will automatically select a created_at or updated_at column with an appropriate timestamp type as the time dimension
- Catalog search has been improved in both performance and result ranking
- Tags automatically created during dbt processes that have been superseded are periodically removed
- A custom database can be specified for Lineage metadata in Snowflake sources
Bug fixes
- Masked fields in Snowflake data sources could cause errors when materializing temporary tables
- Disabled users could not be re-enabled
- Posting labels to Gitlab triggered notifications when there were no changes
- Table profiling failing for views in PostgreSQL data sources
New Lineage UI
The lineage UI was updated to improve the performance for large graphs and to make exploring dependencies more intuitive. Among other changes, the view now distinguishes between upstream and downstream graph directions, and filter settings have moved to the top to provide a larger area for the lineage canvas.

Improved Slack alert messages
To make the anomaly notifications more actionable, the notifications now include the alert name, the actual value and provide more context to the anomaly that occurred.

Reduced verbosity for new tables in the Data Diff CI output
When new tables are created in a PR, the block has been reduced to only show the number of rows and number of columns, and a link to the table profile is inserted.

Other changes
- Automatically created tags from ETL are now cleaned up automatically after their initial use to reduce tag clutter
Bug fixes
- BI dashboards stopped displaying in the catalog
- Added missing icons of BI data sources
- Lineage paging stopped loading off-chart dependencies
- Github refresh button didn’t work correctly
- dbt metadata synchronization for dbt older than 1.0.0 in combination with Snowflake didn’t work correctly
Fine-grained control of what data assets show up in Datafold
There is such a thing as too much data observability. To help you separate signal from the noise and only see tables that actually matter, we added fine-grained settings that allow you to define which databases, schemas, and tables should show up in Datafold Catalog and Lineage and which should be hidden (e.g. dev/temp tables). The filtered out data assets can still be found by their full name (e.g. “db.schema.table”)

Alert subscriptions for Slack user groups
Slack user groups can be now subscribed to alerts — e.g. all members of team X, on-call engineers, incident commanders. Special handles @channel & @here can also be notified in case of alerts — for all or currently online members of a channel respectively.

Pausing data source in the UI
You can temporarily disable or pause data source in the UI


Other changes
- Subscribed users will be notified in case an alert has an execution error (e.g. database permission/connection failure) — not only on actual anomalies
- Improved alert texts in Slack
- Dramatic speedup of schema download from Snowflake
- For Data Diff in CI, unchanged tables are grouped at the top of the report
- For manually created Data Diffs, the primary key case is automatically inferred
- Data diffs on Snowflake are now running much, much faster
Bug fixes
- Fix: Notifications were sent to deleted integrations/destinations for some time after the deletion. No more
- Fix: Slack App integrations were sometimes not showing users & channels if reinstalled from Slack, not from Datafold
- Fix: Plain CI configuration could not be saved/edited when the template variables section was empty.
- Fix: Setting update time for Alerts
- Fix: Proper DB types mapping for the new Snowflake schema downloader
- Fix: non-existing Slack users are filtered from Alerts
- Fix: A lot of upstream deps take too much space in the layout. Now we're showing the first 3, and the rest are available in Lineage UI
- Fix: Multiple tables in a CI diff were too large for a single comment post. The tables are now paginated across multiple comments
- Fix: Hours jump in Alerts time picker
Data Diff can now compare VARIANT type in Snowflake

Other changes
- Added the ability to pause a data source in the API. When a data source is paused, all its data is retained in the system but schema indexing, profiling, and lineage processing are disabled
- Improved error reporting for Redshift data sources when Datafold does not have permissions to access the table
- Lineage speed improvements
Bug fixes
- Fixed a bug where spaces in Data Diff values tab were missing
- Fixed an issue where a Github integration didn't show an error message when it cannot be deleted
- Fixed a bug where the user invite link for organizations that have Okta enabled did not work
- Fixed a bug where BI reports could appear orphaned, not having any links to tables
- Fixed a bug where a CI run could fail if the dbt manifest didn’t contain the raw relation name
- Fixed a bug where the CI reported booleans instead of numbers for the number of mismatched columns
- Fixed a bug in CI where, when a table has no differences, the link to the table profile malfunctioned
- Fixed testing Github repository connections
- Fixed Slack notifications where the integration could not be deleted if currently used in alerts. In the new behavior, it will unsubscribe all related notification methods from alerts as the integration is deleted
Allow CI to continue if Data Diff fails
When you integrate Data Diff into the CI flow, you can control whether an error during Data Diff processing causes the CI flow to fail or continue. This allows you to configure Datafold to be non-blocking in your CI which can be helpful when introducing Data Diff in your development process initially.

Support for key-pair authentication for Snowflake
In our effort to support the most secure practices possible, we’ve added the ability to configure a Snowflake data source to use key-pair authentication. This is more secure than password authentication alone. See Datafold’s Snowflake documentation for details.

Other changes
- Visually collapse Data Diff reports if no changes are detected to save users time
- Optimized schema fetching during a data diff to reduce the runtime of a single diff, as well as the load on the data warehouse
- Irrelevant diff views are not hidden if the primary key was not specified
- The “time dimension” field in the Data Diff view now suggests only date/time columns
Bug fixes
- Integrations could not be deleted if they were used in any alerts
- Minor rendering issue with Datafold logo on the login page
- In the Data Diff view, each of the Dataset text entry fields had its input blocked while its loading indicator was active
Data Diffs without primary keys
Now you can run data diffs without specifying primary keys to compare table schemas and column profiles. Specifying primary keys is required for value-level comparison.

Other changes
- GitLab CI integrations now respect the file ignore lists (previously, it was supported only for GitHub)
- Improved filters autocomplete performance
Bug fixes
- Alert deletion could sometimes be slow or time out
- An unnecessary expand icon in the data source tree filter is not shown anymore
- UI could break if you had more than 500 tags in the organization
Data Diff improvements
Sum and Average diff metrics
Data Diff now also compares sums and averages for numerical columns which can be helpful for analyzing changes in distributions:

Improved handling of long values
When browsing value-level diffs, overflowing values can be explored and compared by hovering over them. The long values can now be copied to clipboard for further analysis.

Ignoring certain files in Data Diff CI
A new setting for CI integrations allows users to selectively ignore files modified in a PR and skip running Datafold for irrelevant changes. Files can be excluded, re-included, and re-excluded again, thus allowing complex patterns for the cases like “only run datadiffs if any dbt files have changed, except for the .txt and .md files in that folder”.

Lineage Improvements
Original SQL queries
You can see SQL query that was used to create/update a table or refresh a BI report in both Datafold Catalog or Lineage views:


BI report filtering
BI reports in Lineage can now be filtered by popularity and freshness:

Mode dashboard previews
You can see a preview screenshot for any Mode report on the Datafold Lineage graph:

Other changes
- Timepicker in Alerts schedule now has a correct “Now” button that converts current time to UTC using the time zone from the browser
- Now you can use Cmd/Ctrl + Click to open a data diff or an alert in a new tab
- You can now see “Last run datetime” in the list of alerts.
Bug fixes
- SQL queries are now visible again in Profile and Lineage for tables
- Multiple lineage UI improvements
New Datafold Slack App and alert subscriptions
Adding Slack channel destinations is easier with the new Slack App. Users can subscribe to alerts and get mentioned in the designated channels allowing for more targeted alerting and collaborative incident resolution. Documentation is available here.


Single sign-on through Okta
Single sign-on through Okta is now available for Datafold Cloud.

Datafold <> Mode Integration now in beta
Mode reports are now discoverable through Datafold Catalog and appear in Lineage which enables tracing data flows on a field level all the way to Mode reports and dashboards. Let us know if you would like to enable it for your account.


Other changes
- Fix for faux-off-chart-deps in Lineage
- Added a UTC notation to Last Run in Catalog results
- Row counts in Diff now take time travel specifiers into account
- Improved refreshes for the GitHub app to use the app authentication token instead of user to server token
- Added the database name to all Redshift and PostgreSQL tables. This allows for use of dbt integration for those databases, and lineage in case of Redshift if cross-database queries are used in the ETL process.
Diffing for advanced data types
Data Diff can now compare Snowflake's VARIANT and ARRAY types. Profiling information won't be generated for those columns, but they will show up in overall statistics, and in the Values tab. Previously VARIANT and ARRAY types were ignored during comparisons.

Improved diff sampling
When comparing tables (for example, Staging and Prod versions of your dbt model), Data Diff provides a sample of divergent values for every column that doesn’t fully match between tables. Previously Diff would select ~15 rows for every column that had differences. If there were just a few such columns, the overall sample size could be quite small. The algorithm now selects ~1,000 rows regardless of the number of columns that are different.

Bug fixes
- Fixed an issue where the “$” character was not accepted in a password
- Improved integer formatting throughout the app
- Improved performance in the Catalog search input
- Fixed 5+ smaller UI issues
Mode reports in Lineage & Catalog
Mode is now available as an integration in Datafold in alpha testing mode. Once enabled, Datafold will index all reports in your Mode account to make them available in the Datafold Catalog search and Lineage.
You can now discover relevant Mode reports alongside datasets in the same search experience. It’s also possible to filter Mode reports based on popularity and freshness.
You can trace field-level data lineage to Mode reports in the Datafold Lineage view to see which tables and columns feed what report, making it easy to perform refactorings and troubleshoot issues:

New Jobs UI
With the new Jobs UI you can check what tasks are currently running in your Datafold account and easily troubleshoot various integrations such as Diff in CI as well as audit the use of Datafold.

Bug fixes
- Fixed displaying of Alert schedules when an hourly interval is selected.
Automatic inference of primary keys for dbt models + CLI tool to check primary key settings for Data Diff

For Data Diff to work in CI, it needs to know the primary key for each table it analyzes. Datafold provides a few options for defining primary keys in the dbt model:
- Define it as meta.primary_key in dbt YAML
- Define it as a table or column-level tag in dbt YAML
- Automatically infer primary keys based on uniqueness tests
To help you ensure that Data Diff can look up or infer primary keys for all tables in your dbt project, we added check-primary-keys command to the Datafold CLI.
Quickly navigate to columns using Go To search bar in Diff UI
Now you can quickly jump to any column in the Diff Values tab which can be helpful when diffing especially wide tables:

Run Data Diffs only with the Datafold label
There are situations where you don't want to run Data Diff in your CI unconditionally. Running it on every change, is the recommended way, to make sure that you don't let any unindented changes slip through. Similar to running the unit and integration tests in the CI, you don't want to disable the tests, since it will likely break a test without you knowing it.
When you're integrating Data Diff, you sometimes want to try it on a select number of changes. This is why we added a new option to the CI integration:

Checking this box won't start a Data Diff right away when opening up a new Pull Request. After setting the Datafold label in Github/Gitlab, it will start the actually diff.

Improvements for Postgres data sources
Postgres has a feature where a currently logged in user can change to acquire only the privileges of a selected role. This is done using the <span class="code">SET ROLE</span> command. <span class="code">SET ROLE</span> effectively drops all the privileges assigned directly to the session user and to the other roles it is a member of, leaving only the privileges available to the named role. This is now implemented for both PostgreSQL and PostgreSQL Aurora as an extra optional parameter in the datasource configuration.

For Aurora PostgreSQL data sources, we've also added an optional keep-alive setting that will allow you to turn on keep-alives for very long running queries. This is a parameter specified in seconds. Leave the option empty to disable keep alives.
Tooltips added to data source fields to avoid confusion
To provide some more context to the options available in the data sources configuration screen, we have added tooltips. We hope this makes the configuration settings a little bit easier without changing back-and-forth between our documentation pages.

Optimization for GraphQL
Our new GraphQL API is also becoming more mature. We applied a performance optimization for loading database and schema info. Previously it was required to load the tables first, but those can now be queried separately.
Bug fixes
We have also added a couple of bug fixes:
- Fixes bug where a CI configuration could not be created without the require_label set
- Fixes selected suggestion id flashing in search autocomplete
- Fixes page size navigation in the Data Diff's Values tab
- Fixes error that was thrown when empty sampling results arrived in the Table Profile sample tab
- Fixes the frontend flooded with 500 errors when alert estimates encountered an error
- Fixes sampling table not being re-rendered when new results come in after reload
Improved messaging on the GitHub integration
This update is based on customer feedback to have more meaningful feedback in the Data Diff process. We updated more information to the GitHub statuses when running the Data Diff:

For example, we include the git hash of the job that it is waiting for. After the job starts, it will show a link to the actual job:

This can be either the job building the pull-request or the main branch. This helps to understand what’s going on when running the Data Diff, and what it is waiting for.
datafold-sdk upload-and-wait
The datafold-sdk is used for synchronizing the information after a dbt run into Datafold. Datafold will extract the table and column information and it is used for Data Diff when running on a pull request.
It is a common practice to clean up the tables after a run on a pull request has ran. But Datafold might need these tables to run the Data Diff. Therefore we have the Datafold upload-and-wait command. Instead of starting the Data Diff asynchronously, it will block for the Data Diff to complete. This makes sure that you don’t drop all the tables before the Data Diff has finished.
Catalog support for dbt sources and seeds
Datafold works seamlessly with dbt. With the latest version of Datafold, we support synchronizing the metadata from dbt’s sources and seeds. Sources are tables that are external to dbt, often tables in the landing zone. When declaring a source, you can annotate it with additional information, which is also synchronized to Datafold.

Smart scheduler
New Smart Scheduler service to manage data source concurrency when scheduling table profiling tasks.

We’ve implemented a new scheduler that we call the smart scheduler. Most users know that certain tasks can impose some load on the data warehouse. This allows us to have more control on the tasks that are running, resulting in a more predictable load. We built this together with our Redshift users because Redshift doesn’t handle concurrency very well. This provides a way to run the tasks in a gentle way.
Descriptive errors on profiling errors
It can happen that a query against the data warehouse results in an error. Maybe the database is offline? Maybe the table is huge and it takes a very long time? Or in the example, below we’re having a divide by zero at runtime. We now have more informative errors when the profiling job fails.

Lineage edges are now hoverable showing source and target nodes, which are highlighted on edge click.
Improved Lineage navigation: when switching central table origin, also switch table for Profiling and Sampling tabs.
Add GraphQL API for lineage
GraphQL is an increasingly popular method for retrieving information. It gives the developer more control over the desired entities and which specific fields they want to access. We now support a GraphQL API for our lineage information. Read more about it in this technical blog.
We’re continuously adding more information to the GraphQL API. For the latest state, please refer to the documentation.
Support dbt_utils for inferring Primary Keys
For running Datafold, we use the primary key of the table to see what changed. One popular way of checking this constraint is using the unique_combination_of_columns function from dbt_utils. With Datafold we now detect the use of these tests, and infer the primary key from it. This allows you to easily get started with Data Diff. Next to this, you can always set the primary key explicitly if desired.

Revamped the signup flow with new UI to create better user experience, and simplified dbt configurations.
Data Diff time travel in BigQuery and Snowflake
Time travel is a useful feature of some modern data warehouses that allows querying table at a particular point in time. Using that feature in combination with Data Diff can be very helpful to detect data drift in a table by diffing it against its older version. When testing changes in prod vs. dev environments, time travel can also help align both environments on the state of source data.
Gitlab support for Data Diff

Now it’s possible to automate full impact analysis of every PR to ETL code in Gitlab repositories.See how a change in the code will impact the data produced in the current and downstream tables.
More information on how to set it up can be found in the docs.
Added support for alerts on scalar values
While the true power of ML-aided alerts comes from monitoring metrics in time, sometimes it may be helpful to check a single value against a set threshold.
Catalog learns about your data from everywhere

Datafold will now automatically populate Catalog with column and table descriptions & tags from dbt, Snowflake, BigQuery, Redshift and other systems, creating a unified view.
Additional descriptions can be added using Datafold’s built-in rich text editor.
Primary keys for dbt models for Data Diff CI integration can now be specified on a table level

- Errors and warnings are now collapsed in Github/Gitlab comments to avoid bloat
- Improved performance of the Catalog search filter
- Improved handling of dbtCloud retries: Datafold now retries 4 times after receiving 500 errors from the dbtCloud service for up to 4 seconds
- Data source log extractor for lineage can now be done on a cron schedule
- Alerts now show the modified at timestamp
- Improved chrontab validation: removed once-an-hour restrictions on scheduling
- It is now possible to disable alert query notifications
- Catalog now shows the timestamp when the dataset was last modified
Customizable Tags

Since tags became a really popular way to document tables, columns, and alerts in Catalog, many of you have requested a better way to manage them including the ability to customize their color to enhance readability. Now all tags can be created, edited and deleted in the Settings menu.
Improvements
- Improved profiler reliability
Interactive external dependencies

Lineage graphs can often get very complex and messy with all dependencies plotted at once. That’s why by default, Datafold shows a slice of the full lineage graph centered on a particular table (“dim_businesses” in the image below). That means that the graph will show tables and columns directly upstream or downstream of the chosen table.
At the same time, downstream tables (“report_hourly_bysiness_pageviews”) may have other upstream dependencies unrelated to the table on which the lineage view is centered. To avoid bloat, those dependencies are shown as dashed lines. Clicking on them will center the lineage graph on the chosen table.
Per-column Data Diff Tolerances

Sometimes it may be helpful to compare columns with a threshold instead of strict equality. For instance, when a database column is a FLOAT computed as a division of aggregates (e.g. COUNT(*) / SUM(someFloatCol)), the results of the computation are not strictly deterministic, resulting in differences that are irrelevant from the business standpoint but would be flagged by diff if strict equality is used: 1.1200033 vs. 1.1200058. Diff tolerance allows you to specify an absolute or relative threshold below which differences in values would be considered equal.
Tags autocomplete
When entering tags, you can rely on autocomplete to avoid creating semantically similar tags:
Improvements
- Fixed a bug that prevented admins from sending password reset emails
- "Discourage manual profiling" flag added to data source settings. If the flag is set, when the user tries to refresh a data profile, a warning popup will appear.
Fixed saving datasources and CI integrations with empty cron schedule.
On-prem deployments now require an install password at first install used to check the state of the CI process.
New Data Diff UI & Landing Page

Streamlined UI with more settings
Improvements
- The application now posts update messages when waiting for dbt runs to finish.
- Added an API endpoint to get status of CI runs. It can be used to check state of a CI process.
- Use standard notation for crontab format
- Fixed a bug where the dbt meta schedule stopped working
New application root page

- The CI config ID is now visible in the CI settings screen
- Allow using the dbt CLI to post the manifests to Datafold, so that Datafold can run diffs in a similar way as in the dbtCloud integration
- Documentation is now available from the header in the app
- Fixes a bug where the dbt cloud account number was passed as a string
The dbt configuration now presents a list of accounts instead of hardcoding the account name manually.
Automatic dbt docs sync to Datafold Catalog

- Fixed a bug where Snowflake timezone-aware fields were compared against timezone-naive instances
- Search: added <span class="code">Select all</span>/ <span class="code">Deselect all</span> to data source filter
- Updated loading indication when loading data source schema
- Search: the user is redirected on <span class="code">/search page when no results are round in <span class="code">as-you-type</span> mode
- Updated usage of URL params for search
- Search: tree and sider are now responsive (expand if schema names don't fit into width)
- Updated scrolling UX
- Profiler: removed <span class="code">experimental_</span> guards from new profiling and sampling UIs
- Profiler: fixed an issue with DATE & DATETIME for Snowflake table profiles
- Lineage: fixed hanging PostgreSQL query due to query planner misoptimization
- Lineage: hotfix for Snowflake + dbt
- Lineage: multiple small bugfixes
- Lineage: support for Snowflake semistructured data
- Lineage: fixed a bug where some parts of the graph were not displayed
- Profiling: bugfix in settings
- Data Diff: fixed handling of <span class="code"> time</span> datatype
- Data Diff: soft-fail on <span class="code">inf</span> and <span class="code">NaN</span> float values
- Made sure that CI data diffs are resilient to server-side interruptions
- Correctly display arrays and maps in profiler sample
- Several bugfixes in lineage UI
- Fixes in the color scheme
- Added support for incremental SQL log fetching to build column-level lineage
- Several fixes in the lineage query parser

Incremental Column-level Lineage
Instead of querying the entire SQL query history, Datafold now looks at only new queries and updates the lineage graph incrementally. Currently works for Snowflake and Bigquery.
Faster Column Profiler
Now supports browsing super-wide (100+ col) tables without any interface lags.