January 12, 2022
Data Diffs without primary keys
Now you can run data diffs without specifying primary keys to compare table schemas and column profiles. Specifying primary keys is required for value-level comparison.
- GitLab CI integrations now respect the file ignore lists (previously, it was supported only for GitHub)
- Improved filters autocomplete performance
- Alert deletion could sometimes be slow or time out
- An unnecessary expand icon in the data source tree filter is not shown anymore
- UI could break if you had more than 500 tags in the organization
December 29, 2021
Data Diff improvements
Sum and Average diff metrics
Data Diff now also compares sums and averages for numerical columns which can be helpful for analyzing changes in distributions:
Improved handling of long values
When browsing value-level diffs, overflowing values can be explored and compared by hovering over them. The long values can now be copied to clipboard for further analysis.
Ignoring certain files in Data Diff CI
A new setting for CI integrations allows users to selectively ignore files modified in a PR and skip running Datafold for irrelevant changes. Files can be excluded, re-included, and re-excluded again, thus allowing complex patterns for the cases like “only run datadiffs if any dbt files have changed, except for the .txt and .md files in that folder”.
Original SQL queries
You can see SQL query that was used to create/update a table or refresh a BI report in both Datafold Catalog or Lineage views:
BI report filtering
BI reports in Lineage can now be filtered by popularity and freshness:
Mode dashboard previews
You can see a preview screenshot for any Mode report on the Datafold Lineage graph:
- Timepicker in Alerts schedule now has a correct “Now” button that converts current time to UTC using the time zone from the browser
- Now you can use Cmd/Ctrl + Click to open a data diff or an alert in a new tab
- You can now see “Last run datetime” in the list of alerts.
- SQL queries are now visible again in Profile and Lineage for tables
- Multiple lineage UI improvements
December 13, 2021
New Datafold Slack App and alert subscriptions
Adding Slack channel destinations is easier with the new Slack App. Users can subscribe to alerts and get mentioned in the designated channels allowing for more targeted alerting and collaborative incident resolution. Documentation is available here.
Single sign-on through Okta
Single sign-on through Okta is now available for Datafold Cloud.
Datafold <> Mode Integration now in beta
Mode reports are now discoverable through Datafold Catalog and appear in Lineage which enables tracing data flows on a field level all the way to Mode reports and dashboards. Let us know if you would like to enable it for your account.
- Fix for faux-off-chart-deps in Lineage
- Added a UTC notation to Last Run in Catalog results
- Row counts in Diff now take time travel specifiers into account
- Improved refreshes for the GitHub app to use the app authentication token instead of user to server token
- Added the database name to all Redshift and PostgreSQL tables. This allows for use of dbt integration for those databases, and lineage in case of Redshift if cross-database queries are used in the ETL process.
November 29, 2021
Diffing for advanced data types
Data Diff can now compare Snowflake's VARIANT and ARRAY types. Profiling information won't be generated for those columns, but they will show up in overall statistics, and in the Values tab. Previously VARIANT and ARRAY types were ignored during comparisons.
Improved diff sampling
When comparing tables (for example, Staging and Prod versions of your dbt model), Data Diff provides a sample of divergent values for every column that doesn’t fully match between tables. Previously Diff would select ~15 rows for every column that had differences. If there were just a few such columns, the overall sample size could be quite small. The algorithm now selects ~1,000 rows regardless of the number of columns that are different.
- Fixed an issue where the “$” character was not accepted in a password
- Improved integer formatting throughout the app
- Improved performance in the Catalog search input
- Fixed 5+ smaller UI issues
November 20, 2021
Mode reports in Lineage & Catalog
Mode is now available as an integration in Datafold in alpha testing mode. Once enabled, Datafold will index all reports in your Mode account to make them available in the Datafold Catalog search and Lineage.
You can now discover relevant Mode reports alongside datasets in the same search experience. It’s also possible to filter Mode reports based on popularity and freshness.
You can trace field-level data lineage to Mode reports in the Datafold Lineage view to see which tables and columns feed what report, making it easy to perform refactorings and troubleshoot issues:
New Jobs UI
With the new Jobs UI you can check what tasks are currently running in your Datafold account and easily troubleshoot various integrations such as Diff in CI as well as audit the use of Datafold.
Fixed displaying of Alert schedules when an hourly interval is selected.
November 11, 2021
Automatic inference of primary keys for dbt models + CLI tool to check primary key settings for Data Diff
For Data Diff to work in CI, it needs to know the primary key for each table it analyzes. Datafold provides a few options for defining primary keys in the dbt model:
- Define it as meta.primary_key in dbt YAML
- Define it as a table or column-level tag in dbt YAML
- Automatically infer primary keys based on uniqueness tests
To help you ensure that Data Diff can look up or infer primary keys for all tables in your dbt project, we added check-primary-keys command to the Datafold CLI.
Quickly navigate to columns using Go To search bar in Diff UI
Now you can quickly jump to any column in the Diff Values tab which can be helpful when diffing especially wide tables:
October 14, 2021
Run Data Diffs only with the Datafold label
There are situations where you don't want to run Data Diff in your CI unconditionally. Running it on every change, is the recommended way, to make sure that you don't let any unindented changes slip through. Similar to running the unit and integration tests in the CI, you don't want to disable the tests, since it will likely break a test without you knowing it.
When you're integrating Data Diff, you sometimes want to try it on a select number of changes. This is why we added a new option to the CI integration:
Checking this box won't start a Data Diff right away when opening up a new Pull Request. After setting the Datafold label in Github/Gitlab, it will start the actually diff.
Improvements for Postgres data sources
Postgres has a feature where a currently logged in user can change to acquire only the privileges of a selected role. This is done using the SET ROLE command. SET ROLE effectively drops all the privileges assigned directly to the session user and to the other roles it is a member of, leaving only the privileges available to the named role. This is now implemented for both PostgreSQL and PostgreSQL Aurora as an extra optional parameter in the datasource configuration.
For Aurora PostgreSQL data sources, we've also added an optional keep-alive setting that will allow you to turn on keep-alives for very long running queries. This is a parameter specified in seconds. Leave the option empty to disable keep alives.
Tooltips added to data source fields to avoid confusion
To provide some more context to the options available in the data sources configuration screen, we have added tooltips. We hope this makes the configuration settings a little bit easier without changing back-and-forth between our documentation pages.
Optimization for GraphQL
Our new GraphQL API is also becoming more mature. We applied a performance optimization for loading database and schema info. Previously it was required to load the tables first, but those can now be queried separately.
We have also added a couple of bug fixes:
- Fixes bug where a CI configuration could not be created without the require_label set
- Fixes selected suggestion id flashing in search autocomplete
- Fixes page size navigation in the Data Diff's Values tab
- Fixes error that was thrown when empty sampling results arrived in the Table Profile sample tab
- Fixes the frontend flooded with 500 errors when alert estimates encountered an error
- Fixes sampling table not being re-rendered when new results come in after reload
October 2, 2021
Improved messaging on the GitHub integration
This update is based on customer feedback to have more meaningful feedback in the Data Diff process. We updated more information to the GitHub statuses when running the Data Diff:
For example, we include the git hash of the job that it is waiting for. After the job starts, it will show a link to the actual job:
This can be either the job building the pull-request or the main branch. This helps to understand what’s going on when running the Data Diff, and what it is waiting for.
September 20, 2021
The datafold-sdk is used for synchronizing the information after a dbt run into Datafold. Datafold will extract the table and column information and it is used for Data Diff when running on a pull request.
It is a common practice to clean up the tables after a run on a pull request has ran. But Datafold might need these tables to run the Data Diff. Therefore we have the Datafold upload-and-wait command. Instead of starting the Data Diff asynchronously, it will block for the Data Diff to complete. This makes sure that you don’t drop all the tables before the Data Diff has finished.
Catalog support for dbt sources and seeds
Datafold works seamlessly with dbt. With the latest version of Datafold, we support synchronizing the metadata from dbt’s sources and seeds. Sources are tables that are external to dbt, often tables in the landing zone. When declaring a source, you can annotate it with additional information, which is also synchronized to Datafold.
New Smart Scheduler service to manage data source concurrency when scheduling table profiling tasks.
We’ve implemented a new scheduler that we call the smart scheduler. Most users know that certain tasks can impose some load on the data warehouse. This allows us to have more control on the tasks that are running, resulting in a more predictable load. We built this together with our Redshift users because Redshift doesn’t handle concurrency very well. This provides a way to run the tasks in a gentle way.
Descriptive errors on profiling errors
It can happen that a query against the data warehouse results in an error. Maybe the database is offline? Maybe the table is huge and it takes a very long time? Or in the example, below we’re having a divide by zero at runtime. We now have more informative errors when the profiling job fails.
Lineage edges are now hoverable showing source and target nodes, which are highlighted on edge click.
Improved Lineage navigation: when switching central table origin, also switch table for Profiling and Sampling tabs.
September 8, 2021
Add GraphQL API for lineage
GraphQL is an increasingly popular method for retrieving information. It gives the developer more control over the desired entities and which specific fields they want to access. We now support a GraphQL API for our lineage information. Read more about it in this technical blog.
We’re continuously adding more information to the GraphQL API. For the latest state, please refer to the documentation.
Support dbt_utils for inferring Primary Keys
For running Datafold, we use the primary key of the table to see what changed. One popular way of checking this constraint is using the unique_combination_of_columns function from dbt_utils. With Datafold we now detect the use of these tests, and infer the primary key from it. This allows you to easily get started with Data Diff. Next to this, you can always set the primary key explicitly if desired.
Revamped the signup flow with new UI to create better user experience, and simplified dbt configurations.
August 20, 2021
Data Diff time travel in BigQuery and Snowflake
Time travel is a useful feature of some modern data warehouses that allows querying table at a particular point in time. Using that feature in combination with Data Diff can be very helpful to detect data drift in a table by diffing it against its older version. When testing changes in prod vs. dev environments, time travel can also help align both environments on the state of source data.
August 12, 2021
Gitlab support for Data Diff
Now it’s possible to automate full impact analysis of every PR to ETL code in Gitlab repositories.See how a change in the code will impact the data produced in the current and downstream tables.
More information on how to set it up can be found in the docs.
Added support for alerts on scalar values
While the true power of ML-aided alerts comes from monitoring metrics in time, sometimes it may be helpful to check a single value against a set threshold.
August 5, 2021
Catalog learns about your data from everywhere
Datafold will now automatically populate Catalog with column and table descriptions & tags from dbt, Snowflake, BigQuery, Redshift and other systems, creating a unified view.
Additional descriptions can be added using Datafold’s built-in rich text editor.
July 27, 2021
Primary keys for dbt models for Data Diff CI integration can now be specified on a table level
- Errors and warnings are now collapsed in Github/Gitlab comments to avoid bloat
- Improved performance of the Catalog search filter
- Improved handling of dbtCloud retries: Datafold now retries 4 times after receiving 500 errors from the dbtCloud service for up to 4 seconds
- Data source log extractor for lineage can now be done on a cron schedule
- Alerts now show the modified at timestamp
- Improved chrontab validation: removed once-an-hour restrictions on scheduling
- It is now possible to disable alert query notifications
- Catalog now shows the timestamp when the dataset was last modified
July 16, 2021
Since tags became a really popular way to document tables, columns, and alerts in Catalog, many of you have requested a better way to manage them including the ability to customize their color to enhance readability. Now all tags can be created, edited and deleted in the Settings menu.
- Improved profiler reliability
June 29, 2021
Interactive external dependencies
Lineage graphs can often get very complex and messy with all dependencies plotted at once. That’s why by default, Datafold shows a slice of the full lineage graph centered on a particular table (“dim_businesses” in the image below). That means that the graph will show tables and columns directly upstream or downstream of the chosen table. At the same time, downstream tables (“report_hourly_bysiness_pageviews”) may have other upstream dependencies unrelated to the table on which the lineage view is centered. To avoid bloat, those dependencies are shown as dashed lines. Clicking on them will center the lineage graph on the chosen table.
May 28, 2021
Per-column Data Diff Tolerances
Sometimes it may be helpful to compare columns with a threshold instead of strict equality. For instance, when a database column is a FLOAT computed as a division of aggregates (e.g. COUNT(*) / SUM(someFloatCol)), the results of the computation are not strictly deterministic, resulting in differences that are irrelevant from the business standpoint but would be flagged by diff if strict equality is used: 1.1200033 vs. 1.1200058. Diff tolerance allows you to specify an absolute or relative threshold below which differences in values would be considered equal.
When entering tags, you can rely on autocomplete to avoid creating semantically similar tags:
- Fixed a bug that prevented admins from sending password reset emails
- "Discourage manual profiling" flag added to data source settings. If the flag is set, when the user tries to refresh a data profile, a warning popup will appear.
May 18, 2021
Fixed saving datasources and CI integrations with empty cron schedule.
May 13, 2021
On-prem deployments now require an install password at first install used to check the state of the CI process.
May 7, 2021
New Data Diff UI & Landing Page
Streamlined UI with more settings
- The application now posts update messages when waiting for dbt runs to finish.
- Added an API endpoint to get status of CI runs. It can be used to check state of a CI process.
- Use standard notation for crontab format
- Fixed a bug where the dbt meta schedule stopped working
May 1, 2021
New application root page
- The CI config ID is now visible in the CI settings screen
- Allow using the dbt CLI to post the manifests to Datafold, so that Datafold can run diffs in a similar way as in the dbtCloud integration
- Documentation is now available from the header in the app
- Fixes a bug where the dbt cloud account number was passed as a string
April 23, 2021
The dbt configuration now presents a list of accounts instead of hardcoding the account name manually.
April 20, 2021
Automatic dbt docs sync to Datafold Catalog
- Fixed a bug where Snowflake timezone-aware fields were compared against timezone-naive instances
- Search: added Select all/ Deselect all to data source filter
- Updated loading indication when loading data source schema
- Search: the user is redirected on /search page when no results are round in as-you-type mode
- Updated usage of URL params for search
- Search: tree and sider are now responsive (expand if schema names don't fit into width)
- Updated scrolling UX
- Profiler: removed experimental_ guards from new profiling and sampling UIs
- Profiler: fixed an issue with DATE & DATETIME for Snowflake table profiles
April 15, 2021
- Lineage: fixed hanging PostgreSQL query due to query planner misoptimization
- Lineage: hotfix for Snowflake + dbt
Lineage: multiple small bugfixes
April 12, 2021
- Lineage: support for Snowflake semistructured data
- Lineage: fixed a bug where some parts of the graph were not displayed
- Profiling: bugfix in settings
- Data Diff: fixed handling of time datatype
- Data Diff: soft-fail on inf and NaN float values
- Made sure that CI data diffs are resilient to server-side interruptions
April 9, 2021
- Correctly display arrays and maps in profiler sample
- Several bugfixes in lineage UI
- Fixes in the color scheme
- Added support for incremental SQL log fetching to build column-level lineage
- Several fixes in the lineage query parser
April 7, 2021
Incremental Column-level Lineage
Instead of querying the entire SQL query history, Datafold now looks at only new queries and updates the lineage graph incrementally. Currently works for Snowflake and Bigquery.
Faster Column Profiler
Now supports browsing super-wide (100+ col) tables without any interface lags.