Changelog

v1.22.0

October 2, 2021

Improved messaging on the GitHub integration

This update is based on customer feedback to have more meaningful feedback in the Data Diff process. We updated more information to the GitHub statuses when running the Data Diff:

For example, we include the git hash of the job that it is waiting for. After the job starts, it will show a link to the actual job:

This can be either the job building the pull-request or the main branch. This helps to understand what’s going on when running the Data Diff, and what it is waiting for.

v1.21.0

September 20, 2021

datafold-sdk upload-and-wait

The datafold-sdk is used for synchronizing the information after a dbt run into Datafold. Datafold will extract the table and column information and it is used for Data Diff when running on a pull request.

It is a common practice to clean up the tables after a run on a pull request has ran. But Datafold might need these tables to run the Data Diff. Therefore we have the Datafold upload-and-wait command. Instead of starting the Data Diff asynchronously, it will block for the Data Diff to complete. This makes sure that you don’t drop all the tables before the Data Diff has finished.

Catalog support for dbt sources and seeds

Datafold works seamlessly with dbt. With the latest version of Datafold, we support synchronizing the metadata from dbt’s sources and seeds. Sources are tables that are external to dbt, often tables in the landing zone. When declaring a source, you can annotate it with additional information, which is also synchronized to Datafold.

Smart scheduler

New Smart Scheduler service to manage data source concurrency when scheduling table profiling tasks.

We’ve implemented a new scheduler that we call the smart scheduler. Most users know that certain tasks can impose some load on the data warehouse. This allows us to have more control on the tasks that are running, resulting in a more predictable load. We built this together with our Redshift users because Redshift doesn’t handle concurrency very well. This provides a way to run the tasks in a gentle way.

Descriptive errors on profiling errors

It can happen that a query against the data warehouse results in an error. Maybe the database is offline? Maybe the table is huge and it takes a very long time? Or in the example, below we’re having a divide by zero at runtime. We now have more informative errors when the profiling job fails.

Lineage edges are now hoverable showing source and target nodes, which are highlighted on edge click.

Improved Lineage navigation: when switching central table origin, also switch table for Profiling and Sampling tabs.

v1.20.0

September 8, 2021

Add GraphQL API for lineage

GraphQL is an increasingly popular method for retrieving information. It gives the developer more control over the desired entities and which specific fields they want to access. We now support a GraphQL API for our lineage information. Read more about it in this technical blog.
We’re continuously adding more information to the GraphQL API. For the latest state, please refer to the documentation.

Support dbt_utils for inferring Primary Keys

For running Datafold, we use the primary key of the table to see what changed. One popular way of checking this constraint is using the unique_combination_of_columns function from dbt_utils. With Datafold we now detect the use of these tests, and infer the primary key from it. This allows you to easily get started with Data Diff. Next to this, you can always set the primary key explicitly if desired.

Revamped the signup flow with new UI to create better user experience, and simplified dbt configurations.

v1.19.0

August 20, 2021

Data Diff time travel in BigQuery and Snowflake

Time travel is a useful feature of some modern data warehouses that allows querying table at a particular point in time. Using that feature in combination with Data Diff can be very helpful to detect data drift in a table by diffing it against its older version. When testing changes in prod vs. dev environments, time travel can also help align both environments on the state of source data.

v1.18.8

August 12, 2021

Gitlab support for Data Diff

Now it’s possible to automate full impact analysis of every PR to ETL code in Gitlab repositories.See how a change in the code will impact the data produced in the current and downstream tables.

More information on how to set it up can be found in the docs.

Added support for alerts on scalar values

While the true power of ML-aided alerts comes from monitoring metrics in time, sometimes it may be helpful to check a single value against a set threshold.

v1.18.0

August 5, 2021

Catalog learns about your data from everywhere

Datafold will now automatically populate Catalog with column and table descriptions & tags from dbt, Snowflake, BigQuery, Redshift and other systems, creating a unified view.

Additional descriptions can be added using Datafold’s built-in rich text editor.

v1.17.1

July 27, 2021

Primary keys for dbt models for Data Diff CI integration can now be specified on a table level

  • Errors and warnings are now collapsed in Github/Gitlab comments to avoid bloat
  • Improved performance of the Catalog search filter
  • Improved handling of dbtCloud retries: Datafold now retries 4 times after receiving 500 errors from the dbtCloud service for up to 4 seconds
  • Data source log extractor for lineage can now be done on a cron schedule
  • Alerts now show the modified at timestamp
  • Improved chrontab validation: removed once-an-hour restrictions on scheduling
  • It is now possible to disable alert query notifications
  • Catalog now shows the timestamp when the dataset was last modified

v1.16.0

July 16, 2021

Customizable Tags

Since tags became a really popular way to document tables, columns, and alerts in Catalog, many of you have requested a better way to manage them including the ability to customize their color to enhance readability. Now all tags can be created, edited and deleted in the Settings menu.

Improvements:

  • Improved profiler reliability

v1.15.0

June 29, 2021

Interactive external dependencies

Lineage graphs can often get very complex and messy with all dependencies plotted at once. That’s why by default, Datafold shows a slice of the full lineage graph centered on a particular table (“dim_businesses” in the image below). That means that the graph will show tables and columns directly upstream or downstream of the chosen table. At the same time, downstream tables (“report_hourly_bysiness_pageviews”) may have other upstream dependencies unrelated to the table on which the lineage view is centered. To avoid bloat, those dependencies are shown as dashed lines. Clicking on them will center the lineage graph on the chosen table.

v1.13.0

May 28, 2021

Per-column Data Diff Tolerances

Sometimes it may be helpful to compare columns with a threshold instead of strict equality. For instance, when a database column is a FLOAT computed as a division of aggregates (e.g. COUNT(*) / SUM(someFloatCol)), the results of the computation are not strictly deterministic, resulting in differences that are irrelevant from the business standpoint but would be flagged by diff if strict equality is used: 1.1200033 vs. 1.1200058. Diff tolerance allows you to specify an absolute or relative threshold below which differences in values would be considered equal.

Tags autocomplete

When entering tags, you can rely on autocomplete to avoid creating semantically similar tags:

Improvements:

  • Fixed a bug that prevented admins from sending password reset emails
  • "Discourage manual profiling" flag added to data source settings. If the flag is set, when the user tries to refresh a data profile, a warning popup will appear.

v1.12.8

May 18, 2021

Fixed saving datasources and CI integrations with empty cron schedule.

v1.12.0

May 13, 2021

On-prem deployments now require an install password at first install used to check the state of the CI process.

v1.11.0

May 7, 2021

New Data Diff UI & Landing Page

Streamlined UI with more settings

Improvements:

  • The application now posts update messages when waiting for dbt runs to finish.
  • Added an API endpoint to get status of CI runs. It can be used to check state of a CI process.
  • Use standard notation for crontab format
  • Fixed a bug where the dbt meta schedule stopped working

v1.11.0

May 1, 2021

New application root page

  • The CI config ID is now visible in the CI settings screen
  • Allow using the dbt CLI to post the manifests to Datafold, so that Datafold can run diffs in a similar way as in the dbtCloud integration
  • Documentation is now available from the header in the app
  • Fixes a bug where the dbt cloud account number was passed as a string

v1.9.2

April 23, 2021

The dbt configuration now presents a list of accounts instead of hardcoding the account name manually.

v1.9.0

April 20, 2021

Automatic dbt docs sync to Datafold Catalog

  • Fixed a bug where Snowflake timezone-aware fields were compared against timezone-naive instances
  • Search: added Select all/ Deselect all to data source filter
  • Updated loading indication when loading data source schema
  • Search: the user is redirected on /search page when no results are round in as-you-type mode
  • Updated usage of URL params for search
  • Search: tree and sider are now responsive (expand if schema names don't fit into width)
  • Updated scrolling UX
  • Profiler: removed experimental_ guards from new profiling and sampling UIs
  • Profiler: fixed an issue with DATE & DATETIME for Snowflake table profiles

v1.8.9

April 15, 2021

  • Lineage: fixed hanging PostgreSQL query due to query planner misoptimization
  • Lineage: hotfix for Snowflake + dbt

v1.8.8

April, 2021

Lineage: multiple small bugfixes

v1.8.7

April 12, 2021

  • Lineage: support for Snowflake semistructured data
  • Lineage: fixed a bug where some parts of the graph were not displayed
  • Profiling: bugfix in settings
  • Data Diff: fixed handling of time datatype
  • Data Diff: soft-fail on inf and NaN float values
  • Made sure that CI data diffs are resilient to server-side interruptions

v1.8.6

April 9, 2021

  • Correctly display arrays and maps in profiler sample
  • Several bugfixes in lineage UI
  • Fixes in the color scheme
  • Added support for incremental SQL log fetching to build column-level lineage
  • Several fixes in the lineage query parser

v1.8.5

April 7, 2021

Incremental Column-level Lineage

Instead of querying the entire SQL query history, Datafold now looks at only new queries and updates the lineage graph incrementally. Currently works for Snowflake and Bigquery.

Faster Column Profiler

Now supports browsing super-wide (100+ col) tables without any interface lags.