October 2, 2021
This update is based on customer feedback to have more meaningful feedback in the Data Diff process. We updated more information to the GitHub statuses when running the Data Diff:
For example, we include the git hash of the job that it is waiting for. After the job starts, it will show a link to the actual job:
This can be either the job building the pull-request or the main branch. This helps to understand what’s going on when running the Data Diff, and what it is waiting for.
September 20, 2021
The datafold-sdk is used for synchronizing the information after a dbt run into Datafold. Datafold will extract the table and column information and it is used for Data Diff when running on a pull request.
It is a common practice to clean up the tables after a run on a pull request has ran. But Datafold might need these tables to run the Data Diff. Therefore we have the Datafold upload-and-wait command. Instead of starting the Data Diff asynchronously, it will block for the Data Diff to complete. This makes sure that you don’t drop all the tables before the Data Diff has finished.
Datafold works seamlessly with dbt. With the latest version of Datafold, we support synchronizing the metadata from dbt’s sources and seeds. Sources are tables that are external to dbt, often tables in the landing zone. When declaring a source, you can annotate it with additional information, which is also synchronized to Datafold.
New Smart Scheduler service to manage data source concurrency when scheduling table profiling tasks.
We’ve implemented a new scheduler that we call the smart scheduler. Most users know that certain tasks can impose some load on the data warehouse. This allows us to have more control on the tasks that are running, resulting in a more predictable load. We built this together with our Redshift users because Redshift doesn’t handle concurrency very well. This provides a way to run the tasks in a gentle way.
It can happen that a query against the data warehouse results in an error. Maybe the database is offline? Maybe the table is huge and it takes a very long time? Or in the example, below we’re having a divide by zero at runtime. We now have more informative errors when the profiling job fails.
Lineage edges are now hoverable showing source and target nodes, which are highlighted on edge click.
Improved Lineage navigation: when switching central table origin, also switch table for Profiling and Sampling tabs.
September 8, 2021
GraphQL is an increasingly popular method for retrieving information. It gives the developer more control over the desired entities and which specific fields they want to access. We now support a GraphQL API for our lineage information. Read more about it in this technical blog.
We’re continuously adding more information to the GraphQL API. For the latest state, please refer to the documentation.
For running Datafold, we use the primary key of the table to see what changed. One popular way of checking this constraint is using the unique_combination_of_columns function from dbt_utils. With Datafold we now detect the use of these tests, and infer the primary key from it. This allows you to easily get started with Data Diff. Next to this, you can always set the primary key explicitly if desired.
Revamped the signup flow with new UI to create better user experience, and simplified dbt configurations.
August 20, 2021
Time travel is a useful feature of some modern data warehouses that allows querying table at a particular point in time. Using that feature in combination with Data Diff can be very helpful to detect data drift in a table by diffing it against its older version. When testing changes in prod vs. dev environments, time travel can also help align both environments on the state of source data.
August 12, 2021
Now it’s possible to automate full impact analysis of every PR to ETL code in Gitlab repositories.See how a change in the code will impact the data produced in the current and downstream tables.
More information on how to set it up can be found in the docs.
While the true power of ML-aided alerts comes from monitoring metrics in time, sometimes it may be helpful to check a single value against a set threshold.
August 5, 2021
Datafold will now automatically populate Catalog with column and table descriptions & tags from dbt, Snowflake, BigQuery, Redshift and other systems, creating a unified view.
Additional descriptions can be added using Datafold’s built-in rich text editor.
July 27, 2021
July 16, 2021
Since tags became a really popular way to document tables, columns, and alerts in Catalog, many of you have requested a better way to manage them including the ability to customize their color to enhance readability. Now all tags can be created, edited and deleted in the Settings menu.
June 29, 2021
Lineage graphs can often get very complex and messy with all dependencies plotted at once. That’s why by default, Datafold shows a slice of the full lineage graph centered on a particular table (“dim_businesses” in the image below). That means that the graph will show tables and columns directly upstream or downstream of the chosen table. At the same time, downstream tables (“report_hourly_bysiness_pageviews”) may have other upstream dependencies unrelated to the table on which the lineage view is centered. To avoid bloat, those dependencies are shown as dashed lines. Clicking on them will center the lineage graph on the chosen table.
May 28, 2021
Sometimes it may be helpful to compare columns with a threshold instead of strict equality. For instance, when a database column is a FLOAT computed as a division of aggregates (e.g. COUNT(*) / SUM(someFloatCol)), the results of the computation are not strictly deterministic, resulting in differences that are irrelevant from the business standpoint but would be flagged by diff if strict equality is used: 1.1200033 vs. 1.1200058. Diff tolerance allows you to specify an absolute or relative threshold below which differences in values would be considered equal.
When entering tags, you can rely on autocomplete to avoid creating semantically similar tags:
May 18, 2021
Fixed saving datasources and CI integrations with empty cron schedule.
May 13, 2021
On-prem deployments now require an install password at first install used to check the state of the CI process.
May 7, 2021
Streamlined UI with more settings
May 1, 2021
April 23, 2021
The dbt configuration now presents a list of accounts instead of hardcoding the account name manually.
April 20, 2021
April 15, 2021
Lineage: multiple small bugfixes
April 12, 2021
April 9, 2021
April 7, 2021
Instead of querying the entire SQL query history, Datafold now looks at only new queries and updates the lineage graph incrementally. Currently works for Snowflake and Bigquery.
Now supports browsing super-wide (100+ col) tables without any interface lags.