What is Datafold Cloud?
So you want to know more about Datafold Cloud? We’ve got you covered ✅. In this blog post, we’ll be covering what Datafold Cloud is, general use cases for its core features, and requirements to use Datafold Cloud.
What is Datafold Cloud?
Datafold Cloud is the data quality testing solution for your team’s entire data journey—from migration to deployment. Datafold Cloud primarily builds off of open source data-diff, an open source Python package that completes row-by-row comparisons between two tables in your data warehouses (think: git diff, but for tables). In addition, Datafold Cloud supports a plethora of other features to make data quality testing accessible for their entire team, such as:
- Value-level diffs
- A CI integration for automated testing during deployment
- Column-level lineage that goes beyond your dbt project
- Access to these features in a secure and compliant environment
- …and much more.
Want to know more about how open source data-diff and Datafold Cloud differ? Read up on their offerings here.
Core Datafold Cloud features
As stated earlier, Datafold Cloud’s core features include:
- Value-level diffs
- Continuous integration (CI) diffing and impact analysis
- Robust column-level lineage
- Security and compliance standards
One of the cornerstone features of Datafold Cloud is access to value-level diffs. Whereas open source data-diff will provide a summary high-level diff (think: number of schema, rows, and column changes), Datafold Cloud allows you to look at the specific row values that may differ between two tables. This is incredibly useful if you need to understand why your tables might be differing, how much the values differ by, and allow you to filter on specific primary keys that are having issues with your proposed code changes.
CI diffing and robust impact analysis
Datafold Cloud is the most effective solution for automating and scaling your data diffs through the use of our native CI integration. With Datafold Cloud, data diffs automatically run when your PR is opened and upon subsequent commits, so you and your team have immediate insight into how these code changes will impact your downstream work. The Datafold CI comment looks a little like this:
So you and your PR reviewer can easily see:
- Your high-level diff summary (schema, column, primary key, and row count changes)
- Potentially impacted downstream dbt models, BI tool assets, and data app assets
- Links to your value-level diffs, so you can investigate at the row-level if needed
Not only does this guarantee that diffs (and data quality tests) run before deployment, but allows you and your team to have a comprehensive view of how your data will change—before it enters your production environment.
A key differentiator of Datafold Cloud is its enhanced column-level lineage built using the query logs of your data warehouse. With Datafold Cloud’s lineage explorer, you not only have access to column-level lineage for your dbt models, but also any table in your data warehouse, BI tool assets (think: Looker Dashboards and Views), and data app assets (think: Hightouch syncs).
We see Datafold’s column-level lineage view typically used for a number of use cases:
- Quickly identifying the upstream tables and columns used to build a certain dbt model
- Understanding which important downstream assets (such as your CFO’s ARR dashboard) that could be impacted by your code change
- Getting a comprehensive view of your data’s ecosystem, from the source tables to the dashboards that power your business
Datafold Cloud’s column-level lineage currently integrates with downstream data apps like Looker, Hightouch, Mode, and (coming soon) Tableau.
Security and compliance
One of the primary reasons we see teams adopt Datafold Cloud is to run and scale data diffs in a secure and compliant environment. Datafold Cloud’s SaaS hosted version is HIPAA, GDPR, and SOC2 Type 2 compliant, allowing your team to run your data quality checks in a safe and reliable environment. In addition, for teams that need next-level security and control over their environment, Datafold Cloud offers a virtual private cloud deployment option.
To learn more about Datafold’s security and compliance standards, check out these resources:
Some of our favorite (and underrated!) Datafold Cloud features
Below are some of other our favorite Datafold Cloud features that you might not be aware of:
- Clone diff to data warehouse: With a click of the button, you can materialize your value-level diff results from the Datafold Cloud UI directly as a table in your data warehouse. This allows your team to quickly conduct analysis on the diff using SQL to gain a clear understanding of why the data differs. This also enables you to keep a historical record of changes directly in your data warehouse if your team needs that.
- REST API access: For teams that need to run data diffs at scale—we’re talking hundreds or thousands of diffs—you can leverage Datafold Cloud’s REST API to run diffs in large batches. This feature is particularly useful for teams undergoing migrations or replicating data across warehouses, so you can compare and validate hundreds of tables across different warehouses in minutes, not days.
- Data app integrations: Datafold Cloud integrates closely with some of the most widely used data apps (Looker, Hightouch, Mode, and [coming soon] Tableau) to ensure you have complete knowledge and control over how your data is used. Not only can your team understand how your data works its way through your entire system, but Datafold CI’s impact analysis immediately detects downstream assets that will be changed with your code updates.
What do you need to get the most of Datafold Cloud?
Because Datafold Cloud is first and foremost a data diffing tool (whether you’re diffing prod and dev dbt models within the same database or across data warehouses during a migration to validate parity), there’s a handful of things that you’ll need to be successful with Datafold Cloud. We’ll breakdown the general requirements to use Datafold Cloud depending on your primary use case:
A note: We see some teams start off by using Datafold Cloud to validate data across databases during their migrations, and eventually evolve to using it for dbt development and deployment testing once they’re ready. For folks who already migrated to a MDS, we typically see them jump directly into development and deployment testing for dbt with Datafold Cloud. As we like to say, Datafold and data diffing play a role throughout your data team’s entire journey.
Please read more here to determine if Datafold Cloud is a good fit for your team’s data quality needs.
To sum it up, Datafold Cloud enables:
- Summarized diff overviews as well as value-level differences to be exposed in the Datafold Cloud UI and directly in your data warehouse
- Automated data quality testing through the native CI integration—no PR shall go untested moving forward 😉
- Column-level lineage for an unparalleled view into your data’s ecosystem
- Doing all of this in a secure and compliant environment
Whether you’re just getting started on migrating to a new data warehouse or you're knee-deep in dbt work, Datafold Cloud can be a solution to ensure your team is working with confidence and speed. If you want to learn more about Datafold Cloud and how it can support your team in your data quality testing journey, please check out the following resources:
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.