Lineage Metadata Where You Need It - Datafold’s GraphQL API
If you need to track your data transformation across your data pipeline, data lineage makes it easy to visualize. Many tools, including Amundsen and Data Hub, offer table-level lineage so that data analysts and scientists can see where derived tables are coming from and see the downstream data flow. However, often you need something more granular - like answering questions such as:
- Where does the data used in this BI report come from?
- What is the impact of changing a certain column?
- Which columns are used the most vs unused?
That’s when you need column-level lineage.
Column-level lineage lets you see where your data is coming from and going to in seconds. Zoom in to trace the flow and truly understand where the values in a column are coming from without digging through SQL code or spending hours hunting through PRs. This can be particularly helpful when you need to understand the downstream impact of a change in your pipeline or track where PII data is being used.
While Datafold has an intuitive UI for column-level lineage, we also know that you might already be using another catalog or data discovery tool in your modern data stack. The modern data stack typically consists of best-of-breed tooling. Each of these tools has its place within the stack but might benefit from lineage metadata. That’s why we have a GraphQL API to make it easy to bring Datafold's lineage wherever you like.
This blog showcases how to bring lineage data into another catalog using Datafold with a simple example pipeline. Of course, to make it extra fun, our incredible Customer Engineer, Fokko Driesprong (who also helped extensively with this blog!), made a beer-themed data pipeline that allows for analytics on breweries in the USA.
How To Bring Column-Level Lineage into Amundsen
Here is a basic visualization of how this flow will work.
You can check out our public GitHub repository that contains the pipeline. We use dbt to model the pipeline, which is written in SQL and uses BigQuery as a backend.
For the input tables, we have the beers and their breweries. With renaming, cleaning, and joining of tables, we can get some insights into where to get our favorite craft beers. This way, we’ll know which states to visit based on the breweries and styles of beer. By default, dbt provides the table-level lineage.
Clearly, this is a simplified pipeline, but it can scale to any size. After all, right now we’re only tracking breweries and the numbers are relatively small, although we’re still excited to visit Colorado with 265 unique beers:
To ingest metadata, Amundsen uses what they call Databuilders. These are small data ingestion libraries that connect to an external source like BigQuery or Datafold to fetch data. We can easily pull the metadata and usage data from BigQuery into Amundsen:
For easy integration with Amundsen, you can use the Datafold Metadata GraphQL API:
This way, you can use the API to fetch the data from Datafold and amend the existing BigQuery data with the lineage information. Because of the graph-oriented nature of the column-level lineage data, it makes the most sense to expose the data using GraphQL.
How To Bring Column-Level Lineage into Data Hub
The process for bringing your lineage metadata into Data Hub is fairly similar to that of Amundsen. Data Hub also allows for metadata ingestion through a range of tools. This is a push/pull mechanism where you have to create a yml file that contains both the sources (database, metastore, etc), and the Data Hub sink.
This will spawn a small python process that will pull the data from the source, and push it to Data Hub, either over HTTP or to Kafka.
What’s interesting to note about Data Hub is that they don’t support column-level lineage (which they call field-level lineage) at the time of writing this. That said, you can still find value in bringing your lineage metadata into Data Hub to explore job lineage and see how they flow together. This high-level lineage can be useful when exploring your datasets, but keep in mind that you won’t be able to zoom in for more granularity.
Once you have your column-level lineage in your data discovery tool of choice, you can explore your data dependencies and gain greater insight into your data. Also, when stakeholders ask if a number is right or where it’s coming from, you can answer in seconds, not hours. With greater confidence in the data, everyone wins.
We hope this was helpful for you when bringing your column-level lineage into Amundsen, Data Hub, or other tools. If you haven’t explored column-level lineage yet, you can see what it’s like in our interactive sandbox. Ready to get started? Be sure to contact us or book a demo to see it in action!
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.