Choosing the right data lineage tools for your analytics work
In 2012, JP Morgan Chase faced an embarrassing data setback and it cost them $2B in stock value. Their Value-at-Risk, or VaR, financial model relied on incorrectly-calculated and problematic historical data. In the follow-up analysis, it was clear that effective data lineage would have reduced the incorrectness of the model. It was an eye-opening moment for the financial industry, emphasizing the importance of understanding each and every piece of data that feeds into an important data model. JPMC, one of the chief architects and biggest proponents of the VaR financial model, stopped using it altogether—as did every other financial institution.
This isn’t the only story of banks getting into financial trouble because of inadequate data quality tooling and processes. In 2016, Credit Suisse was fined $90M by the SEC for misrepresenting how a key performance metric was calculated when an executive changed how data was interpreted in a report. In 2019, Bank of England was fined £44M for financial issues rooted in data quality. In 2020, Citi Bank was fined $400M for poor data management practices. In 2021, Deutsche Bank was fined €10 million for failing to have effective data quality controls in place.
Transparency around data quality and metrics is important. And it’s expensive if you don’t have it.
At the heart of transparency is data lineage, which tells the story of the origin, evolution, and application of every data point across an organization.
Why data lineage is important for analytics projects
Trustworthy analytics depend on accurate data. Let’s say your BI dashboard shows an unexpected sales decline—or a sudden sales increase. Before jumping to conclusions, like replacing your entire sales force or ramping up production to unrealistic levels, you can look at the lineage to check your assumptions. Data lineage reveals whether a metric is real or if, for example, a dbt transformation simply misinterpreted a discount code.
By tracing the flow of data from source(s) to destination(s), data lineage tools identify potential errors and discrepancies (ex. Using an incorrect or outdated upstream data source). They also help ensure that the data you use to generate business metrics is accurate and reliable.
Understanding the lineage of your data can rescue your organization from gigantic fines and be the ultimate source of truth when a data-driven decision is unclear. With the right data lineage tool, you can:
- Document your data throughout its lifecycle, including source information and any ETL- or ELT-related transformations
- Troubleshoot dbt models and BI app assets more easily
- Prevent catastrophes caused by deleting columns with important uses downstream
- Share lineage with team members to provide transparency and reduce single points of failure
Most data lineage tools are incomplete
Data lineage is quickly becoming a “table stakes” feature for data platforms. But not all data lineage features and capabilities are created equal. The lineage you get from a data catalog (e.g. Atlan) differs greatly from the lineage you get from a comprehensive enterprise data platform (e.g. Informatica). Each platform employs different techniques to capture lineage (e.g. pattern-based, tagging). They also present lineage information in a variety of ways.
Having lineage is “great,” but it’s more than just a checkbox on a list of features. Data engineers need a fuller context of both the history of data and the downstream consumption of data. To get this information, they need table-, row-, and column-level lineage. The way this information is gathered is also important. If the lineage data is gathered based on metadata from a BI tool, that’s not comprehensive enough. Advanced lineage tools like Datafold Cloud look at the data warehouse itself to get a comprehensive view of any destination that’s consuming data.
Data professionals need depth and breadth
Complete lineage covers both breadth and depth. It means you get a comprehensive view at even the most granular levels.
Table-level lineage offers a view of data movement at the table level. It’s useful for troubleshooting high-level data issues, seeing intermediate transformations, and downstream derived tables. It’s not very granular, but it has its uses. We’ll get into column-level lineage in just a moment, but as you might have guessed, it lets you get very granular, showing how tables have changed over time and which columns are being used in downstream data assets (e.g. Power BI dashboard).
Getting a complete view of your lineage means getting both high-level and granular lineage data and having it presented in a way that’s most consumable for you and your organization.
Column-level lineage tracking
Data lineage systems frequently face challenges in tracking data movement through complex workflows. It’s difficult to determine how a changed calculation or a renamed column is going to affect data consumers. Of course, developers and data professionals can always do a data diff to see exactly what and how will change as they adjust their code—but that’s still a relatively closed system. Your IDE doesn’t have an error message that says, “This change will break the CIO’s super-secret custom pivot table.”
Column-level lineage provides that visibility. And it’s especially useful when the sheer volume of dbt models and transformations is overwhelming. (It’s okay—no judgment from us!)
Organizational data ecosystems are just so complex nowadays. There are thousands of data sources, a handful of storage destinations, and countless tools that rely on upstream data. Data professionals are pulling their hair out trying to keep track of it all. They may even have a data lineage tool, but it lacks the context and detail they need to understand downstream impacts.
Column-level lineage (with helpful visualizations and an intelligent understanding of data consumption beyond the data warehouse) is becoming increasingly essential. Instead of finding out that a data change broke something, triggering the rollback implementation plan and post-mortem documentation, data teams can predict the future. Column-level lineage will show exactly what will happen downstream, giving you the opportunity to be proactive and tell the CIO that the super-duper secret dashboard is about to change.
Important analytics features in a data lineage tool
In addition to table-level and column-level lineage, the most complete data lineage tools provide analytics professionals with:
- Clear and intuitive visualizations
- Mapping and analysis capabilities
- Integration with data transformation tools and data warehouses
- Support for column-level lineage
Let’s look at why each of these is so important.
When it comes to data visualization, you also want your data lineage tool to offer depth and breadth. You might also want things to look pretty and impressive to execs, but that’s probably secondary.
Look for a data lineage tool that offers easy-to-understand visualizations to highlight data dependencies and relationships between data elements. You want visualizations that help you (and others) to look for anomalies, patterns, errors, and opportunities. You also want the ability to “double-click” into more detail.
Detailed visualizations help you quickly identify potential issues and gain insights into data quality, making the data lineage tracking process more efficient and effective. The detailed views allow you to really get into the nitty-gritty of what could be wrong—or what could be opportunistic—in the data.
It’s not possible to perform data lineage without integrations, so it might sound obvious to say that integrations are an important consideration for a data lineage tool. But there’s more to these integrations than “does this work with all the tools in my data stack?”
For data lineage to sufficiently work, it must be able to interact and gather data from a variety of data sources and destinations: warehouses, lakes, SaaS apps, in-house first-party systems, BI platforms, and governance systems. So, yes, there must be a sufficient number of integrations to meet your data lineage requirements.
However, we need to return to the “breadth and depth” aspects here. The depth of integrations matters as well. While one lineage tool’s website may say it has a Facebook integration, you may have to investigate the depth of that integration. Does it interoperate across the entire Facebook ecosystem? Can it connect to—and understand the intricacies of—the social media platform itself, the ads platform, Pages, and all of the tables buried within? Depending on your use case, it may not need to. Or it just might be mission critical.
The deeper the integration, the more effective the lineage tracking, the better you can maintain data quality throughout your data environment.
Datafold Cloud supports native integrations with Hightouch, Mode, Looker, and (coming soon) Tableau, for unparalleled visibility into how your code changes will potentially impact the most important assets of your business.
dbt for lineage
dbt is an open source tool that helps data analysts and engineers prepare their data for different use cases. If you’re used to working with SQL, dbt can help you do your own data transformations. With its version control features you can test your analytics code before committing it.
dbt infers relationships between data sources and models to create directed acyclic graphs (DAGs), which show the flow of your data and the dependencies among different data transformations. DAGs also identify bottlenecks in data pipelines, troubleshoot data quality issues, and ensure compliance with data governance regulations.
In addition to basic data lineage information, DAGs also store metadata and documentation about data transformations. Data teams rely on this information to understand how data is processed and to track changes to data pipelines over time.
On the surface, it looks like you can use dbt for most, if not all, of your data lineage needs. However, the dbt DAG abstracts a lot of information. For example, you can get high-level information about database and model changes via dbt, but with a more lineage-focused tool like Datafold Cloud, you can see exactly what changes will occur, what data exists versus what should exist, and gain increased visibility into your data warehouse. Datafold Cloud can also show you the exact lineage of a PR about to be merged, so your team has direct insight as to how dependencies may shift with your code change.
If you are using dbt, we highly recommend complementing it with Datafold Cloud's column-level lineage. When used together, your team has a fully-encompassing view as to how your data flows throughout your organization.
Choosing the right data lineage solution
Selecting the optimal data lineage tool requires an understanding of both your comprehensive and granular requirements. Before selecting a data lineage tool, we recommend thoroughly documenting and clarifying your expectations and governance needs around data quality assurance, trend analysis, and predictive modeling. And don’t forget that your needs will evolve over time, so design your requirements around what you need today and what you’ll need in the foreseeable future.
Datafold Cloud’s advanced features and seamless integration with dbt make it a standout choice for data lineage at the tabular and columnar levels. dbt users particularly love its swift validation of dbt model changes during development, deployment, and migrations.
When looking for a data lineage tool, be sure to consider:
- Breadth and depth of lineage capabilities. Make sure “data lineage” isn’t just another line item on the pricing page.
- Granular lineage features, such as table-, column-, and row-level lineage with visibility into downstream consumption.
- Effective visualizations that present both high-level and detailed information for insights, troubleshooting, and opportunities.
- Integrations that work across your data ecosystem, and especially in areas where you need more depth in data lineage.
Datafold Cloud gives you the data lineage foundation you need for effective analytics. It empowers you to seamlessly extract, view, and delve into data lineage, giving you insights into vital areas like compliance, root cause analysis, fraud detection, and business intelligence.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.