Why Data Lineage Matters and Why it’s so Challenging

Change in code introduces risk. While much tooling for testing code changes has been built for software engineers, tooling for data engineers is less mature. Data engineers need to be able to see the impact of their code changes throughout their data pipelines on the data itself because even a small unexpected change in a downstream table can lead to large negative outcomes (financial loss, bad decision making, etc).

As the complexity of data stacks have increased, this problem of understanding impact to data before pushing code changes to production has grown worse. Luckily there are new tools to help, the most critical being data lineage. Data Lineage is a graph showing how all data in a database is connected. With this understanding data engineers can make better design decisions, but building data lineage yourself is difficult and vendors (such as Datafold) providing data lineage also vary in quality.

Why Data Lineage Matters

In modern data stacks data flows back and forth among multiple systems all the time, this makes it difficult to know the impact of a small change across all the connected systems.

Imagine, for example, that you run an ecommerce store selling beer. You currently rate beers as either Malty or Hoppy based on the number of IBUs. You want to add a third category in the middle for balanced which would re-label about 20% of the beers you are selling. This seems like a straightforward change to make in your SQL code but what might be the consequences. The table you are directly editing will likely reflect your expectations but what if the marketing team was using those categories to send custom emails to customer who purchased them. Now everyone buying "Balanced" beer would likely not get a follow up email and the marketing team may not even realize this issue for some time since it is automated.

Next, imagine a similar scenario, but instead of emails not getting sent, the data on what category beer a customer purchased factored in to a data science algorithm for recommending other beers to purchase. In this case, this small change would distort your customer recommendation engine. Small changes can have big impacts in connected data systems.

Tackling the Challenge With Data Lineage

Data lineage tools provide visibility into how data is connected upstream and downstream within a database.

This allows you to see how changes might impact other systems and we can avoid creating problems like the ones we described earlier. That leads to less wasted effort, faster time-to-market and lower costs. It’s faster and cheaper to solve a problem when you catch it early in the development process; data lineage provides this critical missing link in the data engineer workflow.

There are benefits to data governance and regulatory compliance as well. When the European Union passed the General Data Protection Regulation (GDPR) back in 2016, it set off a chain of new requirements for systems that house personally identifiable information (PII). If a customer asks you to delete their data, you’re legally obligated to comply with that request. But what happens upstream and downstream of that change? Will it break anything? Or will some of the customer’s PII remain elsewhere in your systems, rendering you non-compliant with the law?

Data lineage addresses these questions by automating the process of detecting upstream and downstream effects. It provides a clear to audit that can help determine where changes originated and how the data arrived at its current state.

Column-level data lineage is especially difficult because it requires that SQL queries be parsed to discern exactly what changed and how. This means tapping into database logs, but because there are so many database vendors—each with their own unique SQL dialect—parsing that information for various databases can be particularly challenging. As each of those SQL dialects evolves over time, data lineage tools must constantly be refined to accommodate those changes.

An example column-level lineage view that shows all the tables and columns that are involved in computing the sales table used for reporting in an imaginary e-commerce business selling craft beers.

As interconnected systems extend their reach even further and the flow of information accelerates, column-level data lineage will become more important. Today, it’s an extraordinarily useful tool that saves time, effort and money. Soon, data lineage will be indispensable.

Excerpted from the original article first published on DevOps.com.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes