Folding Data #27 Disjointed Lineage

Folding Data #27

Gleb Mezhanskiy

CEO of Datafold

An Interesting Read: The tale of Disjointed Lineage and Grieving Data Quality

Ananth Packkildurai who built the data platform at Slack compares modern data applications to microservices and explains the importance of contracts between producers and consumers and tracing the dependencies (lineage). The natural question is – who in the data stack should be responsible for solving that?

Ananth suggests that it should be data orchestrators. Airflow is foundational and hugely influential but.. obsolete, and we know we need something better. dbt is effective, cool, and virally adopted, but it's SQL only. Awesome dev experience is lost somewhere in the jinja jungles when you try to put one on top of the other.

Makes me wonder if the answer to "should tasks = data models?" and whether it's possible to build data pipelines fast and keep them reliable at scale is Dagster?

Or maybe the task of observability and quality control needs to be performed by specialized, orchestrator-agnostic tools such as Monte Carlo, Datafold, and great_expectations?

Data Product == Microservices, why is this hard?

Tool of the week: Supergrain – again

Supergrain originally announced about four months ago with a bold headless BI vision. But, as it happens in the intergalactic data stack, their impressive launch was soon eclipsed by dbt – the current center of gravity in the transformation galaxy – announcing their metrics layer.

Supergrain went back into stealth and launched again as a "customer engagement platform built natively on your data warehouse". Sounds like Customer.io? Not so fast.

I spent a good chunk of my data engineering days trying to make Marketing happy by piping an ever-increasing amount of data into customer engagement tools. It wasn't fun because those platforms are essentially black holes for data: you need to build data models per their particular schemas, pump them in, and pray that it all works out since there's no real way to validate the completeness or correctness of the data on the other end. That's why Supergrain's "natively on your data warehouse" bit is essential: it's an extension to the prevailing ELT pattern. By keeping all business logic for customer segmentation and personalization with the rest of the analytical code and leveraging your data warehouse, you eliminate the unnecessary complexity of having multiple data platforms.

I am excited for Supergrain not only because George is my former boss and one of the most customer-centric leaders in data and not [only] because I am an investor in the company – I find elegant solutions to old but massive problems very inspiring.