Folding Data #33 Data relationships before contracts

Data Relationships before Contracts

‍

It's interesting how the conversations around data observability and quality focus so much on monitoring / validating data within the data team's scope, almost ignoring the fact that most of the data any analytical team uses are produced/owned by software engineering teams such as analytical events and production table replicas.

Engineers are rightfully upset to be blamed for data quality incidents as they typically have little to no visibility into how the data they produce is used. Furthermore, they are right when they say "our Postgres tables are meant to power the app, not your dashboards". But those tables are full of really useful data, so we can't resist pulling them into our Snowflake caves.

Clearly, software teams need to be more aware of the downstream uses of the data they produce. Obviously, at the modern data scale and complexity, this cannot be solved by posting to Slack "hey! About to drop this column – any concerns?".

So maybe in the search for better data quality data teams should find ways to form better contracts with the data producers?

Tools such as Avo provide frameworks for collaborative management of events.

Chad and the team at Convoy took this concept further with an internal tool called Chassis that not just formalizes event definitions but maps each event to its semantic meaning to the business. Chad's vision for source data management goes beyond reliable events and eventually can help solve the increasingly apparent disconnect between the complexity of the business and the rigidity of relational data models.

I think that we can do better without changing our stacks by making two tweaks to the relationships with the data producers before we formalize them in semantic data contracts:

1. Help software teams understand the impact of the data they produce. This is a perfect use case for column-level lineage: any time an engineer needs to change/drop a column in an event or production table, they can quickly check who/what/how uses this data source for analytics and ML. If lineage is available via an API, such a check can be easily embedded in CI/CD for fool-proof validation.

2. Establish a collaborative process that requires a sign-off from appropriate owners to the changes made to the source data. Provided #1 is solved, locating relevant people should be easier, and many such changes may not have any impact on analytics at all, reducing the friction.

A lightweight process with the right tooling can go a long way!

Tool of the week: Mozart Data

One of the pillars of the Modern Data Stack concept is the modularity of tools. Mozart raised a $15M A round taking the opposite approach to that, and I am excited to see if it evolves if not to the Data OS but at least to a go-to method for bootstrapping an effective data platform.

Orchestra vs. orchestrator

COALESCE 2022

You don't want to miss that!

RSVP

Before You Go

‍