Folding Data #39 | Datafold

The Best Data Contract is the Pull Request

The law we should actually pay attention to more as data people is the Second law of thermodynamics: the entropy of an isolated system always increases. The size and complexity of a data platform inevitably grows, while the intergalactic data stack is getting further apart like the Universe. Amidst these enormous forces, data contracts are our humble attempt to tame the chaos and make our data people lives a bit better by reducing the breakages in the systems we build and manage.

Among all the buzzwords invading the minds of the data community, data contracts are one of the least controversial, most straightforward, and useful: no one would argue that the ability to discover software bugs at compile time or while typing the code in IDE is bad. However, while data engineering is looking up to software engineering, it will never fully converge into it. And while we should implement formally defined interfaces for handing data over between teams and subsystems, things will still break at the weakest link, such as between the data transformation layer (e.g., dbt) and BI tools.

What if, instead of hiring those expensive data contract lawyers and re-architecting our data platforms to implement formal interfaces everywhere, we could operate a robust data platform based on handshakes?

The Best Data Contract is the Pull Request

Tool Highlight – PUDL

PUDL stands for Public Utility Data Liberation

With imminent global warming and the looming energy crisis upon us, figuring out the most effective policy and investment decisions around energy generation and distribution is critical. Unfortunately, the macro-level data is too high-level for effective analysis, and granular data on consumption, generation, and associated costs is scattered and hard to find.

Catalyst Cooperative, which describes itself as a worker-owned cooperative creating open-source software and data to aid energy researchers, is the team behind PUDL – a public dataset and an open-source ELT codebase that generates “yearly, monthly, and even hourly data about fuel burned, electricity generated, operating expenses, power plant usage patterns and emissions.” Thanks to PUDL, researchers can work with well-structured and cleaned data to solve some of the world's most critical problems.

Interestingly, I came across this project as I was working on my own data liberation project to make negotiated rates between healthcare carriers and providers easily accessible for researchers and think tanks – my humble attempt to contribute to solving America’s burning healthcare problem. Please reach out if you are interested in this topic or know someone who would benefit from this dataset.

Data Should Be Liberated

RSVP for Data Quality Meetup

Data Quality Meetup is back from the Summer break!

This time we’ll be hearing from five amazing speakers and more expert panelists on topics ranging from time series forecasting (I heard something better than Prophet 👀) to proactive data testing and, of course data contracts.

Register here

Before you go

‍