Folding Data #16

An Interesting Read: Adopting CI/CD with dbt Cloud

One of the most challenging aspects of building data products has always been change management: stakeholders pushing for faster iteration, growing data and source code volumes, increasing involvement of non-data-team committers require constant heroism from data team leaders. dbt has been spearheading the trend of adopting proven operational practices from the software world in the data domain such as version control, separation of dev from staging from prod, docs as code, and unit testing.

Finally, dbt Labs managed to tackle two essential workflows: Continuous Deployment (for fast incremental updates) and Continuous Delivery (for releasing multiple changes at once). Whether or not you are using dbt or dbt Cloud, incorporating those processes in your team's workflow can bring you to the entirely different level of data reliability SLAs. And if you want to add a pinch of automated testing in your CI/CD flow, check out Data Diff. 😉

Enable fast & reliable data development with CI/CD

PII

Tracking PII? Column-level Lineage Can Help!

As if data teams haven't had enough to worry about, now they are also expected to protect customer data and ensure compliance with privacy regulations such as CCPA and GDPR. While dedicated tooling in this space is still evolving, a good first step toward gaining control over sprawling Personal Identifiable Information (PII) is having column-level data lineage across your data platform. With detailed data lineage, you can easily trace how PII flows through your pipelines and make the compliance process far less painful.

Read more about tracing PII with lineage

Tool of the Week: Materialize

Seems like we are over the hype cycle of streaming data applications: multiple frameworks including Apache's Beam and Flink claimed to revolutionize how we do ETL, but haven't managed to make a dent on SQL-batch-dominated analytical data prep workflows (although gained a lot of popularity for realtime ML) in part because none offered first-class support of SQL. Materialize is interesting in that regard because it (a) has SQL as the primary interface (b) is serious about SQL and even implements advanced JOIN functionality (c) has great interop with major data lake (Avro, JSON) and streaming (Kafka) interfaces. Bonus point: written in Rust 🤘

Check out Materialize on GitHub ✨

data quality meetup-oct-2

Registrations Now Open for Data Quality Meetup #6

Registrations are now open for our upcoming Data Quality Meetup. I can’t wait to hear the lightning talks from data leaders from Yelp, Patreon, Convoy, and Lightdash. This is our last Data Quality Meetup for the year, and you don’t want to miss it! Register now for free, and join in from around the world.

Register free for DQM #6 on October 28th 🗓️

Before You Go

As seen on Twitter.

nerdy halloween

Love this newsletter? Forward it to a friend! 🚀
Received this newsletter from a friend? Subscribe for yourself! ✨

Get Started

To get Datafold to integrate seamlessly with your data stack we need to have a quick onboarding call to get everything configured properly