Folding Data #16
An Interesting Read: Adopting CI/CD with dbt Cloud
One of the most challenging aspects of building data products has always been change management: stakeholders pushing for faster iteration, growing data and source code volumes, increasing involvement of non-data-team committers require constant heroism from data team leaders. dbt has been spearheading the trend of adopting proven operational practices from the software world in the data domain such as version control, separation of dev from staging from prod, docs as code, and unit testing.
Finally, dbt Labs managed to tackle two essential workflows: Continuous Deployment (for fast incremental updates) and Continuous Delivery (for releasing multiple changes at once). Whether or not you are using dbt or dbt Cloud, incorporating those processes in your team's workflow can bring you to the entirely different level of data reliability SLAs. And if you want to add a pinch of automated testing in your CI/CD flow, check out Data Diff. 😉
Tracking PII? Column-level Lineage Can Help!
As if data teams haven't had enough to worry about, now they are also expected to protect customer data and ensure compliance with privacy regulations such as CCPA and GDPR. While dedicated tooling in this space is still evolving, a good first step toward gaining control over sprawling Personal Identifiable Information (PII) is having column-level data lineage across your data platform. With detailed data lineage, you can easily trace how PII flows through your pipelines and make the compliance process far less painful.
Tool of the Week: Materialize
Seems like we are over the hype cycle of streaming data applications: multiple frameworks including Apache's Beam and Flink claimed to revolutionize how we do ETL, but haven't managed to make a dent on SQL-batch-dominated analytical data prep workflows (although gained a lot of popularity for realtime ML) in part because none offered first-class support of SQL. Materialize is interesting in that regard because it (a) has SQL as the primary interface (b) is serious about SQL and even implements advanced JOIN functionality (c) has great interop with major data lake (Avro, JSON) and streaming (Kafka) interfaces. Bonus point: written in Rust 🤘
Registrations Now Open for Data Quality Meetup #6
Registrations are now open for our upcoming Data Quality Meetup. I can’t wait to hear the lightning talks from data leaders from Yelp, Patreon, Convoy, and Lightdash. This is our last Data Quality Meetup for the year, and you don’t want to miss it! Register now for free, and join in from around the world.
Before You Go
As seen on Twitter.