Folding Data #13 Data Science Ethics

An Interesting Read: Data science students don’t know a lot about ethics

Data Science is, perhaps, the fastest-growing job market in tech. Luckily, getting ramped up on the required hard skills is as easy as ever not only through formal education but with the overabundance of boot camps and MOOCs. But, as the article below points out, as Data Science is practiced by humans and to (hopefully) help other humans, the ethical aspect of making data-driven decision making is essential but often overlooked during the training of specialists. Once you start thinking about this subject, you quickly realize it's much deeper than the "fairness" of ML models and overcoming cognitive biases in data analysis: it also concerns privacy protection, reliability of data products, and much more. Luckily, there are dedicated courses available focused on the subject already. And if you still think this doesn't affect your daily life, try requesting an Uber with 1% battery remaining 😉

Data science students flunking ethics

Tool of the Week: Zingg

An ML-based tool that can reconcile and deduplicate records, Zingg works with a wide range of entities and integrates natively into your data stack. Add to that the fact it works with all the usual data warehouses and lakes, plus reads and writes to any Spark-supported store, and this open-source tool could be right at home in your data stack.

Check out the de-dupe magic

What I’m Proud of...

In my days as a data engineer, there was always a certain level of holding my breath. I knew that I could write good code, that I could get my job done, but I never knew with 100% confidence that my PR wouldn't break things. In fact, me blowing up Lyft's data warehouse with a four-line SQL "hotfix" a few years ago is the backbone of the "how Datafold started" story.

Lots of conversations on data quality revolve around anomaly detection, i.e. finding things that have been broken. When Alex and I started Datafold to solve the data quality problem, we asked a different question: how can we help data teams not break data in the first place? That's how our first product, Data Diff, was born. To many around the time of our launch on HackerNews, it wasn't obvious how powerful testing PRs with Data Diff can be. But in the past year, we were lucky to partner with many great teams including the cannabis marketplace Dutchie, who turned out to be strong believers in the proactive approach as well. To me as a founder, it means a lot to hear from someone like David Wallace, Staff Data Engineer @ Dutchie that "Data Diff is the missing piece of the puzzle for data quality assurance"!

The blunt story about Dutchie and Datafold

Before You Go