Folding Data #40
dbt + Python = ?
dbt is one of the best things that happened to the world of SQL. Although one can go quite far with SQL, Python becomes a must when you start building ML models and doing advanced cleansing.
When dbt Labs announced Python support at COALESCE 2022, I was initially confused about how that would work architecturally. In Airflow, every task is essentially a Python process that can execute an arbitrary operation, from running SQL to training an ML model with Pandas. That’s convenient but exposes a lot of complexity related to managing execution environments. So not before long, you are running a Kubernetes cluster just to run your DAGs 🥶
dbt elegantly avoided such scope creep by declaring that you can run Python tasks with dbt, just like SQL – only on external infrastructure.
That sounds good because most cloud data platforms provide ways to run Python jobs with dbt.
Our conclusion: you can run Python with dbt on pretty much any modern data cloud, including Snowflake, Databricks, and GCP, running it locally with fal (could come really handy for development!). Snowflake seems to have the most straightforward API for running Python – Snowpark – as it, unlike Databricks and GCP, is serverless and doesn’t require you to create a Spark cluster. It's conceptually very similar to submitting a SQL query.
So… does Python support for dbt mean you no longer need Airflow? I hope so, it served us well, but it’s time for it to retire and give way to new-generation orchestrators, like Dagster. (By the way, David wrote a great post series on Dagster vs. Airflow).
But on a serious note, the real questions are: where are the limits of dbt as the data orchestrator, the heart of the modern data stack? Whether and at what point should the data team consider bringing a full-featured modern orchestrator such as Dagster?
The answer to it deserves its own post, but my intuitive take is that it's perfectly fine for most data teams with analytics engineering and occasional ML use cases to start with dbt – a tool built for analytics engineers, and the newly added Python support goes quite far. Dagster, as it is today, is made for software and data engineers and has a steep learning curve, even for technical users. One can imagine the introduction of Dagster as a natural progression for the data team as its use cases expand, reliability requirements tighten, and technical competence grows with the addition of data engineers.
Interesting Read: ChatGPT, and are we out of a job yet?
For as long as I’ve been writing SQL for a living, countless startups have been promising to help data users get the answers they need with AI, bypassing the chronic bottleneck - the data team. In their demos, the apps showed translating seemingly free-form questions about orders and revenue by city into SQL and returning some promising-looking charts. None of them seem to be around still, and the consensus about why it has been that the AI tech “just wasn’t there yet”. Post-ChatGPT, it may seem that that tech has finally arrived. Or has it?For instance, Benn believes that while the long-awaited disruption of data workflows by AI is imminent, the prevailing challenge is the incompatibility of the way the analytical data is organized with AI: humans have been modeling the data for humans. (Not that it’s been particularly friendly to humans, but at least human analysts can make sense of the data "model" with enough time and grit).
Benn suggests that the key to AI disrupting the work of data people is in designing our data models in an AI-friendly way:
“Today, analytics engineers have to translate complicated technical schemas into semantically useful ones that have to make sense to the people who use them. That’s an architectural job, it’s a creative job, and it’s hard. Schema models designed for LLMs, by contrast, would probably come with a spec. Because the model wouldn’t need to be semantically expressive—LLMs find patterns, not meaning—that spec would likely also be relatively simple and consistent. Our transformation layers wouldn’t be inventive kitchens, figuring how to design bespoke dishes for individual customers. They’d instead become factories, stamping out the same standardized and machine-readable perforations that everyone else does.”
I agree with Benn’s thesis, but I also think it’s worth distinguishing the use cases for AI in data workflows:
Suppose the hypothesis is that AI replaces the analyst/analytics engineer. In that case, we need an AI-based process that is far more robust than what we see now, and we need a different model for our data in the first place for the reasons cited by Benn.
However, just like computers ended up augmenting rather than replacing many skilled white-collar jobs, I believe that the current state of AI can be incredibly impactful in empowering data practitioners, even with the data not specifically organized for consumption by AI.
These tools don’t replace the analyst; they make the analyst more productive by automating mundane tasks such as writing typical SQL or Pandas code. Unlike in the replacing-the-analyst scenario, the skilled data practitioner using AI code generation doesn’t necessarily need the AI to produce perfectly correct code. All the AI needs to be helpful is to save time on writing boilerplate code by, for example, providing snippets that a competent analytics engineer can tweak – a glorified StackOverflow workflow of a kind.
Tool of the week: Patterns
The holy grail of data analytics is to inform and completely automate organizational decision-making. Reverse-ETL – moving data from the warehouse to business applications like Salesforce – has exploded. One step further is building reactive data-driven workflows. Reactive means that, unlike a classic data DAG which runs on schedule and transforms or moves data around, the workflow activates by some event, e.g., a new support ticket opened. Zapier made this automation easy for business teams, and n8n built an excellent extensible open-source alternative. Now, Patterns aims to take this further by making the workflows data-first (e.g., Python support, native plugins for OpenAI, etc.). On that note, “Replacing a SQL analyst with 26 recursive GPT prompts” is a fun read and a cool demo of the capabilities of reactive data applications.
Job of the week: Senior Analytics Engineer @ GlossGenius
GlossGenius is building an ecosystem enabling entrepreneurs to succeed. We empower small businesses owners to focus on being creators, not admins, by offering a range of business management tools including booking and scheduling, marketing, analytics, payment processing and much more.
As a Senior Analytics Engineer, you will be responsible for building and fine tuning a scalable data warehouse that drives impactful business outcomes as GlossGenius continues to grow. You will partner cross-functionally with analytics, data engineering and data savvy business stakeholders to design reliable and consistent datasets, identify opportunities for improvement across business verticals, and scale data-driven decision making with self-serve data assets.
You will report to the Head of Data and are welcome to either work out of our office in New York City or work remotely from anywhere in the continental US.
To learn more and apply, checkout the job posting here.
Virtual Hands-on Lab: test-driven data development with Data Diff + dbt
dbt is awesome, and dbt + data diff is even better! 😊Join our lab to get set up with the tools that help you develop dbt models faster and with higher confidence!
What you can expect:
🚀 Upgrade your personal dbt dev environment with open-source data-diff toolkit
👀 Learn how to automate dbt code testing with Datafold Cloud for your team
🙌 We will walk through the instructions together and provide support over Slack if anyone gets stuck