The Modern Data Stack: Open-source Edition
In my earlier post, I proposed a data stack for a typical analytical use case along with the key criteria to choose tech for each step in the data pipeline, such as minimal operational overhead, scalability, and pricing. And while I firmly believe that open source is not by itself a compelling enough reason to choose a technology for the data stack, it can be the most feasible solution in some situations, for example, in a highly regulated industry or due to local privacy and data security legislation that may prohibit the use of foreign SaaS vendors.
Similar to my original blog, this post follows the steps in the data value chain. Organizations need data to make better decisions, either human (analysis) or machine (algorithms and machine learning). Before data can be used for that, it needs to go through a sophisticated multi-step process, which typically follows these steps:
- Specification: defining what to track (OSS in that area hasn’t evolved yet)
- Instrumentation: registering events in your code
- Collection: ingesting & processing events
- Integration: adding data from various sources
- Data Warehousing: storing, processing, and serving data
- Transformation: preparing data for end users
- Quality Assurance: bad data = bad decisions
- Data discovery: finding the right data asset for the problem
- Analysis: creating the narrative
Most open-source products listed below are in fact open-core, i.e. primarily maintained and developed by teams that make money off consulting about, hosting, and offering “Enterprise” features for those technologies. In this post, we are not taking into account the features that are only available through the SaaS/Enterprise versions, thereby comparing only openly available solutions.
For each open-source project, we use the following criteria:
Feature completeness – how closely the product matches best-in-class SaaS offering.
- # of stargazers
- # of contributors
- Velocity expressed in weekly commits – if it's very low, you risk adopting a stale project and end up fixing bugs and building integrations yourself.
Maturity - beyond features, how easy and reliable it will be to use in production. This is evaluated as:
- Promising product – the product demonstrates basic functionality, but using it in production likely will require some development and deployment work.
- Turnkey product – the product offers straightforward deployment and most common integrations are readily available.
- Mature product – the product is production-ready out of the box, is used by multiple large organizations, and is considered one of the go-to solutions in its category.
Segment (acquired for $3.2B by Twilio) has built great client-side instrumentation libraries for tracking users and events in JS, mobile, and server-side. Interestingly, Segment's current product story started with the release of the analytics.js library on Hacker News in 2012. The cool thing about the SDKs is that other vendors, including open-source products such as Jitsu (formerly known as EventNative) and Rudderstack (more on them below), offer drop-in Segment backend replacement.
Collection + Integration
Best-in-class SaaS: Segment, Fivetran
Collecting event data from client and server-side applications + third-party systems. In our previous post, we recommended Segment for event collection and Fivetran for other integrations. Those tools offer a great experience but can be a no-go for certain use cases because of their SaaS-only offering and steep pricing.
Here are five contenders for open-source alternatives to Segment and Fivetran:
Rudderstack appears to be the most well-rounded, mature, and actively developed platform that is an "open-source Segment alternative" class. However, you should carefully evaluate its architecture and limitations, such as the lack of event ordering and potential event loss during downscaling, that are solved only in the Enterprise version. Jitsu and Airbyte are young contenders focusing on event collection and 3rd party integrations respectively.
Here's how RudderStack compares to Segment.
SQL ETL + BI
Best-in-class SaaS: Snowflake, BigQuery
We previously went in-depth on the criteria for choosing a data warehouse as well as on common misconceived advantages of open-source data warehousing technologies, and have highlighted Snowflake & BigQuery as optimal solutions for typical analytical needs.
Assuming the typical pattern for analytics:
- Current or anticipated large (terabyte-petabyte) and growing data volume
- Many (10-100+) data consumers (people and systems)
- Load is distributed in bursts (e.g. more people query data in the morning but few at night)
- Query latency is important but not critical; 5-30s is satisfactory
- The primary interface for transforming and querying data is SQL
To meet the assumptions above, teams frequently use a serverless or a data lake architecture (decoupling storage from computing).
The leading open-source project that meets these criteria is Trino, which you may have known as PrestoSQL or just Presto. Side story: Trino/PrestoSQL is a fork of PrestoDB, originally developed at Facebook as a replacement to Hive. The reason for the fork and rename was, apparently, Facebook enacting its trademark and ownership rights on the software in a community-unfriendly way. You can read more on that from the core team behind PrestoDB/PrestoSQL.
Trino has the following advantages:
- Feature-rich SQL interface
- Works well for a wide range of use cases from ETL to serving BI tools
- Close to matching best-in-class SaaS offerings (Snowflake) in terms of usability and performance
- Provides a great trade-off between latency and scalability
Trino has been so successful as an open-source Snowflake alternative largely due to a substantial improvement in usability and abstracting the user (and the DevOps) from the underlying complexity: it just works. Forget hundreds of tuning parameters that were critical to make older systems like Hive work.
ML & Specialized Jobs
Spark is a true workhorse of modern data computing with a polyglot interface (SQL, Python, Java & Scala) and unmatched interoperability with other systems. It is also extremely versatile and handles a wide range of workloads from classic batch ETL to streaming to ML and graph analytics.
Do I need Trino if Spark can handle everything?
While Spark is a Swiss army knife for ETL and ML, it is not optimized for interactive query performance and usually has significantly higher latency for average BI queries as shown by benchmarks.
A popular pattern is to have both Trino & Spark used for different types of workflows while sharing data and metadata:
Wait, but what about the storage layer?
Most companies that embark on implementing a modern data lake choose to use blob storage from one of the leading cloud providers such as S3. If that is not an option, a bare-metal and open-source solution is the time-proven HDFS. However, given the popularity of S3, many modern systems developed interfaces specifically for it. This is where project like MinIO that offer a bare-metal drop-in replacement for S3 can come in handy.
Ultra-fast serving layer
There may be use cases that require sub-second query performance which cannot be achieved with a data lake architecture shown above (since data is sent over the network for every query). For such use cases, there are warehousing technologies that bundle storage and compute on the same nodes and, as a result, are optimized for performance (at the expense of scalability). ClickHouse, Pinot, and Druid are all strong open-source contenders for the sub-second query layer. Here’s an excellent article that compares the three products.
The role of the transformation framework is to help you develop and reliably execute code that transforms the data in the data lake for further analysis, ML, or other needs. Given the variety of available frameworks and tools, it is important to consider three major use cases:
Premise: your transformations are expressed predominantly in SQL and the data is then consumed via BI tools.
dbt was our top choice even when compared to SaaS tools for the following reasons:
- Comprehensive yet lightweight: captures most use cases and is easy to start with
- Open source yet smooth: on top of all benefits of OSS, it’s also easy to run
- Opinionated yet reasonable: relies on common conventions and best practices
- Great documentation and an active, devoted community
Full-featured data orchestration
Premise: in addition to (or instead of) SQL, your workflows include jobs defined in Python, PySpark, and other frameworks.
In this case, it may be beneficial to use a full-featured data orchestrator that:
- Manages scheduling and dependencies of disparate tasks
- Handles failures and retries gracefully
- Provides visibility into the data pipeline status, logs, and performance
The most widely used and incredibly influential data orchestrator is Apache Airflow.
Airflow is a safe choice if you are looking for a mature, time-proven technology with dozens of available integrations.
However, Airflow has a few chronic pains:
- Dependencies are not data-aware (e.g. ensuring the schema compatibility between upstream and downstream tasks is hard to impossible)
- Clunky development experience
- No built-in QA capabilities
- Limited visibility and debugging
In the last few years, multiple teams have attempted to rethink and improve orchestration, and at the moment, recommend taking a closer look at the following leading open-source orchestration frameworks:
Prefect follows the Airflow paradigm with a lot of sugar on top:
And better data-ops visibility:
Another framework worth mentioning is Dagster. Unlike Prefect that can be seen as the "modern Airflow", Dagster takes a bolder approach and proposes a few paradigm shifts:
- Reusable and composable tasks (called solids) – increasingly important as the complexity of data pipelines grows
- Data-aware tasks – a big step forward in ensuring the reliability of pipelines; now, dependencies between tasks can be verified both during development and in production – how cool is that!
- Integrated testing, tracing, logging
And, it offers a great data developer experience (expected nothing less from the co-author of GraphQL!). You can read more from the Dagster author Nick Schrock here.
The three technologies that are worth a close look when it comes to streaming transformations are Apache Spark, mentioned earlier, Apache Flink, and Apache Beam. Spark is a popular choice thanks to its mature, multi-language Structured Streaming framework.
Flink is a streaming-first engine with a Java/Scala API that offers a few advantages over Spark:
- Native support of stateful streaming transformations (e.g. to calculate metrics over user sessions)
- Low latency (millisecond scale vs. second with Spark)
While there are multiple deeply technical (e.g. memory management) and operational aspects that need to be taken into account when comparing Flink to Spark, from the strategic perspective, Spark is likely a better choice for straightforward ingestion/transformation jobs and would be easy to pick up by teams already using Spark for ML or batch transformations. Flink, on the other hand, enables more advanced workflows in production developed by professional data engineers.
Apache Beam, unlike Spark and Flink, is a framework (think like dbt but for streaming) that allows developers to write batch or streaming transformations and execute them on a “runner” of their choice, such as Spark or Flink. It’s certainly a powerful concept, and Beam’s API is designed to be expressive and robust. However, as most of the momentum behind Beam seems to originate at Google (Beam is the official interface to Google Dataflow), we have yet to see it being adopted by the larger open-source community. For that reason, it gets a ranking of a turnkey but not yet mature tech.
Beam's framework encourages functional-style pipeline definition:
With all the awesome tools for instrumenting, collecting, integrating, and processing data at your disposal, it has indeed become quite easy to create new datasets and analytical content. That, however, led to another problem - data discovery:
- Where do I find the data / prior analytical work on a topic?
- Can I trust that data? Who else uses it? When was it last updated? Are there any known quality issues?
Open-source tooling in the space is still young, but there are two projects: Amundsen by Lyft and Datahub by LinkedIn that show a lot of promise in the area. Out of the box, they provide a basic data catalog with search and offer a framework for augmenting the catalog with additional metadata (e.g. links to source code, ownership, etc.).
The open-source nature of Amundsen and DataHub allows their users to tailor them according to their stack, workflows, and team cultures, while their plugin architecture allows them to easily add metadata from various sources. For example, column-level lineage information can be synced to Amundsen from Datafold (which is not open source) via API and visualized in the Amundsen UI.
Data testing, i.e. validating business-specific assumptions (e.g. whether a value is within the expected range) when making changes to the data-processing code or when data is computed in production, becomes an essential part of the data team’s workflow. Modern data transformation frameworks – such as dbt and Dagster for batch, and Beam and Flink for streaming – have integrated testing features that allow developers to write and execute tests with minimal effort. However, sometimes there is a need for auxiliary tests to be run, for which GreatExpectations is a neat solution.
The basic flow is:
- Connect your database
- Create assertions in a Jupyter Notebook
- Integrate the created assertions in your ETL orchestration platform (e.g. Airflow)
In our previous post, I laid out the main use cases for BI:
- Self-serve analytics – no-code data analysis
- Dashboards – combinations of charts for passive review
- Ad-hoc analysis – advanced analysis by technical users
- End-to-end product analytics
While Metabase supports direct SQL > chart > dashboard flow, among the open-source BI tools, it has the most advanced self-serve UX that allows fewer technical users (or anyone, really) to explore data with simple aggregations:
The UX falls in between Looker and Mode - closer to the former, but without a semantic data model.
Another open-source alternative to Looker, Lightdash promises to connect to your dbt project and allow everyone to answer their own data questions. Lightdash is definitely not the most mature open-source BI product relative to Metabase or Superset, but the LookML-like modeling layer and tight integration with dbt make me believe they are onto something big! Read more about why I'm intrigued by Lightdash in my recent blog.
For a classic SQL > chart > dashboard flow, there are two strong contenders:
Simple & Effective
If your main interface to data is SQL and you are looking for a dashboarding tool that just works, look no further. Redash is easy to set up and has a minimalistic UX. It doesn't shine a no-code query constructor like Metabase or fancy Sankey charts of Superset, but it's simple and robust.
Something to keep in mind: after the acquisition by Databricks, the development of the open-source repo has slowed down by ~50%, as you can observe in the velocity below as well:
Started by the creator of Apache Airflow, Maxime Beauchemin at Airbnb, Superset is perhaps the most advanced open-source BI tool in terms of visualization capabilities and has a massive momentum behind it.
Keep in mind: there is a learning curve even for technical users.
SaaS state of the art for SQL ad-hoc analysis: Mode
Querybook is a promising notebook-like open-source IDE for data analysis originally developed at Pinterest. In line with the modern trend, it also features lightweight data discovery functionality.
For data science that typically heavily rely on R, Python or Julia, Jupyter is still a safe choice.
End-to-end product analytics
While SQL + charts tools give a lot of flexibility in terms of the types of analysis you can perform, for certain tasks such as optimizing UX and conversion in your app or website, end-to-end analytics tools offer pre-baked analysis and powerful visualizations such as funnels, cohort charts, and customer segmentation. These tools typically handle the entire cycle of event collection, storage, transformation, and visualization - hence "end-to-end".
PostHog is an open-source alternative to Mixpanel, Amplitude, and Heap.
Plausible is an open-source alternative to Google Analytics.
When I set out to write this post, I didn’t have a clear answer to whether a 100% OSS data stack can be as effective as if open source wasn’t a constraint. But after quite a bit of research and tinkering, I can say “YES” with a caveat that everyone should still choose open source for the right reasons and be mindful of the implications on support, hosting, maintenance and development.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.