October 4, 2024

Modern Data Stack

The Modern Data Stack: Open-source edition

If you're looking to build the ideal modern data stack for analytics using only open-source tools, we cover the top alternatives to your favorite paid solutions.

No items found.

Gleb Mezhanskiy

The Modern Data Stack: Open-source edition

Note: While some technologies listed in this guide are not open-source but source-available per their licenses, we believe it's important to list and evaluate them alongside since the license, in most cases, wouldn't have a material effect on how a typical data team would leverage the technology.

‍

Putting together an effective, scalable, cost-efficient data stack is challenging. The industry has made enormous strides over the past 10 years, and many technologies and products exist.

So why is it hard to put together a cohesive data stack?

Data stacks are fragmented: they require stitching multiple tools together.
Many technologies and vendors overlap, and it’s hard to understand what’s great for what.
The ecosystem is changing fast: new promising entrants pop up, and consolidation and economic hardship kill products. You need to place your bets wisely.

As data practitioners who also grapple with the same challenges, we aim to provide an overview of the top technology choices for each part of a modern data stack. While we do our best to base our judgment on certain objective criteria looking into the development and release velocity of projects and diving deep into documentation, it is inevitably a subjective opinion like any technology recommendation is. Opinions are our own based on decades of combined experience as data leaders at multiple companies including Autodesk & Lyft.

Why open source?

I firmly believe that open source alone is not a compelling enough reason to choose a technology for the data stack. However, there are a few good reasons to evaluate open source data stack elements:

Some parts of the data stack, such as orchestrators or in-memory data processing engines, greatly benefit from being open source as their users are likely to embed them in their internal applications and/or extend them.
For some organizations, open source may be the only possible solution, such as in a highly regulated industry or due to local privacy and data security legislation that may prohibit the use of foreign SaaS vendors.
Unlike closed-source SaaS, which often hides its full functionality behind demos and layers of overpromising marketing that can be hard to cut through, open-source SaaS is fully transparent in its capabilities and, therefore, is easier to research and compare.

Elements of the modern data stack

Since even a basic data stack has many moving parts and technologies to choose from, this post organizes the analysis according to the steps in the data value chain.

Organizations need data to make better decisions, either human (analysis) or machine (algorithms and machine learning). Before data can be used for that, it needs to go through a sophisticated multi-step process, which typically follows these steps:

Collection & Integration: getting data in the data warehouse/lake for further processing some text
1. Event Processing: Collecting events and getting them in the warehouse
2. Data Integration: Adding data from various sources into the warehouse and vice-versa (Reverse ETL)
Data Lake: Storing, processing, and serving data in a lake architecture, along with its Storage, File, Metadata, and Compute layers.
Data Warehousing: Similar to a data lake but an all-in-one solution with all the layers more tightly integrated.
Streaming & Realtime: Emerging category for low-latency or operational use cases
Data Orchestration: Managing how data pipelines are defined, structured, and orchestrated.
Data Catalogs: Finding the right data asset for the problem
BI & Data Apps: The consumption layer of the data stack with subcategories, including Notebooks, Self-serve, Dashboards, Code-first Data Apps, and Product/Web Analytics.

Most of the open-source products listed below are open-core, i.e., primarily maintained and developed by teams that profit from consulting about, hosting, and offering enterprise features for those technologies. In this post, we do not consider the features only available through the SaaS/Enterprise versions, thereby comparing only openly available solutions.

‍

Overview of open-source modern data stack products

‍

Evaluation criteria

For each open-source project, we use the following criteria:

Feature completeness – how closely the product matches best-in-class SaaS offering.

Traction

# of stargazers
# of contributors
Velocity that is expressed in commits and releases per month. If it's very low, you risk adopting a stale project, requiring you to fix bugs and build integrations on your own.

‍Maturity - beyond features, how easy and reliable it will be to use in production. This is a qualitative and subjective assessment.

Promising —the technology demonstrates basic functionality, but using it in production will likely require some development and deployment work.
Mature – the technology is production-ready out of the box, is used by multiple large organizations, and is considered one of the go-to solutions in its category.
Legacy – the technology is well-adopted, but its popularity and momentum have already peaked, and new entrants are actively disrupting it.

Collection and integration

Best-in-class SaaS: Segment (event processing), Fivetran (data integration), Hightouch (reverse ETL)

Collecting event data from client and server-side applications + third-party systems.

Event collection and ingestion

Rudderstack appears to be the most well-rounded, mature, and actively developed platform in an "open-source Segment alternative" class. It’s the open-source leader in integration at the event level. Snowplow is another popular solution pursuing the ETL (instead of ELT) pattern and enables in-flight transformations. While it had prominent open-source momentum a few years ago, these days, it seems to be focused on the cloud offering with almost no development on the open-source project. Jitsu, on the other hand, is a younger contender that is more focused on event ingestion into the warehouse.

Data integration + reverse-ETL

Airbyte is currently leading open-source data integration solution followed by Meltano, a project that originated at GitLab..

Airbyte and Meltano used to share the same core—the Singer protocol for extracting and loading data. Some time ago, Airbyte pivoted to its own connector framework while Meltano continued building on Singer. Another distinction is that Airbyte’s database replication is based on Debezium, arguably more mature and scalable than Meltano’s CDC based on Singer “taps”. Airbyte and Meltano focus more on data integration – bringing data into the warehouse from multiple sources – but offer some reverse-ETL (pushing data from the warehouse into business apps) functionality.

Here are the five contenders for open-source alternatives to Segment and Fivetran:

Feature	RudderStack	Jitsu	Snowplow	Airbyte	Meltano	dlt
Closest SaaS	Segment	Segment	Segment	Fivetran	Fivetran/Hightouch	Fivetran
API as a source	No	No	No	Yes	Yes (via Singer)	Yes
Event collection	Yes	Yes	Yes	No	No	No
Database replication	No	No	No	Yes (via Debezium)	Yes (via Singer)	No
API as destination (reverse ETL)	No	No	No	No	Yes (via Singer)	Yes
Database as destination	Yes	Yes	Yes	Yes	Yes (via Singer)	Yes
Maturity	Mature	New	Mature	Mature	Mature	Mature
Repo	rudderlabs/rudder-server	jitsucom/jitsu	snowplow/snowplow	airbytehq/airbyte	meltano/meltano	dlt-hub/dlt
Stargazers	4.1k	4.1k	6.8k	15.7k	1.8k	2.4k
# of contributors (180d)	24	7	2	178	12	42
# of weekly commits (180d)	11	8	0	247	9	18
Primary language	Go	TypeScript	Scala	Python	Python	Python
License	Elastic License 2.0	MIT	Apache 2.0	MIT	MIT	Apache 2.0

Data quality is your moat. Being proactive is your defense.

Learn best practices to safeguard data quality and unlock your data team’s velocity.

Data lake & warehousing

The heart of any data stack! Let’s get clear with the definitions first:

Data lake vs. data warehouse

	Data Lake	Data Warehouse
Architecture	Assembled from individual components, storage and compute are decoupled	Storage and compute are more coupled
Query Engine	Supports multiple query engines (e.g. Spark, DuckDB operating on a unified data lake)	Typically has one query engine
Scalability	Highly scalable, compute and storage can be scaled separately	Depends on the architecture, usually less scalable than a data lake
Structure	Supports both unstructured and structured (via metastore) data	More structured
Performance	Depends on the query engines and architecture	Typically more performant due to tighter coupling of storage and compute
Management overhead	Higher due to multiple components that need to work together	Depends on the specific technology; usually lower due to fewer moving parts

Typically, a data lake approach is chosen when scalability and flexibility are more important than performance and low management overhead.

Data lake components

Data lake architecture is organized into layers each responsible for its part:

Storage Layer – operates the physical storage
File Layer – determines in what format the data is stored in the storage layer
Metadata Layer – defines structure (e.g. what files correspond to what tables and simplifies data ops operations and governance)
Compute Layer – performs data processing and querying

Storage layer

Best-in-class SaaS: AWS S3, GCP GCS, Azure Data Lake Storage

Object storage is the prevalent way to store data due to its simplicity and infinite scalability, where you can add more data without worrying about available space. Modern object storage solutions by major cloud providers replaced open source HDFS (Hadoop Distributed File System), which provided a file system spread across clusters of machines and required a lot of overhead to manage while being far less elastic.

The storage layer is one that many data platforms that lean heavily on open source decide to buy from a vendor such as AWS. However, if you want to go 100% open source, MinIO is the leading technology for self-managed blob storage that is compatible with S3 API (and hence highly interoperable within the data ecosystem). It can be easily deployed and managed through Kubernetes.

File layer

The File Layer determines how the data is stored on the disk. Choosing the right storage format can have a massive impact on the performance of data lake operations. Luckily, given the flexibility of the data lake architecture, you can mix and match different storage formats. Apache Parquet is the leading storage format for the file layer due to its balanced performance, flexibility, and maturity. Avro is a more specialized option for specific applications.

Feature	Parquet	Avro
Data orientation	Columnar	Row
Compression	High	Medium
Optimized for	Read-heavy analytical workloads, classic ELT architectures	Row-oriented workloads, data exchange, streaming
Schema evolution	Partial	Full
Interoperability	Best	Good
Repo	apache/parquet-format	apache/avro
Stargazers	1.8k	2.9k
# of contributors (180d)	11	37
# of weekly commits (180d)	0	12
Maturity	Mature	Mature
License	Apache 2.0	Apache 2.0

Metadata layer

Since in the data lake architecture, unlike a data warehouse, we operate with files directly, the metadata layer (also called “table format” in modern solutions such as Apache Iceberg) is a key component that enables efficient data workloads at scale while abstracting the end users away from the complexity of dealing with files. It serves two primary purposes:

Linking files to tables allows the user to query a table without knowing in what file(s) the data is stored.
Enabling data ops such as concurrent writes, ACID transactions, versioning, and file compaction that simplify data management and ensure high robustness.

Hive Metastore, originally built within the Hadoop/Hive ecosystem, has been the prevalent metadata layer for over a decade, primarily focused on the first use case. Over the last decade, as the data complexity, usage, and volume increased, “just” storing data as tables became no longer feasible. Today, Hive Metastore, while still widely adopted, is considered legacy as it’s been superseded by newer data lake platforms such as Apache Iceberg, which not only provide a metastore but also much more powerful table management capabilities that enable sophisticated data workflows.

Apache Hudi and Delta Lake should be mentioned as more specialized data lake platforms originally optimized for Apache Spark as an engine.

As of now, Apache Iceberg seems to be winning as the standard data lake platform providing table and metadata management for a broad variety of use cases.

Feature	Iceberg	Hive Metastore	Hudi	Delta Lake
Originated at	Netflix	Facebook	Uber	Databricks
File formats	Parquet, Avro, etc.	Parquet, Avro, etc.	Parquet, Avro, etc.	Parquet
Schema evolution	Yes	Limited	Limited	Yes
Transactions	ACID	No	ACID	ACID
Concurrency	High	Limited	High	High
Table Versioning	Yes	No	Yes	Yes
Interoperability	High (Spark, Trino, Flink, BigQuery)	High (Hive, Trino, Spark)	Limited (Optimized for Apache Spark)	Limited (Optimized for Apache Spark)
Repo	apache/iceberg	apache/hive	apache/hudi	delta-io/delta
Stargazers	6.2k	5.5k	5.4k	7.5k
# of contributors (180d)	95	77	84	81
# of weekly commits (180d)	21	15	19	20
Maturity	Mature	Legacy	Mature	Mature
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0

Compute layer

The Compute Layer is the most user-facing part of the data lake architecture since it’s what data engineers, analysts, and other data users leverage to run queries and workloads.

There are multiple ways compute engines can be compared:

Interface: SQL, Python, Scala, etc.
Architecture: in-process vs. distributed
Primary processing: in-memory vs. streaming
Capabilities: querying, transformations, ML

We believe that today, the architecture – the in-process vs. distributed – is the most important distinction since other aspects, e.g. supported languages or capabilities can be added over time.

In-process compute engines

Over the past few years, in-process compute engines emerged as a novel way of running compute on data lakes. Unlike distributed compute engines like Hive, Trino, and Spark which run their own clusters of nodes that are queried over the network, in-process engines are invoked from and run within data science notebooks or orchestrator jobs. I.e., they function similarly to libraries that are called by the code that runs the flow. To illustrate, with DuckDB you can do the following from a Python application:


import duckdb
duckdb.connect().execute("SELECT * FROM your_table")

Whereas working with a distributed system like Spark requires first spinning it up as a standalone cluster and then connecting to it over the network.

The power of the in-process pattern is simplicity and efficiency: distributed engines like Spark and Trino require a significant dedicated effort to operate, whereas in-process engines can run inside an Airflow task or a Dagster job and require no DevOps or additional infrastructure.

The are three main limitations of in-process engines:

Size scalability: data queried should generally fit in memory limited by the available memory to the process/machine; although disc spilling is increasingly supported as a fallback.
Compute scalability: parallelization is limited to the number of cores available to the machine and cannot be parallelized to multiple machines.
Limited support for data apps: while in-process engines can be used for various workloads, including ingestion, transformation, and machine learning, most BI tools and business applications require a distributed engine with an online server endpoint. One cannot connect to DuckDB from, say, Looker since DuckDB runs within a process and does not expose a server endpoint, unlike PostgreSQL.

As a rule of thumb, the in-process engines are optimal and efficient for small-medium (up to hundreds of gigabytes) data operations.

The following three technologies are leading in-process data engines:

DuckDB – a column-oriented SQL in-process engine

Pandas – a popular Python library for processing data frames

Polars – widely considered “faster Pandas”, Polars is a new-generation data frame engine optimized for performance.

It’s worth noting that DuckDB and Polars are rapidly growing their interoperability with other technologies with the adoption of highly efficient Apache Arrow (ADBC) as a data transfer protocol and Iceberg as a table format.

Feature	DuckDB	Pandas	Polars
Repo	duckdb/duckdb	pandas-dev/pandas	pola-rs/polars
Querying Interface	SQL	Python	Python, Rust, SQL
File IO	JSON, CSV, Parquet, Excel, etc.	JSON, CSV, Parquet, Excel, etc.	JSON, CSV, Parquet, Excel, etc.
Database integrations	Native PostgreSQL, MySQL; Any via ADBC	Any via ADBC or SQLAlchemy	Any via ADBC or SQLAlchemy
Parallel (multi-core) execution	Yes	No (through extensions)	Yes
Metastore Support	Iceberg	None	Iceberg (emerging)
Written in	C++	Python	Rust
Stargazers	2.4k	43.4k	29.6k
# of contributors (180d)	42	201	149
# of weekly commits (180d)	18	33	63
Maturity	Mature	Mature	Mature
License	MIT	BSD 3-Clause	MIT

Distributed compute engines

Best-in-class SaaS: Snowflake, Databricks SQL, BigQuery

We previously discussed criteria for choosing a data warehouse and common misconceived advantages of open-source data processing technologies. We have highlighted Snowflake, BigQuery and Databricks SQL as optimal solutions for typical analytical needs.

Assuming the typical pattern for analytics:

Current or anticipated large (terabyte-petabyte) and growing data volume
Many (10-100+) data consumers (people and systems)
Load is distributed in bursts (e.g. more people query data in the morning but few at night)
Query latency is important but not critical; 5-30s is satisfactory
The primary interface for transforming and querying data is SQL

To meet the assumptions above, teams frequently use a data lake architecture (decoupling storage from computing).

The leading open-source project that meets these criteria is Trino. Trino has the following advantages:

Feature-rich SQL interface
Works well for a wide range of use cases from ETL to serving BI tools
Close to matching best-in-class SaaS offerings in terms of usability and performance
Provides a great trade-off between latency and scalability

Trino has been so successful as an open-source Snowflake alternative largely due to a substantial improvement in usability and abstracting the user (and the DevOps) from the underlying complexity: it just works. Forget hundreds of critical tuning parameters to make older systems like Hive work.

ML & specialized jobs

Spark is a true workhorse of modern data computing with a polyglot interface (SQL, Python, Java & Scala) and unmatched interoperability with other systems. It is also extremely versatile and handles a wide range of workloads, from classic batch ETL to streaming to ML and graph analytics.

Do I need Trino if Spark can handle everything?

‍While Spark is a Swiss army knife for ETL and ML, it is not optimized for interactive query performance. Usually, it has significantly higher latency for average BI queries, as shown by benchmarks.

A popular pattern is to have both Trino & Spark used for different types of workflows while sharing data and metadata:

Feature	Trino	Spark
Repo	trinodb/trino	apache/spark
Architecture	Distributed	Distributed
Querying Interface	SQL	Scala, Java, Python, R, SQL
Fault Tolerance	None - partial failures require full restart	Fault-tolerant, can restart from checkpoints
Latency	Lower	Higher
Streaming	No	Yes
Metastore Support	Hive Metastore, Iceberg, Delta Lake (emerging)	Delta Lake (optimal), Hive Metastore, Iceberg
Machine learning	No	Yes
Written in	Java	Scala
Stargazers	10.3k	39.4k
# of contributors (180d)	122	180
# of weekly commits (180d)	100	73
Maturity	Mature	Mature
License	Apache 2.0	Apache 2.0

‍

Notable mentions:

PrestoDB

PrestoDB shares a common ancestry with Trino (previously called PrestoSQL). Side story: Trino is a fork of PrestoDB, which was initially developed at Facebook as a replacement for Hive. Facebook enacted its trademark and ownership rights on the software, prompting the fork and rename. You can read more on that from the core team behind PrestoDB/PrestoSQL. We recommend Trino over PrestoDB due its larger community (PrestoDB is primarily maintained by Facebook/Meta and therefore prioritizes the internal needs).

Hive – Hive was once a major breakthrough in large-scale data compute offering a SQL interface to Hadoop MapReduce. However, by now it’s mostly obsolete and has no advantages over modern systems like Trino and Spark while being much slower and harder to maintain and use.

Data warehouses

Two good reasons for choosing data warehouse (coupled storage and compute architecture) are:

Performance: use cases that require sub-second query latencies which are hard to achieve with a data lake architecture shown above (since data is sent over the network for every query). For such use cases, there are warehousing technologies that bundle storage and compute on the same nodes and, as a result, are optimized for performance (at the expense of scalability), such as ClickHouse and Hydra.
Simplicity - Operating a data warehouse is usually significantly simpler than a data lake due to fewer components and moving parts. It may be an optimal choice when:
1. The volume of data and queries is not high enough to justify running a data lake.
2. Data lake already exists but certain workloads require lower latency (e.g., BI). As far as simplicity goes, projects like Hydra that are based on PostgreSQL are hard to match in terms of ease of deployment, with scalability as a potential tradeoff.

Both ClickHouse and Hydra take advantage of the tighter coupled storage and compute standard for the data warehousing paradigm:

Column-oriented data storage optimized for analytical workloads
Vectorized data processing allowing the use of SIMD on modern CPUs
Efficient data compression

However, they have several important distinctions:

ClickHouse is one of the most mature and prolific open-source data warehousing technologies originating at Yandex (aka the Google of Russia). It powers Yandex Metrica (a competitor to Google Analytics) and processes over 20 billion events daily in a multi-petabyte deployment.

Hydra is a newer player that builds upon PostgreSQL. As an extension to PostgreSQL, Hydra has excellent interoperability with other tools while adding powerful analytical features and optimizations to PostgreSQL. Hydra’s primary limitation is its scalability – unlike ClickHouse, which can scale to hundreds of nodes and handle petabytes of data, Hydra currently can only scale vertically on a single node.

Feature	ClickHouse	Hydra
Repo	ClickHouse/ClickHouse	hydradatabase/hydra
Architecture	Distributed	Single node (based on PostgreSQL)
Data volume	Up to petabytes	Up to hundreds of gigabytes
Scaling by	Adding nodes	Increasing instance size and storage (limited to 1x instance)
Querying Interface	SQL	SQL
External data sources	PostgreSQL, MySQL, Snowflake, BigQuery, Kafka, S3, GCS, Redshift	S3, PostgreSQL, MySQL, Google Sheets
Streaming Inserts	Yes	Emerging
Written in	C++	C
Stargazers	36.9k	2.8k
# of contributors (180d)	253	4
# of weekly commits (180d)	570	0
Maturity	Mature	Promising
License	Apache 2.0	Apache 2.0

‍

Notable mentions:

Pinot and Druid are distributed OLAP data engines that are more similar to ClickHouse than to Hydra. Originally developed at LinkedIn and Metamarkets respectively, those systems are largely superseded by ClickHouse in terms of lower maintenance costs, faster performance, and a wider range of features (for example, Pinot and Druid have very limited support for JOIN operations that are key parts of modern data workflows).

Data orchestration

The role of the data orchestrator is to help you develop and reliably execute code that extracts, loads, and transforms the data in the data lake or warehouse as well as perform tasks such as training machine learning models. Given the variety of available frameworks and tools, it is important to consider three major use cases:

SQL-centric workflows

Premise: your transformations are expressed predominantly in SQL.

dbt was our top choice even when compared to SaaS tools for the following reasons:

Comprehensive yet lightweight: captures most use cases and is easy to start with
Open source yet smooth: on top of all the benefits of OSS, it’s also easy to run
Opinionated yet reasonable: relies on common conventions and best practices
Great documentation and an active, devoted community

SQLMesh is a new player that positions itself as an alternative to dbt and claims to provide a better developer experience with the same paradigm.

Key differentiators:

Code Understanding – unlike dbt, which relies on Jinja templating, SQLMesh leverages full SQL parsing and code manipulation leveraging SQLGlot. This, in turn, enables it to solve several problems:some text
1. Column-level lineage – SQLMesh tracks dependencies for all models that it manages on the columnar level.
2. Impact analysis – SQLMesh analyzes changes, determines which are breaking vs. not, and simplifies dataops such as backfills and deploys.
More robust handling of incremental/partitioned and slowly-changing dimension tables. Incremental (or partitioned) models are needed when data is too large to be processed in one query run. In dbt, such models are handled in a non-idempotent way, meaning that each run results in a different output that is not deterministic because the state is not managed. SQLMesh preserves the state of incremental models making expressing them easier and eliminating potential data leakage and data quality issues. It also provides richer templating support for various types of incremental models. Read in their own words how the dbt workflow feels compared to SQLMesh.

As of today, dbt is a de-facto standard for SQL transformations with a massive and loyal community and viral adoption. Nevertheless, SQLMesh is worth paying attention to given its strong technological core and focus on developer experience and data quality.

Feature	dbt Core	SQLMesh
Repo	dbt-labs/dbt-core	TobikoData/sqlmesh
Interface	CLI	CLI + UI
Column-level Lineage	No	Yes
Macros	Jinja	Jinja, Python
Unit tests	Coming in v1.8	Yes
Embedded Data Diff	No	Yes
Contracts	Schema only	Schema + Data
Multiple Environments	Yes, via materialization	Virtual Data Environments
Package Marketplace	Yes	No
Multi-repository support	No	Yes
Written in	Python	Python
Stargazers	9.7k	1.7k
# of Slack community members	34k	2.5k
# of contributors (180d)	37	38
# of weekly commits (180d)	9	24
Maturity	Mature	Promising
License	Apache 2.0	Apache 2.0

Full-featured data orchestration

Premise: in addition to (or instead of) SQL, your workflows include jobs defined in Python, PySpark, and/or other frameworks. While dbt and SQLMesh provide basic support for Python models, their focus is on SQL and a full-featured orchestrator may be warranted for running multi-language workloads.

In this case, it may be beneficial to use a full-featured data orchestrator that:

Manages scheduling and dependencies of disparate tasks
Handles failures and retries gracefully
Provides visibility into the data pipeline status, logs, and performance

The most widely used and incredibly influential data orchestrator is Apache Airflow.

We consider Airflow a legacy technology that should no longer be the default option as a general-purpose orchestrator. While it’s incredibly widely adopted and has many extensions, more modern orchestrators such as Dagster are better suited for present day data complexity and development lifecycle, and have reached maturity to be run at an Enterprise level.

In the last few years, multiple teams have attempted to rethink and improve orchestration, and at the moment, recommend taking a closer look at the following leading open-source orchestration frameworks.

Prefect follows the Airflow paradigm and aims to provide a better developer experience:

And better data-ops visibility:

The most promising full-featured orchestrator that we are excited about today is Dagster. Unlike Prefect, which can be seen as the "modern Airflow", Dagster takes a bolder approach and proposes a few paradigm shifts.

The biggest one is Software-defined Assets: unlike Airflow where the primary concept is a unit of work – Task – with multiple Tasks chained into DAGs, Dagster defines the output – an asset – as its primary concern: "An asset is an object in persistent storage, such as a table, file, or persisted machine learning model. A Software-defined Asset is a Dagster object that couples an asset to the function and upstream assets used to produce its contents.”

While the difference may seem subtle at first glance, it turns the paradigm upside down by shifting the focus from the work being done to the actual data product produced (what the business cares about).

Dagster brings and implements a number of other major enhancements relative to Airflow, including:

Data-aware tasks – a big step forward in ensuring the reliability of pipelines; now, dependencies between tasks can be verified both during development and in production.
Declarative, SLA-driven Scheduling – in Airflow’s paradigm, DAGs of tasks are triggered on a particular schedule. It is up to the data engineer to figure out the scheduling and dependencies making it increasingly difficult to maintain Service Level Agreements at scale. Dagster takes a radically different approach with declarative SLA-driven scheduling: you define when the data asset needs to be produced or updated, and the upstream scheduling is managed for you.
Integrated testing, tracing, logging
First-class environment support for easier development, deployment and maintenance in production.

Mage is a newer player in the data orchestration space. Its focus seems to be on the speed of development and it may be a great choice for smaller data teams that prioritize speed of development over robustness. For example, Mage comes with a built-in data integration framework based on Singer that enables data syncing and removes the need for standalone data integration software.

Feature	Airflow	Dagster	Prefect	Mage
Repo	apache/airflow	dagster-io/dagster	PrefectHQ/prefect	mage-ai/mage-ai
Data contracts	No	Yes	Yes	Yes
Multiple Environments	No	Yes	Yes	Yes
SLA-driven Scheduling	No	Yes	No	No
dbt integration	No/limited	Yes	No/limited	Yes
Written in	Python	Python	Python	Python
Stargazers	36.5k	11.3k	15.9k	7.8k
# of Slack community members	24.4k	10.6k	14.4k	5.8k
# of contributors (180d)	335	111	74	45
# of weekly commits (180d)	79	99	59	11
Maturity	Legacy	Mature	Mature	Promising
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0

‍

Streaming & real time

Streaming and real-time data processing space is a vast space with a lot of nuance. In this guide, we offer a high-level overview of concepts and mention a few foundational and breakthrough technologies.

Streaming paradigm is key in a few use cases:

Collecting data (e.g., events) and ingesting it into a data warehouse/lake
Processing data in-flight for enrichment before loading into the long-term storage
Real-time analytics and machine learning, e.g. fraud detection, operational analysis, etc. – where latency matters.

The cornerstone of data streaming is Apache Kafka. Kafka implements a publisher-subscriber (pub/sub) event hub allowing it to collect, store, and deliver vast amounts of data with low latency. Many technologies integrate with Kafka as a source or a sink.

When it comes to streaming transformations and compute, three technologies are worth a close look: Apache Spark (mentioned earlier), Apache Flink, and Materialize.

Spark is a popular choice for implementing streaming applications thanks to its mature, multi-language Structured Streaming framework and versatility allowing data teams to leverage Spark both for their real-time and batch workloads.

Flink is a streaming-first engine with a Java/Scala API that offers a few advantages over Spark:

Native support of stateful streaming transformations (e.g. to calculate metrics over user sessions)
Low latency (millisecond scale vs. second with Spark)

While multiple deeply technical (e.g. memory management) and operational aspects need when comparing Flink to Spark, from the strategic perspective, Spark is likely a better choice for straightforward ingestion/transformation jobs and would be easy to pick up by teams already using Spark for ML or batch transformations. On the other hand, Flink enables more advanced production workflows developed by professional data engineers.

Materialize is a newer player in the real-time world offering a new paradigm: “a cloud-native data warehouse purpose-built for operational workloads where an analytical data warehouse would be too slow, and a stream processor would be too complicated.” Materialize’s framework is centered around real-time views allowing the developers to implement applications in a familiar table/view pattern but leverage real-time updates. It is therefore best suited for real-time analytics and applications that require up-to-the-second insights, such as monitoring dashboards, personalization features, and financial alerts.

Feature	Spark	Flink	Materialize
Optimized for	Versatility across ETL/ELT, large-scale processing, and machine learning	Low-latency high-throughput stream processing	Low-latency updates and queries over streaming data with emphasis on consistency and accuracy
Support for batch paradigm	Yes	Yes	No
Latency	Seconds	Subsecond	Subsecond
Scalability	High	High	High
Interface	Scala, Java, Python, SQL	Scala, Java	SQL
Written in	Scala	Java	Rust
Repo	apache/spark	apache/flink	materializeInc/materialize
Stargazers	39.4k	23.9k	5.7k
# of Slack community members	N/A	4.3k	2.3k
# of contributors (180d)	180	115	49
# of weekly commits (180d)	73	30	208
Maturity	Mature	Mature	Mature
License	Apache 2.0	Apache 2.0	Business Source

‍

Notable mentions:

Apache Beam, unlike Spark and Flink, is a framework (think like dbt but for streaming) that allows developers to write batch or streaming transformations and execute them on a “runner” of their choice, such as Spark or Flink. It’s certainly a powerful concept, and Beam’s API is designed to be expressive and robust. However, as most of the momentum behind Beam seems to originate at Google (Beam is the official interface to Google Dataflow), we have yet to see it being adopted by the larger open-source community. For that reason, it gets a ranking of a turnkey but not yet mature tech.

‍

Data cataloging

With all the awesome tools for instrumenting, collecting, integrating, and processing data at your disposal, it has indeed become quite easy to create new datasets and analytical content.

That, however, led to another problem - data discovery:

Where do I find the data / prior analytical work on a topic?
Can I trust that data? Who else uses it? When was it last updated? Are there any known quality issues?

Open-source tooling in the space is still young, but there are two projects: Amundsen by Lyft and Datahub by LinkedIn that show a lot of promise in the area. Out of the box, they provide a basic data catalog with search and offer a framework for augmenting the catalog with additional metadata (e.g. links to source code, ownership, etc.).

The open-source nature of Amundsen and DataHub allows their users to tailor them according to their stack, workflows, and team cultures, while their plugin architecture allows them to easily add metadata from various sources. For example, column-level lineage information can be synced to Amundsen from Datafold (which is not open source) via API and visualized in the Amundsen UI.

‍

Feature	Amundsen	DataHub
Originally developed at	Lyft	LinkedIn
Optimized for	Data discovery and search	Data discovery and metadata management
Lineage	Table-level, bring-your-own lineage	Column-level lineage for a subset of integrations
UI	Emphasizing simplicity of discovery	More feature-rich and complex
Written in	Python	Java, Python
Repo	amundsen-io/amundsen	datahub-project/datahub
Stargazers	4.4k	9.7k
# of Slack community members	3.3k	11k
# of contributors (180d)	5	123
# of weekly commits (180d)	0	36
Maturity	Mature	Mature
License	Apache 2.0	Apache 2.0

Analysis

In our previous post, I laid out the main use cases for BI:

Self-serve analytics – no-code data analysis
Dashboards – combinations of charts for passive review
Ad-hoc analysis – advanced analysis by technical users
Code-first data app frameworks for building highly customizable interactive data applications embedded into other products.
End-to-end product analytics

Self-serve analytics & dashboards

SaaS state of the art for SQL ad-hoc analysis: Looker, Tableau

While Metabase supports direct SQL > chart > dashboard flow, among the open-source BI tools, it has the most advanced self-serve UX that allows less technical users (or anyone, really) to explore data with simple aggregations:

‍

‍

The UX falls in between Looker and Mode - closer to the former, but without a semantic data model.

Another open-source alternative to Looker, Lightdash promises to connect to your dbt project and allow everyone to answer their own data questions. Lightdash is definitely not the most mature open-source BI product relative to Metabase or Superset, but the LookML-like modeling layer and tight integration with dbt make me believe they are onto something big! Read more about why I'm intrigued by Lightdash in my recent blog.

Started by the creator of Apache Airflow, Maxime Beauchemin at Airbnb, Superset is perhaps the most advanced open-source BI tool in terms of visualization capabilities and has a massive momentum behind it.

Keep in mind: there is a learning curve even for technical users.

‍

‍

Feature	Metabase	Lightdash	Superset
Repo	metabase/metabase	lightdash/lightdash	apache/superset
Closest SaaS	Tableau	Looker	Tableau
Optimized for	Self-serve data exploration and dashboards	Self-serve data exploration and dashboards	Dashboards with advanced data visualization
Focus	Intuitive UI over extensibility	Deep integration with dbt as a semantic layer	Extensibility with plugins, custom visualizations
Written in	Clojure	Typescript	Typescript, Python
Stargazers	38.3k	3.8k	62.1k
# of Slack community members	Website-based forum	1.6k	15.5k
# of contributors (180d)	73	34	132
# of weekly commits (180d)	96	57	36
Maturity	Mature	Promising	Mature
License	Apache 2.0	MIT	Apache 2.0

Ad-hoc analysis & notebooks

SaaS state of the art for SQL ad-hoc analysis: Hex

‍Querybook is a promising notebook-like open-source IDE for data analysis originally developed at Pinterest. In line with the modern trend, it also features lightweight data discovery functionality.

For data science that typically heavily relies on R, Python, or Julia, Jupyter is still a safe choice.

Evidence.dev is an interesting project that pushes the simplicity of creating dashboards and visualizations to the edge: making the workflow centered on SQL and Markdown

Feature	Jupyter	Querybook	Evidence
Primary interface	Python	SQL	SQL + Markdown
Optimized for	Data science, ad-hoc analytics, machine learning	Ad-hoc SQL analysis and data discovery	Getting analytics reports through SQL and markdown
Written in	Python	Typescript, Python	JavaScript
Repo	jupyter/notebook	pinterest/querybook	evidence-dev/evidence
Stargazers	11.6k	1.9k	4.2k
# of Slack community members	No Slack community, only public forum	600	1.3k
# of contributors (180d)	8	8	23
# of weekly commits (180d)	1	1	78
Maturity	Mature	Mature	Promising
License	BSD 3-Clause	Apache 2.0	MIT

‍Additional tools:

Polynote is a Scala-first notebook experience with support for Python and SQL which could be a great option for teams building on Spark.

End-to-end product analytics

‍While SQL + charts tools give a lot of flexibility in terms of the types of analysis you can perform, for certain tasks such as optimizing UX and conversion in your app or website, end-to-end analytics tools offer pre-baked analysis and powerful visualizations such as funnels, cohort charts, and customer segmentation. These tools typically handle the entire cycle of event collection, storage, transformation, and visualization - hence "end-to-end".

‍PostHog is an open-source alternative to Mixpanel, Amplitude, and Heap.

Plausible is an open-source alternative to Google Analytics.

Feature	PostHog	Plausible
Optimized for	Enabling product analytics in one platform	Web analytics alternative to Google Analytics
Written in	Python, Typescript	Elixir, Javascript
Repo	PostHog/posthog	plausible/analytics
Stargazers	21.2k	19.9k
# of Slack community members	Website-based forum	Website-based forum
# of contributors (180d)	75	13
# of weekly commits (180d)	108	16
Maturity	Mature	Mature
License	MIT	AGPLv3

‍

Code-first data apps

Plotly Dash

Developed by Plotly, Dash is designed for creating analytical web applications. It's Python-based and heavily integrates with Plotly for visualization, which makes it ideal for building highly interactive, data-driven applications. Dash is particularly popular among data scientists who are familiar with Python and want to create complex, customizable web apps without extensive web development experience.

Streamlit

Streamlit is another Python-based tool that allows for the rapid creation of web apps specifically for data science and machine learning projects. Its main selling point is its simplicity and efficiency, allowing developers to turn data scripts into shareable web apps using straightforward Python code. Streamlit was recently acquired by Snowflake and is getting increasingly integrated into the Snowflake ecosystem.

Observable

Observable is a new entrant in the data app world offering a framework for creating interactive visualizations and narratives using Markdown, JavaScript, SQL, etc. “You write simple Markdown pages — with interactive charts and inputs in reactive JavaScript, and with data snapshots generated by loaders in any programming language (SQL, Python, R, and more) — and Observable Framework compiles it into a static site with instant page loads for a great user experience. “

Feature	Plotly Dash	Streamlit	Observable Framework
Primary interface	Python	Python	Javascript
Optimized for	Building interactive analytical web apps	Fast prototyping of data apps	Building data narratives with Markdown + code
Written in	Python	Typescript, Python	Javascript/Typescript
Repo	plotly/dash	streamlit/streamlit	observablehq/framework
Stargazers	21.3k	34.9k	2.4k
# of Slack community members	Website-based forum	Website-based forum	2.3k
# of contributors (180d)	18	38	20
# of weekly commits (180d)	11	17	10
Maturity	Mature	Mature	Promising
License	MIT	Apache 2.0	ISC

Additional tools:

Cube

Unlike Dash and Streamlit, Cube is not directly a tool for building applications but a headless BI platform that acts as an analytics API layer between your databases and applications. It is designed to provide a consistent way to access data for analytics and visualizations across different client-side frameworks (including Dash and Streamlit), making it a powerful backend for data applications.

Putting it all together

When I set out to write this post, I didn’t have a clear answer as to whether a 100% OSS data stack could be as effective if open source wasn’t a constraint. But after quite a bit of research and tinkering, I can say “YES” with a caveat that everyone should still choose open source for the right reasons and be mindful of the implications on support, hosting, maintenance, and development.