Modern Analytics Stack

Gleb Mezhanskiy

June 22, 2020

Is there a perfect stack for analytics?

I often come across questions such as “What is the best tool for Product Analytics / data integration / data warehousing?” There are a myriad of vendors in the data space each selling the similar and vague value propositions such as “360 customer view”. There are also thousands of “X vs. Y” articles and yet very few resources that offer a holistic framework for establishing an effective data infrastructure.

In this article, I attempt to deconstruct the analytical infrastructure into steps, following the “value chain” of data: from where it originates to the point where it becomes part of a narrative used to make a decision. For each step, I outline questions that are important to consider when deciding on the solution as well as recommend tools based on personal experience and research.

That framework can be handy if you are establishing the data stack at a young company, or bringing analytics to an established enterprise, or need to re-think parts of existing data infrastructure.

About the author

In the past five years, I’ve designed and built data platforms for three very different companies: Autodesk, Lyft & Phantom Auto, a mature public corporation, an exponentially growing marketplace and a seed-stage startup at the time respectively. In the process I evaluated hands-on dozens of vendors and OSS in the data space. I’ve also made numerous mistakes and observed many bad technical and economic decisions that helped me build a framework for thinking about data infrastructure.

Data value chain

Organizations need data to make better decisions, either human (analysis) or machine (algorithms and machine learning). Before data can be used for that, it needs to go through a sophisticated multi-step process, which can be seen as the following steps:

  1. Specification: defining what to track
  2. Instrumentation: registering events in your code
  3. Collection: ingesting & processing events
  4. Integration: adding data from various sources
  5. Data Warehousing: storing, processing, and serving data
  6. Transformation: preparing data for end users
  7. Quality Assurance: bad data = bad decisions
  8. Data discovery: finding the right data asset for the problem
  9. Analysis: creating the narrative

0. Observability: monitoring data assets, pipelines & infrastructure

Specification: defining what to track

To measure the business, we need to define the atoms of analytical data: events and their properties.

Example event message:

{
    "event": "event_checkout_completed",
    "properties": {
        "user_id": 123986,
        "user_device_type": "mobile",
        "user_os": "iOS",
        "occurred_at": "2020-05-02 09:06:01.843"
    }
}

Events are typically defined in a tracking plan that is shared between Engineering, Product and Analytics:

Recommended tools You can start with something as simple as a spreadsheet, for example, Segment offers a great free Tracking Plan template.

A great addition to the tracking plan is automatic validation of events against the definitions. If you use Segment for instrumentation (see below), you can use their Protocols feature for validating the events.

Alternatively, the following tools provide collaborative tracking plans with automatic event validation (and integrate with Segment):

Finally, the decision about what to track is by far more important than the chosen tools for the tracking plan. The following reads are helpful for defining metrics:

Instrumentation: registering events in code

Once you’ve defined what to track, the next step adding the code that registers events and sends them to some ingest endpoint, for example:

analytics.track('event_checkout_completed', { user_id: 123986,
user_device_type: “mobile”, user_os: “iOS”, });

Tip: It’s typical to send events to different tools (Product Analytics, Marketing Automation etc.). Instead of instrumenting your app with each tool’s SDK, it’s generally better to use a single SDK that supports all required platforms and then forward the collected data to all relevant tools. Maintaining multiple SDKs is costly, leads to errors and inconsistencies.

Recommended tool: Segment

Segment’s Analytics SDK is a “track once, send to many tools” solution that supports all major web & mobile platforms. Since the SDK itself is open source, you can also use it for sending to your own endpoint if you don’t want to use Segment for processing.

Alternatively, if you are using Avo or Iterative.ly for tracking plans, you can leverage their SDK for instrumentation.

Collection: ingesting & processing events

Once the events have been registered on the user devices and, optionally, your app backend, you would probably want to:

  • Be able to ingest the events into a data warehouse that stores all data.
  • Be able to relay certain events to third-party tools (e.g. for marketing automation) or to your own services.

Requirements

  1. [Reasonably] reliable: losing a significant portion of events may result in poor downstream analytics, but delivering events without losses is hard, so likely you need to define a tolerance.
  2. [Near] real-time: ability to process events within a few seconds
  3. Economical: price needs to be cheap to start & reasonably grow with the data volume
  4. Scalable: support at least 100x your current data volume
  5. Interoperable: easy to ingest data into a data warehouse of choice and to other tools
  6. 0-DevOps: focus your resources on your product

Recommended tool: Segment

Segment excels at all six requirements above. Furthermore, Segment’s biggest value proposition is the ability to seamlessly send data to ~100 destinations including data warehouses and marketing tools.

Segment’s major caveat is price:

10,000 visitors start at $120/mo, then add $10 for every 1K visitors, so at 100K visitors you’ll likely be paying $1K+ / mo. Note that the pricing is based on unique visitors, not users.

What to do once Segment becomes expensive?

Alternatives:

  1. GCP Pub/Sub
  2. AWS Managed Kafka Streaming (MSK)

Both options are lower-level solutions in that they provide a “pipe” but lack ready-to-go integrations. Both offer excellent scalability and are low-maintenance, and can be over 100x cheaper than Segment in terms of data volume costs. However, they will require lots of plumbing such as implementing custom connectors for producers and consumers of data which can be very costly.

Popular alternatives that we do not recommend:

  • Running OSS like Kafka yourself – while it’s powerful and open-source, Kafka is dev-ops heavy even when running through vendors, at which point it also becomes expensive.
  • AWS Kinesis – inferior to Pubsub & Kafka across most dimensions.

Integration: adding data from other sources

While analytical events typically serve as primary source of data, often you would need to augment them on a regular basis with data from other systems, such as:

  • Production/OLTP databases that power your app backend
  • CRM
  • Support systems
  • HR systems
  • External data sources

Integrating disparate sources is challenging due to differences in data formats and APIs. Today, the data integration market is mature and offers plenty of robust solutions, so it almost always doesn’t make sense to write your own integrations.

Requirements

  1. The solution should have all connectors you need (duh)
  2. Total # of connectors: to cover for what you may need in the future
  3. Bring-your-own connector: allow writing connectors within their framework if needed
  4. Pricing: should be transparent & scale well with your business

**Recommended tools:

  1. **Segment, if you are using it for collection, you can also leverage dozens of integrations

Stitch

  • Allows to create your own connectors
  • Transparent pricing

Alternative options

  1. Fivetran – similar to Stitch but less flexible on custom connectors & pricing is aggressive.
  2. Singer – open-source (supported by Stitch) but documentation is lacking

Further reading: Segment vs. Fivetran vs. Stitch comparison.

Data Warehousing: store, transform and serve data

The role of a data warehouse (DWH) is to accommodate all your analytical data and to provide easy and reasonably fast access to people and systems that need the data.

Requirements:

Scalability:

  • Storage
  • Compute
  • Ability to scale storage & compute independently
  • Price elasticity
  • Interoperability
  • Querying & transformation features
  • Speed
  • 0-DevOps

Recommended tools:

Optionally supplemented with:

  • Spark for sophisticated jobs or ML
  • Druid for the ultra-fast serving layer

Check out our article on Choosing a Data Warehouse for Analytics for deeper dive in this topic.

Transformation: prepare data for end-users

Now that you are able to collect events into your data warehouse and enrich them with data from other sources, the next step is transformation: cleaning, merging, aggregating raw data into tables that can be used by analysts.

This step is also often called: “Business Logic Layer”

Do we even need a framework?

In the early days of a company, one may think: why not run and visualize SQL queries in a dashboarding tool? Or can’t we just some SQL scripts on cron? That quickly becomes unmanageable for the team, and one finds themselves focusing on firefighting more than on adding new data into the system.

The role of a modern data transformation framework:

Standardizes code patterns

  • Lower barrier for new users to start contributing
  • People understand each other’s code better and make fewer mistakes.
  • Provides tools for testing and monitoring data quality
  • Facilitates clear code organization and documentation
  • Abstracts boilerplate (CREATE, INSERT etc.)
  • Lowers infrastructure (database) costs (thanks to pre-computed tables)
  • Reduces work duplication and helps avoid “multiple” sources of truth problem

SQL-centric frameworks

SQL is simple and popular, which enables a large number of people in the organization to contribute and comprehend the pipeline code. SQL’s drawbacks include poor testability.

Recommended tool: dbt

  • Comprehensive yet lightweight: captures most use cases and is easy to start with
  • Open source yet smooth: on top of all benefits of OSS, it’s also easy to run
  • Opinionated yet reasonable: relies on common conventions and best practices
  • Great documentation & and an active devoted community

Source: dbt

You can run dbt in dbtCloud or host yourself.

Generalized frameworks

Those typically leverage general-purpose programming languages such as Python, Java, and Scala for defining jobs. They offer greater flexibility and robustness than SQL but have a much steeper learning curve.

Apache Beam – is a new generation framework for batch and stream processing. One of its distinctive features is decoupled code from compute: a job defined in Beam can be executed on a variety of engines including Spark, Flink, and Google Dataflow. Beam has Java, Python, Go APIs.

For a typical case, my recommendation is to use dbt and consider Apache Beam for streaming or advanced jobs.

No-code frameworks

They enable defining transformations in the UI using Excel-like functions and pivoting.

Examples:

My take: those can be helpful in large organizations to empower non-technical users with the ability to process data but at the moment no-code is too limiting to support the majority of use cases on par with SQL / Generalized frameworks.

Testing: bad data = bad decisions

With so many different sources and systems that the data passes through, quality issues are almost inevitable, but just like in software, they can be effectively managed with appropriate tools and processes.

When should you start investing in data quality and tests? The answer is, usually – yesterday. It’s best to write tests before publishing the data. Fortunately, today, there are great tools and frameworks to help.

A simple and effective approach to data testing are assertions (rule-based checks) that validate your assumptions about the input or output data in the ETL process, such as:

  • The primary key is unique
  • A column is not NULL
  • A column value is within a given value range

When to use assertions:

  • Development: after making changes to the code, before merging to Production.
  • Production: after every pipeline execution on new data (e.g. every day)

Recommended tool: the best way to implement assertions is to use a transformation framework that includes integrated testing framework, such as dbt: when you create a new table with dbt, it provides basic tests out of the box which you can supplement with custom test cases.

Alternative:

  • great_expectations – rule-based checks for SQL
  • deequ – rule-based checks for Spark

Data Observability: data quality & performance monitoring

Teams working with modern analytics stacks have to deal with a lot of complexity: (1) the stack is made of sophisticated distributed systems (2) the more powerful and scalable the stack is, the more data assets will be created, processed and stored in it. (3) The higher the capabilities of the data stack are, the more reliant the business becomes on it, imposing ever stricter SLAs (Service Level Agreement) on quality, timeliness and performance of data products.

Managing the complex data infrastructure and processes requires a special category of tooling – data observability platform.

Data observability platforms provide visibility into the state, quality and performance of data assets (tables, reports, dashboards etc.) as well as processes that create and transform those assets (ETL/ELT/BI pipelines and queries).

Since data preparation flows typically span multiple tools (as demonstrated on the diagram above), an effective data observability platform should trace data along its journey and provide a global detailed lineage graph.

In software development, observability is delivered by Application Performance Monitoring platforms such as Datadog, Prometheus & New Relic that combine multiple tools including metric & logs collection and analysis, distributed tracing and anomaly detection. In a similar manner, data observability platforms provide tools specific to data-centered workflows:

Data Profiling

Data profiler analyzes datasets and generates descriptive statistics such as distributions, completeness (% of nulls), uniqueness etc. As it is essential to understand the underlying data before using it, a good data profiler can save a lot of time for both data developers and consumers.

Automatic Metric Monitoring

Today organizations track hundreds/thousands of metrics of different kinds: business KPIs, application & infrastructure performance, data quality indicators. Plotting these metrics on dashboards is helpful but does not ensure that the right person gets notified of an important deviation from the norm in a timely manner.

Automatic metric monitoring allows teams to delegate tracking of important metrics to algorithms by defining those metrics in SQL or other language, adjusting the sensitivity of anomaly detection to the desired level and receiving alerts about anomalies through a variety of channels such as Slack, Pagerduty, OpsGenie etc.

Data Diff

Data diff is a tool for comparing datasets within a database (or across databases) to determine the difference between them. It adds elements of version control to large-scale data warehouses: like a git diff command but for actual data.

While assertions (rule-based checks) discussed above help with validating assumptions about the data, data diff tells exactly how the datasets are different.

When to use data diff

  • Regression testing: before deploying change to ETL code, make sure that the data changes in an expected way.
  • Code reviews: it’s hard to tell from looking only at the source code how the actual data is going to change. Data Diff solves that problem.
  • Migrations: when porting pipelines from one database to another
  • Data integration: when copying data from one database to another.

Diff overview:

Diff values comparison:

Data Lineage

It is not uncommon for companies to have thousands of tables in their data warehouse with hundreds of reports and dashboards drawing data from them. Without an easy way to visualize and trace dependencies in the data ecosystem, teams are bound to making costly mistakes and to wasting time manually reading source code to answer basic questions.

Data lineage is a global graph representation of the data flows that can be provided at a table, column or task levels of detail and that helps answer questions such as:

  1. What BI reports & models use a given column/dataset?
  2. What columns are not used vs. used the most?
  3. What will be the impact of changing/deleting a given column?

Having easy access to such information allows teams to dramatically speed up incident response, prevent breaking changes and optimize the infrastructure.

Data catalog: find the right data asset for the problem

Data grows quickly not only in volume but also in variety. A typical post-Series-C startup often has 100+ raw data tables (events and other sources), 500+ derived tables, and 1000+ data reports/dashboards/queries. At that point, it becomes hard even for professional data analysts to keep track of data. Work and code duplication, quality issues start eating up productivity.

Mature transformation frameworks such as dbt support automatically generated documentation which should suffice until a certain point:

Once your data assets grow beyond that, a great option for data discovery is Amundsen, as it supports most major databases and BI tools. It does require certain effort to set up.

Analysis

Now that you’ve collected, integrated, and transformed data, it’s time to pick the right tool for users to consume it.

We’ll consider tools that help people make decisions, also called Business Intelligence.

To pick the right tool, it’s important to first review different BI workflows and user personas and then identify tools that serve each the best.

Three analytical workflows

Dashboards

Dashboards are collections of widgets displaying quantitative data. The purpose of dashboards is to show a snapshot and trend of the key performance indicators and align the team around these metrics. It is common to structure regular review meetings around dashboards to assess business performance and discuss deviations from the plan.

Self-serve analytics

“Self-serve” means enabling all users across the organization to answer deeper questions then what dashboards tell:

  1. Without writing SQL
  2. Without needing to know the detailed structure of underlying data

Example: Looker Explore view. One can analyze data and create charts by selecting fields and applying pivots and filters.

The self-serve concept is incredibly important, as it empowers all functions in the organization to make data-driven decisions without relying on Data Science / Analytics which are always a bottleneck.

Ad-hoc analysis

Answering a long tail of custom questions that cannot be answered using dashboards or self-serve analysis:

  • Investigating anomalies
  • Feature-specific analysis
  • Building new models and analytical frameworks
  • One-off investigations

Many analytical narratives are born from ad-hoc analysis and then become embodied in dashboards or self-serve explorations.

We took what we think are best-in-class tools and plotted their coverage of each of the three workflows:

As you can see from the chart, Mode + Looker provides the combination that maximizes value across all three use cases.

Mode

  • Fastest SQL to a [reasonably] pretty chart, optimized for analyst workflow.
  • SQL centric and most flexible
  • Allows for custom visualizations with JS frameworks
  • Supports Python + R notebooks
  • Poorly scales on both human and infrastructure
  • Facilitates explosion of data content and duplication of work at scale

Looker

  • Best BI tool for democratizing data in organization of any size
  • Well-designed architecture that enables data teams to create great dashboarding and self-serve experiences for business users
  • Allows managing data access to the most granular level
  • Allows creating external-facing and white-labeled analytical apps (e.g. for your customers)
  • Facilitates great engineering practices such as single source of truth, code reuse, version control & documentation.
  • Requires upfront & continuous investment in data model layer (LookML)