I often come across questions such as “What is the best tool for Product Analytics / data integration / data warehousing?” There are a myriad of vendors in the data space each selling the similar and vague value propositions such as “360 customer view”. There are also thousands of “X vs. Y” articles and yet very few resources that offer a holistic framework for establishing an effective data infrastructure.
In this article, I attempt to deconstruct the analytical infrastructure into steps, following the “value chain” of data: from where it originates to the point where it becomes part of a narrative used to make a decision. For each step, I outline questions that are important to consider when deciding on the solution as well as recommend tools based on personal experience and research.
That framework can be handy if you are establishing the data stack at a young company, or bringing analytics to an established enterprise, or need to re-think parts of existing data infrastructure.
In the past five years, I’ve designed and built data platforms for three very different companies: Autodesk, Lyft & Phantom Auto, a mature public corporation, an exponentially growing marketplace and a seed-stage startup at the time respectively. In the process I evaluated hands-on dozens of vendors and OSS in the data space. I’ve also made numerous mistakes and observed many bad technical and economic decisions that helped me build a framework for thinking about data infrastructure.
Organizations need data to make better decisions, either human (analysis) or machine (algorithms and machine learning). Before data can be used for that, it needs to go through a sophisticated multi-step process, which can be seen as the following steps:
0. Observability: monitoring data assets, pipelines & infrastructure
To measure the business, we need to define the atoms of analytical data: events and their properties.
Example event message:
Events are typically defined in a tracking plan that is shared between Engineering, Product and Analytics:
Recommended tools You can start with something as simple as a spreadsheet, for example, Segment offers a great free Tracking Plan template.
A great addition to the tracking plan is automatic validation of events against the definitions. If you use Segment for instrumentation (see below), you can use their Protocols feature for validating the events.
Alternatively, the following tools provide collaborative tracking plans with automatic event validation (and integrate with Segment):
Finally, the decision about what to track is by far more important than the chosen tools for the tracking plan. The following reads are helpful for defining metrics:
Once you’ve defined what to track, the next step adding the code that registers events and sends them to some ingest endpoint, for example:
Tip: It’s typical to send events to different tools (Product Analytics, Marketing Automation etc.). Instead of instrumenting your app with each tool’s SDK, it’s generally better to use a single SDK that supports all required platforms and then forward the collected data to all relevant tools. Maintaining multiple SDKs is costly, leads to errors and inconsistencies.
Recommended tool: Segment
Segment’s Analytics SDK is a “track once, send to many tools” solution that supports all major web & mobile platforms. Since the SDK itself is open source, you can also use it for sending to your own endpoint if you don’t want to use Segment for processing.
Alternatively, if you are using Avo or Iterative.ly for tracking plans, you can leverage their SDK for instrumentation.
Once the events have been registered on the user devices and, optionally, your app backend, you would probably want to:
Recommended tool: Segment
Segment excels at all six requirements above. Furthermore, Segment’s biggest value proposition is the ability to seamlessly send data to ~100 destinations including data warehouses and marketing tools.
Segment’s major caveat is price:
10,000 visitors start at $120/mo, then add $10 for every 1K visitors, so at 100K visitors you’ll likely be paying $1K+ / mo. Note that the pricing is based on unique visitors, not users.
What to do once Segment becomes expensive?
Both options are lower-level solutions in that they provide a “pipe” but lack ready-to-go integrations. Both offer excellent scalability and are low-maintenance, and can be over 100x cheaper than Segment in terms of data volume costs. However, they will require lots of plumbing such as implementing custom connectors for producers and consumers of data which can be very costly.
Popular alternatives that we do not recommend:
While analytical events typically serve as primary source of data, often you would need to augment them on a regular basis with data from other systems, such as:
Integrating disparate sources is challenging due to differences in data formats and APIs. Today, the data integration market is mature and offers plenty of robust solutions, so it almost always doesn’t make sense to write your own integrations.
Further reading: Segment vs. Fivetran vs. Stitch comparison.
The role of a data warehouse (DWH) is to accommodate all your analytical data and to provide easy and reasonably fast access to people and systems that need the data.
Optionally supplemented with:
Check out our article on Choosing a Data Warehouse for Analytics for deeper dive in this topic.
Now that you are able to collect events into your data warehouse and enrich them with data from other sources, the next step is transformation: cleaning, merging, aggregating raw data into tables that can be used by analysts.
This step is also often called: “Business Logic Layer”
Do we even need a framework?
In the early days of a company, one may think: why not run and visualize SQL queries in a dashboarding tool? Or can’t we just some SQL scripts on cron? That quickly becomes unmanageable for the team, and one finds themselves focusing on firefighting more than on adding new data into the system.
The role of a modern data transformation framework:
Standardizes code patterns
SQL is simple and popular, which enables a large number of people in the organization to contribute and comprehend the pipeline code. SQL’s drawbacks include poor testability.
Recommended tool: dbt
You can run dbt in dbtCloud or host yourself.
Those typically leverage general-purpose programming languages such as Python, Java, and Scala for defining jobs. They offer greater flexibility and robustness than SQL but have a much steeper learning curve.
Apache Beam – is a new generation framework for batch and stream processing. One of its distinctive features is decoupled code from compute: a job defined in Beam can be executed on a variety of engines including Spark, Flink, and Google Dataflow. Beam has Java, Python, Go APIs.
For a typical case, my recommendation is to use dbt and consider Apache Beam for streaming or advanced jobs.
They enable defining transformations in the UI using Excel-like functions and pivoting.
My take: those can be helpful in large organizations to empower non-technical users with the ability to process data but at the moment no-code is too limiting to support the majority of use cases on par with SQL / Generalized frameworks.
With so many different sources and systems that the data passes through, quality issues are almost inevitable, but just like in software, they can be effectively managed with appropriate tools and processes.
When should you start investing in data quality and tests? The answer is, usually – yesterday. It’s best to write tests before publishing the data. Fortunately, today, there are great tools and frameworks to help.
A simple and effective approach to data testing are assertions (rule-based checks) that validate your assumptions about the input or output data in the ETL process, such as:
When to use assertions:
Recommended tool: the best way to implement assertions is to use a transformation framework that includes integrated testing framework, such as dbt: when you create a new table with dbt, it provides basic tests out of the box which you can supplement with custom test cases.
Teams working with modern analytics stacks have to deal with a lot of complexity: (1) the stack is made of sophisticated distributed systems (2) the more powerful and scalable the stack is, the more data assets will be created, processed and stored in it. (3) The higher the capabilities of the data stack are, the more reliant the business becomes on it, imposing ever stricter SLAs (Service Level Agreement) on quality, timeliness and performance of data products.
Managing the complex data infrastructure and processes requires a special category of tooling – data observability platform.
Data observability platforms provide visibility into the state, quality and performance of data assets (tables, reports, dashboards etc.) as well as processes that create and transform those assets (ETL/ELT/BI pipelines and queries).
Since data preparation flows typically span multiple tools (as demonstrated on the diagram above), an effective data observability platform should trace data along its journey and provide a global detailed lineage graph.
In software development, observability is delivered by Application Performance Monitoring platforms such as Datadog, Prometheus & New Relic that combine multiple tools including metric & logs collection and analysis, distributed tracing and anomaly detection. In a similar manner, data observability platforms provide tools specific to data-centered workflows:
Data profiler analyzes datasets and generates descriptive statistics such as distributions, completeness (% of nulls), uniqueness etc. As it is essential to understand the underlying data before using it, a good data profiler can save a lot of time for both data developers and consumers.
Today organizations track hundreds/thousands of metrics of different kinds: business KPIs, application & infrastructure performance, data quality indicators. Plotting these metrics on dashboards is helpful but does not ensure that the right person gets notified of an important deviation from the norm in a timely manner.
Automatic metric monitoring allows teams to delegate tracking of important metrics to algorithms by defining those metrics in SQL or other language, adjusting the sensitivity of anomaly detection to the desired level and receiving alerts about anomalies through a variety of channels such as Slack, Pagerduty, OpsGenie etc.
Data diff is a tool for comparing datasets within a database (or across databases) to determine the difference between them. It adds elements of version control to large-scale data warehouses: like a git diff command but for actual data.
While assertions (rule-based checks) discussed above help with validating assumptions about the data, data diff tells exactly how the datasets are different.
When to use data diff
Diff values comparison:
It is not uncommon for companies to have thousands of tables in their data warehouse with hundreds of reports and dashboards drawing data from them. Without an easy way to visualize and trace dependencies in the data ecosystem, teams are bound to making costly mistakes and to wasting time manually reading source code to answer basic questions.
Data lineage is a global graph representation of the data flows that can be provided at a table, column or task levels of detail and that helps answer questions such as:
Having easy access to such information allows teams to dramatically speed up incident response, prevent breaking changes and optimize the infrastructure.
Data grows quickly not only in volume but also in variety. A typical post-Series-C startup often has 100+ raw data tables (events and other sources), 500+ derived tables, and 1000+ data reports/dashboards/queries. At that point, it becomes hard even for professional data analysts to keep track of data. Work and code duplication, quality issues start eating up productivity.
Mature transformation frameworks such as dbt support automatically generated documentation which should suffice until a certain point:
Once your data assets grow beyond that, a great option for data discovery is Amundsen, as it supports most major databases and BI tools. It does require certain effort to set up.
Now that you’ve collected, integrated, and transformed data, it’s time to pick the right tool for users to consume it.
We’ll consider tools that help people make decisions, also called Business Intelligence.
To pick the right tool, it’s important to first review different BI workflows and user personas and then identify tools that serve each the best.
Three analytical workflows
Dashboards are collections of widgets displaying quantitative data. The purpose of dashboards is to show a snapshot and trend of the key performance indicators and align the team around these metrics. It is common to structure regular review meetings around dashboards to assess business performance and discuss deviations from the plan.
“Self-serve” means enabling all users across the organization to answer deeper questions then what dashboards tell:
Example: Looker Explore view. One can analyze data and create charts by selecting fields and applying pivots and filters.
The self-serve concept is incredibly important, as it empowers all functions in the organization to make data-driven decisions without relying on Data Science / Analytics which are always a bottleneck.
Answering a long tail of custom questions that cannot be answered using dashboards or self-serve analysis:
Many analytical narratives are born from ad-hoc analysis and then become embodied in dashboards or self-serve explorations.
We took what we think are best-in-class tools and plotted their coverage of each of the three workflows:
As you can see from the chart, Mode + Looker provides the combination that maximizes value across all three use cases.