Understanding your Data: The People Behind it
In the context of people, process, and technology, people are the most complicated part. Anyone can learn to code. Not just anyone can navigate the very real complexities of navigating people's sentiments and emotions. In this first section of my 3 part series, we will highlight just some of the many personas that you will interact with through the data development process. This is certainly not meant to be comprehensive. In the same way that knowing about different kinds of documentation can help documentation meet people where they are, knowing the different personas who interact with our data can help us meet them wherever they are in their data journeys.
The most important consideration when discussing the people-side of data is empathy. While it may be frustrating that folks who break data upstream or use it in ways it was not intended downstream, it is (almost) never done with malicious intent. In other words, people are just looking to do good work, finish whatever task was at hand, and move on to their next project. Having empathy for the position that their colleagues are in allows us to start working from a place of collaboration rather than an adversarial context. We are all fallible and we should build systems that allow for that.
The Data Creator
Data is the numerical representation of your organization’s processes. Data is only valuable when it improves your decisions. Data that is collected but not utilized is useless. Data that is created but then broken or created in such a way that doesn’t accurately represent your organization’s data processes is less valuable than it could be. It’s important to think of the data in context at all stages of its lifecycle, from creation to consumption. And, it’s important to think of all the people who interact with that data at all stages of its lifecycle.
Software engineers create data.
There are primarily two circumstances in which your software engineers create data: logging and application data. The ways in which they create data are focused on addressing a specific need- and analytical use cases are rarely a primary motivator. For the remainder of this guide, we will pretend that your company is running an application that helps companies manage their customer relationships.
Software engineers who add new tables to the database or update the column names are focused on building the best application they can. When they update the status column from active to canceled, they are not trying to mess up historical counts of active customers, they are trying to store data in the most efficient way possible, reduce the chance that a query is miswritten producing inaccurate data in the UI, or anything else that could inhibit the very best customer experience.
Moreover, while logging is often used to help understand a user’s experience with the application, its primary purpose is to enable developers to understand problems in the application directly. Just because it can be used for analytics and it often is (especially at the early stages of an organization’s analytics journey), it is rarely captured for analytical purposes as the primary use case.
In both of these circumstances, software engineers are creating data to solve their problems. They are the producers and the consumers. While data teams benefit from the data, the benefit to the data team is a byproduct of the logging and application problems they are focused on solving.
Put directly, while data teams benefit from the data created by software engineers, it wasn’t built for us.
At a certain point in a company’s analytical journey, companies will work to implement event-based telemetry. In these circumstances, the data team or product managers ask software engineers to instrument those events. Keeping this data flowing accurately and up-to-date is a pain point for many organizations- driving the creation of many tools including Avo and Iteratively. Event tracking software exists to help align the producers and consumers of data- in this case, where it’s produced by the application as implemented by software engineers and consumed for analytics purposes by data and product teams. But this technical solution is one that aims to address what is, in many ways, a disconnect between producers and consumers.
Ops teams create data
A key team that also generates core data sets that your company will use for analytical purposes can best be described as the myriad of Ops teams in your organization. The specific details here will require your company’s specific org chart- Marketing Ops, Sales Ops, Rev Ops, Supply Chain Ops, Cloud Ops, Fin Ops, Engineering Ops, or any other variation that exists in your company. These teams are often the technical implementers of processes. They build, configure, and enable the technical systems that allow your company to function. Again, data is the numerical representation of your company’s processes. Hence, the reason the Ops teams are going to be key producers of data sets.
Like software engineers, these teams produce datasets to address their own needs. They need to be able to ensure that sales people get paid accurate commissions, contracts are renewed on time, and cloud costs are kept under control. Their responsibilities are different from yours.
While, again, analytics is often seen as a byproduct of what can be done with this data, it is not the primary use case for its production. The salesperson might not log every stage every customer is in because they see it as extra work. That same action makes it difficult for the data team to accurately estimate the average days in stage for sales deals.
In the same way, data practitioners must develop empathy for software engineers who let us use their data for purposes different from its intended use cases, data practitioners must develop empathy for Ops teams who let us use their data for purposes different from its intended use case. We can and should work with these teams to collaborate on the most effective way to make this data as multi-purpose as possible, but we cannot expect Ops teams to be focused on solving data problems or be focused on analytical use cases first.
Customers, team members, and most activities create data
From interactions with our hypothetical CRM to failed login attempts caught by our instrumented telemetry, customers are also our data producers. Their focus is on getting their problem solved with their application in as timely a manner as possible.
Similarly, team members and other folks who interact with your company will also be creators of data through their interactions with various systems.
What is key to understand is that the data is a byproduct of all of these activities, not the main attraction. Rarely, at the production stage, are folks focused on analytics. My theory is that this is because data is produced individually but analytics is conducted over an aggregate. Our goal as data practitioners is to understand the methods that produce any and all data sets, the nuances of those data sets, and how we can massage the data to help us address our analytical questions.
The Data Consumer
As we’ve established there can be many use cases for data, as each producer creates data for a purpose and it’s rarely the analytical problems that data teams are focused on solving. Similarly varied, data consumers have different problems that they look to data to solve. In what follows, we’ll be looking at how different personas consume data to solve each of the five jobs to be done by modern data teams:
- data activation – making operational data available to the teams that need it
- metrics management – the business needs shared definitions and a baseline of key metrics
- proactive insight discovery – team members outside of the data team are limited in the questions they can ask by their limited knowledge of what data exists and what questions can be asked
- driving experimentation – driving measurable impact to the business through A/B experimentation moving key business metrics in the right direction
- interfacing with the data – empowering team members across the business with the information and conclusions they need to be unblocked
Executives use data for analytical purposes
Executives usually care about data in aggregate. Data used at the executive level is presented in metric form. Revenue, active users, or other business metrics are an aggregation of many individual interactions or transactions. Metrics allow the company to have a shared understanding of reality- aligning on what is on track and what is not- and prioritize the limited resources of time and attention appropriately.
Of course, metrics are just the beginning of any conversation using data. It’s also important to be able to slice and dice any analysis and do a root cause analysis to understand the levers that drive changes to metrics.
Executives, when they’re looking at a metric like Monthly Recurring Revenue, don’t necessarily care about whether that information is from Stripe, Paypal, or someone standing over everyone's shoulder counting on their fingers. As a consumer, they want to be able to able to trust the aggregated metrics without worrying about the underlying details, including where, how, or why it was produced or the many steps of the data lifecycle that got it to its aggregated form.
Ops teams, Product teams, and the rest of the company
In the same way that data teams are the consumers of information created by others, Ops teams can often ask other parts of the organization for information. Data activation is the perfect way for any part of the company, especially Ops teams, to be presented with data that they don’t have. It is much more effective for RevOps or SalesOps teams to get the data they need in their systems of record.
Many of these teams also drive experiments, where data team members play key roles in helping analyze results. Communicating experiment results can be complex, so this sort of work is often best done in collaboration with data teams.
Whether its infrastructure ops teams tagging cloud resources correctly so that finance teams can help understand cloud costs appropriately, marketing ops teams leveraging utm parameters on all campaigns to help drive efficient work, or engineering ops focusing on addressing a bottleneck pain point, knowing and understanding the role of the data when it was created and its best application to the current problem of focus is the responsibility of the data practitioners that can help be the brokers between data sets, data use cases, and a myriad of functions within the organization.
The Data Practitioner (Probably You)
If working in, around, and with data is the bulk of your job, you are likely a data practitioner, independent of your job title. You might be deemed an “engineer” or an “analyst”; your job title might include “data”, “business intelligence”, or “analytics”. Or, it might not. But, if you’ve landed here, you’re probably feeling some acute pain in your data lifecycle and transformation process, and you’re looking to understand how to improve it.
In many ways, your responsibilities in this data lifecycle include shepherding the data back and forth between different consumers and driving impact and improvements to the business with that data. All the data in the world doesn’t matter if you don’t do something with it.
Put differently, your mandate is not small. You must:
- understand the purpose for which the data was created (probably not analytics),
- ensure that the data means what you think it means by collaborating with its producers,
- ask for additional data to be created where appropriate (e.g. event tracking),
- consolidate and transform data to reflect the closest approximation of reality possible,
- make this post-transformation data available in a reliable fashion in a myriad of tools and analyses to a variety of stakeholders with a wide set of requirements,
- (Other things, lots of other things)
And do this all reliably, with the lowest latency possible, every day for the rest of forever. While we talk about data plumbing in a micro context within the Modern Data Stack, in some ways the data practitioner is the organization’s plumbing in the macro context- unappreciated when everything works despite a clear utility-style dependency.
Cross-functional data is the result of the data team creating something that never existed before. Salesforce is usually accessible to the SalesOps team, and Marketo is usually accessible by the MarketingOps team, but the combination of data sets that helps understand how user retention is a predictor of churn is a net-new creation that is the result of combining functional data sets into something more cross-functional. This is only possible if the data practitioner- the chef creating something new- has the requisite knowledge of the data.
This knowledge will never come from engineering alone. It is the result of getting to know the producers, as outlined above, and the contexts in which that data is produced. It is just as important to get that information as it is to build relationships and partnerships with the folks behind the data and the data production processes.
All of this is not one person’s responsibility (unless you’re a data team of one, which is the exception not the rule). The specialization of team members within data organizations allows us to balance the required breadths and depths required to do this role well.
Data team managers may be the only people with title or direct reports, but they are not the only leaders on teams. Every time we engage a person in another part of the organization to understand their problems before jumping to a solution, we are demonstrating leadership within our organizations that can help build our data team’s effectiveness throughout the company.
Data team managers are enablers who must help make all of these things possible for their team- through introductions, unblocking, example, and grit. A manager must be able to bounce between the tactical details of implementation, the operational requirements of project management, and the strategic planning required to move the big picture projects forward. Data team managers must balance the variety of forms of expertise demanded of them in order to be effective at their roles. Managers are practitioners in that they are “data people” but to drive true impact as managers, they need to be enabling over practicing data within their teams and within the wider company.
In the next blog post, we’ll focus on understanding the process behind our data!
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.