Understanding your Data: The Process that Makes it Work
This is part II of our series on Understanding your Data, if you haven't already check out part I - Understanding your Data: The People Behind it
Understanding the human component of what we’re trying to do brings us closer to the producers, consumers, and practitioners of data. Understanding the process component of what we’re trying to do helps us navigate what problems need to be solved. The stronger our understanding of what we’re trying to do for the business, the more empowered we are to drive impact as directly as possible.
Core pieces of a great analytics engineering process:
- Communicate in the Pull Request
- Automation as a principle
- Enforce reproducibility and repeatability
- Simplicity is a feature, not a bug
- Proactive > Reactive
- Prioritize Enablement
- Focus on Impact
Communicate in the Pull Request
If we recognize that data is only as accurate as the process that created it, we implicitly recognize that communication needs to be associated with the creation process. Pinging the #marketing Slack channel that the leads table is about to change is not enough context and it might not reach the right people who can understand the impact of that change.
Communicate as close to the work as possible, put details in pull requests. This allows for people to review changes and the impact they will have in one place.
For example, a strong way to introduce a change might be, “The column change will change how we categorize new users into regions of the United States. We will follow the new Sales norms of having 6 regions instead of 4. We will not re-categorize existing leads. This will only affect leads created on or after July 1.” While explaining changes is important, it is better to focus on communicating the impact of the technical changes.
Tools that automate these reports such as Data Diff make communication about impact in the PR even more robust.
Data teams can and should be oriented toward building and optimizing their workflows, in addition to their products and outputs. Rarely is a one-off ask truly one-off so we should invest in tooling to make these tasks less tedious and error prone.
The largest improvement to workflows in the data space has been dbt. dbt made building and maintaining pipelines signficantly easier. It has brought version control and code-based workflows into analytics engineering domain and itself has increasingly becoming the norm for doing data transformation. dbt has unlocked many other software engineering best practices including having distinct staging and production environments, code reviews with test environments, and testing as part of the development process.
Of course, this is only the beginning. At most companies there are workflows that are still manual and tedious such as PR reviews, validation replications, PII tracking, migration auditing and more, all of which should have tooling built around to be as automated and error free as possible.
Automation as a principle
Speaking of automation, we all know that we should Automate the Boring Stuff, but automation is about a lot more than just getting rid of the boring stuff. The reality is that when things aren’t automated, they do not get done every time, so if it must happen- if it is mission critical- then we must automate it. Automation empowers us with confidence that the things that must happen are happening.
Automation can be done in many ways, but is often seen as part of the CI/CD process. Every change being tested as part of the code review process so that you stop breaking your data is an example of the power of automation.
Checklists aren’t automation. Checklists are a powerful tool, but they are prone to human error or omission. Checklists can compliment automation workflows, but we cannot settle for checklists over automation.
Enforce reproducibility and repeatability
Part of doing analysis well is making it in such a way that people have confidence in it. First, you build confidence in your methods. Second, you build your peers’ confidence in your methods. Finally, you build your stakeholders’ confidence in your methods.
Reproducibility and repeatability are core methods to help build confidence. Reproducibility means that every time you run an analysis on the same static data set, it should produce the same numbers (catching problems like those notorious to notebooks, which are often used in exploratory analysis). Repeatability means that we can update that static data set and have the analysis run again to implement the same methods so we can draw the appropriate conclusions. For example, a retention analysis run in April will be even more useful when it is rerun in May to see what has changed in customer behavior.
There are many methods for implementing reproducibility and repeatability into your workflows. Git, SQL, and CLI-first workflows are key ingredients in your R&R Soup. If this is a completely new idea, try getting started with a Query Library.
Simplicity is a feature, not a bug.
Imagine this: it is your first day at a new job, joining a data team of twelve. The biggest hurdle you’ll have to overcome? Not learning the names of all the members of your team. Not onboarding remotely. Not learning your organization’s code practices. The hard part is navigating complexity- complexity in your code base, complexity with your run times, complexity with your stakeholders, complexity in the politics of an organization.
There are five domains in which Filippova outlines complexity:
- Idempotency and data history
- Data activation
One way we proactively combat complexity is by leaning into simplicity. By choosing boring solutions, we leverage simplicity as a tool to actively counter complexity. Examples of simplicity in our work can include standardizing the SQL patterns we use with a SQL Style Guide and writing technical documentation.
Proactive > Reactive
Rather than waiting for a problem to arise and trying to fix it under pressure, it’s easier to take action when you can identify the areas that need the most TLC. This allows you to make improvements without the pressure cooker situation of things being broken.
When we are reactive in our workflows (i.e., when we wait until something goes wrong) we lose sight of what matters most – our customer needs and expectations! So consider how you can make these changes so that they become part of your culture.
It can be hard to be proactive when we feel like our teams are stuck in the service trap, but there are baby steps that all team members can take to help create a more proactive culture. Start by leveraging Slack to help create a more data-centric culture. For data literacy, create a #data-reads channel where everyone can share interesting articles, newsletters, or other data-related resources worth learning from. For data adoption, create a #data-insights channel to share the things the data team is discovering with the whole company. Two small slack channels for your company, one giant leap in shifting towards a more proactive data culture in your company.
Data expertise everywhere is a true differentiator for companies. One of the best things for companies is to not only distribute analyses and insights but empowering as many people as possible to perform their own analyses. Enablement happens at all parts of the organization.
Data practitioners must enable functional practitioners. Data platform teams must enable distributed practitioners. Data leaders must enable platform teams.
Enablement, coupled with literacy, is a prime example of how we empower organizations to go from having data to leveraging data. While there are many models to consider, my favorite is the Data Business Partnership as a way of transcending most organization norms around how data teams work with functional teams.
Focus on Impact
In order for data to be impactful, people must have access to it. It is not enough for data to exist or be available, if it’s not actually being used.
A common trap that teams fall into is to provide data or analyses to stakeholders and leaving them to explore on their own. Data teams must present recommendations or otherwise share the implications of the data in order to effectively move the needle with data.
Put differently, one way to make sure that we can drive impact with data (instead of just distributing it) is to make sure that all stakeholders have access to relevant information about what's happening with their data and why. This includes executives who need context around how it impacts them directly as well as middle managers who may not have full visibility into every aspect of their department's workflow but still want some kind of insight into how things are going in general.
You should also share metrics about performance against business goals at every stage along the way so everyone understands what needs improvement—and where there are opportunities for improvement based on new insights coming from within different departments or functions within an organization. This can be done in many different structures, but weekly or monthly business reviews are a common approach that allow companies to ensure they’re building a muscle around using data.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.