Improving the Modern Data Stack - Recap of Monday Morning Data Chat
In this episode of Monday Morning Data Chat, entitled “Improving the Modern Data Stack,” podcast hosts Matt Housley and Joe Reis interview Gleb Mezhanskiy.
Matt and Joe host the Monday Morning Data Chat podcast, where they have candid chats about data. They cover many topics like managing data teams, data management in the cloud, and data engineering.
Gleb Mezhanskiy is the CEO and founder of Datafold, a leading data quality platform with diffs, column-level lineage, alerts, and more.
Below is a summary of episode highlights.
Q: What are the common challenges in data?
Applying data to business logic is hard regardless of what stage you’re at, so this is a common problem experienced by all levels of business. The scale varies, but ultimately the problem of change management and tracking the impact of data changes applies everywhere.
For smaller companies, the struggle tends to be acquiring enough budget for proper data tools and developing data engineering expertise. This is especially true for startups; they often move so quickly that they’re developing expertise on the fly, with little to no formal change management. As a result, data discrepancies can creep in early and unnoticed.
Larger companies don’t usually have the same struggles with budget and expertise; rather, politics is their main struggle. Data tells a story, and it’s not always a favorable one.
Lastly, data quality is always hard. When you’re working with widespread heterogeneous data sets, tracking all of the business logic changes and how they affect the data is a momentous task. Without good tooling, the task is nearly impossible.
Q: How can we use tech to reconcile people getting different answers to the same data question?
Technology can assist, but there’s no replacement for collaboration. For each question, the key is to reach an agreement on what you’re actually trying to answer and what level of accuracy is required.
For example, let’s say a company has different teams that use different indicators for an “active user” metric. How do you reconcile this? It’s important to run an experiment—without fear—when the source of the data may not be perfect. If there’s room for a margin of error in the answer, then ask the question to both systems and use an algorithm to reach a consensus.
Another important aspect is setting expectations. The worst conflicts arise from situations where people expect accurate data everywhere or they use naturally flexible data to answer questions that require accurate data.
Q: What are some ways to improve the modern data stack?
Data complexity has become much more apparent in the past few years. Dbt made that very apparent. Because of the increase in speed, we’re hitting data inconsistencies faster, which means we need tooling to catch the inconsistencies and fix them faster. Software engineering and DevOps adopted the practice of catching issues early, long before data engineering started doing it. Naturally, data engineering is gradually adopting this practice, benefitting the early data development cycle.
The publication, Interfaces and Breaking Stuff, from Tristan Handy, CEO at dbt, goes into some of this with the idea of data contracts. There is a lot of potential around data contracts (data mesh) and general workflow improvements.
The show hosts noted that some of the safety rails—like schema checks—have been around for a long time, but data engineers and developers need to use a schema to get any benefit from it. Monoliths, in particular, reaped great benefits from schema checks. However, with the advent of microservices, it’s possible that this hasn’t continued to be a popular tool.
Q: What do you think about data contracts?
In general, applying software engineering practices to data is a good idea. However, software tends to be more deterministic, which makes testing more straightforward. You can perform unit tests. If something breaks, then the user knows it right away.
With data, it’s much more difficult to determine if the data is accurate because the data used to determine accuracy can evolve over time. If an incorrect evolution isn’t caught, then it becomes pervasive. However, data contracts and better interfaces can help.
However, data contracts are currently hitting a ceiling with cross-tool contracts. Establishing a common interface across all data vendors is an insurmountable task. Beyond that, even if you developed a contract, who owns the contract? Open source is an option, but it would need to be a carefully curated endeavor so that we don’t wind up with a badly constructed open-source data contract that becomes pervasive.
In the past, model versioning wasn’t done because it took up a lot of storage and wasn’t practical. Maybe there’s a middle ground where you can be aggressive about cleanup. Data warehouse versioning is actually quite cheap.
Still, it makes more sense to work on making the data stack more robust because most of the bugs that cause a fire are not related to changes in codebases. Often, the cause is not that the schema changed, but rather that the data has changed.
Ultimately, this all comes down to change management. The key takeaways are:
- Focus on improving the workflow.
- You need to know what you ship and evolve quickly. This means understanding individual metric changes and how they impact your downstream consumers.
These steps need to be firmly in place before talking about model contracts in practice.
Q: Is there room for chaos engineering in data engineering?
There’s room for catching edge cases through chaos engineering and then double-checking the results. Dry runs are helpful for proactively identifying issues, without necessarily undertaking chaos engineering. However, the main problem with dry runs is that they don’t scale. As the data set grows in complexity, so does the number of dry runs that you need to perform. To be effective at this, you need a tool like Spectacles for Looker, which brings the dry run concept to CI/CD.
Q: With companies that have really heavy data needs, do they already have a better solution for integration between application teams and analytics teams?
The short answer is this: It’s a hard problem, and this is just as much an organizational problem as it is a tooling problem.
For example, operational metrics are highly specific to application teams because they inform them about the correctness of the application. Then, the analytics teams build the business metrics used for making high-level decisions. The disconnect arises between what the application does and what business metric it impacts. There is a lack of understanding of what your data and your metrics affect downstream. This goes back to knowing what you ship from a data engineering perspective.
You also need to bring software engineers into the fold with data. They need to see the effects of their changes. Techniques like data lineage make this possible. A blend of processes and tools is necessary. Organizationally, make the case for blending software engineers and data engineers into one team, removing those communication barriers. This is especially important for fast-moving data companies.
At the end of the day, equipping teams with the right data and tooling is what enables this to work. In data engineering, there’s still a misconception that we can hold all the data complexity in our heads. Software engineering has already conceded that they can’t keep all the code complexity in their heads. They need tooling to make sense of it. Data engineering needs to do the same.
“Datafold makes it easy for us to do the right thing.”
Q: What are your thoughts on where the data industry will go in the next few years?
There are a few things that are already happening.
First, putting together a data stack is far less complex and far more affordable, so a data team will start to become a mainstay for more companies, regardless of size or industry.
Second, while data stacks will become simpler for the end user, the data will become more complex. Right now, everyone does big data, but it’s not always big and complex data. We can expect the complexity to increase. This will shift the discussion from simply “big data” to “big metadata”. Good data teams will be commended for the complexity of the data that they manage, not the size.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.