Proactive Data Quality on the Data Engineering Podcast

In July 2021, Datafold co-founder and CEO Gleb Mezhanskiy joined Tobias Macey on the Data Engineering Podcast to share his vision for a proactive approach to solving the data quality problem that bugs most data teams today. Plus he told his personal story of his work as a data engineer that led to starting Datafold. A couple of months later, it seemed fitting to share a recap of the conversation, along with the full transcript for those who prefer to read along as they listen. 


Over the course of the conversation, Gleb shared his journey in the world of data, from starting at Autodesk before Airflow, Looker, or even Snowflake had evolved to what they are today. He went on to share the fateful story that would eventually be the foundation for creating Datafold. Back in his Lyft days, Gleb pushed a simple, 4-line SQL code hotfix, following all the rules and procedures for making changes to the SQL codebase. Despite doing everything “right”, he managed to slip in a "small" error, resulting in a major data incident.


That incident would eventually lead Gleb to wonder why there weren’t tools in place that could automatically detect and prevent data quality issues in the first place. So many great tools and technologies have evolved that facilitate data production and analytics at a massive scale, but the tools to manage that complexity haven’t necessarily kept pace with that growth. 


The podcast dives into questions about the culture of data quality, how the industry is evolving and changing, and what challenges we should anticipate in the future. Of course, the two also discussed Datafold in more detail, particularly the Data Diff and column-level lineage features that are in use at major data-driven organizations. Together, these tools provide visibility and observability for the whole data pipeline.


Interestingly, when asked about an unexpected way that Datafold is being used in practice that maybe wasn’t anticipated while developing the platform, Gleb described how Data Diff and column-level lineage are helping during data migration projects. Not only can teams track the data flow and ensure proper transfers and deprecations, but they can also use Diff to show that the new data matches what’s expected based on the legacy systems. This builds stakeholder confidence, plus makes for a much smoother transition process.


Be sure to listen to the Data Engineering Podcast, or feel free to read the full transcript below. There’s even an interesting business idea if you’re looking to launch your own data technology!


Datafold on the Data Engineering Podcast: Strategies For Proactive Data Quality Management

Full transcript below:


TM: Your host is Tobias Macey, and today I'm interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them. So, Gleb, can you start by introducing yourself?


GM: Thanks Tobias for having me. My name is Gleb. I'm currently CEO and co-founder of Datafold. We're building a data observability platform that helps data teams build data products faster and with higher confidence. Before building Datafold, I was a data practitioner and was doing data engineering, data science, analytics. So a lot of what they're building is informed by my personal experiences and pain points.


TM: And do you remember how you first got involved in the area of data management? 


GM: Yeah, it was back in 2014 and I joined Autodesk's newly formed consumer group that had a portfolio of over 20 B2C creativity tools as a one-man data platform. I was then tasked to centralize all analytics around this portfolio of 20 apps.


And the great part about that. That as a one-man data platform, I got to choose all the tools I wanted to put together my stack, but it's also worth mentioning that, although it was not so long ago in 2014, we live in a pretty much completely different world data tools-wise. So Airflow wasn't released yet. I think Looker and Snowflake just raised their series B, Spark was bleeding edge, just the first release.


And so tooling-wise, and I think approaches that were used back then were quite different from what is mainstream today. 


TM: Yeah, it’s definitely, always crazy to look back at some of the timelines because the overall data tooling space has been moving so fast that if you look at what's out there today, it's just impossible to even remember what came out when and how long ago it was available for, because as you said, you know, 2014, that's only seven years ago, but it's a complete lifetime and an entire paradigm shift away from where we are right now with the overall data lands.


GM: Yeah, absolutely. And a huge shift in the problem space as well. So I think what's top of mind for data teams today is a very different set of problems than what we were facing back then. 


TM: Yeah, I think at that point it was still just a matter of, I need to get this data from here to over there and I need to make sure that it doesn't error out halfway through. And now we're sort of moving up the pyramid of, you know, the hierarchy of needs to, you know, data observability is actually one of the concerns for data teams now that wasn't even on the table seven years ago. 


GM: Exactly. 


TM: And so in terms of what you're building at Datafold, can you give a bit of a background and overview about what it is that you're creating and some of the story behind what motivated you to launch this?

The story behind Datafold


GM: Yeah, absolutely. I think to tell the story of Datafold, I should also give a little bit of background on my path in data engineering. So after starting, you know, building data platform and Autodesk, I then moved to Lyft where at the time when I joined, we had a 15 person data team that over the course of the next three years grew to over 300 person org.


And so with almost, you know, 20X expansion of the team, exponential growth of the business, and of course the data volume and complexity, that all created tremendous pressure on infrastructure and tooling. And so I initially started as a data analyst, building data products, such as BI reports and forecasting machine learning models.


And I very quickly realized that the available tooling was really not suited to tackle the problems that were rapidly emerging due to the growth of the team and the data. So I switched my focus from building data products, to building tools that enable. Data developers data scientists to build those products because the complexity of the data, the reliability of the data and the speed of developments that we had in data, teamwork were quickly becoming bottlenecks for the business growth.


And so one of the key, I guess, pivotal moments for me to start focusing on tooling were when, as a data engineer being on call. So basically we're responsible for taking care of all incidents. I had to ship a very small incremental change to one of the core jobs that were building analytical datasets.


And I made just a very tiny change, about four lines of SQL. And I did some testing. I got a code review from my teammates and I shipped it. I merged it, rebuilt the entire Doug. And next day we discovered that there was a huge data incident going on. So basically all analytics was stopped because it was apparent that a huge portion of the data was missing.


And what was crazy is that it took us about six hours to realize that that data incident was related to the change that I made the previous night. And even for me, the person who made the change, it wasn't at all apparent that the data incident was related to it. Right? And the most scary part is that I followed the process that existed, and I used the tools that existed, but even still, I was able to make such a mistake that led to a really bad outcome for, for the business.


So it took us the full next day to clean it up and to rely on shoulder processes and to get all the data pipelines back on track. And so the realization that one person, you know, making a small change can bring down, the entire like the platform at a large company with huge business impact was one of the pivotal moments for me to start focusing on building tools first, internally at Lyft, and then eventually starting Datafold to help solve these problems for, for everyone.


So back at Lyft, just to give you kind of a sense of what we were building, we build a framework on top of Airflow that enhanced the developer experience and helped build more testable pipelines. We also built a real-time anomaly detection based on a bunch of Flink. We also built an early version of data catalog that was a predecessor to Amundsen, which is now open source.


And so all of these projects really impacted how the entire data org was building data products. And then the realization was that something that Lyft needs with its scale, probably the rest of the Data community that likely suffers from the same issues, but won’t have resources to build so many different tools in-house will also need it.


So that's kind of my personal experience that led to creation of Datafold. And I guess the macro reason or the bigger why of why I decided to start a company, building data observability tool and data quality tooling is that obviously data is eating the world. And I think we're just at the beginning of seeing how data products, disrupts industries, all industries in the world.


And we kind of started talking about that at the beginning of how the data environment is different right now from let's say seven or 10 years ago. And so over the last five to seven years, we really solved a lot of fundamental problems of how we store data, how we collect it. We now have really fast, almost limitless, scalable databases, right?


We have great BI tools, visualization capabilities, we have ML infra. But the problem that emerged over the last few years is now that the companies have limitless capacity to accumulate and produce data, how do we deal with this complexity? How do we tackle the problems of data quality when we're dealing with, you know, tens of thousands of tables and millions of columns at an average size company.


And so I think that by solving those problems for the people, for data teams for are using and developing data day to day, we can then make a really huge impact on the world in general. So that's the bigger why behind it. 


TM: Yeah, it's definitely a huge problem that a lot of teams are dealing with is because there's this explosion in tooling and it's moving so fast. It's hard to keep up with that. And so you're just trying to build systems and keep them moving and deal with all the different data sources. And now that data integration is a lot easier with tools like Fivetran or the Stitch ecosystem where anybody can say, oh, I'm going to connect this data source into my data warehouse that, you know, now data teams are just being completely swamped.


And so even just keeping track of what data exists has become an entire tooling problem. And, you know, entire companies are being launched just on that one problem. So the fact that managing the quality of any one of those data sources can have such an outsized impact. As you mentioned, with just changing four lines of SQL.


Destroying the entire productivity of the company for a day is definitely something that a huge financial burden, particularly for companies that aren't set up to handle it. And so in terms of those data quality problems, I'm curious what you see as being the biggest factors that will actually contribute to incidents of, you know, quality, you know, quality problems or pipeline failures or some of these out-sized impacts that can happen from a small change. 

What are the biggest factors that contribute to data quality problems?

GM: Yeah, absolutely. I think the data quality is right now becoming as big of a problem with space as software quality. So it's enormous. And I don't think there is any single, you know, framework or tool or solution to really solving it for even a not very large company. And so I think with such big problems, it's always helpful to try to break them down into a few dimensions. Then it becomes more manageable. 


And one way to look at the data quality problems is to look at what are the sources of those problems. So one is obviously operational issues, right? Say our data-producing jobs are delayed, infrastructure failures. There are errors, there are certain queuing in a system. So data is not available. It's not computed. I think this problem space is more understood right now and probably easier to manage given the maturity of infrastructure. 


Another failure scenario is when the data that we rely on changes. So the examples of that could be vendors that we use to ingest the data are not complying with expectations.


Other teams are making changes to their data sources and causes impact, or there can also be a change in the business. So the fundamental changes in the world that also get reflected in the data. 


And the third big category of data quality problems arise from us, data engineers, data developers, making changes to our data products. So making changes to the code, that processes data, be that SQL, Python, Scala or other frameworks. Changes to the business logic that exists in business intelligence tools. So right now, many of those tools also contain a lot of logic in terms of how data is computed and presented. Changes through ML models, definitions.


So right now data-driven companies, so companies that are really relying on data for making decisions by humans and machines, they typically deal with codebases that are used to process data, which are comparable size to their actual software product. So it's tremendous amount of complexity, probably tens of thousands of hundreds, of thousands of lines of code.


And it's also very rapidly evolving because to be really data-driven, we not only need to build out all the infrastructure and all the models and the star schema, we also have to rapidly iterate and them to keep up with the business demands and where the new challenges that the growth goals pose.


Within the framework, right? So operational issues changes to the data and changes to the data processing code. I think the latter area right now is probably the least studied and the least understood. And I think it's somewhat natural, right? Because the first standpoint you are dealing with data quality issues is think, well, I want to least know whenever they happen. 


But I think to really tackle the problem would have to pay closer attention to how do we work with the data? What is our development process? What is our change management process? And then to solving that and solidifying it, that we can then achieve better data quality.


TM: To your point with the parallels about the software quality space. There are a lot of tools like linters and unit tests and, you know, static code analysis for both potential bugs and security implications.


And because of the fact that data platforms are not a static system, there is no single point in time snapshot that is going to accurately represent the entire system as there can be with code, although with code, it gets complex as well. How are we able to map some of those same concepts from software quality management into the data space and deal with the sort of dynamicism of real world data cleanliness issues and how they impact the actual systems that are dealing with processing them to create these preventative maintenance systems.

What can the data world learn from software development practices?

GM: So I think that one of the bigger trends that we see right now in a data world is the application of what are now considered standard software development practices into the data workflow. Some of the ways in which we can solidify our data development process are to one bring version control, right?


So version control of everything, starting with your ETL code, that ingests data, that processes data. Also version control for things like even BI dashboards. Because if we think about this, in the current world where companies have entire meetings structured around dashboards to make decisions about investments or about killing or doubling down on a particular feature, the stakes of making the wrong decision based on data are really high.


And so all data products, no matter whether they are executive facing or, you know, going into production should have version control, because that enables one, very clear reproducibility of whatever is the state of this code. It also enables very clean and visible change management process because we can cleanly delineate between the previous version, the new version.


And it also allows for more seamless collaboration between the teams, because right now data products are not built by a single person or even a single team. We have probably dozens of people collaborating on everyone, especially with larger companies. 


I think the second important aspect of development process that we're seeing coming from software engineering into data world is having a good visibility of changes. So whenever we are making change to let's say a pipeline that transforms data, we have to really understand what does this change entail? Both for us as others, as a team, and also for our downstream stakeholders and consumers. 


And so in software world, we are typically doing this through regression testing. Right? So we are running unit tests, we're running regression tests. We are potentially in the world of microservices exposing a tiny bit of traffic to the new service and observing it and seeing what happens. And so in the data space, we now also have similar frameworks, for example, assertions such as, you know, validating that a given column on a data set is unique or optimal is very helpful to validate business assumptions about the data can run both during the development process and in production.


But I think one of the still missing aspects that we're trying to close in the development process, and in particular, in having the visibility into the changes, is understanding like what is the full impact analysis of the change that I'm making? For example, that can go into simple questions such as what is the number of rows that a particular data set will produce? Will I have any drifts in the features? Am I going to break any dashboards because I may have removed a column or renamed the column. 


And so having this visibility is really paramount for a reliable change management process. 


And then the third component that I think is also rapidly making its way into the data world is continuous integration and continuous deployment. And so similarly how it helps software teams be more agile, make smaller, incremental changes, and then shift them faster in a really reliable way, in the data world we see the Renaissance of almost CI where we see data teams investing in the automated testing procedures. So for example, whenever someone checks in code that transforms the data or even controls the layout of the dashboard, there is an automatic process that runs tests maybe builds a staging data set, and then even maybe automatically merges this code and deploys it to ETL orchestrator.


So that really helps make sure that whatever is the change management process, it's not only available to people, but it's also automatically enforced, right? And there is no change that bypasses certain testing, which is required. 


TM: As far as the tooling and the platforms and the sort of impact that it can have on effective data quality issues. What are some of the ways that they can contribute to the occurrence of data, quality issues, as far as the systems that you're building, the way that your data platform is architected and some of the design considerations that teams should be thinking about as they're planning out their data platform, or as they're starting to think about introducing new systems or new processes? 


GM: So as a big believer in great workflows, I think that the best way tools can support reliable data and help data teams ensure high data quality is to really facilitate those strong workflows.


And to give you an example, we talked about version control. We talked about testing and CI. So we see that certain tools that we now consider part of the modern data stock, for example, dbt for SQL transformations or tools like Dagster for general purpose data pipelines and tasks. They come with those features and frameworks, already built in. So they already facilitate version control. They have built-in testing frameworks that make it really easy for data developers to write tests and run them as part of the pipeline. And documentation frameworks that help both keep the documentation close to the code, which is always great, but also serve that documentation in a nice UI that can be consumed by not necessarily the data developers, but data users.


And very importantly, they have separate production and staging and development environments. That also is a very important concept for making sure that the change management process is reliable.


TM: As far as the potential consequences, we have addressed some of that where, you know, if you have a wrong column or the data is old, it can potentially lead to costly decisions that end up being based on incorrect assumptions around the data that it's available.


And so how can organizations start to shift to being more proactive in the data quality management and start to instill the understanding at the business level that it's worth the investment and the time and energy that it takes the engineering team to create these systems for proactive management, and also how to instill the sort of level of care and diligence that's necessary across engineering teams, not just within the data organization that you know, data quality is everybody's problem, and that anybody can have an impact on it.

How can organizations become more proactive around data quality management?

GM: I think probably the first step that's important in every organization is, one, recognizing that there was a problem and getting a buy-in to solve it. Unfortunately, we still see that some teams, you know, live with data quality issues as a status quo, right. And so we have to recognize that there's a problem and we are able to improve it.


I think the second important aspect is understanding what are the root causes of the issues. So probably trying to classify them and see what are the areas that are most risky and most impactful. 


And again, I'd like to emphasize the proactive data quality management through improving the developer process over more like post-factum monitoring because a feeling is the idea of kind of data monitoring post Factum, “tell me when my data is wrong”, kind of black box solutions. 


It's quite hard to rely on that solely to improve data quality because by the time that you identified that there is an issue in production, the damage is already done, right? So the stakeholders probably already looked at the dashboards showing wrong information, machine learning models ingested the wrong data and skewed their results. 


Another problem is that's it, by the time the data is waiting production, it could be really hard to identify that root cause because with the multi-stage data pipelines, corrupted data propagates really fast and it becomes ubiquitous. 


And the other aspect, which is more organizational is with the issues, data quality issues that are already in production, to fix them you have to fight organizational momentum, right? You have to advocate for people to stop whatever they doing, go back and fix them as opposed to work on the new things, which is always an uphill battle. That's why I strongly advocate for data teams and companies to really look into the preventative ways to address data quality, because then all of those issues are taken care of.


And so in terms of how to think about improving the process is I think an important aspect is to understand what are the current inefficiencies of the process. So is the bottleneck in the ability to ship? Let's say data right to teams need better frameworks for shipping data products faster. So sometimes a team would need to let's say switch to a more agile framework like DBT, which comes with a lot of the data quality toolkit features already.


But assuming that the basic infrastructure and tooling is already in place, I would start with planning out the change management process. So what are the steps that are required in order to make a change to our data products, be that SQL job or a BI dashboard, and then introducing the visibility tools.


So how can we make sure that there are tests are executed, that we have full understanding of the changes that we're making, and then making sure that these processes are enforced? 


TM: As far as what you're building at Datafold, I'm wondering if you can talk through some of the design and features that you are building in, and some of the architectural aspects of the system that allow it to be able to enable some of this proactive data quality management of finding and fixing, you know, data quality and data bugs before it actually goes out into a production context.

How Datafold helps with proactive data quality

GM: Yeah, absolutely. So we call Datafold a data observability platform. And so by observability, we mean that we help data teams discover, understand their data, how it works, what is the distribution of data, where it comes from, where it goes, and also verify and test it. And so while there are multiple features that I won’t go into the detail right now, into all of them, but the really key pieces of the platform that help enable a reliable change management process, are Data Diff and column-level lineage engine. 


So Data Diff is a tool that analyzes changes in the data and provides a visual report across multiple dimensions and with various degrees of granularity. So you can think of it as Git diff for data or, you know, a Microsoft wards diff, but for pretty data sets. 


So whenever you want to compare two data sets, it gives you a view into how they are different, both in terms of individual roles and also on a statistical level in terms of the distributions. 


And so how does Data Diff fit into those workflows that we discussed? So for one, it helps you automate regression testing because you can compare the before and after state of your data product.


For example, you can compare the production version of your dataset with the development version of the dataset, built with the new code that you're about to merge. And so that helps you answer questions such as what is going to happen to the data? Are there any unintended changes to know number of rules, percentage of rows? Are we going to cause feature drifts by changing distributions on particular dimensions. Are we going to cause theB I tools to fail because we renamed or misplaced the columns? 


So Data Diff helps answer those questions without writing any SQL or without doing any manual checks. And the way it fits into the workflow is essentially automating what most teams do right now, but manually.


So we spoke to some really senior data engineers at public companies to learn that sometimes they spend up to a week testing a single change to a really important SQL job. If that job, for example, powers the financial reporting. Because the stakes of making a regression are super high. And the majority of the time in that week goes into writing arbitrary ad hoc SQL queries that are essentially comparing things and validating things to make sure that there are no regressions. So Data Diff essentially takes out that manual part of work. 


And then aside from testing the regressions between production and development, Data Diff can also be helpful in identifying drifts in the data in production, because we can compare the state of the data today versus yesterday, or let's say, after a  versus before to identify any anomalies.


So are there any unexpected consequences? So that's more of an autonomous anomaly detection piece. But back to the development workflow, like I said, the second component is column-level lineage. 


So what is lineage? It's essentially an interactive map of the dependencies in your data ecosystem that essentially shows you for a given column, where does the data go and where it comes from?


So if we look at a particular dashboard, we can immediately answer the question. So a given metric, what are the columns, how it's computed, what the columns that are feeding the data into this metric? And we can see that, for example, a particular column is a combination of two upstream columns. It's, you know, some operator or it's a case when statements, so we can trace those dependencies up and down.


And while there are multiple uses for column-level lineage, the one that's relevant for reliable change management process is doing that impact analysis. Right? So whenever we are changing, let's say a SQL job and we have the Data Diff that shows us what is the impact in the particular table. The next thing we can do with column-level lineage is understand what are the potential downstream consequences that we haven't accounted for, of making a given change.


For example, if we change the definition of a given metric, for example conversion, with column-level lineage, we can immediately identify all the downstream jobs, all the dashboards, all machine learning models that are using this metric. So we can, one, potentially do an impact analysis there. Or we can also proactively reach out to stakeholders, to owners of those data products and data users and tell them about that anticipated change.


So together, these two tools facilitate the full understanding of the impact you're making when you're introducing changes to the data processing code. And through that, we can dramatically reduce the chance of making errors and also save a lot of time for data developers that otherwise would go into manual testing.


TM: Another interesting element of the sort of data quality question is that, with particularly organizations that have their own in-house software teams, a lot of the data is going to be coming from operational database systems that are owned and managed by a team that is distinct from the data team. And that has their own priorities and their own release cadences and their own ideas about what database design should be and how to evolve it.


And then there are also things like customer event tracking, where you have a tracking pixel or set of JavaScript on a website that is going to have some event schema that’s coming in. And so then you have to deal with pulling those events in and convert them into a database table and deal with downstream transformations there.


And, you know, not even factoring in the third-party SaaS platform data that you need to pull in, you know, just within the scope of data sources that are within the entire control of your organization, but not necessarily owned by the data team. How do you sort of popularize or build an organizational contract between the different stakeholders and data owners about how to manage change propagation through the different systems?


You know, maybe starting in software systems or event tracking to, you know, how that impacts the business dashboard that your CEO is looking at tomorrow. 


GM: Yeah, absolutely. It's a huge problem. And it's typically a big pain point for every company that we spoke with that is really data driven and building lots of data products.


I think the first step is again to acknowledge and to say that change management process for data sources, be that events or operational data stores that are copied to your warehouse, should also be reliable and equip the teams that are owning those sources with full visibility, into the impact of the changes that they are making.


And then in the world of event tracking, we are seeing emergence of tools that are specifically focused on reliable definition and change management, of those events schemas. So they are called instrumentation trackers or steam appliers. So the idea of those tools is that you have a central repository for defining events.


So what is an event will that have properties that are sent alongside the events? And then whenever engineers implement those events, there is an automatic validation against the spec to ensure that both during development and in production, whatever instrumentation chain of rates, whatever data comes out of those sources, as part of the tracking, it conforms to the original spec and all the changes are also version controlled and all the data developers who use those events, data, consumers, and engineers to instrument, those events are all on the same page. 


I think speaking about kind of interoperability off the tools and how we can piece together the ideal stock for maintaining data quality. The missing piece that most of those tools have is the visibility into the downstream mean buckets of the events, because they are mostly seeing just the world up until those events are just in the warehouse.


And this is where a tool like Datafold can come in because we have the visibility. All the way from the raw events sources to the ultimate data consumers. So by plugging these tools together, you can also ensure a reliable change management process for those real sources. 


As far as the operational stores that are oftentimes copied using change data capture into warehouse. This is a somewhat more complex problem because it's a fairly kind of low-level infrastructure process to copy the data operational source. And there is a big amount of variability in terms of how companies implement that. So some use vendors, some use open source, CDC methods, some use batch copies, and so whatever is that the team is using, I think the key part is to again, make sure that before any change is made to the original source or to the source schema, there is an impact analysis performed that clearly shows what is going to be the impact of the change, because sometimes you could remove a column and no one cares and sometimes you change a slide definition, and there's a huge data incident.


So understanding the difference between these two scenarios is key. Again, I think column-level lineage is the fundamental instrument and source of information for that. But how exactly it plugs into the change management process for operational data stores highly depends on how the company implements it. 


TM:  To that point, too, of column-level lineage. A lot of systems will look at that from the data warehouse perspective, but it's definitely an interesting question to think about how can we propagate some of that information and extend the visibility of these data tooling systems into the operational stores and the applications so that it becomes part of the application development lifecycle to be able to view and analyze the downstream impacts and not just have that be a responsibility of the data engineers and data analysts.


GM: Absolutely! I think the cool thing is that with the emergence of ELT pattern, we shifted from doing a lot of in-flight transformations on raw data before it lands in the warehouse, into the pattern of doing one-to-one copies of whatever is in your operational source. So I think the prevalent pattern right now is to copy your entire schema from the transactional store, such as Postgres or MySQL into your warehouse as is. And so if you have, if that is the case, then having lineage in your warehouse that shows you downstream usage of those copies, effectively can be translated to the ultimate raw sources in your operational source, which makes the entire visibility pipeline much easier.


But if you have more complex scenarios, then basically there is also an option to extend your lineage graph to those sources, but that increases complexity massively.


TM: For organizations that aren't necessarily using a cloud data warehouse and are more in sort of the data lake paradigm, where they have data in S3, in parquet format, and they're dealing with partitioned data sets there and, you know, they might be using Trino or Presto on top of it, or they're using Delta lake or Hoodie, or, you know, the plethora of tools that are arise being in that space.


What additional challenges or complexities does that pose to, you know, systems like what you're building with Datafold, to be able to add the level of insight and introspection that's necessary, that is relatively straightforward in a vertically integrated data warehouse stack, but you know, but it's not necessarily as sort of cohesive in these data lake environments.

Column-level lineage in a data lake vs data warehouse

GM: I think to answer it, it may be worthwhile to take a look at, you know, under the hood of how column-level lineage is constructed. So fundamentally to have a reliable bottom-up column-level lineage map of your data ecosystem, we have to first attain the code. That's basically the DDL and DML code. So the code that defines the schema of your datasets and the recipes for how those data sets are created. 


And in the world SQL that means SQL queries that are creating data sets for modifying them and SQL queries that are consuming datasets. And by then doing static analysis of that code. So decomposing that into AST representation and then piecing it back into the global graph dependencies, we can then understand how data is produced and how it's consumed - no matter what happens in that SQL, no matter how complex your queries are and whether you're using correlated subqueries or case when statements, or we names, a proper column lineage engine should piece it back together. Which we do at Datafold. 


Now, if you're using a data lake approach and still relying on a SQL-based engine, such as Presto or Spark Sequel or Hive, there, fundamentally, isn't more complexity than building a lineage graph for a basically self-contained warehouse, such as, you know, Redshift or Big Query or Snowflake.


It's just a matter of making sure that you collect those SQL logs. However, when it comes to other scenarios for how data is built, for example, using PI Spark or Scala Spark or a framework such as Apache Beam, where the language of how data is transformed is not SQL. That massively increases the complexity because those languages have massively more powerful syntax than SQL.


And so in these scenarios, we have to either connect to the underlying, fundamental representations of jobs. So taking a look at how do those engines compile whatever is their domain-specific language for defining those transformations into the primitive operations, and then using that to augment the graph.


But in any case that probably increases the complexity for building lineage, but as long as we stay in the SQL world, piecing back the entire lineage graph is fairly straightforward. 


TM: How much attention are you paying to efforts such as open lineage to try and create a more of an open standard of how to think about and represent and integrate with these lineage graphs, particularly for non-SQL systems that have their own sort of custom transformation, logic, and how much potential positive impact do you see with more systems starting to adopt and flush out that standard or anything else that might be arising in the space?

Column-level lineage and open lineage

GM: Yeah, in general, I'm a strong believer in interoperability between data tools. And I think that's one of the core principles of a modern data stack, that tools are increasingly specialized, but at the same time, more interoperable and more modular, which allows companies to piece together the stack with choosing the tool, which is best in every particular vertical.


And so I think the standards like open lineage are really important in defining how particular types of metadata are shared between the tools. And the way I think a tool like Datafold can be integrated into a larger data ecosystem using open lineage is by providing the fundamental lineage information.


So basically dependency graph that is then shared using the open lineage standard with other tools. Right now, we already have integrations with data catalogs such as Amundsen and Data Hub. So anyone who is using them can ingest column-level lineage information from Datafold using GraphQL API, and then load them in the data catalog.


I think with open lineage, that will be even easier once it's adopted more widely in the ecosystem, because once you have the permission of the standard, you can then reuse it across multiple tools, and like you said, you can also use this standard to piece together, different sources for lineage. Right? 


So for example, you might use Datafold to obtain all the lineage information from your SQL warehouses. And you may then plug in the lineage graph from systems like Spark and Beam again, using open lineage to construct the global graph of dependencies. 


TM: Going back to the organizational aspects of data quality management, in your experience, who has typically been responsible for identifying and addressing data quality issues. And do you think that the current state of affairs is sufficient or beneficial, or do you think that there needs to be a shift in how data quality is sort of owned and operated at the organization level? 

Who is responsible for data quality?

GM: So I think naturally the responsibility for maintaining high data quality falls on the teams that own the data and typically that's analytics engineering, or data engineering teams that have the largest surface area with the data products and therefore become responsible for the end-to-end data reliability. And then common for them to pass this responsibility to software engineering teams.


So for example, the ultimate stakeholder or user of data, such as let's say, financial team or analytical team would expect the data engineering team to provide them with high-quality data. And then data engineering team would build or collaborate with other teams, separate are responsible in the process of creating data sets to make sure that the data is reliable across the entire pipeline.


I think what is currently missing is the clear contracts between the teams on who is responsible and what are the ways that teams can collaborate to ensure data quality. Because like I said, especially with the raw data sources, such as operational data, which is owned typically by completely different teams and sometimes thousands of teams if we're talking about a large company with a microservice architecture. The clear contracts about who's responsible for what and how the entire process for maintaining data quality for change management is conducted. So I think one of the changes we'll see in the future also is the emergence of top-level key results or KPIs that will be more organizational level that will measure the data reliability and data quality at the organizational scale.


And then various teams that participate in creation of data products will be responsible for their parts, their contribution to that KPIs. And will be held accountable in a more formal setting. Whereas right now, it's a more ad hoc process where teams are more reactive to certain quality issues. And there isn't a very clear understanding of how exactly to measure or to set those goals. 


TM: In terms of the experience that you've had, building Datafold and working with your end-users and talking to people in the industry, what are some of the initial ideas or assumptions that you had about how data quality is managed, the sources of issues, you know, the organizational aspects of it that you have had to, you know, reform and that have been challenged or changed as you worked through this overall problem space and built the tooling and technologies to help support teams who are trying to improve the visibility and quality of their data.

What are some surprises from the data quality management space?

GM: Yeah. So I think one of the interesting revelations that we had after going to market with our solution was that initially we thought that, you know, given our experience, working at large companies and large data teams, my initial assumption was that what we’re building, tooling for reliable change management and testing automation and observability, would be most useful and most sought after, by really large companies with complex data ecosystems and large data teams.


And what we realized is that while the impact overall is probably indeed larger at those companies, so bad data quality, is that these are the issues that are sensed by increasingly more, younger companies. So we've had customers as small as a one-person data team at post-seed-stage startups that already start to feel the data quality issues.


So overall, I think that the challenges of maintaining data reliability and quality have shifted from large companies into, you know, upstream earlier in the company lifecycle. That was one of the realizations. 


I think the second one was that even maybe five or three years ago, their data teams have, or data engineers, individual contributors used to have much more flexibility into choosing the tools. And, you know, even back in my days of doing data engineering, there was a lot of freedom into, you know, go and try this tool, that tool, and kind of iterate fast on making the making choices there. And there was a bottom-up adoption of data tools. Whereas I think these days, because companies become increasingly more protective of their data, given the sensitivity and the complexity of their ecosystems, the decisions of what is the data stack and what is the approach and tooling for each step in the stack, are increasingly more centralized and are made higher up in their organization.


So I think those are two primary takeaways that we had, you know, going with Datafold to the market.


TM: In terms of ways that you've seen Datafold deployed, what are some of the most interesting or unexpected or innovative ways that you've seen it used? 

Most interesting or innovative ways Datafold has been deployed

GM: Yeah. So initially we built Datafold to automate data quality testing and increase data observability. But one of the popular use cases that we've seen for our tooling, such as column-level lineage and Diff, was to accelerate migrations toward more modern data stack, or just in general across tools. For example, if you are migrating your ETL from, let's say a legacy warehouse to a new warehouse, no matter what they are.


One of the most time-consuming parts of that process is to validate the before and after state of data, because ultimately your stakeholders don't want to deal with discrepancies, right? They want to make sure of that what are they seeing, which is served from your new warehouse or your new ETL framework is the same data that they used to see from your legacy system.


Or if they're not seeing the same data they want you to be able to fully explain those discrepancies. And when we were doing migration at Lyft, that was probably 80% of the time spent on overall migration effort. And so what was interesting was to see Data Diff being adopted for those use cases, basically accelerating the migration through faster validation of the dataset transferred to new warehouses or into new ETL framework. 


TM: And in your experience of building and growing the company, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?


GM: So, you know, being a data tool in 2021 with like I said, increasing focus on data protection and security, we had to pay a lot of attention to making sure that our solution is secure and for a very large number of customers, even larger than we expected, that meant being able to deploy our solution on-premises, which for a younger company with fewer engineers, that brings lots of challenges, right?


Because we have to not only maintain one SaaS solution that is scalable. We also have to be able to quickly deploy the entire distributed application into customer environments and do it securely, quickly and also in a way which allows us to maintain it with minimal overhead as well. So that was, I think, one of the hardest technical problems that we had to tackle.


TM: In terms of people who are looking at Datafold, and they're thinking about how they're going to manage their data quality and try to be more proactive instead of reactive. What are the cases where a Datafold is the wrong choice and they might be better served with other frameworks or in-house tooling or just organizational patterns. 

When is Datafold the wrong choice?

GM: Yeah, I think that Datafold is built with the modern data stack philosophy. And it's also optimized to integrate seamlessly with modern warehouses, such as Redshift, Big Query, Snowflake, and modern data lake systems like Presto and Spark.


It is probably going to be an uphill battle to use a system like Datafold with a more legacy data stack based on, let's say older systems that are based on Hadoop, Hive and even more kind of proprietary data frameworks. And if your organization is in the process of either establishing your dataset and stack from scratch, so you're still in the process of setting up data warehouse and BI tools, more fundamental blocks of the stack, or you're in the process of migrating from legacy systems to the modern data stack. It's probably too early for you to adopt Datafold because in the hierarchy of needs, Datafold will not be able to solve your immediate challenges.


And I think the second group of use cases is for companies that are not necessarily data-driven. So the importance that they give to analytics is not as high, probably Datafold also won’t be, you know, be able to bring lots of value because our value proposition is to help ensure data reliability and data quality. So if that's not necessarily the topmost priority that we won't be able to naturally generate a lot of the impact. 


And I think finally, I think there has to be a mandate for change and improvement that exists in your organization. So if there is a status quo that data is broken and everyone is fine with living in this painful world of broken data, but without necessarily plans or KPIs or OKRs improve it again, probably solutions for data quality on tools like Datafold or others won't be able to help much.


So it's very important to have right incentives and motivation within the organization to actually address those problems. 


TM: And as you continue to build out Datafold and work in the space of data quality management, and try to stay up to date with all of the rapid shifts in the data ecosystem, what are some of the things that you have planned for the near to medium term?

Plans for Datafold in the near to medium term

GM: So for the near to medium term, we are going to focus on making Datafold even more interoperable with other parts of the modern data stack. So integrating with the popular BI tools and increasing the integrations with popular ETL frameworks, such as Dagster and others, basically to be able to provide a more holistic picture into data quality, both as far as change management process and for sort of in production, autonomous data monitoring. 


And if I were to zoom out and think about the fast-forward future more long-term plans for Datafold, what I would really want to happen is for us to be able to automate 80% of what current data prep or analytics engineering workflow is today. 


Because if you look at it, most of it is not creative process. It's not writing code. It's actually dealing with simple, but really painful questions of understanding your data, understanding the edge cases, understanding the data quality issues, or fixing data quality issues. It's reading the code to understand dependencies.


And so through providing better observability, we can not only solve data quality, but we can also accelerate the entire workflow of building data products. And ultimately I think that we can go as far as not only helping teams to ensure the quality of their data sets, but even to create high-quality data sets at the first place, because as a data observability tool, we are uniquely positioned to collect and process very valuable metadata.


That basically gives us an understanding of how data links, how it's produced, how it's consumed, what is the semantic meaning of every single data point, which puts us in a very strong position to build lots of useful tools, to really accelerate the workflow. 


TM: Are there any other aspects of the work that you're doing at Datafold or the overall space of data quality management and strategies for being proactive in preventing data quality issues that we didn't discuss yet that you'd like to cover before we close out the show? 

Other cool things to know about Datafold

GM: I’d like to say that, you know, as well-versed in a space as we are, we realize that data quality is a very young topic and young space overall, both in terms of tools, but even then in terms of understanding of what are the approaches and solutions to solving these problems. And so I think one of the key ways we as data practitioners can contribute to solving that and helping each other is through sharing the knowledge.


And we at Datafold, and I personally have been hosting Data Quality Meetup, which is a quarterly online gathering for data practitioners to discuss the best ways, tools, and solutions for data quality management. And so we invite everyone to both contribute with lightning talks. So tell us about what are the ways in which you have tackled data quality problems in your organization, or what are the cool tools or frameworks that you've built or extended to help solve these problems and also to just come and learn and disobey the knowledge within your organization.


TM:  And if you don't already have it, you would probably be interesting to add the data quality war stories, where you have a sequence of lightning talks about all the things that went wrong and ways that you failed, because it's always fun hearing about some of the non-obvious ways that things can go wrong.


GM: Yes, absolutely. 


TM: Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

What is the biggest gap in data management tooling or technology?

GM: Part of me wants to say, you know, talk about more data quality, tooling, and testing, but this is I think, less interesting because it's on our roadmap. We're going to build this, it’s going to be great and it's going to be very helpful. 


So I don't think we're going to build, but I think that probably needs to be built. It doesn't make sense that building fundamental data sets like star schemas takes so much time and effort, basically just to piece together raw data into slightly more usable representations of business entities. I think that this process is ripe for more automation, which should come from really deep understanding of how the data works from maybe semantic or graph technologies that would help connect the, you know, dozens and hundreds of disparate data sources, events, OLTP sources, third-party vendors, into a more cohesive view of the data.


And we sort of scratch this area with customer data platforms, right? That kind of give you the unified view of the customer. But the pitfall, I think those tools fell into was focusing too much on marketing and using this data for marketing tool automation. Whereas I think that similar approaches to unifying the data views can be used across your entire data stack to build star schemas, to build machine learning feature sets, and ultimately to make building data products easier.


So to whoever could make sense of my fairly high-level desire or proposal, if you think that'd be exciting to build, reach out to me, I'd love to brainstorm and discuss it. 


TM: Yeah. That's definitely an interesting proposition and one that I can wholeheartedly agree with that there's a lot of time and effort that goes into data modeling that could potentially be automated, particularly with the progression that we've made with semantic graph technologies and being able to do entity extraction and entity resolution.


So definitely interesting thing to think about. So definitely if anybody's working on that, reach out to me too, I'd love to talk about it. Awesome. So thank you again for taking the time today to join me and share the work that you've been doing at data fold and your insights and experience on how to be more proactive about data quality management.


It's definitely a very interesting and relevant and necessary space. So I appreciate all of the time and effort you're putting into it, and I hope you enjoy the rest of your day. 


GM: Thank you so much Tobias for inviting me to the show.

Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.

Datafold is the fastest way to test dbt code changes