Over the past 10 years, we've seen a great advancement in technologies and tools for analytics and machine learning: with today’s modern analytics stack, we have fast and scalable data warehouses, dirt-cheap data storage, capable ETL orchestrators, and powerful BI tools.
These innovations have made it extremely easy to collect, store, process, and visualize data as well as deploy ML models. Still, at many organizations, the data ecosystem has grown so complex and large in both volume and variety of datasets that it has become extremely difficult to ensure the quality and reliability of data products and infrastructure. This complexity, combined with the increased reliance on data applications and accessibility in our current remote-first world, has made 2020 a remarkable year for starting a company in the Data Tools space.
Data quality as a term has almost reached the buzzword status of #bigdata. And although there are still more questions than answers, I am excited to share Datafold’s current vision based on our experience building data platforms and tools at Autodesk, Lyft, and Phantom Auto, as well as from working with dozens of great data teams this year including Thumbtack and Patreon.
Data performance monitoring is the new necessity
It's impossible to imagine a modern software team building apps without relying on Application Performance Monitoring systems to provide visibility into metrics, logs, and error traces.
Similar observability challenges exist in data:
- Has the dataset been updated? If not, why?
- Are there any anomalies in the dataset?
- How is a given column/metric derived?
- What are the highest-priority issues in the data ecosystem at the moment?
Today, Datafold users can observe profiles and maps of their data (column-level lineage graph) to track how data flows and changes across the pipelines and transformations. We envision a system that not only shows the state of data but also performs root cause analysis of anomalies and incidents in the datasets.
Ensuring data quality starts with improving existing workflows
No matter how sophisticated and detailed the monitoring is—if it’s not embedded in analysts’ workflows, it has no value. We started with the workflow that data teams spend most of their time in and that frequently causes problems for the business: change management. One of the chronic problems of change management in data pipelines has been the lack of testing: manual QA, which often entails spending hours (or days in some cases) writing SQL queries to validate the change, is too inefficient and expensive, and tooling for automatic testing is too rudimentary.
We built our first tool — Data Diff — to help data developers quickly verify the changes being introduced to the data right in their Github/Gitlab workflow, allowing them to catch data incidents before they get to production and at a point when they are easiest to fix. The time savings and the gained confidence compounds: Data teams increase their velocity, and other users of analytics, previously locked out of the data development workflow due to the fear of breaking things, get an opportunity to contribute to the code base which has a multiplicative effect on the organization as a whole.
Interoperability is key
Modern data stacks are becoming more modular, and tools, increasingly specialized. By closely integrating and exchanging data with other systems that form the core data stack (data warehouses, ETL orchestrators, BI tools, etc.) we enable our users with full visibility into their data ecosystem. We have a number of exciting integrations coming out in the following months.
We are in this together
Since the very beginning, we have been actively engaged with the Data community by hosting Data Quality meetups, supporting open-source projects, and independent thought leaders. We will continue to rely on the wisdom and power of the community and to find opportunities to contribute back.
The importance of data quality & observability has been recognized at the highest level by most large tech companies. Airbnb & Uber recently went public on their massive investments in data quality tools. At the same time, there are many more thousands of teams that leverage data to grow their business using workflows fundamentally similar to those of tech giants. Those teams are suffering from the same data quality issues but won't be able to invest millions into building in-house data quality tooling.
We strive to enable every team that leverages data to make human or machine decisions with the data observability tools to help them move faster and with higher confidence.
We are humbled by the support of NEA alongside other great investors on our journey, and are looking forward to building the future of data engineering together!