The State of Data Quality in 2021
We ran a survey of data producers (data/analytics engineers) and consumers (product managers, data scientists, analysts, and other roles) to explore trends in tools, pain points, and goals. The survey had 231 respondents, mostly in SaaS, Finance, Consumer Internet, AI, Cloud, and Retail, and mostly from mid-to-large sized companies.
This report outlines the results and what we learned.
Data quality is an important issue for most data teams. However, individuals don’t have the right data observability tools and processes they need to succeed.
The four key learnings from this survey
• Data quality & reliability is vital for data teams
• Too much manual work is the #1 reason for low productivity of data teams
• SQL is still the #1 interface to data and doesn’t seem to be going anywhere
• BI remains unchanged... for now
We only considered responses from people directly working with data (as a producer or a consumer). 68% of respondents belong to a data team, while others interact with data in another role.
Data quality & reliability is vital for data teams
Data quality and reliability are top KPIs for data teams, followed by improving data accessibility, collaboration, and documentation. Data quality is often subject to issues outside of the team’s control or even awareness, yet it is a core metric that determines the team’s success.
What’s top of mind for data teams?
We asked respondents what goals and KPIs they are working toward.
It appears as a stark difference from the general sentiment from even 2-3 years ago when inadequate infrastructure, slow query speed, and the challenges of data integration (collecting all data into a warehouse) seemed to occupy the minds of data professionals.
Data quality issues happen frequently
More than 80% of respondents said that they regularly run into data quality issues.
Most data quality issues originate outside of the team’s scope
What’s interesting is that, according to the surveyed data teams, 75% of data quality issues fall within the responsibility of other teams and 3rd-party vendors. Furthermore, 20% of respondents don’t have any visibility into where the issues originate! That reinforces the idea that data quality cannot be owned by any single team and needs to be addressed on a company level (in the same way security is) and requires close collaboration between teams.
51% of respondents indicated that they don’t have adequate processes and tools to address data quality issues.
Data users mostly rely on manual data quality checks to validate the data
There are three important conclusions to draw from how data teams go about validating their data:
- Almost no one (< 10%) takes data quality for granted
- Most teams still rely on manual data checks or asking others before using the data for their work.
- Automated tests and data catalogs are currently used by ~30% and 20% of teams respectively as a source of truth for data quality.
The majority of teams have yet to adopt data quality tools
Too much manual work is the #1 reason for low productivity of data teams
Followed by inefficient collaboration (“too many meetings” and “organizational issues”) & poor data quality.
Considering that data teams identify data quality as their to KPI while lacking tools and processes to manage that, it is not surprising that data teams are haunted by manual work, as many routine tasks such as testing the changes to ETL code or tracing data dependencies can take days without proper automation.
Data Stack Review
Data quality aside, it’s always interesting to explore the trends in modern data tool adoption. Which leads us to our next two findings!
SQL is still the #1 interface to data and doesn’t seem to be going anywhere
When it comes to Querying and ETL languages, SQL and Python are by far the most popular, followed by R and Scala.
Star and Snowflake schema models are the most popular data models for building data warehouses
With the rapid adoption of infinitely scalable cloud data warehouses such as BigQuery and Snowflake that offer great UX and relatively cheap storage and compute, we were curious how that affects team’s choice of their data modeling patterns.
Interestingly enough, once dominant star/snowflake kimbalian modeling, while still #1, is aspirational to only ~35% of teams. At the same time, we see that alternative approaches such as Data Vault and a single event table, aka Activity Schema, rapidly gain popularity largely due to their simplicity and agility.
BI remains unchanged... for now
No element of the data stack causes more internal strife than BI tools. So let’s see what are the most popular tools as of 2021. We’ll start straight with the winner which is, unsurprisingly, Tableau. 😏
The biggest surprise, however, is the second-most-popular BI tool, which is… Google Spreadsheets! Wait, weren’t BI tools supposed to replace reporting done in Excel?! Kind of, but apparently, data users still love spreadsheets, and Google Spreadsheets provide them flexibility for modeling while offering great collaborative features and integrations with modern warehouses.
Segment & Snowplow dominate the analytical instrumentation but new players are catching up
Internal tools usage outpaced other tools for data integration
Very few people these days use in-house infrastructure for event instrumentation and collection. The market is still dominated by Segment and Snowplow with a few new players such as Rudderstack (open-source Segment clone) and Freshpaint (also offers no-code event capture) getting noticeable traction.
Interestingly, unlike event collection, data integration is still largely done in house.
Following this, Fivetran and Stitch were the most popular.
As internal data integration tools become an increasingly critical part of the modern analytics stack, startups are adding a wide variety of data warehousing and processing tools to handle them: More than 70% of respondents say their company is using either PostgreSQL, Redshift, or Snowflake.
Big Data is no longer hype or even a buzzword - it's sheer reality. And yet, the most popular data warehouse appears to be good old PostgreSQL! Note that this question allowed for multiple answers, and for most teams Postgres is likely either a legacy warehouse while they are migrating to a column-oriented one or a last-mile “serving layer” that they use to power lightweight dashboard queries.
What’s also interesting is that the “big three” data warehousing technologies Redshift, Snowflake, and BigQuery are coming close in terms of their popularity, followed by Spark. Feeling sorry for those 17% of folks that still have to use Hive 😔
So with most people using some sort of a modern cloud warehouse, how do they feel about its performance?
There’s an even split on whether their productivity is negatively affected by a slow data warehouse.
Again, that seems a pretty good improvement compared to the sentiment of 3-4 years ago. Modern data warehousing tech navigates a few tricky trade-offs, and over-optimizing for speed makes you sacrifice scalability, at least.
In house tools are still most common for orchestrating data transformations
The queen of ETL, Airflow, is still popular but is increasingly challenged by both proprietary tools (Glue) and rapidly growing dbt.
Will dbt replace Airflow as #1? Unlikely, since dbt is very SQL-centric, and although SQL (as this survey reinforces) is still by far the most popular language for data transformations, most teams will want to also run non-SQL jobs for ML, data integration, etc. So most likely Airflow will eventually be replaced by a combination of a new-generation general-purpose data orchestrator (e.g. Dagster) and a SQL framework (dbt).
Satisfaction with data stack: 3.3/5
While technically the general sentiment of the data stack seems to be neutral-positive, the lack of exuberance proves that there is a lot of room for improvement.
So why don't people just change data stacks?
When asked about the barriers to improving the data stack, respondents most often named high switching costs and organizational pushback. That is especially true for data warehousing and BI technologies that are hardest to migrate from, so teams should choose wisely.
Let us know if these results surprise you. If you're curious about the best way to build your dream data stack so that you can boost your satisfaction, productivity, and reduce manual work, check out our dream stack post. Finally, if you're ready to change your status quo, request a demo below and learn how Datafold can improve your organization's data quality.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.