Data observability tools vs data quality tools
Nobody likes getting “stuff is breaking in production” alerts. First of all, you have to figure out what’s broken and why. Second, you have to identify the potential fix and implement it. Third, you have to restrain yourself from rage-quitting when you find out the issue was completely preventable.
Data teams are wasting so much time putting out fires when there were three mini-fires leading up to it that they had no visibility into. They had no tests, they had no standardization for their data quality tests, and they had no data diffing. No wonder they didn’t catch it before it hit prod.
Without continuous integration or a standard set of practices, there’s no guarantee that two engineers will test data the same way. Did they check that primary keys are the same between dev and prod? Did they check for statistical outliers? Were they expected to do so in the first place? Probably not—at least not from one pull request to another.
When data quality is low, more problems rear their head in production. Yet it’s not always a “garbage in, garbage out” problem, which is why it’s critical to understand the difference between data quality and data observability. What are they, what’s the difference, and is it possible to end the 2AM wake-up calls?
Understanding data quality vs. data observability at a high level
Without solid definitions of data quality and data observability, the line between them could appear fuzzy. But that’s not quite right. The two methods may have functional overlap in terms of features and methods, however the intentions of each approach are quite different.
Data quality starts in pre-production and focuses on the data itself. Some may argue that quality starts at the data source, but you usually have no control over whether Facebook changes the name of a column or LinkedIn changes a calculation in their CPM column. With the right tools, you can maintain high quality data even when your data providers aren’t.
Pre-production environments are where you manage the eight dimensions of data quality with techniques like replication testing, automated regression testing, data lineage visualizations, and data diffing. Unless you want to keep finding issues the hard way, you need to test your data before something bad or incorrect gets deployed to production.
Data quality is really about testing and validation of the eight dimensions. When you change your data, you need to assess whether those changes accurately reflect the requirements for the change and the expected outcomes. For example, are the results in pre-production what you’re expecting in production? If so, you’re good to deploy—right?
Well, not always, which is why we need to talk about data observability.
Data observability is focused on the overarching health and status of your data and pipelines, often in your production environments. Think: monitoring and alerting of anomalies. You wouldn’t typically use observability methods to manage the eight dimensions of data quality. Rather, you’re more concerned with knowing whether something is awry as soon as it can be detected.
Finding the “garbage out” with data observability tools
Data observability tools often alert you to problematic activity. They might notify you about something that needs an immediate fix. Or they might alert you to look at your production environment or pipelines.
Observability tools don’t typically tell you the specifics of the problem unless you’ve specified what to look for. It’s usually just a “Hey, your ARR has jumped 200% today.” Out of the box, they can point out blank fields, duplicate data, and other more obvious issues. But they often won’t tell you why your revenue suddenly increased by an order of magnitude overnight.
By nature, data observability tools require considerable domain context. If you see an annual recurring revenue (ARR) value spiked for some reason, there’s probably only a handful of people that can look at that number and say it’s correct, normal, or expected. Was there a really great sale day? Did a dbt code merge change the calculation of ARR? Did someone input something wrong into Salesforce?
The people most familiar with the underlying data are also likely to have written the change. The rest of the team won’t have the transparency or confidence to solve that issue. Analytics engineers try to understand the data as much as possible, but when it comes to domain-specific knowledge (e.g. how the finance team interprets detailed numbers), you really have to rely on the subject matter expert to get the root-cause of a data quality issue found by an observability tool.
Where there’s smoke, there’s (probably) a data fire
Data observability tools can look for smoke in production. Data quality tools can prevent the flammable stuff from getting to prod in the first place.
You don’t want to run a data ecosystem without data quality and observability tools. You need both, but it usually comes down to two questions:
- How much tooling do we need?
- How much tooling can we afford?
The answers are different for every team and every organization. Depending on where you are with your data stack’s technical debt and the level of staff expertise, you might need much more than you can afford. In those situations, you end up doing much more firefighting, burning out, and wishing you could “accidentally” double your salary in the HR database without anyone noticing.
Data observability finds the issues too late
You can’t be proactive if prod’s already on fire. Sure, you can mitigate issues with some proactive protections, but let’s be honest: that’s just damage control. By the time you get an alert, bad (or anomalous) data has already reared its head in production. Now you have to decide whether to roll back or write a patch. Meanwhile, your stakeholders are breathing down your neck.
Data observability is a type of reactive protection. It’s great that it can discover when things go sideways, but it won’t do much for your data quality.
Which leads to the question: when do you want to find your problems? Would you rather wait until after they’re in prod, when a stakeholder can find them? Do you enjoy the “drop everything and fix this” style of working?
Or, do you want to fix your data quality issues before they reach production? (Hint: we think this is the way.)
Data observability tools can’t catch all your data quality issues
Data quality issues can happen whether you’ve got the world’s greatest observability tools or not. You really need to employ data quality tools (i.e. tests and context awareness) everywhere there could be a problem:
- At the data source
- Before and after transformations
- In pre-prod and prod
- In your codebase
Putting data observability tools in the same places won’t guarantee you’ll catch the issues. However, data quality tools will tell you:
- “This table has 48% fewer rows than normal”
- “These data tests didn’t pass”
- “The primary and foreign keys are no longer correct”
- “You should notify the engineering team”
Data observability tools won’t tell you any of that until they’ve already happened.
Making matters more complicated, most observability tools have lost their focus by bolting on less-than-useful features and products to their platforms. It’s resulted in alert fatigue with false positives and low signal-to-noise ratios.
Data observability tools cause alert fatigue
Too much observability and not enough quality management are the main ingredients for alert fatigue. One of the biggest issues we hear about with observability tools is that these tools often send too many alerts, leading to a game of whack-a-mole and determining whether there’s really an issue in production.
Signal to noise ratios may be the most difficult part of observability. This is why data quality management is so crucial. High quality data solves the “garbage in” problem and effective, well-tuned observability helps you manage—and ultimately fix—the “garbage out” problem.
Poorly configured observability suffers from data quality issues. No one trusts the data when they receive too many false alerts.
Data quality or observability—why not both?
Some data products out there have gotten distracted. Nowadays, your observability tool comes with a data catalog and lineage capabilities, but very few products help you from preventing data quality issues from entering your production pipelines to begin with.
Datafold Cloud provides full transparency into how your data is changing and gives you more context before code is merged in. Whereas an observability tool will catch a value going from 0 to 1M, Datafold will give you visibility into the granular details of your data through data diffing. We’ll tell you exactly which rows, columns, and tables are affected by a change, whether you’re comparing pre-prod to prod or performing a migration to another data platform.
You don’t need to write tests for Datafold, either. When you open up a dbt or transformation code change, Datafold will automatically run a data diff on the dev and production versions of your data. If there’s a difference, Datafold will identify the exact rows that are different, and your team can decide if these changes are expected (and accepted) or unexpected (and likely problematic).
Most data observability tools will find large data discrepancies once they’ve entered your CFO’s dashboard. Datafold will catch that before you’ve merged any code into your master/main branch. There could be a million reasons why a large data discrepancy occurs in your ARR—and Datafold will help show you exactly how that data will change before it actually does.
This is not to say that data quality and data observability tools can't work together. General alerting and monitoring at the source-level for data, for example, might be incredibly helpful for finding data quality issues at the source, such as a backend engineering bug or bad source data.
Do you want to be reactive or proactive?
Somewhere in the world right now, some poor soul is getting alerted about a data problem that could have been avoided in pre-production. Maybe that person was you just recently. And maybe you also discovered that the problem was avoidable—or maybe it was yet another false alarm. Do you really want to live that out again, losing precious time reactively solving for root cause problems?
Or would you rather spend that time on newer data work?
We think life would be so much easier if software developers and data engineers could see exactly where problems originate in their code and their data. A data quality tool like Datafold can help prevent lost time. Instead of being a firefighter, you can be an architect and a builder, just like our friends at Trainual.
Datafold is designed to help you proactively avoid problems before they pop up in production. We show you how each code change is going to affect your data in a purposefully unopinionated set of results. We give you row-level change awareness, not just summaries and trends. We enable your team to find data quality issues before they actually happen.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.