ETL pipeline testing: Validate, validate, validate
Data pipelines are borderline magical. They connect to dozens of data sources, understand complex table structures, seamlessly manage schema changes, and bring that data into your own centralized data store in real time. Some pipelines even do change data capture (CDC), capturing hard and soft deletes, row updates, and other changes, often without much computational burden.
What a time to be alive!
Pipelines are so magical, in fact, it’s tempting to just trust them. Scroll through a list of popular connectors and you might think, “Hey, this one does CDC. Great. One less thing I have to think about.” Or maybe you built the pipelines yourself and you’ve got a little bit of the IKEA effect going on.
But how do you really know a pipeline is legitimately trustworthy and doing everything you expect?
Pipelines may be great, but you need perfection
When it comes to data quality, you often need to aim for perfection. Getting a 94% is great for a college exam, but it’s not a good score for most data quality dimensions. Data completeness, accuracy, and consistency need to be as close to 100% as possible—especially if you work in an industry where the data can’t be wrong (we see you Finance). To get there, you need perfect pipelines.
And your ETL/ELT pipelines are doing a lot! They have to extract data from multiple data sources, transform that raw data into usable formats, and do data quality testing along the way. For many organizations, ELT pipelines are the foundation of their team—whether the majority of business users know it or not.
For modern data teams, we typically see ETL/ELT pipelines take the form of:
- An extract process (the E) where raw data is extracted from a production backend database and cloud data sources (think: Facebook Ads, Salesforce, etc.)
- A loading process (the L) where that data is loaded into a target database
- A transformation process (the T), where raw data is modeled and tested for analytics and downstream use cases
There are plenty of commercial EL pipeline providers out there who do great work, like Fivetran or Airbyte, to extract and load your data to your analytics database. They strive for and promise 100% completeness, accuracy, and consistency. It’s up to you (the customer) to verify you’re getting 100%.
For transformation pipelines, we see many teams now adopting more modern, code-based tooling (like dbt), often paired with an orchestrator of some type (like dbt Cloud, Airflow, or Dagster).
Depending on their financial and people resources, technical ability, and business requirements, some data teams might prefer a SaaS-based stack where they acquire commercial tooling to help with their ELT pipeline needs. Other (typically larger, more resource-available organizations) will opt for more DIY data engineering pipelines where they might extend an existing open source solution to fit their organizational needs.
Whether you’re using a commercial solution or a DIY one, you need testing and validation to prove that you’re getting the full 100% from your EL pipelines.
ETL testing is a two-part process
Getting to 100% with your ETL pipeline requires two primary actions: 1. Validation of your EL pipelines and 2. Data quality testing to confirm your transformations are working as expected.
Part one: Validating your ETL data replication with data diffs
Let’s say your engineering team is running a Postgres shop. They get a request from a new analytics team that needs a copy of the Postgres data in their new Snowflake instance. So, you write the scripts to do the data copy (high five!) and do some spot checks to compare between the tables in Postgres and in Snowflake. Everything looks good, so you set up a new ETL to copy data into Snowflake on a regular basis. Can you walk away at that point? Are you done?
Well, you could, but that’s not the path to perfection. You need to do a thorough validation of the copied data. Is it 99.3% the same or is it 100% the same? And how do you do this as your data scales and changes over time? There’s only one way to tell: data diffs.
Running a data diff on your Postgres and Snowflake datasets will show you whether you’re completely, accurately, and consistently replicating data. A data diff will tell you exactly how two tables across any database differ, allowing you to quickly identify if your replication (EL) pipelines have gone awry somewhere.
Our Datafold Cloud service makes it easy to plug right in and get quantifiable results within minutes. Datafold Cloud supports a REST API and scheduling, so you can check the quality status of your replication pipelines on a regular basis.
Validate your business-critical data (unless you like pain)
The key thing about data replication is that it often involves your most critical data. You might be using it for business continuity, analytics, data science models, or regulatory compliance. Whatever the case may be, it’s important that you validate validate validate.
The default position of most companies is to trust the system without validating every replication. It’s not that validation is particularly complex or cumbersome; it’s just not done. Spot checks are thought to be good enough—and they are, until something goes wrong. Once you discover an issue in production, you have to pinpoint the issue, resolve it, and then determine whether you’ve got gaps in your data quality or data observability. Quite a hassle.
If you don’t know how your replication pipelines are performing—if they’re even working properly—how can your business trust the data? How can they rely on it if a US-based server goes down and they need to rely on the replicated data in a Germany-based database? What happens to machine learning models that ingest incorrectly replicated data into their learning process?
Save yourself the pain and misfortune by integrating a tool like Datafold Cloud into your EL pipelines to ensure they’re working as expected.
Part two: Validating your ETL/ELT transformations with continuous integration
With your diffs complete and your data replicated at 100% accuracy, you need to validate your transformations. You must be sure you’re transforming raw data into accurate, high quality data. If it isn’t accurate and high quality, it’s not trustworthy or usable.
The best way to do this is by implementing data quality tests in your continuous integration process. Not only does this a) guarantee that your transformation code is being tested with every PR, but b) establishes a consistent standard for how your team tests their data. Gone are the days of expecting your data engineers to test for every edge case, and in are the days of automated and governable data testing.
For example, you can use Datafold Cloud to validate that your transformation pipelines are outputting data you expect (and catch the unexpected data changes that may result of a code change). This is especially useful if you have a continuous integration workflow in place to validate every code change.
ETL vs. ELT pipelines: Does validation differ?
Data pipeline providers sometimes make a big deal about “where the T is.” ETL pipelines perform transformations before data is loaded into a destination, consuming compute resources along the way. ETL pipelines perform transformations after data is loaded into a destination, consuming compute resources on an as-needed basis by leveraging the infinite capacity of cloud infrastructure.
With advancements in the technology around cloud data warehouses, ETL tooling (like Fivetran and Airbyte), and transformation technologies like dbt, we’ve seen great adoption and movement towards an ELT mindsight–one of the reasons we’ve talked so interchangeably used ELT/ETL throughout this post.
However, there’s not a big difference in how you validate the pipelines. Whether you’re building more traditional ETL pipelines or more modern ELT pipelines that leverage cloud infrastructure, the philosophy is the same: validate your replication and validate your transformations.
Trust, but verify
Just because today’s data pipelines are borderline magical doesn’t mean you should trust them. You can’t rely on their perceived efficiency, fancy marketing, or even their SLAs. Only you can prevent data quality fires.
No matter how sophisticated the pipeline may be, you need to verify it’s doing everything you need for the eight dimensions of data quality. Achieving data excellence requires a vigilant and proactive mindset, which involves:
- Regular Audits and Validation: Conducting validations of your data pipelines ensures that they are performing as expected. This goes beyond trusting the system; it's about consistently proving its reliability.
- Continuous Integration and Quality Testing: Implementing continuous integration with robust data quality testing is key. This ensures that every piece of data transformed and transported by your pipelines maintains the highest quality standards.
- Adaptability to Changes: Data sources, structures, and requirements are constantly evolving. Your pipelines must not only handle these changes efficiently but also be adaptable to future transformations.
- Collaboration and Transparency: Encourage a culture where data engineers, analysts, and business users collaborate. Transparency in how data is managed and validated builds trust and enables more informed decision-making. Consider implementing contracts between your engineering and data teams, to prevent broken replication pipelines.
- Leveraging Tools for Accuracy and Efficiency: Tools like Datafold Cloud offer critical support in achieving these goals. They provide the necessary insights and automations to maintain and improve the data pipeline's performance.
Harnessing the full potential of your data pipelines is not just about building or selecting the right tools. It's about creating a comprehensive strategy that emphasizes continual validation, quality assurance, and collaboration.
If you want to get started with testing your data pipelines—from replication to business use—check out the free trial of Datafold Cloud today!
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.