Data quality guide

Data quality is your moat, this is your guide

/•/

Fortify your data, fortify your business: Why high-quality data is your ultimate defense.

Enhancing data quality: Lessons from FINN, Petal, and Rocket Money

Published

May 28, 2024

Read how the data teams at FINN, Petal, and Rocket Money implemented "shift-left" and proactive data quality testing practices to safeguard their data, streamline their teamwork, and keep their stakeholders happy.

FINN: Incorporating data quality checks into data pipelines & ETL processes

FINN, a global car subscription platform, exemplifies how the strategic use of data quality metrics helped them scale their explosive growth without sacrificing quality. Not every startup could have handled a rapid expansion of 40x increase in ARR over three years with operational integrity. FINN’s data team saw their dbt developer headcount increase by 3 to 30 in less than two years. With more than 1800 dbt models in play, FINN was faced with enormous pressures around handling the resulting operational complexities and governance issues.

There’s a few reasons why they succeeded, and all of them were related to expertly implementing a culture of proactive data quality:

They improved their workflows by shifting data testing to the left.
FINN implemented value-level data diffing during the deployment stage of their dbt project. Each new PR would trigger automated data diffs that provided developers with clear visuals and summary statistics on row-level comparisons between their dev and prod work. Testing earlier allowed them to confidently merge any code changes without fear of breaking dependencies or hitting critical production systems.
‍

They enforced a company-wide checklist of data quality standards.
FINN made sure that data lineage became a rule-enforced structure. They implemented checks to prevent improper references, like a KPI model directly accessing staging layers—automatically enforcing architectural guidelines during the pull request stage.
‍
They also used pre-commit hooks in their dbt project to scrutinize models for compliance with company policies (e.g. presence of model owners, uniqueness, non-null tests for primary keys, and consistent naming conventions). No PRs could merge to main without passing these tests.
‍

They automated everything.
FINN enforced these rules through their CI GitHub Actions, which would automatically run with each PR. Updating their company rules was as easy as adding new checks to Python scripts and YAML files, and helped facilitate the immediate and efficient rollout of new policies for all developers. There were no manual tests to slow down development, and all developers worked within the same data quality standards.

Read the full story study here.

Petal: Eliminating data quality incidents and improving business confidence

Data scientists are keenly aware that their models live and die by the quality of the data fed into them. But they’re often embedded in complex data pipeline architectures where they simultaneously rely on using production data from other sources to train their models, which then produce new data that loops back to influence production.

As a fintech company on a mission to bring financial opportunity to underserved consumers, Petal grappled with the challenge of ensuring that the data science models powering their core business decisions and strategies were only trained on the highest quality data. After refactoring their data science systems to improve internal data quality and performance, they were able to successfully eliminate data quality incidents entirely.

Here’s how Petal's data team adopted proactive data quality principles to manage the delicate balance between workflow transformation and compliance with their industry standards for data integrity:

They made small, but key changes to their workflow, to shift data testing to the left.
Petal recognized the importance of catching data quality issues early in the development process to prevent them from propagating downstream. By implementing an automated data testing system with DAtafold, they significantly reduced the time spent on QA tasks, cutting down the validation process from around 30 minutes per PR to a fraction of that time. This allowed them to identify and rectify potential issues at their source, saving hours of developer time in manual validation efforts alone.
‍

They replaced ad hoc manual testing with standardized checklists across all PRs.
Previously, Petal’s team would run manual SQL queries per PR for validation, taking up to an hour each. Each PR is now subject to standardized and comprehensive data validation checks with Datafold's data diffing during their CI process, reducing the risk of overlooking critical issues and saving them 15 hours per month on manual testing. These standardized checks ensure consistency and thoroughness in their data quality assurance processes.

They automated all their data tests, leaving no possibility for human error.
After moving away from ad hoc manual SQL queries, the next step was to automate the entire process through their CI pipeline. Because automated data validation saves time and frees up developer resources for more important work, it was a game change for other pressing data initiatives. Petal was able to use this new data validation workflow to refactor their dbt project for improved end-user performance. Refactoring a large and complex dbt project is an often intimidating exercise, but was made possible by automated checks that uncovered any data discrepancies and errors at each stage, boosting developer productivity. Not only did this improve their overall data quality, but gave them confidence in the data they were providing for downstream data science models.

Read the full story here.

Rocket Money: Data quality beyond data teams

Something we’ve returned to again and again is that data quality is not the sole domain of data engineering teams. A proactive data culture is one that makes data quality metrics transparent to other parts of the organization, and makes it possible for them to contribute to better data quality as well. In fact, it often leads to markedly improved internal collaboration between technical and non-technical teams.

Rocket Money’s experience makes this point vividly clear. As a financial management service for more than 5 million users, they were no strangers to how financial data can be tricky to transform, analyze, and reconcile. After initiating a comprehensive overhaul of their data transformation system to improve data quality standards, they were able to pass their first SOX audit with zero deficiencies. The interesting thing is that this was the result of enabling a proactive data quality beyond the data quality team:

They improved workflows by shifting data testing left and bringing key stakeholders along.
Rocket Money's existing Quote-to-Cash (Q2C) system was extremely complicated as it tracked a range of financial activities, such as invoice generation, revenue recognition, and managing accounts receivable. Each one had its own rules and datasets, which needed to be carefully reconciled in dbt.

Rocket Money streamlined the process with the intention of better involving a key stakeholder, the accounting team, which was non-technical. They made sure that the new workflow made it more accessible for the accountants to manage, audit, and report the financial information, but without compromising on precision.
‍
Similar to the other two case studies, Rocket Money implemented value-level data diffing during the deployment stage with Datafold. Whenever the data team would create and modify dbt models for their Q2C process, these code changes would be run through data diffing in CI. Then, the accounting team could analyze the user-friendly data diff results to verify that any changes were acceptable and expected.
‍

They enforced accounting rules within a company-wide checklist of data quality standards.
In their new workflow, they integrated value-level data diffs within their existing ticket tracking tool. Rocket Money’s development process ensured that CI pipelines ran comprehensive tests on each build each day, and acted as a company-wide gatekeeper that only approved changes after the accountants validated the resulting data diffs. This centralized where financial rules were being created and then enforced – and was critical for maintaining the integrity of precise financial records in preparation for their first SOX audit.
‍

They helped teams become more efficient through cross-functional automation.
Data engineers work closely with their CI pipelines each day, but that’s rarely true for any other functions in the company. In fact, we would guess that if you quizzed the finance team at most companies about what CI is and how it works, chances are you would be met with amused looks.

Not for Rocket Money. Because they simplified the Q2C system and ensured that it would be accessible to non-technical staff, data engineers, analytics engineers, and accountants could all work together and see the same data quality tests and metrics to base their decisions on. The automated CI tests acted as a safeguard against unreliable manual tests and different approaches across functions.

Read the full story here.

previous Passage

Next Passage