Pushing through the frontier with automated CI: How FINN scaled developer growth 10x with Datafold

Key metrics

10x

Growth in data team members

1800+

dbt models tested with Datafold

Introduction

For high-growth startups like FINN, the ability to rapidly scale data teams while maintaining the integrity of data pipelines, speed to deployment, and empowering developers towards self-service is a difficult challenge. Over just three years, FINN saw its ARR increase by 40x. With rapid expansion came an increase in operational complexity and governance issues. The number of dbt developers increased from 3 to 30 in less than two years, contributing to more than 1800+ models. Using Datafold’s data diff within an automated CI setup, FINN’s data team was able to maintain high data quality without sacrificing model development and deployment speed.

Customer quote

There is some sort of fear, if you have a lot of [dbt] models, to break something that you’re not aware of. Datafold makes this ‘fear to merge’ completely go away, because you can check the downstream dependencies for issues.

Jorrit Posor

Data Engineering Tech Lead

The challenge: Scaling with speed and governance

FINN faced significant challenges during their high-growth phase, including the task of scaling their data operations team to keep pace with their rapid expansion. In just three years, FINN saw its ARR increase by a factor of 40. The team grew to 400 people, while all of them interacted with the data in some fashion, with their dbt project powering most of this data.

With an ambitious timeframe of less than two years, FINN sought to expand its team from just 3 to over 30–a tenfold increase in developer headcount. However, with over 1800+ data models in play and more than 30 dbt developers contributing to the project, the complexity of governance and the risk of data degradation loomed large.

This wasn't just about adding more talent to the team; it was about controlling the chaos that could arise from managing a vast network of data models and a growing team of contributors. The primary challenge was ensuring that the scaling process did not degrade data quality or slow down development.

Existing traditional solutions were insufficient. The traditional tools at FINN's disposal were not cut out for such dynamic scaling, and the potential for errors and inconsistencies threatened the integrity of their operations. Data quality became a critical issue, especially with varying standards and increasing technical debt in their dbt repo, due to contributions from multiple embedded data teams without unified practices.

‍

‍

Many in FINN’s position faced the problem of a ‘frontier’: increasing development speed came at the cost of data quality, and vice versa. This worsened in 2023 when more governance would reduce speed, but reducing governance requirements hurt data quality.

FINN required a robust, scalable system that could support the workflow complexities brought on through high and fast growth, and break through this natural frontier. The solution would have to be agile enough to match FINN's pace, robust enough to manage a sprawling data infrastructure, and intuitive enough to be embraced by a rapidly expanding company.

The solution: Automation through CI and Datafold

FINN tackled the issue of maintaining high data quality without sacrificing speed by implementing an automated Continuous Integration (CI) setup using Datafold’s data diff alongside dbt and SQLFluff in GitHub Actions.

This setup wasn't just about enforcing code quality; it was about empowering developers with tools for consistent coding styles, data quality testing requirements through data diff, lineage adherence, and policy compliance, all automated within the CI pipeline.

FINN turned to a variety of solutions to break through limits imposed through the speed-quality frontier. They used a combination of Datafold Cloud’s data diffing in CI for automated data testing and data lineage enforcement, GitHub Actions for policy automation, and code linting for standardization.

‍

‍

Data validation at scale: The team used Datafold Clouds automated data diffing to provide developers with feedback on the impact of their changes on production data and downstream models. Datafold provided next-level visibility on changes, assuring developers through clear row-level comparisons between their dev and prod work—reducing fear of breaking dependencies and allowing confident merges.

Data lineage enforcement: Data lineage became less of a navigational nightmare and more a rule-enforced structure. FINN implemented checks to prevent improper references, like a KPI model directly accessing staging layers—automatically enforcing architectural guidelines during the pull request stage.

Policy automation through GitHub Actions: FINN used Python scripts to scrutinize models for compliance with company policies (e.g. presence of model owners, uniqueness, non-null tests for primary keys, and consistent naming conventions). These automated checks facilitated the immediate and efficient rollout of new policies.

Code quality with linting: SQLFluff ensured uniform code aesthetics across 1800 models. Through CI, it enforced a standard format—improving readability and maintenance. No code could merge without meeting these set standards.

How does this work in practice?

FINN used data diff to monitor critical reports that may be affected by changes in upstream data. Whenever modifications are made to the data, Datafold Cloud assesses the impact by calculating the percentage of changed values between the prod and dev versions of a dbt model. A high percentage of change would warrant further investigation, indicating substantial data alterations. A low percentage, or one close to 100%, suggests minimal changes, typically signaling that the updates can be safely deployed without adverse effects on the reports.

‍

‍

Looking at Datafold Cloud’s CI printout’s key metrics helped determine changes made to data models and ensure that important reports remained accurate for decision-making.

The result: A supercharged team of analytics engineers

Integrating Datafold's tools into FINN’s data infrastructure had a transformative impact on their business operations. FINN successfully grew its development team tenfold in three years. They were not only able to scale extremely quickly but also established a robust foundation for data quality, fostering trust and reliability in their data-driven decision-making processes.

Automated data governance: Automated checks and balances ensured that even as the number of dbt developers increased from 3 to over 30 in a short span of time, the rapid deployment rates remained uncompromised. This helped manage the complexity arising from more than 1800 dbt models. Within this new framework, developers could iterate and contribute to FINN’s data models with confidence, knowing that quality checks were in place to catch issues early.

Streamlined data quality monitoring: Datafold’s data diff allows FINN to proactively identify and resolve data issues. This ensures that data analysts are working with high-quality data, minimizing the risk of erroneous business insights.

Increased trust: Datafold's integration with FINN's data operations has improved the reliability and quality of data. This reliability reduces interruptions in data availability and analytics workflows. This enhanced trust was not just internal among the data analysts but also extended to the business stakeholders who rely on accurate insights for strategic decisions.

Empowerment of data teams: Automated checks and balances allowed data teams to maintain rapid deployment rates, essential for a fast-growing company.

Operational efficiency: FINN's data quality monitoring became more proactive and less reactive with Datafold. Data is validated more quickly, thus freeing up valuable resources and allowing data teams to focus on deriving business insights rather than being preoccupied with infrastructure management. Streamlined CI processes eliminated the need for exhaustive manual reviews and the fear of introducing errors, thus reducing the bottleneck effect in the development cycle

Higher growth and talent integration: These all served to propel and support FINN’s exponential growth over the years. The integration of Datafold created a strong foundation on which to onboard new hires at an unprecedented pace without a corresponding increase in data management complexity or risk.

Building better one check at a time

This level of operational scalability is essential for any business aiming to maintain its growth momentum without suffering from the common pitfalls of rapid expansion, such as bottlenecks in training and quality assurance.

But it can seem daunting to integrate new tools and frameworks. This is especially true for teams that are just starting to implement CI. As FINN’s Jorrit advises in his 2023 Coalesce talk, starting with "one check at a time" can smooth the path forward. Instead of a wholesale overhaul that could cause significant disruption, introducing a single check allows both the data infrastructure and team members to adapt incrementally.

Through this lens, each added check becomes an opportunity to strengthen the overall data quality framework and develop organizational competence. It's a journey taken step by step, with each new integration acting as a building block towards a more resilient data ecosystem.

The result? A data infrastructure that grows organically with the team's capabilities, ensuring a practical path forward that is sustainable, manageable, and ultimately, successful.

‍If you want to learn more about FINN’s journey with automated testing and Datafold, check out their full Coalesce 2023 presentation here.