Data quality guide

Data quality is your moat, this is your guide

/•/

Fortify your data, fortify your business: Why high-quality data is your ultimate defense.

During data transformation development & deployment

Published

May 28, 2024

During data transformation development and deployment, ensuring high data quality is crucial as it sets the foundation for downstream processes. Data practitioners often encounter these core areas around the development and deployment of their data transformation work, often done with a tool like dbt or custom SQL models.

When bad data reaches production
When bad data reaches production, it can lead to cascading issues throughout the data pipeline. This can occur due to errors in data transformation logic, inconsistencies in source data, or inadequate testing procedures.

Slow development & deployment
One factor that contributes to slow development and deployment lies in a manual and ad hoc data validation process. Data teams often use manual forms of inspection, whether through data unit tests or custom SQL queries, which are time-consuming, prone to errors, and not standardized across teams. Because you should validate your data every time you change your code, the process or re-running these ad-hoc queries consumes valuable time and introduces development delays.

Lack of standardized data quality practices
Without standardized coding conventions, documentation practices, and version control procedures around data quality testing, it becomes difficult to maintain consistency and ensure the reliability of data transformation workflows. Also, a lack of governance can lead to inconsistencies in data models and increased technical debt over time.

Scaling dbt projects to users, models, and tests
Whether you’re the founding data engineer of a startup, or the 100th addition to a large organization’s analytics unit, you’ll still encounter the same challenges around scaling projects for the next phase of growth. Maintaining data quality at scale requires a lot more intentionality and adherence to best practices. As dbt projects grow in complexity to accommodate more users, models, and tests, you’ll experience performance bottlenecks, resource constraints, or difficulties in managing dependencies between different components of the dbt project.

How they are typically solved

There are four ways that practitioners have typically approached data quality testing during dbt project development and deployment, ranging from least mature to most mature.

Approach	Purpose	How it measures up
Ad hoc SQL tests	Naive row count and summary statistic checks	Rudimentary: Manual inspection is prone to oversight and lacks comprehensive coverage, especially as data transformation projects scale.
dbt tests	Manual tests added to dbt models, either run manually during development or automated during CI checks	Better, but limited: Provides explicit testing to catch expected problems within dbt jobs. However, lacks the ability to address nuanced quality concerns and contextual anomalies specific to business needs.
CI checks	Ensuring dbt projects compile and pass tests when new PRs are opened	Better: Automates testing to maintain consistent quality standards. Integrating CI checks helps catch errors early in the development process, ensuring modifications meet quality guardrails before deployment.
CI checks with Datafold	Utilizing Datafold to standardize and automate data quality testing for this part of the analytics workflow	Best practice: Datafold integrates seamlessly into CI pipelines, providing value-level comparisons between datasets. This eliminates the need for extensive manual testing and enhances confidence in deployments by quickly identifying unexpected problems.

previous Passage

Next Passage