How to Setup dbt Development Environments

What’s an environment?

In software development, an environment is a system in which a software component is deployed and executed.

It’s the place where your thing runs.

Applied to the data warehouse, an environment might refer to grouping of databases, schemas, tables, or views.

Why have multiple environments?

Imagine a world with a single environment, in which users and developers directly interact with a single system.

An analytics flavored example of a single environment would be engineers and end users interacting with the same data models. Our analytics engineer happily works on the <span class="code">impt_metrics_model</span> that fuels our CFO’s Important Metrics Dashboard.

All is well in the land, until our analytics engineer accidentally introduces a breaking change to the <span class="code">impt_metrics_model</span>.

Any changes introduced to the environment would immediately impact users, for better or worse. This diagram demonstrates the impact of the “or worse” scenario. Breaking changes introduced to the <span class="code">impt_metrics_model</span> set fire to the CFO’s dashboard. Lacking visibility into important metrics, the CFO is unable to make date driven decisions, and could steer the company in the wrong direction. Unreliable data erodes business trust in analytics, and can further damage business outcomes.

Separate production and development environments

Separating environments allow for a greater degree of change management. Changes are originally made in the development environment, and are only promoted to production once we can be confident that we are not introducing breaking changes.

Within the development environment, our analytics engineer is free to make changes without impacting the end user in the production environment. Engineers are free to iterate until their proposed changes are ready for production.

Development environments in dbt

Assuming I’ve convinced you of the value of having development and production in separate environments, what does that look like in a dbt project? There are several ways to accomplish this, with no single consensus in the dbt community. By configuring your target in your profile, you can have your local dbt usage point to your defined target environment, typically a dev environment.

Variables to consider

Administrative burden

Depending on the size of the team, provisioning and maintaining development environments can quickly become a burden for administrators. 

  • Utilizing custom schemas and custom databases can ease the burden
  • Self-serve options for refreshing development data (ex, a macro for cloning prod data in dev)

Number of developers

If there are only one or two developers on the team, it’s very plausible that they could share a single development environment, and not step on each other’s toes. If the team scales to many developers, discrete spaces will be necessary.

Cost

Maintaining several full copies of production data can become expensive, both in terms of compute and storage. For example, full refreshes of a large incremental model across dozens of development environments could be a huge resource drain.

Developer time should be valued; developer experience and productivity are important considerations. Scenarios where developers have to go through a cumbersome process or where data replication takes forever are not ideal.

Options to mediate cost include:  

  • limiting data processed in dev by filtering data where target = <span class="code">dev</span> 
  • Utilizing cloning techniques
  • Simply accept deltas between prod and dev

How often do developers truly require a 1:1 copy of prod? Arguably very infrequently, particularly for testing of data integrity (vs testing of actual data values) 

One database to rule them all

dbt recommends “using different schemas within one data warehouse to separate your environments…If you have multiple dbt users writing code, it often makes sense for each user to have their own development environment. A pattern we’ve found useful is to set your dev target schema to be <span class="code">dbt_username</span>."

In action, this approach might look like this:

One database per environment

Another approach would be to have one database per environment, with a schema per developer in the dev database.

One database per developer

In this approach, each developer has a personal copy of the production database. Schemas are identical across all databases, and data can be cloned or models run as needed.

Summary

This list is not exhaustive, nor does it reflect complexities of teams with multiple dbt projects, multiple production environments, or additional environment tiers like staging, testing, or integration.

At Datafold, we’re fans of the “one database per environment” approach, but recognize it’s not a one size fits all solution. From an organization perspective, we like keeping our production database tidy and uncluttered. For collaboration and review, maintaining read permissions on the dev database for all developers makes it easy to view the impact of proposed changes. 

Tired of urgent Slack messages? Blindsided by broken dashboards and silent bugs? Spending countless hours on PR reviews? Datafold can help. Get Started today or Book a Demo with our data experts instead.

Stop data quality incidents, before they happen.
Get Started

Not ready yet? Book a Demo Instead

Schedule a personal call and see how Datafold can help you

Get Started

To get Datafold to integrate seamlessly with your data stack we need to have a quick onboarding call to get everything configured properly