Setting up dbt Development Environments
What’s an environment?
In software development, an environment is a system in which a software component is deployed and executed.
It’s the place where your thing runs.
Applied to the data warehouse, an environment might refer to grouping of databases, schemas, tables, or views.
Why have multiple environments?
Imagine a world with a single environment, in which users and developers directly interact with a single system.
In an analytics context, this would mean engineers and end users accessing the same data models. For instance, our analytics engineer happily works on the <span class="code">impt_metrics_model</span> that fuels our CFO’s Important Metrics Dashboard.
All is well until our analytics engineer accidentally introduces a breaking change to the <span class="code">impt_metrics_model</span>.
Any changes introduced to the environment would immediately impact users, for better or worse. This diagram demonstrates the impact of the “or worse” scenario. Breaking changes introduced to the <span class="code">impt_metrics_model</span> set fire to the CFO’s dashboard. Lacking visibility into important metrics, the CFO is unable to make date driven decisions, and could steer the company in the wrong direction. Unreliable data erodes business trust in analytics and can further damage business outcomes.
Separate production and development environments
Separate environments allow for a greater degree of change management. Changes are originally made in the development environment and are only promoted to production once we can be confident that we are not introducing breaking changes.
Within the development environment, our analytics engineer is free to make changes without impacting the end user in the production environment. Engineers are free to iterate until their proposed changes are ready for production.
Development environments in dbt
Assuming I’ve convinced you of the value of having development and production in separate environments, what does that look like in a dbt project? There are several ways to accomplish this, with no single consensus in the dbt community. One common approach is configuring your target environment in your dbt profile, typically pointing to a dev environment.
Variables to consider
Depending on the size of the team, provisioning and maintaining development environments can quickly become a burden for administrators.
- Utilizing custom schemas and custom databases can ease the burden
- Self-serve options for refreshing development data (ex, a macro for cloning prod data in dev)
Number of developers
If there are only one or two developers on the team, it’s very plausible that they could share a single development environment and not step on each other’s toes. If the team scales to many developers, discrete spaces will be necessary.
Maintaining several full copies of production data can become expensive, both in terms of compute and storage. For example, full refreshes of a large incremental model across dozens of development environments could be a huge resource drain.
Developer time should be valued; developer experience and productivity are important considerations. Scenarios where developers have to go through a cumbersome process or where data replication takes forever are not ideal.
Options to mediate cost include:
- limiting data processed in dev by filtering data where target = <span class="code">dev</span>
- Utilizing cloning techniques
- Simply accept deltas between prod and dev
How often do developers truly require a 1:1 copy of prod? Arguably very infrequently, particularly for testing of data integrity (vs testing of actual data values)
One database to rule them all
dbt recommends “using different schemas within one data warehouse to separate your environments…If you have multiple dbt users writing code, it often makes sense for each user to have their own development environment. A pattern we’ve found useful is to set your dev target schema to be <span class="code">dbt_username</span>."
In action, this approach might look like this:
One database per environment
Another approach would be to have one database per environment, with a schema per developer in the dev database.
One database per developer
In this approach, each developer has a personal copy of the production database. Schemas are identical across all databases, and data can be cloned or models run as needed.
This list is not exhaustive, nor does it reflect complexities of teams with multiple dbt projects, multiple production environments, or additional environment tiers like staging, testing, or integration.
At Datafold, we’re fans of the “one database per environment” approach, but recognize it’s not a one-size-fits-all solution. From an organization perspective, we like keeping our production database tidy and uncluttered. For collaboration and review, maintaining read permissions on the dev database for all developers makes it easy to view the impact of proposed changes.
Datafold is the fastest way to validate dbt model changes during development, deployment & migrations. Datafold allows data engineers to audit their work in minutes without writing tests or custom queries. Integrated into CI, Datafold enables data teams to deploy with full confidence, ship faster, and leave tedious QA and firefighting behind.