Github Best Practices for Analytics Engineering
Github offers many different features such as protected branches, pull requests, and code reviews. It is best practice to take advantage of all of these in your analytics workflow and use them to maintain the integrity of your data models.
#1: Use branches
Branches help to keep code changes separate. This can be between members, teams, or even tasks. Branches ensure that you can make changes without affecting the entire code base, especially what’s in production.
In order to create a branch in Github you simply run this command: <span class="code">git branch <branch_name></span>
If you start coding directly on the main branch of your dbt project it can be difficult to know when some part of your data breaks what particular change caused it. By creating your own branch and merging in new models and updates there is a timestamp of when you made an update and what the update was so it can be investigated or rolled back easily.
#2: Establish branch naming conventions
I recommend setting a naming convention for your branches in your internal documentation. This will hold team members accountable to a standard and keep your Github environment consistent. There are many different ways you can organize your branches but I personally like to organize by Jira ticket. You can name your branch after the number of your ticket or, my personal favorite:
<span class="code"> < team member name >_< ticket number > </span>
This makes it easy to know who is working on which branch. It easily tells you whether or not you own that branch and if you should switch to it. It also prevents you from changing code for two different models on the same branch. This will allow for easier testing when it comes to pushing your code and having it reviewed.
#3: Write descriptive commit messages
When working on your own branch, it’s also important to include meaningful commit messages. If you aren’t familiar with a commit, this is when you permanently add your code changes to staging to then be pushed to your repository. It is easy to skip out on including a detailed message, just wanting to get the code saved. However, these messages will help you keep track of changes within your own branch. They will come in handy if you ever need to revert to older code and undo a change you made.
Bad commit message:
<span class="code"> Git commit -m “Join change” </span>
Good commit message:
<span class="code"> Git commit -m “Replace inner join with left join when joining orders and subscriptions tables” </span>
The first message is vague and leaves you guessing what the code looked like before and what it was supposed to be doing after the change. The second message gives the type of join that was in the code before and the type of join that replaced it. It also included the specific tables that were being joined.
Now, when you complete a Jira ticket, you can push that code to Github to be reviewed by a team member. That team member will know exactly the problem you were trying to solve and that all the changes made were specific to the task at hand. When the code is pushed to the main branch, you can test it knowing exactly what the changes were and how that may affect production. In the end, this reduces risk when merging code and saves time in trying to decipher multiple changes within one pull request.
#4: Utilize pull requests
Enforcing pull requests within your organization goes hand in hand with separating your tasks by branch. The two best practices work together in order to minimize the chance of bad code being pushed to production. When pull requests are enforced, anybody can’t merge their code changes directly to the main code base. The changes must first be reviewed by a team member before they can add their code to production.
How to create one.
In order to create a pull request, you first want to add, commit, and push your changes from your local branch to your Github repository. Once you do this you can navigate to “pull requests” on the repository and “new pull request”.
You then want to select the branch that you wish to merge with main.
I am going to choose the branch “census” here and then the green “create pull request” button. Here you can leave a comment, or even a Loom video, describing the changes that you made and the problem you were solving. It may even be helpful to link the exact Jira ticket here. Your job here is to make the reviewer’s job as easy as possible. Perceive the questions they may ask about your code and write them here.
#5: Enforce code reviews
In order to enforce pull requests and code reviews, you must require them in your repository settings.
Navigate to settings → branches → add rule → branch protection rules → select “main” → check “require pull request reviews before merging”.
You can also choose the number of reviewers you want to require. If you have a small team, I recommend choosing just one. More than one can become a bottleneck when you need to work fast. However, if you’re a larger team of say 10 or more, you may want to require at least two reviews.
Requiring pull requests will force your team to follow best practices. It will get them in the habit of checking their code twice before pushing it, explaining it to others and making it readable. Not only will it benefit the data quality and business itself, but it will help analytics engineers and data analysts grow in their skills. Forcing code reviews allows team members to share their knowledge with one another, becoming better developers.
#6: Use an orchestrator that integrates with GitHub
By now you probably know my love for dbt. dbt allows analytics engineers to write modular, tested, documented data models. There are a lot of baked-in features that make it so powerful and better than any other data transformation tool on the market. One of these is that their cloud platform integrates with Github.
With Github, dbt Cloud is able to trigger builds from pull requests, which allows for testing dbt code in the pull request before it is merged. This allows for double protection along with the code review from a team member. You quite literally can’t merge your code until your pipeline successfully builds.
Using an orchestration tool like dbt Cloud forces you to practice what you preach. It enforces many of the best practices I mentioned above, making it a great solution for teams just getting off the ground.
#7: Implement automated testing
While all these best practices help decrease the likelihood of you pushing bad code to production, it can’t prevent it. There will always be syntax errors and incorrect logic that slips through the cracks and ends up breaking something. Luckily, tools are getting even more sophisticated and helping prevent even this from happening.
With automated testing tools like Datafold, you can see how your code changes affect your data before merging your code to production. After making a pull request, the tool scans your Github repository, running your previous code against your warehouse as well as the new code, and outputs the differences in an impact report. You can validate that these changes were expected before doing anything drastic, acting as an additional quality check.
Datafold allows you to view the results before merging a pull request. Even if your code looks good, Datafold may find differences in the resultant data that you weren’t expecting, especially in downstream tables or tools like Mode or Hightouch. It prevents the problem of pushing code before it is fully tested, ensuring production isn’t affected. This also helps make the job of validation easier. Using it you can see all of your dependencies between data models, making sure you don’t update one without updating others.
Put it all to work
Now, let’s walk through how this all works in the context of updating one of your data models. You need to change one of the joins in your revenue model. First, you create a new branch with your name and Jira ticket number.
<span class="code"> git branch -b madison_4355 </span>
You make your need changes to only the files corresponding to the revenue model. No other code changes are added to this branch.
When you’re finished, you add your change and write a detailed commit message detailing the exact changes you made. Then you push your change.
<span class="code"> git commit -m “change inner join to left join in profit_summed intermediary model” </span>
<span class="code"> git push </span>
After pushing your change, you create a pull request to merge your <span class="code">madison_4355</span> branch to the main branch. Be sure to link the exact Jira ticket in the pull request description.
During this stage, you can also check the automated testing run by Datafold. Did the tool detect any rows that will now be missing from the resultant dataset? Is this the change you wanted to occur? Be sure to review all of the validation that Datafold gives you before merging your code changes.
If you’re confident that everything is as expected, request some from your team to review your pull request. Have them leave any comments on code they don’t understand so you can make the appropriate changes.
Then, when they approve the pull request, you can merge to the main branch and dbt Cloud will automatically build and test the new data model.
When analytics teams begin to utilize best practices such as Github branches and pull requests, big shifts in data quality will begin to occur. These two simple practices prevent poor code from being pushed to production, breaking data pipelines, and affecting how the business operates. The productivity and reliability of software engineers skyrocketed when they began using version control tools like Github. Now it is a must in all engineering teams.
We are already seeing analytics tools like dbt Cloud and Datafold reflecting software engineering best practices. Eventually, all modern data stack tools will follow in their footsteps and require features like version control. Without version control, it’s hard to write quality code and produce reliable data. The use of Github will help analytics teams jump those hurdles faster than they have before.
If you want to take your data quality skills to the next level, or if your stakeholders have simply refused to QA your data for you, consider trying Datafold.