Why to Conduct Data Quality Post-mortems: Lessons learned from GitLab, Google, and Facebook

Zoe Hawkins

September 22, 2021

Organizations lose an average of $15 million per year due to poor data quality, according to Gartner Research. Poor data quality— inaccuracy, inconsistency, incompleteness, and unreliability— can have serious negative consequences for a data-driven company. Given the cost of these consequences, organizations have a high incentive to spot, fix, and understand them so that they don’t reoccur. That’s the goal of post-mortems.

The term post-mortem originally comes from the medical field, where medical examiners perform an autopsy on a person to identify the underlying cause of death. Software engineers scooped up the term next, and performed post-mortems to analyze the causes of software failure and mitigate future crashes.

For Ray Dalio, Co-Chief Investment Officer of Bridgewater Associates, Work Principle 3 — “creat[ing] a culture in which it is okay to make mistakes” — fits perfectly with the idea of post-mortems. The whole idea of post-mortems is to learn from your mistakes without blame to avoid repeating them. Major tech companies like Google, Hootsuite, and Atlassian enforce blameless post-mortems for almost every incident that has a business impact. The Google Site Reliability Engineering (SRE) team discovered that conducting blameless post-mortems “makes systems more reliable, and helps service owners learn from the event.”

A software post-mortem review uncovers the summary of the incident — a detailed timeline of events (down to Slack messages and emails being sent around), the root cause, backlog checks, business impact estimates, detection and response, recovery, lessons learned, and corrective action moving forward — in order to properly document details of the incident.

Now, data teams have an opportunity to adopt the philosophy and practice themselves, studying data quality incidents to avoid repeating the same mistakes. Below, we share three key reasons why data teams should carry out data quality post-mortems, including avoiding future errors, working better together, and mitigating fallout with stakeholders.

1. Post-mortems help identify an incident’s root cause and improve the process

History repeats itself, but that doesn’t mean you’re doomed to make the same data mistakes over and over. The primary benefit of a post-mortem is to identify the specific cause of a data incident. While studying that single problem, you also gain an opportunity to improve data quality for the entire organization.

Let’s look at an example software engineering incident; Gitlab’s database outage that occurred in 2017. Gitlab's online service had a service outage as a result of “an accidental removal of production data from the primary database server,” which impacted over 5000 projects, 5000 comments, and 700 new user accounts. Gitlab was able to articulate what caused the incident using the five whys technique.

They broke the incident into two parts— why Gitlab was down and why it took so long to restore the outage— to explain the root cause. Their findings helped them discover broken recovery procedures in their system, like database backup with pg_dump, Azure disk snapshots, and LVM snapshots. Unfortunately, there were some loopholes in these procedures, so they had to work on implementing better recovery procedures by splitting the work with a list of issues and a backup monitoring dashboard.

When using this process in a data context, your Data Team can determine if it was an unexpected data loss, broken operational processes, or even human error that was responsible for an incident by carrying out a post-mortem analysis. It also helps your team keep an eye on potential weak spots — data sources, data pipelines, and ETL processes— that can be improved for enhanced data quality management in the organization beyond just finding the root cause of any one incident.

2. Post-mortems allow data teams to collaborate, bond, and learn from one another

Data teams are greater than the sum of their parts, but data teams often work in different areas of responsibility that prevent them from collaborating. Data quality post-mortems serve as an opportunity for your team members to put their heads together and emerge a smarter, tighter team.

This article on re:Work shows how Google maintains a collaborative approach during the post-mortem process. Google maintains a practice of real-time collaboration on Google Docs with an open commenting system during a post-mortem. It enables their teams to collect ideas and solutions rapidly thereby increasing the effectiveness of their solutions.

Your data team can combine or exchange their varying experience, knowledge, and expertise during a post mortem—beyond just identifying one-off mistakes. For example, the team can determine organization-wide standards on who gets notified if data is automatically tested for accuracy, check all the code that processes data checks in a version control system, streamline an automated review process, and other team-based flows. Collaboration builds trust and resiliency in your data team and provides opportunities to think creatively and spark new ideas.

3. Post-mortems keep your stakeholders, end-users, and partners in the loop about what occurred

Post-mortems give you a clear narrative of what went wrong and how to fix it. This clear, easy-to-understand explanation earns trust and authority in the eyes of stakeholders, customers, and other partners.

For example, say your data team confronts a data quality incident. Unfortunately, your stakeholders or your company’s leadership probably won’t have visibility into the process — all they see is the fallout of the mistake. It can be a huge concern, especially if they aren’t sure whether it’ll happen again. As a result, you may lose customers or support from other business parties.

This is why Facebook’s CRO took to the platform to explain what happened with Facebook’s video metric issues. He clearly explained the issue in calculations, how the problem was identified, what was done to remedy the situation, and how it impacted advertisers. Through transparency of the post, Facebook was working hard to take responsibility, explain the situation, and pave a course to avoid similar issues in the future.

A post-mortem is excellent, transparent research for each of those steps followed by Facebook (that everyone should follow after making a mistake), helping to earn back stakeholder trust after a turbulent incident.

Prioritize proactive data quality management to avoid repetitive post-mortems

You know that phrase “an ounce of prevention is worth a pound of cure”— that applies to data teams as well. While post mortems are great after a data quality incident, committing to an effective data quality management and monitoring system would save a lot of time and resources spent on post-mortem processes and meetings. A good start is shifting away from cumbersome manual processes — after all, manual processes are the biggest frustration for data engineers and are often avoided in favor of a “deploy and pray” approach. 

Implementing simple, automated data QA tools over manual processes ensures everyone working with the data (from the most experienced to the least) can proactively avoid breaking data quality and get visibility into what causes breaks in the data pipeline. On top of reducing manual work and improving data quality, this frees up your data team so they can focus on creative and strategic projects.