Bitbucket partial outage across all services
Incident Report for Atlassian Bitbucket
Postmortem

SUMMARY

On Sept 25, 2022, between 01:46 PM and 09:19 PM UTC, some Atlassian customers using Bitbucket Cloud were unable to access their repositories. The event was triggered when our storage vendor experienced an outage at their data center. The outage was caused by a firmware upgrade resulting in a subset of their storage clusters failing to update correctly. The incident was detected within 14 mins by our on-call SRE team and was escalated to our storage vendor who began restoring the failed nodes to bring their storage services online again. The total time to resolution was seven hours and 33 minutes.

IMPACT

The overall impact was between 01:46 PM and 09:19 PM UTC on Bitbucket Cloud. The Incident caused service disruption to some of our users, causing affected customers to be unable to access repositories via the CLI, or browse the Bitbucket Cloud website.

ROOT CAUSE

The issue was caused by a firmware update to the nodes made by our storage vendor; few nodes failed to update and went down over a span of several hours. And as a result, the affected Bitbucket Cloud customers could not access their repositories and the users received HTTP 504 errors.

The root cause of the incident was the firmware update process, which did not properly update and restart all storage nodes.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific failure with the firmware upgrade process wasn't detected prior to deployment.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Implementing detection and alerting mechanism that validates each storage volume is accessible and available.
  • Working with our external vendor on preventative measures as well as improved detection on their end.
  • Implement alerting mechanism about errors that would be generated for failed firmware updates.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Oct 13, 2022 - 00:52 UTC

Resolved
A fix has been implemented by our storage provider, and this incident has been resolved.
Posted Sep 25, 2022 - 22:23 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 25, 2022 - 22:00 UTC
Update
Customer issue is mitigated. Bitbucket will continue to monitor the health status and update the component status soon.
Posted Sep 25, 2022 - 21:20 UTC
Update
Bitbucket team is engaged with our storage provider to identify the root cause of the problem
Posted Sep 25, 2022 - 19:47 UTC
Update
We are continuing to work on a fix for this issue.
Posted Sep 25, 2022 - 19:25 UTC
Identified
We have identified the root cause of outage at storage layer. Currently team is investigating the fix and will provide more details once the issue is resolved.
Posted Sep 25, 2022 - 17:53 UTC
Investigating
We are investigating reports of intermittent errors for all Atlassian Bitbucket Cloud customers. We will provide more details once we identify the root cause.
Posted Sep 25, 2022 - 15:49 UTC
This incident affected: Website, API, Git via SSH, Git via HTTPS, Source downloads, and Pipelines.