On Feb 05, 2021, between 00:10 and 00:21 UTC, Bitbucket Cloud customers were unable to access their repositories over SSH for approximately 11 minutes. About 7 hours later, on Feb 05, 2021, between 7:10 and 9:10 UTC, customers experienced degraded performance from Bitbucket's website and APIs. These events were triggered by a production deployment leaving load balancer pools at diminished capacity to handle full production traffic. These load balancer pools process traffic to bitbucket.org and api.bitbucket.org, so this incident impacted customers in all regions. The SSH incident was initially detected within two minutes by our automated systems and mitigated by restoring the load balancer pools responsible for routing SSH traffic, making our SSH service fully operational. However, the load balancer pools for website and API traffic were not restored until several hours later, when automated alerts detected degraded performance of these services. The total time to resolution was about 11 minutes for the SSH incident, then two hours for the incident causing degraded website and API performance.
These issues were caused by an aborted production deployment which left load balancer pools partially drained with diminished capacity to serve requests. This initially only noticeably impacted SSH services resulting in elevated error rates causing SSH commands such as
git pull and
git push over SSH to fail. After restoring service for SSH operations in approximately 11 minutes, the engineering team repeated the production deployment and validated that all services were healthy. Due to a low volume of website and API traffic at the time, these services remained healthy even with partial capacity in the load balancer pools and the team failed to notice that these pools had not been restored to full production capacity. Only as customer traffic organically increased to these services several hours later did they experience sufficiently degraded performance to trigger automated alerts.
This incident resulted from three contributing causes:
We are taking immediate steps to address these causes and improve the performance and availability of Bitbucket Cloud.
Much of Bitbucket's monitoring and alerting is designed to measure customer impact such as error rates and response times. Where possible, we also implement measures to detect infrastructural issues that have no immediate impact but may cause eventual impact if left unaddressed. This incident surfaced a gap in our alerting belonging to the second category. Since resolving the issue, our team has established an automated alert in case our load balancer pools reach a level below their expected capacity.
The team is also making changes to our tooling for orchestrating production deployments to prevent an aborted or failed production deployment from leaving load balancer pools at reduced capacity in the future. This will ensure the system is more resilient to unexpected or transient issues that may disrupt a deployment.
Finally, we are implementing another set of changes to our deployment tooling to account for the unexpected error that caused the deployment to fail preceding this incident.
We know that outages are impactful to your productivity. We apologize to customers whose services were impacted during this incident.
Atlassian Customer Support