Bitbucket Cloud service degradation
Incident Report for Atlassian Bitbucket
Postmortem

Summary

On March 11, 2024, between 20:29 UTC and 21:41 UTC, Atlassian customers using Bitbucket Cloud faced degradation to its website and APIs. This impact was caused by an issue with Bitbucket’s database, resulting in connection pools becoming saturated, increasing response times, and a ramp-up of requests timing out completely.

IMPACT

Customers who were impacted experienced increased latency when accessing the bitbucket.org website and APIs during the duration of the incident. Git requests over HTTPS and SSH were also affected.

ROOT CAUSE

The incident was caused by a bug in the version of database software being used. With Bitbucket’s query patterns, if certain processes do not run frequently enough, eventually issues can arise that can result in poor query planner performance.

Due to this bug, our process configuration, which has been tuned to our specific workload previously, is no longer proving to be effective. While the appropriate tuning is determined, we have implemented a system to trigger that process as soon any issues are detected. We are confident this will prevent a repeat incident while we determine an appropriate threshold and cadence.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are prioritizing the following improvement actions to reduce recovery time, limit impact, and avoid repeating these types of incidents in the future:

  • Vacuuming immediately when the defect is detected.
  • Appropriately tuning autovacuum settings to meet the requirements of our workload.
  • Upgrading our database version as soon as the fix becomes available.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Mar 20, 2024 - 18:55 UTC

Resolved
This incident has been resolved.
Posted Mar 11, 2024 - 21:56 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 11, 2024 - 21:43 UTC
Update
Identified degraded performance and increased error rate, the team is currently working to mitigate.
Posted Mar 11, 2024 - 21:28 UTC
Investigating
We're investigating an issue with high resource utilisation on one of our databases, customers may experience a degradation in performance while we work on a resolution.
Posted Mar 11, 2024 - 21:12 UTC
This incident affected: Website, Git via SSH, and Pipelines.