Bitbucket has degraded performance

Incident Report for Atlassian Bitbucket

Postmortem

Summary

On May 8, 2025, at 3:26pm UTC, Bitbucket Cloud experienced website and API latency due to an overloaded primary database. The event was caused by a backfill job running from an internal Atlassian service, which triggered an excessive call volume of expensive queries and pressure on database resources.

As a result, the primary database automatically failed over, and Bitbucket services recovered in 15 minutes. Our real-time monitoring detected the incident immediately, and the high-intensity backfill job was stopped. However, following the failover, a backlog of retries from downstream services continued to impact overall database performance. Customers may have seen intermittent errors or website latency during this time.

During this period following the failover, the engineering team implemented several strategies to further shed database load, successfully alleviating pressure on resources and improving performance. On May 9th at 11:19 AM UTC, Bitbucket Cloud systems were fully operational.

Impact

The overall impact occurred between May 8th, 2025, at 3:26 PM and May 9th at 11:19 AM UTC on Bitbucket Cloud. The incident resulted in increased latency and intermittent failures across Bitbucket Cloud services, including the website, API, and Bitbucket Pipelines.

Root cause

The issue was caused by an internal high-scale backfill job that triggered excessive load on certain API endpoints, which eventually impacted the database through resource-intensive queries and operations. This led to additional load from retries by dependent services, increasing the total recovery time.

Remedial action plan and next steps

We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified during our testing, as it was related to a specific high-scale backfill job run by an internal component, which impacted highly resource-intensive database queries.

To prevent this type of incident from recurring, we are prioritizing the following improvement actions:

Improve database request routing so that more reads go to read replicas instead of the write-primary database.
Adjust rate limits for internal API endpoints with resource-intensive database operations.
Optimize database queries so that they can run more efficiently.
Tune retry policies from downstream services.

We apologize to customers whose services were interrupted by this incident, and we are taking immediate steps to improve the platform’s reliability.

Thanks,

Atlassian Customer Support

Posted May 19, 2025 - 16:43 UTC