On February 2, 2023, between 9:00 AM and 10:30 AM UTC, Atlassian customers using the Bitbucket Cloud experienced degraded performance and timeouts of the website. The event was triggered by a feature flag configuration that created an unexpected load on core infrastructure. Customers across all regions were affected. The incident was detected within 10 minutes by automatic monitoring and mitigated by disabling the feature flag and scaling up the number of website servers, which put Atlassian systems into a known good state. The total time to resolution was about one hour and 30 minutes.
The overall impact was between 9:00 AM and 10:30 AM UTC on February 2, 2023, on Bitbucket Cloud. The Incident caused service disruption to customers in all regions where they experienced slow website response times and website requests timing out. This affected users' ability to use Bitbucket Cloud through the web user interface, including functions such as viewing and merging pull requests, viewing commits and files in Git repositories, and viewing the dashboard.
The issue was caused when a feature flag was enabled with a higher-than-anticipated evaluation cost. The feature flag was only enabled for an internal team, but the method used to evaluate that condition resulted in an increased load on each server. That increased load caused slower request times, and as we were operating at our peak listed server capacity, we ultimately ended up queuing requests which is when customers started seeing timeouts and errors instead of just slow requests. Manually scaling up the servers allowed us to spread that load out and serve requests without queuing until we found the cause and disabled the inefficient evaluation of the feature flag.
We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the change was related to causing an increase in website's CPU load that was not picked up by our automated continuous deployment suites and manual test scripts.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Atlassian Customer Support