Website is slow

Incident Report for Atlassian Bitbucket

Postmortem

SUMMARY

On February 2, 2023, between 9:00 AM and 10:30 AM UTC, Atlassian customers using the Bitbucket Cloud experienced degraded performance and timeouts of the website. The event was triggered by a feature flag configuration that created an unexpected load on core infrastructure. Customers across all regions were affected. The incident was detected within 10 minutes by automatic monitoring and mitigated by disabling the feature flag and scaling up the number of website servers, which put Atlassian systems into a known good state. The total time to resolution was about one hour and 30 minutes.

IMPACT

The overall impact was between 9:00 AM and 10:30 AM UTC on February 2, 2023, on Bitbucket Cloud. The Incident caused service disruption to customers in all regions where they experienced slow website response times and website requests timing out. This affected users' ability to use Bitbucket Cloud through the web user interface, including functions such as viewing and merging pull requests, viewing commits and files in Git repositories, and viewing the dashboard.

ROOT CAUSE

The issue was caused when a feature flag was enabled with a higher-than-anticipated evaluation cost. The feature flag was only enabled for an internal team, but the method used to evaluate that condition resulted in an increased load on each server. That increased load caused slower request times, and as we were operating at our peak listed server capacity, we ultimately ended up queuing requests which is when customers started seeing timeouts and errors instead of just slow requests. Manually scaling up the servers allowed us to spread that load out and serve requests without queuing until we found the cause and disabled the inefficient evaluation of the feature flag.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the change was related to causing an increase in website's CPU load that was not picked up by our automated continuous deployment suites and manual test scripts.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

We have changed how we evaluate feature flag configuration so that high-volume feature flag evaluation doesn't increase the load on the website service.
We have doubled the maximum available capacity to allow for the website auto-scaling to absorb additional load caused by any performance-impacting changes

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Feb 07, 2023 - 21:27 UTC

Resolved

The issue has been resolved and the service is operating normally.

Posted Feb 02, 2023 - 12:53 UTC

Monitoring

We have identified the root cause of the slowness and have mitigated the problem. We are now monitoring closely.

Posted Feb 02, 2023 - 10:43 UTC

Investigating

We are investigating reports of intermittent errors for our Atlassian Bitbucket Cloud customers. We will provide more details once we identify the root cause.

Posted Feb 02, 2023 - 09:38 UTC

This incident affected: Website.