Degraded Bitbucket web and API performance

Incident Report for Atlassian Bitbucket

Postmortem

SUMMARY

On JUL 13, 2023, between 13:48 and 16:05 UTC, Atlassian customers using Bitbucket Cloud were experiencing degraded performance for git operations. The event was triggered by a bug that was deployed to production. The changes included the introduction of a bug that bypassed a critical cache for git operations which impacted all Bitbucket Cloud customers. The incident was detected within 10 minutes by automated monitoring and mitigated by rolling back the code changes and the redeployment of some services which put Atlassian systems into a known good state. The total time to resolution was about 2 hours & 17 minutes.

IMPACT

The overall impact was between JUL 13, 2023, 13:48 UTC and JUL 13, 2023, 16:05 UTC on Bitbucket Cloud products. The Incident caused service disruption to all customers where they experienced slow response times or failures when interacting with repository data. As a result of network saturation causing connections to queue, customers experienced increased latency and error rates across Bitbucket Cloud services.

ROOT CAUSE

The issue was caused by a change to a feature flag that contained a bug which bypassed a critical cache. As a result, more requests were directly accessing the disks which led to increased latency and eventually, degraded the performance of some operations.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the impact was subtle and took approximately 48 hours to surface after initial deployment. This slow ramping was not picked up by our automated continuous test scripts because it requires significant load to reach the tipping point where systems start to become degraded.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Alerting on cache hit/miss rates
Improving the monitoring of network saturation for connections to our disks

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

‌

Thanks,

Atlassian Customer Support

Posted Jul 21, 2023 - 15:56 UTC

Resolved

This incident has been resolved.

Posted Jul 13, 2023 - 16:46 UTC

Monitoring

We've identified the cause of the issue and removed the change that was responsible.

Posted Jul 13, 2023 - 16:20 UTC

Investigating

We are currently investigating this issue

Posted Jul 13, 2023 - 14:33 UTC

This incident affected: Website, API, Git via SSH, Git via HTTPS, and Pipelines.