On Tuesday, January 9th, 2018, we experienced an incident with Bitbucket Cloud that resulted in service degradation for our users, including two hours of repository unavailability. The incident was triggered by a disk failure in our storage layer that was followed by a resource-intensive, automatic repair process during the peak traffic window. No data was lost, but the repair operation caused long response times from our storage system and led to connections from clients to be queued and ultimately dropped as the limit of the connection queue was exceeded. Customers experienced intermittent request failures during the height of the incident which lasted for several hours on January 9th.
Disk failure is a somewhat regular occurrence with any large storage system and a single disk failing is normally a non-event that has no visible impact, but in this case, it wasn't.
The incident began at 10:55 UTC on January 9th. Bitbucket Cloud started seeing increased response times across all services, but no alerts were triggered as the service was operating within predefined limits without any backlogs. 12:00 UTC is one of our highest traffic periods of the day, and the increase in traffic led to more stress on the storage system. We began to get alerts for requests being dropped and users began to experience slower response times and backlogs.
During the initial triage, common candidates such as recent deployments, database latency and unexpected use patterns were eliminated, leaving file system latency as the true culprit. The investigation into our filesystem nodes revealed that a disk array rebuild process had started and was impacting service response times, leading to backlogging and load shedding.
During this incident, the bitbucket.org website and our Git & Mercurial services were largely unavailable for a period of 180 minutes on January 9th. A large percentage of client connections were being queued and/or dropped within additional 90 minute periods of service degradation before and after the interruption window. The storage system repair operation consumed a large amount of I/O resources and could not be paused as it was a system-critical operation. The net effect was that the high response times from the affected part of our storage system created a pile-up of connections for all clients, leading to overall service degradation.
To alleviate the impact, we moved storage off the affected storage node and onto well-performing alternative nodes. As we started the migration, we prioritized moving the volumes that were used most heavily by multiple services. To improve migration throughput and reduce the number of stalled client connections that were piling up in the connections queue, our team began to temporarily disable access to repositories that resided on the storage segment that was under repair. To restore service as quickly as possible to customers whose repositories were blocked, we moved those repositories to different storage segments that were not in a self-repair mode. Furthermore, we shed additional load off our systems by disabling any non-critical services and operations.
The background repair operation took 5 days to complete due to the large size of the volumes and the concurrent load of client requests during this time. Ultimately, service to all customers was brought back to normal performance by moving hundreds of thousands of repositories to other storage nodes.
In parallel, we also opened a support issue with our storage vendor and began to make system and network configuration changes to further reduce the impact of the repair operation on our customers. Our team continued to work around the clock for several days after the initial outage to mitigate the effects of the storage system repair process to ensure that during the peak traffic window, the majority of customers would be able to continue to work using Bitbucket Cloud with improving performance.
In line with our company values and a spirit of transparency, we kept our users up to date with regular posts to our Statuspage and by responding to support requests that were opened via our support service desk.
What is being done to prevent this in the future?
Was any data lost during the incident?
No. The actions performed and the repair job itself were non-destructive in nature.
To the people affected by the outage, we apologize. We know that you rely on Bitbucket Cloud to keep your teams working and your businesses running and this incident was disruptive. Our systems and processes are designed to balance customer traffic and behind-the-scenes, automatic recovery from faults. However, in this case, we encountered a new performance impact from what would normally be a common storage system fault. We will continue to incorporate what we've learned from this event in the design and implementation of our systems and processes in the future.