Increased latency in user and content search across products
Incident Report for Atlassian Bitbucket
Postmortem

SUMMARY

On Thursday, June 25, 2021, between 8:00 AM - 11:28 AM UTC, customers of the Jira family of products, Confluence and Bitbucket were unable to search for users and select entries via user pickers; customers using Confluence were also not able to search for content or experienced very slow loading times for search results to appear.

The event was triggered by a change that was rolled out for Bitbucket, which introduced pre-fetching of users to provide recommendations for approvers in new pull requests. Unfortunately, the changes resulted in a high volume of queries not optimized for our search infrastructure, which overwhelmed it. This impacted customers using those products and specifically connecting to our US East region due to their close geographical location to it. The incident was detected within 42 minutes through our automated monitoring system and mitigated by identifying and rolling back the changes that triggered the event, which put Atlassian systems into a known good state. The total time to resolution was about 3 hours and 28 minutes.

IMPACT

The overall impact was between 8:00 AM and 11:28 AM UTC on Thursday, June 25, 2021, and affected the Jira family of products, Confluence and Bitbucket. The incident caused service disruption only to customers connecting to our US East region due to their close geographical location to it, and they couldn’t search for users or content. Product-specific impact areas were the following:

  • Jira family of products: users were not able to search for users and select entries via user pickers.
  • Bitbucket: users were not able to search for users and select entries via user pickers.
  • Confluence: users were not able to search for users and select entries via user pickers. In addition, customers were not able to search for content or experienced very slow loading times for search results to appear.

ROOT CAUSE

The issue was triggered by a change being progressively rolled out in Bitbucket to add user recommendations for approvers in new pull requests. As a result of the change, the products mentioned above could not reach the search infrastructure for the purpose of user and group lookups.

User lookup powers user search and by proxy also user pickers across the Jira family of products, Confluence and Bitbucket, resulting in those requests timing out and eventually failing; customers were not able to see results for their user searches and to see items to select in user pickers.

Group lookup is a dependency for content search in Confluence, resulting in those requests to time out and eventually fail; customers were not able to see results for their content searches or experienced very slow loading times for search results to appear.

The root cause of the incident was found in the search infrastructure in our US East region getting overwhelmed by a high volume of non-optimized queries introduced by the rollout of changes to Bitbucket. As those changes progressively propagated to customers whose user/content search queries where routed to US East due to their close geographical location to it, the resources consumption generated by the resulting queries eventually reached a point where the infrastructure was not able to process requests and started failing.

An attempt at rerouting search infrastructure traffic to the next closest US region did not improve the situation until the source of the queries was fully identified and the changes were rolled back. Following the rollback, the infrastructure quickly recovered and our systems reached a known good state leading to the resolution of the incident.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because the change was related to a very specific type of query reaching our search infrastructure at scale. Its impact was not picked up by our automated continuous deployment suites and manual test scripts.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Improving the resilience of our search infrastructure by better isolating traffic from specific products.
  • Improving our internal documentation and processes to include the necessary steps to ensure that new queries from users are optimized at scale for our search infrastructure.

We deploy our changes progressively (by cloud region) to avoid broad impact. However, in this case, our detection did not work as expected. To minimise the impact of breaking changes to our environments, we will implement additional preventative measures such as:

  • Introducing dedicated monitoring to identify non-optimized queries volume and impact in order to reduce detection time.
  • Improving our internal incident response documentation to promptly identify and isolate the source of non-optimized queries in order to reduce resolution time.
  • Improving our monitoring of resource usage to trigger alerts at the appropriate severity level by including long-term resource usage trends in order to anticipate customer impact ahead of time.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve our platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jul 01, 2021 - 15:28 UTC

Resolved
Between 09:50 UTC to 11:25 UTC, we experienced degraded search performance and respective timeouts for Confluence, Jira Software, and Atlassian Bitbucket. The issue has been resolved and the service is operating normally.
Posted Jun 24, 2021 - 11:25 UTC
Identified
We continue to work on resolving the user and content search degradation for Confluence, Jira Software, and Atlassian Bitbucket. We have identified the root cause and expect recovery shortly.
Posted Jun 24, 2021 - 11:04 UTC
Investigating
We are investigating cases of degraded performance for user and content search functionality for Confluence, Jira Software, and Atlassian Bitbucket Cloud customers. We will provide more details within the next hour.
Posted Jun 24, 2021 - 09:50 UTC