Degraded Pipelines performance
Incident Report for Atlassian Bitbucket
Postmortem

SUMMARY

On December 16, 2022, between 10:07 and 15:40 UTC, Atlassian's internal artifact management infrastructure experienced an outage. Some Atlassian customers using Bitbucket Pipelines, Marketplace apps and integrations, and managing attachments across Atlassian cloud products were impacted. Bitbucket users experienced failures of Bitbucket pipelines and Bitbucket pipeline runners, Marketplace apps and integrations users experienced missing webhooks, and actions on attachments (especially uploads) were impacted across all our products. The event was triggered by Atlassian's internal Artifact Repository Manager becoming unavailable due to a combination of an abnormally high load and misconfiguration of rate limiting and circuit breaking. Customers in all regions were impacted. The incident was immediately detected by our monitoring systems and mitigated by changing policies and configuration, which allowed the Artifact Repository Manager to recover. The total time to resolution was about five hours and 33 minutes.

IMPACT

We detected the impact of this incident on December 16, 2022, at 10:07 UTC, recovery started at 11:00 UTC with most of the functionality restored by 12:12 UTC, and full recovery was achieved at 15:40 UTC.

Below is the breakdown of the impact for each product.

Marketplace apps and integrations:

  • Webhooks were intermittently impacted for approximately two hours (10:07 UTC - 12:07 UTC).
  • Between 13:45 UTC and 15:40 UTC, we observed a significantly increased failure rate and drop in traffic on app discoverability (apps disappearing from products UI), app execution (apps execution failing), and developer capabilities (app deployment and management).

Bitbucket:

  • Bitbucket Pipelines rely heavily on our internal artifact management infrastructure to fetch the necessary container images during builds and were impacted for the duration of the incident.
  • We have detected approximately 60,000 build steps failures in the affected period.

All Atlassian cloud products:

  • Our internal attachment management infrastructure had encountered scaling issues and this intermittently affected the ability to upload/manage attachments for approximately two hours.
  • Users of Confluence, Jira Service Management, and Jira Software were unable to view or upload attachments.

ROOT CAUSE

The issue was caused by an outage of Atlassian's internal Artefact Repository Manager due to a combination of an abnormally high load and misconfiguration of rate limiting and circuit breaking. As a result, the products listed above could not access Docker images and other necessary artifacts to scale up, which caused partial degradation of some services or complete unavailability of some other services for customers. The restart of the internal Artifact Repository Manager and changes in policies and configuration caused downtime to the service but led to successful recovery.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid the impact of this kind of outage in the future.

We are prioritizing the following improvement actions to minimize the likelihood of incidents of this type reoccurring:

  • Build redundancy into the artifact management system to increase resiliency.
  • Make changes to our internal infrastructure and affected systems to improve resiliency.
  • Improve our internal documentation for incident resolution.
  • Tweak alerting to improve the detection and TTR of similar incidents in the future.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jan 20, 2023 - 04:12 UTC

Resolved
The issue has been resolved and the service is operating normally.
Posted Dec 16, 2022 - 10:52 UTC
Monitoring
We have identified the root cause and have mitigated the problem
Posted Dec 16, 2022 - 10:52 UTC
Investigating
We are investigating issues with pulling images with Bitbucket Pipelines. We will provide more details within the next hour.
Posted Dec 16, 2022 - 10:28 UTC
This incident affected: Pipelines.