Pipelines failing due start due to docker-public.packages.atlassian.com outage

Incident Report for Atlassian Bitbucket

Postmortem

SUMMARY

On June 22, 2022, from 01:08 AM UTC to 03:42 AM UTC, some customers using Bitbucket Pipelines, Confluence Cloud, Forge, and Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management). While for Bitbucket Pipelines there was an increase in build failures, Jira, Confluence, and Forge experienced performance and functionality degradation. The event was triggered by our internal Artifact Repository Manager becoming unavailable during a scheduled multi-availability zone disaster recovery test. Customers across all regions were affected. The incident was detected within two minutes by monitoring and mitigated by restarting the Artifact Repository service, which recovered the affected products. The total time to resolution was about three hours.

IMPACT

The overall impact was between June 22, 2022, 01:08 AM UTC, and June 22, 2022, 05:58 AM UTC on Bitbucket Pipelines, Confluence Cloud, Forge, and Jira Cloud family of products (Jira Software, Jira Service Management, Jira Work Management). The outage of the internal Artifact Repository Manager caused scalability problems in the aforementioned products and an inability to build or deploy new versions of our services. That meant the degradation of performance and functionality for most of these products.

ROOT CAUSE

The issue was caused by an outage of the internal Artifact Repository Manager during the planned multi-availability zone disaster recovery test. As a result, the products listed above could not access docker images and other necessary artifacts to scale up, which caused partial degradation of services or complete unavailability of services for some customers. The restart of the internal Artifact Repository Manager caused downtime to the service but led to successful recovery.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages may impact your productivity. After the immediate impact of this outage was resolved, the incident response team completed a technical analysis of the root cause and contributing factors. The team has conducted a post-incident review to determine how we can avoid the impact of this kind of outage in the future.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

We raised a critical issue with the vendor who provides us with software for Artifact Management to optimise the resilience of the application caused by availability zone failures.
We are working on improving our disaster recovery plan to be able to mitigate such incidents faster.
We are reviewing our test strategies to be able to catch similar issues in the early stages.

To minimize the impact of such incidents on our customers, we will implement additional preventative measures such as:

Development of a redundant caching mechanism for our platform system to improve the scalability and reliability of our products.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jul 07, 2022 - 06:35 UTC

Resolved

This incident has been resolved.

Posted Jun 22, 2022 - 04:26 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 22, 2022 - 04:11 UTC

Update

We’ve restored service for cloud hosted pipelines, however self hosted runners are still impacted.

Posted Jun 22, 2022 - 03:30 UTC

Investigating

An outage with docker-public.packages.atlassian.com is causing pipelines to be stuck in pending. These pipelines need to be manually stopped, and newly triggered pipelines should complete successfully.

We've mitigated the impact for cloud pipelines by failing over to an alternative docker registry, however pipelines on self hosted runners will be failing.

Posted Jun 22, 2022 - 02:26 UTC

This incident affected: Pipelines.