Pipelines stuck in pending state
Incident Report for Atlassian Bitbucket
Postmortem

Summary

On February 22, 2024, between 7:22 UTC and 13:30 UTC, Atlassian customers using Bitbucket Cloud faced degradation to its website and APIs. This was caused by the vacuum process not being run frequently enough on our high-traffic database tables, which impaired the database’s ability to handle requests. This resulted in connection pools becoming saturated, response times increasing, and a ramp-up of requests timing out completely.

After the database recovered at 13:30 UTC, Bitbucket Pipelines experienced build scheduling delays as it processed the backlog of jobs. Additional resources were added to Bitbucket Pipelines and the backlog was cleared in full by 17:30 UTC.

IMPACT

Customers who were impacted experienced significant delays with running Bitbucket Pipelines and increased latency when accessing the bitbucket.org website and APIs during the duration of the incident. Git requests over HTTPS and SSH were unaffected.

ROOT CAUSE

The incident was caused by an issue during the routine autovacuuming of our active database tables, which impaired its ability to serve requests. This led to slowdowns that impacted a variety of Bitbucket services, including the queuing of a large backlog of unscheduled pipelines.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are prioritizing the following improvement actions to reduce recovery time, limit impact, and avoid repeating these types of incidents in the future:

  • Reconfigure vacuuming threshold for high write activity database tables.
  • Adjust alert thresholds to proactively catch this behavior earlier and reduce potential impact.
  • Tuning autoscaling and load shedding behavior for Pipelines services and increasing build runner capacity.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Mar 11, 2024 - 17:49 UTC

Resolved
Services have recovered and are operational
Posted Feb 22, 2024 - 17:45 UTC
Monitoring
Services are in the process of recovering while we continue to monitor
Posted Feb 22, 2024 - 17:24 UTC
Identified
The issue has been identified. We are working towards resolution.
Posted Feb 22, 2024 - 16:39 UTC
Investigating
We are still receiving some reports of Pipelines queueing or delays, requiring further investigation
Posted Feb 22, 2024 - 15:55 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Feb 22, 2024 - 15:20 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Feb 22, 2024 - 15:16 UTC
Investigating
We are currently investigating an issue preventing Bitbucket Pipelines from starting for some customers.
Posted Feb 22, 2024 - 14:45 UTC
This incident affected: Webhooks and Pipelines.