Git operations via HTTPS/SSH are failing
Incident Report for Atlassian Bitbucket
Postmortem

SUMMARY

On July 06, 2021 between 21:38 - 22:41 UTC customers using Bitbucket Cloud products were unable to perform git operations over HTTPS/SSH. The event was triggered as the result of side effects from enabling a feature flag used to toggle new behavior in a caching layer used for authentication. As a result, all Bitbucket Cloud customers were impacted. The incident was detected within 2 minutes by automated internal monitoring systems and mitigated by disabling the feature flag which put Atlassian systems into a known good state. The total time to resolution was about 1 hour & 3 minutes.

IMPACT

The overall impact was between July 06, 2021, 21:38 PM UTC and July 06, 2021, 22:41 PM UTC on Bitbucket Cloud products. The Incident caused service disruption to all customers attempting to perform Git operations over HTTPS and SSH. Customers would have noticed that they were not able to execute commands such as git push, git pull, git clone and would instead receive HTTP 401/permission denied errors.

ROOT CAUSE

The issue was caused by a change to a feature flag targeting another part of the system (caching layer for authentication) which unexpectedly impacted Git services. As a result, the users of Bitbucket Cloud and products that integrate with Bitbucket cloud could not make git clone, git pull, git push, etc. The root cause of the incident was that the change created unexpected load on a caching layer which is shared between these two parts of the system. The lack of visibility into this recent and seemingly unrelated feature flag change hindered the engineers who were troubleshooting the issue from identifying the source of the problem right away.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages are impactful to your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because a change had been initiated by one team which had an adverse effect on services owned and maintained by a different team. The team that made the change monitored their own systems and confirmed that they were operating normally after the flag was enabled. The impact was reflected in the monitoring systems of the other team.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Requiring approval for feature flag changes - Updates that rollout new changes to the production environment will require a minimum of two people to approve the rollout (rollbacks will not be held to the same standard).
  • Leveraging feature flag integration to widely broadcast updates to flags - Internal comms will be sent and cataloged each time there is an update to a feature flag.

We apologize to those customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jul 15, 2021 - 17:05 UTC

Resolved
This incident has been resolved.
Posted Jul 06, 2021 - 22:52 UTC
Investigating
Git operations over HTTPS and SSH are failing. Pipelines are also failing because they cannot clone repositories.

We are investigating this issue and taking steps to mitigate and recover. Will post an update within an hour
Posted Jul 06, 2021 - 22:24 UTC
This incident affected: SSH, Git via HTTPS, and Pipelines.