At 07:57 UTC on 25 October 2018, the primary Bitbucket database server failed due to a full disk. An oversight in configuration management meant that part of the archive command could not run; without a functional archive command, transaction logs accumulated on disk instead of being rotated out, and in time the disk filled with transaction logs. Since this particular server had only recently been promoted to “primary”, the broken archive command was not immediately noticed, and alerts on that system had not yet been updated to reflect its new role.
Service was restored at 08:17 UTC, after engineers installed the missing components for the archive command, and after obsolete logs were deleted (thus freeing up enough space for temporary files). Because the database processes shut down immediately when the disk reached full usage, there was no loss of user data.
To ensure that this doesn’t happen again, we’ve updated our configuration management system to maintain archive-command component binaries on all database systems. We are also in the process of updating our internal database failover processes to include more rigorous validation steps, and automating more failover-related changes to monitoring and auxiliary configurations.
As always, please feel free to contact our Support team at support.atlassian.com if you need assistance.