Postmortem on the outage of June 1st, 2024.
When the database came back up after the initial upgrade, Engineering noticed an unusual spike of high activity and locking on the database; anything it tried to do got stuck behind a lot of traffic. This had not happened on prior upgrades, including recent upgrades, so it was surprising.
During the investigation Engineering discovered that a change that had rolled out earlier that day interacted poorly with a freshly started database due to a lot of traffic from customers re-connecting (this is sometimes referred to as a “Thundering Herd” problem.) Engineers rolled back to the release prior to that change and the database traffic started to return to an expected level, and connections through StrongDM started working again, but things were still slow.
Engineering looked at other new code paths and discovered that we needed to optimize another new feature -- we were able to mitigate that quickly, and then the database fully recovered and service was responsive again.
StrongDM apologizes for the impact that this has had on our customers and we appreciate your patience. We have learned from this maintenance and have already incorporated this knowledge into our practices.