High DB Load after upgrade.

Incident Report for StrongDM

Postmortem

Postmortem on the outage of June 1st, 2024.

When the database came back up after the initial upgrade, Engineering noticed an unusual spike of high activity and locking on the database; anything it tried to do got stuck behind a lot of traffic. This had not happened on prior upgrades, including recent upgrades, so it was surprising.

During the investigation Engineering discovered that a change that had rolled out earlier that day interacted poorly with a freshly started database due to a lot of traffic from customers re-connecting (this is sometimes referred to as a “Thundering Herd” problem.) Engineers rolled back to the release prior to that change and the database traffic started to return to an expected level, and connections through StrongDM started working again, but things were still slow.

Engineering looked at other new code paths and discovered that we needed to optimize another new feature -- we were able to mitigate that quickly, and then the database fully recovered and service was responsive again.

StrongDM apologizes for the impact that this has had on our customers and we appreciate your patience. We have learned from this maintenance and have already incorporated this knowledge into our practices.

Posted Jun 13, 2024 - 14:47 UTC

Resolved

From between 9pm and 10:50pm CDT some customers were seeing failed connections and authentication issues due to the high load of the db after the scheduled upgrade. These issue should be clear at this time.

Engineering continues to monitor.

Posted Jun 01, 2024 - 02:00 UTC