SDM Outage(US)
Incident Report for StrongDM
Postmortem

On December 3rd, SDM released a server build that added a new index to the table tracking latency between nodes. While the build succeeded on the EU and UK Control Planes, it failed on the US Control Plane.

Retrying the build on the US Control Plane left the index in an invalid state, which went undetected by our migration tools.

The following day, December 4th, SDM released a server build that relied on the new index. Without the index, node-to-node latency was no longer stored, and approximately three minutes later, the routing system stopped attempting multi-hop routes. This issue was limited to the US Control Plane, as the index was successfully created in the EU and UK Control Planes.

StrongDM has already taken steps with internal processes to ensure that this kind of issue does not repeat in the future. Thank you for your patience and understanding.

Posted Dec 20, 2024 - 17:22 UTC

Resolved
This incident was resolved at 13:49 UTC. An RCA will be posted here within a week.
Posted Dec 04, 2024 - 15:08 UTC
Update
This issue does not appear to have impacted customers using our EU and UK Control Planes. This outage impacted the US Control Plane only.
Posted Dec 04, 2024 - 13:56 UTC
Monitoring
A fix has been implemented and we are monitoring the results. This should be resolved. If you are still seeing issues please reach out to support: https://help.strongdm.com/hc/en-us/requests/new
Posted Dec 04, 2024 - 13:49 UTC
Identified
The cause of the outage is due to a change in our production db. Engineering has identified the cause and is working to remediate the issue. Updates to follow.
Posted Dec 04, 2024 - 13:03 UTC
This incident affected: Admin UI and API.