A recent update revealed an underlying issue with our replica database. Internal alerts flagged the problem and the incident was declared when StrongDM received customer support tickets.
To resolve the issue and prevent recurrence, our Infrastructure team made improvements to query monitoring and adjusted how we manage replica lag within RDS.
Incident Timeline:
Sep 5, 20:49 UTC - Change deployed
Sep 6, 13:50 UTC - First problem report from internal alerts
Sep 6, 15:15 UTC - Tickets from two customers, incident declared, replica disabled
Sep 6, 15:59 UTC - Incident resolved
Posted Sep 12, 2024 - 20:42 UTC
Resolved
The incident is considered resolved as we have seen no additional errors. We will be performing an internal post-mortem/RCA and an incident after action review next week.
Posted Sep 06, 2024 - 15:59 UTC
Update
The US Control Plane was experiencing intermittent authentication issues affecting all users, as well as listing available resources. The issue presented by requiring a user to authenticate multiple times before they are allowed into the AdminUI or the SDM Client. We have remediated the source of the issue and are continuing to monitor for any additional errors.
Posted Sep 06, 2024 - 15:39 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 06, 2024 - 15:38 UTC
Monitoring
The issue has been identified and a fix has been implemented. Normal operations should resume. We will continue to monitor and provide further updates here.
Posted Sep 06, 2024 - 15:31 UTC
Update
We are continuing to investigate this issue.
Posted Sep 06, 2024 - 15:27 UTC
Investigating
We are currently investigating this issue and will update here with more information.