At 10:38 PM UTC 2025-03-04 we received a first notification from our alerting system about problems with our messaging infrastructure being overwhelmed by the number of requests
At 11:06 PM UTC 2025-03-04 the system recovered.
Engineers identified the issue and created a ticket to address it.
At 6:56 AM UTC 2025-03-05 we received another notification from our alerting system with the same problem as before
At 7:28 AM UTC 2025-03-05 the system recovered again.
At 2:27pm UTC 2025-03-05, we began working on mitigations and further investigation as the US team started their day.
At 7:23pm UTC 2025-03-05, mitigations were released to production and we saw significant performance improvements in our metrics during the next spike overnight.
Root cause:
One of our message passing systems was being overloaded by messages due to a period of heavy use.
The specific messages that caused the issue were broadcasts for the creation of connections to local clients.
These messages were not being produced or consumed efficiently.
Once the messaging system was overloaded, it stopped accepting new messages for a brief period of time.
As consumers caught up, performance would improve.
All managed secret actions (validation, rotation, retrieval) depend on this messaging system. Once the system was overloaded, the managed secret actions would either time out or be dropped.
Mitigation:
We dramatically improved the efficiency of both the production and consumption of these messages. We experienced a similar high load spike after the mitigation was released, and we were able to observe the mitigations were very effective.
Posted Mar 07, 2025 - 21:57 UTC
Resolved
On March 4th at approximately 2:30pm PST, we experienced an unusual load event that caused degraded service for 30 minutes. We experienced a follow up event later that day at 11pm PST for 30 minutes again. During this time, the service was slower than usual to respond to requests and some session connections would fail. We have implemented mitigations and are currently monitoring.