Summary:
During a routine system update on October 2, a server release was pushed to production that changed how tokens were validated, which led to failed connections to resources. Our internal canary deployment caught the issue and alerted the team. They started to rollback the change when we started to learn about customer impact. There were two compounding issues that caused this outage.
Cause:
It took some time for the issue to manifest in our Canary environment. By the time the system alerted, the build was deemed OK and pushed to production. Which was the ultimate root cause of the failure.
Impact:
Remediation & Prevention: