A change to the control plane was rolled out which altered the default policy used to authorize connections when the policy editor is not enabled for an organization. This policy was synced down to customer infrastructure for local authorization but, critically, new entities which were referenced by that policy were not always synchronized.
This was caused by an incomplete implementation of a database synchronization function on the control plane, which, in some cases, informed clients about the policy but not about the new entities.
The policy began to deny authorizations on customer infrastructure due to missing entity references which caused the policy to not apply. This issue affected customers who had policy disabled (either by SKU or by toggle in the Admin UI) because the globalpolicy is intended to stand in for hand-written policy when policy is enabled.
Resolution:
To fix the issue, the control plane was rolled back to a version that did not have the new globalpolicy plugin. However, the bad policy could have synchronized to customer nodes and, in order to make sure that it was purged, we later merged a change to bump the authsync version which forced all customers' nodes to fully resynchronize their policies and entities with the control plane.
Downtime:
Total downtime was just over an hour. Restarting nodes helped some customers due to forcing a resynchronization of the full set of policies and entities.
Prevention:
Going forward StrongDM now automatically forces a full resynchronization whenever code that is involved in synchronizing policies and entities to the nodes is modified.