John Swanson
authored
* Do not cache all token check failures We want to cache token checks to avoid an issue where we repeatedly ask the store "hey, is this token valid?? is this token valid?? is this token valid??" for the same token. However, transient errors can also occur. For example, maybe a network issue causes the HTTP request to fail entirely. In this case, if we cache the result, the user needs to restart metabase (or wait 5 minutes until the cache is cleared) before they can attempt to validate their token again. This PR moves the cache logic deeper into the stack. We want to cache "successful" responses from the store API - cases where the store has told us categorically that the token is or is not valid. We don't want or need to cache other things that might happen. Maybe your token isn't the right length - we can recalculate that, it's ok. Maybe you get a 503 error from the Store - we should let you retry. Maybe your network is having issues and you can't contact the Store at all - again, we should let you retry. The one potential issue I see here is that if the store goes down, we'll massively increase the number of requests we send to the store, potentially making it harder to recover. If this is a concern, I can add a circuit breaker: if we repeatedly get errors back from the store, back off and stop making requests for a while. * Add a circuit breaker to store API requests In the pathological case where the store goes down for >5 minutes, the cache will expire and all instances everywhere will start repeatedly making requests for token validation at once. This might make recovering from an outage more difficult. This adds a circuit breaker around the API request. If the call repeatedly throws (5XX errors, socket timeouts, etc.) then we'll pause for 1 minute, during which time all calls to token validation will immediately fail without making any request to the API. After one minute, we'll allow one request through to the API. If it succeeds, we'll go back to normal operation. Otherwise, we'll wait another minute.
Code owners
Assign users and groups as approvers for specific file changes. Learn more.