deps.edn · cffae168715266cb26e58fe9ce221499e315a4ff · Engineering Digital Service / Metabase

5 months ago

0ef2052f

Do not cache all token check failures (#48147) · 0ef2052f

John Swanson authored 5 months ago

* Do not cache all token check failures

We want to cache token checks to avoid an issue where we repeatedly ask
the store "hey, is this token valid?? is this token valid?? is this
token valid??" for the same token.

However, transient errors can also occur. For example, maybe a network
issue causes the HTTP request to fail entirely. In this case, if we
cache the result, the user needs to restart metabase (or wait 5 minutes
until the cache is cleared) before they can attempt to validate their
token again.

This PR moves the cache logic deeper into the stack. We want to cache
"successful" responses from the store API - cases where the store has
told us categorically that the token is or is not valid. We don't want
or need to cache other things that might happen. Maybe your token isn't
the right length - we can recalculate that, it's ok. Maybe you get a 503
error from the Store - we should let you retry. Maybe your network is
having issues and you can't contact the Store at all - again, we should
let you retry.

The one potential issue I see here is that if the store goes down, we'll
massively increase the number of requests we send to the store,
potentially making it harder to recover. If this is a concern, I can add
a circuit breaker: if we repeatedly get errors back from the store, back
off and stop making requests for a while.

* Add a circuit breaker to store API requests

In the pathological case where the store goes down for >5 minutes, the
cache will expire and all instances everywhere will start repeatedly
making requests for token validation at once. This might make recovering
from an outage more difficult.

This adds a circuit breaker around the API request. If the call
repeatedly throws (5XX errors, socket timeouts, etc.) then we'll pause
for 1 minute, during which time all calls to token validation will
immediately fail without making any request to the API.

After one minute, we'll allow one request through to the API. If it
succeeds, we'll go back to normal operation. Otherwise, we'll wait
another minute.

Unverified

0ef2052f

History

Do not cache all token check failures (#48147)

John Swanson authored 5 months ago

* Do not cache all token check failures

We want to cache token checks to avoid an issue where we repeatedly ask
the store "hey, is this token valid?? is this token valid?? is this
token valid??" for the same token.

However, transient errors can also occur. For example, maybe a network
issue causes the HTTP request to fail entirely. In this case, if we
cache the result, the user needs to restart metabase (or wait 5 minutes
until the cache is cleared) before they can attempt to validate their
token again.

This PR moves the cache logic deeper into the stack. We want to cache
"successful" responses from the store API - cases where the store has
told us categorically that the token is or is not valid. We don't want
or need to cache other things that might happen. Maybe your token isn't
the right length - we can recalculate that, it's ok. Maybe you get a 503
error from the Store - we should let you retry. Maybe your network is
having issues and you can't contact the Store at all - again, we should
let you retry.

The one potential issue I see here is that if the store goes down, we'll
massively increase the number of requests we send to the store,
potentially making it harder to recover. If this is a concern, I can add
a circuit breaker: if we repeatedly get errors back from the store, back
off and stop making requests for a while.

* Add a circuit breaker to store API requests

In the pathological case where the store goes down for >5 minutes, the
cache will expire and all instances everywhere will start repeatedly
making requests for token validation at once. This might make recovering
from an outage more difficult.

This adds a circuit breaker around the API request. If the call
repeatedly throws (5XX errors, socket timeouts, etc.) then we'll pause
for 1 minute, during which time all calls to token validation will
immediately fail without making any request to the API.

After one minute, we'll allow one request through to the API. If it
succeeds, we'll go back to normal operation. Otherwise, we'll wait
another minute.

Code owners

Assign users and groups as approvers for specific file changes. Learn more.