Skip to content
Snippets Groups Projects
  • John Swanson's avatar
    0ef2052f
    Do not cache all token check failures (#48147) · 0ef2052f
    John Swanson authored
    * Do not cache all token check failures
    
    We want to cache token checks to avoid an issue where we repeatedly ask
    the store "hey, is this token valid?? is this token valid?? is this
    token valid??" for the same token.
    
    However, transient errors can also occur. For example, maybe a network
    issue causes the HTTP request to fail entirely. In this case, if we
    cache the result, the user needs to restart metabase (or wait 5 minutes
    until the cache is cleared) before they can attempt to validate their
    token again.
    
    This PR moves the cache logic deeper into the stack. We want to cache
    "successful" responses from the store API - cases where the store has
    told us categorically that the token is or is not valid. We don't want
    or need to cache other things that might happen. Maybe your token isn't
    the right length - we can recalculate that, it's ok. Maybe you get a 503
    error from the Store - we should let you retry. Maybe your network is
    having issues and you can't contact the Store at all - again, we should
    let you retry.
    
    The one potential issue I see here is that if the store goes down, we'll
    massively increase the number of requests we send to the store,
    potentially making it harder to recover. If this is a concern, I can add
    a circuit breaker: if we repeatedly get errors back from the store, back
    off and stop making requests for a while.
    
    * Add a circuit breaker to store API requests
    
    In the pathological case where the store goes down for >5 minutes, the
    cache will expire and all instances everywhere will start repeatedly
    making requests for token validation at once. This might make recovering
    from an outage more difficult.
    
    This adds a circuit breaker around the API request. If the call
    repeatedly throws (5XX errors, socket timeouts, etc.) then we'll pause
    for 1 minute, during which time all calls to token validation will
    immediately fail without making any request to the API.
    
    After one minute, we'll allow one request through to the API. If it
    succeeds, we'll go back to normal operation. Otherwise, we'll wait
    another minute.
    Do not cache all token check failures (#48147)
    John Swanson authored
    * Do not cache all token check failures
    
    We want to cache token checks to avoid an issue where we repeatedly ask
    the store "hey, is this token valid?? is this token valid?? is this
    token valid??" for the same token.
    
    However, transient errors can also occur. For example, maybe a network
    issue causes the HTTP request to fail entirely. In this case, if we
    cache the result, the user needs to restart metabase (or wait 5 minutes
    until the cache is cleared) before they can attempt to validate their
    token again.
    
    This PR moves the cache logic deeper into the stack. We want to cache
    "successful" responses from the store API - cases where the store has
    told us categorically that the token is or is not valid. We don't want
    or need to cache other things that might happen. Maybe your token isn't
    the right length - we can recalculate that, it's ok. Maybe you get a 503
    error from the Store - we should let you retry. Maybe your network is
    having issues and you can't contact the Store at all - again, we should
    let you retry.
    
    The one potential issue I see here is that if the store goes down, we'll
    massively increase the number of requests we send to the store,
    potentially making it harder to recover. If this is a concern, I can add
    a circuit breaker: if we repeatedly get errors back from the store, back
    off and stop making requests for a while.
    
    * Add a circuit breaker to store API requests
    
    In the pathological case where the store goes down for >5 minutes, the
    cache will expire and all instances everywhere will start repeatedly
    making requests for token validation at once. This might make recovering
    from an outage more difficult.
    
    This adds a circuit breaker around the API request. If the call
    repeatedly throws (5XX errors, socket timeouts, etc.) then we'll pause
    for 1 minute, during which time all calls to token validation will
    immediately fail without making any request to the API.
    
    After one minute, we'll allow one request through to the API. If it
    succeeds, we'll go back to normal operation. Otherwise, we'll wait
    another minute.
Code owners
Assign users and groups as approvers for specific file changes. Learn more.