Skip to content
Snippets Groups Projects
user avatar
dpsutton authored
* Relieve db pressure on api/health check

https://github.com/metabase/metabase/issues/26266

Servers under heavy load can be slow to respond to the api/health
check. This can lead to k8s killing healthy instances happily humming
along serving requests.

One idea floated was to use QoSFilters
https://www.eclipse.org/jetty/javadoc/jetty-9/org/eclipse/jetty/servlets/QoSFilter.html
to prioritize those requests in front of others. But I suspect this
might not be our bottleneck.

Our health endpoint was updated to see if it could acquire an endpoint
when we were dealing with connection pool issues. We were reporting the
instance was healthy once it has finished the init process, but would
report healthy if 60/15 app-db connections were used and no actual
queries could complete.

The remedy was adding
`(sql-jdbc.conn/can-connect-with-spec? {:datasource (mdb.connection/data-source)})`
to the endpoint. But now to get information about the health of the
system we have to wait in the queue to get a datasource.

The hope is that this change which monitors for recent db
checkins (query success) and checkouts (query begun) can be a proxy for
db activity without having to wait for a connection and hit the db ourselves.

Some simple and crude benchmarking:
- use `siege` to hit `api/database/<app-db>/sync_schema`
- in a separate tab, use `siege` to hit `api/health`

Three trials with unconditional db access and conditional db
access (look for recent activity set by the new `ConnectionCustomizer`).

One siege client is synching the app-db's schema with 80 clients each
sending 60 requests. the other has 1 client sending 60 requests to api/health.

Run             |  Elapsed Time | max tx  | tx rate
 before change  |    7.16s      |  0.79s  |  8.38 tx/s
 before change  |   23.91s      |  1.44s  |  2.51 tx/s
 before change  |   13.00s      |  0.50s  |  4.62 tx/s
----------------------------------------------------
 after change   |    4.46s      |  0.27s  |  13.45 tx/s
 after change   |    5.81s      |  0.61s  |  10.33 tx/s
 after change   |    4.54s      |  0.44s  |  13.22 tx/s

Full(er) results below:

```
Unconditional db access
=======================

siege -c80 -r 40 "http://localhost:3000/api/database/2/sync_schema POST" -H "Cookie: $SESSION"

siege -c 1 -r 60 "http://localhost:3000/api/health"

Elapsed time:		        7.16 secs
Response time:		        0.12 secs
Transaction rate:	        8.38 trans/sec
Longest transaction:	        0.79
Shortest transaction:	        0.01

Elapsed time:		       23.91 secs
Response time:		        0.40 secs
Transaction rate:	        2.51 trans/sec
Longest transaction:	        1.44
Shortest transaction:	        0.02

Elapsed time:		       13.00 secs
Response time:		        0.22 secs
Transaction rate:	        4.62 trans/sec
Longest transaction:	        0.50
Shortest transaction:	        0.06

Conditional db access
==============================================================

Elapsed time:		        4.46 secs
Response time:		        0.07 secs
Transaction rate:	       13.45 trans/sec
Longest transaction:	        0.27
Shortest transaction:	        0.01

Elapsed time:		        5.81 secs
Response time:		        0.10 secs
Transaction rate:	       10.33 trans/sec
Longest transaction:	        0.61
Shortest transaction:	        0.00

Elapsed time:		        4.54 secs
Response time:		        0.08 secs
Transaction rate:	       13.22 trans/sec
Longest transaction:	        0.44
Shortest transaction:	        0.01
```

* Remove reflection in `.put` call (not the reflections trategy)

also remove the call to `classloader/the-classloader` as it did nothing

* Comment and settle on a single method

* tests

* select from db twice

had a failure in CI. give it time to do its thing with another db call

* block to wait for timestamp update?

* unflake the tests

tasks and events from outside the thread can hit the db. the
ConnectionCustomizer is also run from c3p0 controlled threads so we
can't easily isolate everything to our thread

Was running

```clojure
(comment
  (dotimes [n 5]
    (dotimes [_ 100]
      (recent-activity-test)
      (CheckinTracker-test))
    (println (* (inc n) 100)))
    )
```

to run the tests 500 times and would keep getting flakes at a rate
~1/100 to 1/500. Just frustration for the future.

* typehint

* Switch it up a bit

Tests were flaking in h2 and I don't know why. I'm switching to just
updating recent activity on most methods.
0975f5d9
History
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
Name Last commit Last update
..
metabase