Skip to content
Snippets Groups Projects
Unverified Commit 0975f5d9 authored by dpsutton's avatar dpsutton Committed by GitHub
Browse files

Relieve db pressure on api/health check (#27192)

* Relieve db pressure on api/health check

https://github.com/metabase/metabase/issues/26266

Servers under heavy load can be slow to respond to the api/health
check. This can lead to k8s killing healthy instances happily humming
along serving requests.

One idea floated was to use QoSFilters
https://www.eclipse.org/jetty/javadoc/jetty-9/org/eclipse/jetty/servlets/QoSFilter.html
to prioritize those requests in front of others. But I suspect this
might not be our bottleneck.

Our health endpoint was updated to see if it could acquire an endpoint
when we were dealing with connection pool issues. We were reporting the
instance was healthy once it has finished the init process, but would
report healthy if 60/15 app-db connections were used and no actual
queries could complete.

The remedy was adding
`(sql-jdbc.conn/can-connect-with-spec? {:datasource (mdb.connection/data-source)})`
to the endpoint. But now to get information about the health of the
system we have to wait in the queue to get a datasource.

The hope is that this change which monitors for recent db
checkins (query success) and checkouts (query begun) can be a proxy for
db activity without having to wait for a connection and hit the db ourselves.

Some simple and crude benchmarking:
- use `siege` to hit `api/database/<app-db>/sync_schema`
- in a separate tab, use `siege` to hit `api/health`

Three trials with unconditional db access and conditional db
access (look for recent activity set by the new `ConnectionCustomizer`).

One siege client is synching the app-db's schema with 80 clients each
sending 60 requests. the other has 1 client sending 60 requests to api/health.

Run             |  Elapsed Time | max tx  | tx rate
 before change  |    7.16s      |  0.79s  |  8.38 tx/s
 before change  |   23.91s      |  1.44s  |  2.51 tx/s
 before change  |   13.00s      |  0.50s  |  4.62 tx/s
----------------------------------------------------
 after change   |    4.46s      |  0.27s  |  13.45 tx/s
 after change   |    5.81s      |  0.61s  |  10.33 tx/s
 after change   |    4.54s      |  0.44s  |  13.22 tx/s

Full(er) results below:

```
Unconditional db access
=======================

siege -c80 -r 40 "http://localhost:3000/api/database/2/sync_schema POST" -H "Cookie: $SESSION"

siege -c 1 -r 60 "http://localhost:3000/api/health"

Elapsed time:		        7.16 secs
Response time:		        0.12 secs
Transaction rate:	        8.38 trans/sec
Longest transaction:	        0.79
Shortest transaction:	        0.01

Elapsed time:		       23.91 secs
Response time:		        0.40 secs
Transaction rate:	        2.51 trans/sec
Longest transaction:	        1.44
Shortest transaction:	        0.02

Elapsed time:		       13.00 secs
Response time:		        0.22 secs
Transaction rate:	        4.62 trans/sec
Longest transaction:	        0.50
Shortest transaction:	        0.06

Conditional db access
==============================================================

Elapsed time:		        4.46 secs
Response time:		        0.07 secs
Transaction rate:	       13.45 trans/sec
Longest transaction:	        0.27
Shortest transaction:	        0.01

Elapsed time:		        5.81 secs
Response time:		        0.10 secs
Transaction rate:	       10.33 trans/sec
Longest transaction:	        0.61
Shortest transaction:	        0.00

Elapsed time:		        4.54 secs
Response time:		        0.08 secs
Transaction rate:	       13.22 trans/sec
Longest transaction:	        0.44
Shortest transaction:	        0.01
```

* Remove reflection in `.put` call (not the reflections trategy)

also remove the call to `classloader/the-classloader` as it did nothing

* Comment and settle on a single method

* tests

* select from db twice

had a failure in CI. give it time to do its thing with another db call

* block to wait for timestamp update?

* unflake the tests

tasks and events from outside the thread can hit the db. the
ConnectionCustomizer is also run from c3p0 controlled threads so we
can't easily isolate everything to our thread

Was running

```clojure
(comment
  (dotimes [n 5]
    (dotimes [_ 100]
      (recent-activity-test)
      (CheckinTracker-test))
    (println (* (inc n) 100)))
    )
```

to run the tests 500 times and would keep getting flakes at a rate
~1/100 to 1/500. Just frustration for the future.

* typehint

* Switch it up a bit

Tests were flaking in h2 and I don't know why. I'm switching to just
updating recent activity on most methods.
parent 8e385d49
No related merge requests found
(ns metabase.db.connection-pool-setup
"Code for creating the connection pool for the application DB and setting it as the default Toucan connection."
(:require [metabase.config :as config]
(:require [java-time :as t]
[metabase.config :as config]
[metabase.connection-pool :as connection-pool]
[schema.core :as s])
(:import com.mchange.v2.c3p0.PoolBackedDataSource))
(:import [com.mchange.v2.c3p0 ConnectionCustomizer PoolBackedDataSource]))
(def ^:private latest-activity (atom nil))
(def ^:private ^java.time.Duration recent-window-duration (t/seconds 15))
(defn- recent-activity?*
[activity duration]
(when activity
(t/after? activity (t/minus (t/offset-date-time) duration))))
(defn recent-activity?
"Returns true if there has been recent activity. Define recent activity as an application db connection checked in,
checked out, or acquired within [[recent-window-duration]]. Check-in means a query succeeded and the db connection
is no longer needed."
[]
(recent-activity?* @latest-activity recent-window-duration))
(defrecord DbActivityTracker []
ConnectionCustomizer
(onAcquire [_ _connection _identity-token]
(reset! latest-activity (t/offset-date-time)))
(onCheckIn [_ _connection _identity-token]
(reset! latest-activity (t/offset-date-time)))
(onCheckOut [_ _connection _identity-token]
(reset! latest-activity (t/offset-date-time)))
(onDestroy [_ _connection _identity-token]))
(defn- register-customizer!
"c3p0 allows for hooking into lifecycles with its interface
ConnectionCustomizer. https://www.mchange.com/projects/c3p0/apidocs/com/mchange/v2/c3p0/ConnectionCustomizer.html. But
Clojure defined code is in memory in a dynamic class loader not available to c3p0's use of Class/forName. Luckily it
looks up the instances in a cache which I pre-seed with out impl here. Issue for better access here:
https://github.com/swaldman/c3p0/issues/166"
[^Class klass]
(let [field (doto (.getDeclaredField com.mchange.v2.c3p0.C3P0Registry "classNamesToConnectionCustomizers")
(.setAccessible true))]
(.put ^java.util.HashMap (.get field com.mchange.v2.c3p0.C3P0Registry)
(.getName klass) (.newInstance klass))))
(register-customizer! DbActivityTracker)
(def ^:private application-db-connection-pool-props
"Options for c3p0 connection pool for the application DB. These are set in code instead of a properties file because
......@@ -11,7 +53,8 @@
https://www.mchange.com/projects/c3p0/#configuring_connection_testing for an overview of the options used
below (jump to the 'Simple advice on Connection testing' section.)"
(merge
{"idleConnectionTestPeriod" 60}
{"idleConnectionTestPeriod" 60
"connectionCustomizerClassName" (.getName DbActivityTracker)}
;; only merge in `max-pool-size` if it's actually set, this way it doesn't override any things that may have been
;; set in `c3p0.properties`
(when-let [max-pool-size (config/config-int :mb-application-db-max-connection-pool-size)]
......
......@@ -8,6 +8,7 @@
[metabase.api.routes :as api]
[metabase.core.initialization-status :as init-status]
[metabase.db.connection :as mdb.connection]
[metabase.db.connection-pool-setup :as mdb.connection-pool-setup]
[metabase.driver.sql-jdbc.connection :as sql-jdbc.conn]
[metabase.plugins.classloader :as classloader]
[metabase.public-settings :as public-settings]
......@@ -49,7 +50,8 @@
;; ^/api/health -> Health Check Endpoint
(GET "/api/health" []
(if (init-status/complete?)
(try (if (sql-jdbc.conn/can-connect-with-spec? {:datasource (mdb.connection/data-source)})
(try (if (or (mdb.connection-pool-setup/recent-activity?)
(sql-jdbc.conn/can-connect-with-spec? {:datasource (mdb.connection/data-source)}))
{:status 200, :body {:status "ok"}}
{:status 503 :body {:status "Unable to get app-db connection"}})
(catch Exception e
......
(ns metabase.db.connection-pool-setup-test
(:require [clojure.java.jdbc :as jdbc]
[clojure.test :refer :all]
[java-time :as t]
[metabase.connection-pool :as connection-pool]
[metabase.db.connection-pool-setup :as mdb.connection-pool-setup]
[metabase.db.data-source :as mdb.data-source]
[metabase.test :as mt])
(:import com.mchange.v2.c3p0.PoolBackedDataSource))
[metabase.models :refer [Database]]
[metabase.test :as mt]
[metabase.test.fixtures :as fixtures]
[metabase.util :as u]
[toucan.db :as db])
(:import [com.mchange.v2.c3p0 C3P0Registry ConnectionCustomizer PoolBackedDataSource]
metabase.db.connection_pool_setup.DbActivityTracker))
(use-fixtures :once (fixtures/initialize :db))
(deftest connection-pool-spec-test
(testing "Should be able to create a connection pool"
......@@ -28,3 +36,60 @@
(testing "from a connection URL"
(test* :h2 (mdb.data-source/raw-connection-string->DataSource
(format "jdbc:h2:mem:%s;DB_CLOSE_DELAY=10" (mt/random-name))))))))
(defn- simulate-db-activity
"Hit the application db a few times. In practice, monitoring based on db checkins could be not super deterministic and
it is an optimization. This simulates that and reduces the potential for flakes"
[]
(dotimes [_ 5]
(db/count Database)))
(deftest DbActivityTracker-test
(testing "connection customizer is registered"
(let [customizer (C3P0Registry/getConnectionCustomizer (.getName DbActivityTracker))]
(is (some? customizer) "ConnectionCustomizer is not registered with c3p0")
(is (instance? ConnectionCustomizer customizer)
"checkin tracker must satisfy the c3p0 ConnectionCustomizer interface")
(is (instance? DbActivityTracker customizer)
"ConnectionCustomizer is not an instance of our DbActivityTracker")))
(testing "db activity resets counter"
(try
(let [updated? (promise)]
(add-watch (var-get #'mdb.connection-pool-setup/latest-activity)
::DbActivityTracker-test
(fn [_ _ _ _nv]
(deliver updated? ::completed)))
(reset! (var-get #'mdb.connection-pool-setup/latest-activity) nil)
(simulate-db-activity)
(u/deref-with-timeout updated? 200)
(let [recent-checkin (deref (var-get #'mdb.connection-pool-setup/latest-activity))]
(is (some? recent-checkin)
"Database activity did not reset latest-checkin")
(is (instance? java.time.temporal.Temporal recent-checkin)
"recent-checkin should be a temporal type (OffsetDateTime)")))
(finally (remove-watch (var-get #'mdb.connection-pool-setup/latest-activity)
::DbActivityTracker-test)))))
(deftest recent-activity-test
;; these tests are difficult to make non-flaky. Other threads can hit the db of course, and the lifecycle of the
;; connection pool is worked from other threads. This means we can't isolate the `latest-checkin` atom. Many will
;; take the value of the checkin timestamp and pass it to `recent-activity?*` to act on the value at the time it
;; cares about rather than trying to suppress writes to the `latest-checkin`. If you change this, run the test about 500 times to make sure there aren't flakes.
(testing "If latest-checkin is null"
(reset! (var-get #'mdb.connection-pool-setup/latest-activity) nil)
(is (not (#'mdb.connection-pool-setup/recent-activity?* nil (t/millis 10))))
(testing "db activity makes `recent-activity?` true"
(simulate-db-activity)
(is (mdb.connection-pool-setup/recent-activity?))))
(testing "If latest-checkin is stale"
(let [duration (var-get #'mdb.connection-pool-setup/recent-window-duration)
twice-duration (t/minus (t/offset-date-time) duration duration)]
(is (not (#'mdb.connection-pool-setup/recent-activity?* twice-duration duration)))))
(testing "Goes stale"
(simulate-db-activity)
;; can't easily control background syncs or activity so just suppress registering
(is (mdb.connection-pool-setup/recent-activity?))
(testing "When duration elapses should report no recent-activity"
(let [latest-activity (deref (var-get #'mdb.connection-pool-setup/latest-activity))]
(Thread/sleep 30)
(is (not (#'mdb.connection-pool-setup/recent-activity?* latest-activity (t/millis 10)))
"recent-window-duration has elapsed but still recent")))))
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment