Test data warehouse connections on checkout (#11018)

c3p0 has an option to test connections when fetched from the connection pool intended to fix that exact problem, so this was a simple matter of enabling it. It is not enabled by default because it adds overhead to connection checkout, especially for JDBC drivers that don't support the full JDBC 4 API, which includes a Connection.isValid() method. However, all of our JDBC drivers are JDBC 4 compliant, meaning c3p0 can check connection validity fairly efficiently. With the data warehouse on the same machine as the Metabase server, the added overhead was in the order of ~100µs, or an order of magnitude or two below the threshold of human perception. With the data warehouse in AWS us-east-1 and the Metabase server in San Francisco, the overhead was around ~70ms, which is basically typical network latency for such a request. Based on those results, it appears that the additional cost of this test is roughly equal to the network latency of the request. IRL the Metabase server and data warehouse are likely to be located in closer geographical proximity to one another than my trans-contintental tests, if not in the same AWS region (or equivalent). I would expect the added overhead for most query executions for most setups to be in the 1-10ms range and the p99 to be under 100ms extra latency. Accepting that overhead seems like a no-brainer when considering that an entire class of bugs will be fixed by enabling the option.

Test data warehouse connections on checkout (#11018)
cbabdf4e · Cam Saul · GitHub · c43faa64 · cbabdf4e · cbabdf4e
Unverified Commit cbabdf4e authored 5 years ago by Cam Saul Committed by GitHub 5 years ago
--- a/modules/drivers/sparksql/project.clj
+++ b/modules/drivers/sparksql/project.clj
@@ -20,7 +20,7 @@
                 org.tukaani/xz]]
   ;; TODO - this is deprecated, seems like we'll want to upgrade to `org.apache.hive/hive-jdbc` in the future. Don't
   ;; thing it works with Spark SQL atm however
-   [org.spark-project.hive/hive-jdbc "1.2.1.spark2"
+   [org.apache.hive/hive-jdbc "1.2.1"
    :exclusions
    [#_com.google.guava/guava
     commons-logging

--- a/modules/drivers/sparksql/src/metabase/driver/hive_like.clj
+++ b/modules/drivers/sparksql/src/metabase/driver/hive_like.clj
@@ -5,7 +5,9 @@
            [metabase
             [driver :as driver]
             [util :as u]]
-            [metabase.driver.sql-jdbc.sync :as sql-jdbc.sync]
+            [metabase.driver.sql-jdbc
+             [connection :as sql-jdbc.conn]
+             [sync :as sql-jdbc.sync]]
            [metabase.driver.sql.query-processor :as sql.qp]
            [metabase.driver.sql.util.unprepare :as unprepare]
            [metabase.models.table :refer [Table]]
@@ -18,6 +20,14 @@

 (driver/register! :hive-like, :parent :sql-jdbc, :abstract? true)

+(defmethod sql-jdbc.conn/data-warehouse-connection-pool-properties :hive-like
+  [driver]
+  ;; The Hive JDBC driver doesn't support `Connection.isValid()`, so we need to supply a test query for c3p0 to use to
+  ;; validate connections upon checkout.
+  (merge
+   ((get-method sql-jdbc.conn/data-warehouse-connection-pool-properties :sql-jdbc) driver)
+   {"preferredTestQuery" "SELECT 1"}))
+
 (defmethod sql-jdbc.sync/database-type->base-type :hive-like
  [_ database-type]
  (condp re-matches (name database-type)

--- a/src/metabase/driver/sql_jdbc/connection.clj
+++ b/src/metabase/driver/sql_jdbc/connection.clj
@@ -9,7 +9,7 @@
             [util :as u]]
            [metabase.models.database :refer [Database]]
            [metabase.util
-             [i18n :refer [tru]]
+             [i18n :refer [trs]]
             [ssh :as ssh]]
            [toucan.db :as db]))

@@ -28,32 +28,70 @@
 ;;; |                                           Creating Connection Pools                                            |
 ;;; +----------------------------------------------------------------------------------------------------------------+

-(def ^:private data-warehouse-connection-pool-properties
+(defmulti data-warehouse-connection-pool-properties
  "c3p0 connection pool properties for connected data warehouse DBs. See
-  https://www.mchange.com/projects/c3p0/#configuration_properties for descriptions of properties."
-  {"maxIdleTime"                  (* 3 60 60)
+  https://www.mchange.com/projects/c3p0/#configuration_properties for descriptions of properties.
+
+  The c3p0 dox linked above do a good job of explaining the purpose of these properties and why you might set them.
+  Generally, I have tried to choose configuration options for the data warehouse connection pools that minimize memory
+  usage and maximize reliability, even when it comes with some added performance overhead. These pools are used for
+  powering Cards and the sync process, which are less sensitive to overhead than something like the application DB.
+
+  Drivers that need to override the default properties below can provide custom implementations of this method."
+  {:arglists '([driver])}
+  driver/dispatch-on-initialized-driver
+  :hierarchy #'driver/hierarchy)
+
+(defmethod data-warehouse-connection-pool-properties :default
+  [_]
+  {;; only fetch one new connection at a time, rather than batching fetches (default = 3 at a time). This is done in
+   ;; interest of minimizing memory consumption
+   "acquireIncrement"             1
+   ;; [From dox] Seconds a Connection can remain pooled but unused before being discarded.
+   "maxIdleTime"                  (* 3 60 60) ; 3 hours
   "minPoolSize"                  1
   "initialPoolSize"              1
   "maxPoolSize"                  15
-   ;; prevent broken connections closed by dbs by testing them every 3 mins
-   "idleConnectionTestPeriod"     (* 3 60)
-   ;; prevent overly large pools by condensing them when connections are idle for 15m+
+   ;; [From dox] If true, an operation will be performed at every connection checkout to verify that the connection is
+   ;; valid. [...] ;; Testing Connections in checkout is the simplest and most reliable form of Connection testing,
+   ;; but for better performance, consider verifying connections periodically using `idleConnectionTestPeriod`. [...]
+   ;; If clients usually make complex queries and/or perform multiple operations, adding the extra cost of one fast
+   ;; test per checkout will not much affect performance.
+   ;;
+   ;; As noted in the C3P0 dox, this does add some overhead, but since all of our drivers are JDBC 4 drivers, they can
+   ;; call `Connection.isValid()`, which is reasonably efficient. In my profiling enabling this adds ~100µs for
+   ;; Postgres databases on the same machince and ~70ms for remote databases on AWS east testing against a local
+   ;; server on the West Coast.
+   ;;
+   ;; This suggests the additional cost of this test is more or less based entirely to the network latency of the
+   ;; request. IRL the Metabase server and data warehouse are likely to be located in closer geographical proximity to
+   ;; one another than my trans-contintental tests. Thus in the majority of cases the overhead should be next to
+   ;; nothing, and in the worst case close to imperceptible.
+   "testConnectionOnCheckout"     true
+   ;; [From dox] Number of seconds that Connections in excess of minPoolSize should be permitted to remain idle in the
+   ;; pool before being culled. Intended for applications that wish to aggressively minimize the number of open
+   ;; Connections, shrinking the pool back towards minPoolSize if, following a spike, the load level diminishes and
+   ;; Connections acquired are no longer needed. If maxIdleTime is set, maxIdleTimeExcessConnections should be smaller
+   ;; if the parameter is to have any effect.
+   ;;
+   ;; Kill idle connections above the minPoolSize after 15 minutes.
   "maxIdleTimeExcessConnections" (* 15 60)})

 (defn- create-pool!
-  "Create a new C3P0 `ComboPooledDataSource` for connecting to the given DATABASE."
-  [{:keys [id engine details], :as database}]
+  "Create a new C3P0 `ComboPooledDataSource` for connecting to the given `database`."
+  [{:keys [id details], driver :engine, :as database}]
  {:pre [(map? database)]}
-  (log/debug (u/format-color 'cyan "Creating new connection pool for database %d ..." id))
+  (log/debug (u/format-color 'cyan (trs "Creating new connection pool for {0} database {1} ...") driver id))
  (let [details-with-tunnel (ssh/include-ssh-tunnel details) ;; If the tunnel is disabled this returned unchanged
-        spec                (connection-details->spec engine details-with-tunnel)]
-    (assoc (connection-pool/connection-pool-spec spec data-warehouse-connection-pool-properties)
-      :ssh-tunnel (:tunnel-connection details-with-tunnel))))
+        spec                (connection-details->spec driver details-with-tunnel)
+        properties          (data-warehouse-connection-pool-properties driver)]
+    (assoc (connection-pool/connection-pool-spec spec properties)
+           :ssh-tunnel (:tunnel-connection details-with-tunnel))))

-(defn- destroy-pool! [database-id pool-spec]
-  (log/debug (u/format-color 'red (tru "Closing old connection pool for database {0} ..." database-id)))
+(defn- destroy-pool! [database-id {:keys [ssh-tunnel], :as pool-spec}]
+  (log/debug (u/format-color 'red (trs "Closing old connection pool for database {0} ..." database-id)))
  (connection-pool/destroy-connection-pool! pool-spec)
-  (when-let [ssh-tunnel (:ssh-tunnel pool-spec)]
+  (when ssh-tunnel
    (.disconnect ^com.jcraft.jsch.Session ssh-tunnel)))

 (defonce ^:private ^{:doc "A map of our currently open connection pools, keyed by Database `:id`."}
@@ -65,6 +103,7 @@
  `database-id->connection-pool` directly because it properly closes down old pools in a thread-safe way, ensuring no
  more than one pool is ever open for a single database."
  [database-id pool-spec-or-nil]
+  {:pre [(integer? database-id)]}
  (let [[old-id->pool] (swap-vals! database-id->connection-pool assoc database-id pool-spec-or-nil)]
    ;; if we replaced a different pool with the new pool that is different from the old one, destroy the old pool
    (when-let [old-pool-spec (get old-id->pool database-id)]
@@ -74,7 +113,7 @@

 (defn notify-database-updated
  "Default implementation of `driver/notify-database-updated` for JDBC SQL drivers. We are being informed that a
-  DATABASE has been updated, so lets shut down the connection pool (if it exists) under the assumption that the
+  `database` has been updated, so lets shut down the connection pool (if it exists) under the assumption that the
  connection details have changed."
  [_ database]
  (set-pool! (u/get-id database) nil))
@@ -82,26 +121,25 @@
 (def ^:private create-pool-lock (Object.))

 (defn db->pooled-connection-spec
-  "Return a JDBC connection spec that includes a cp30 `ComboPooledDataSource`.
-   Theses connection pools are cached so we don't create multiple ones to the same DB."
+  "Return a JDBC connection spec that includes a cp30 `ComboPooledDataSource`. These connection pools are cached so we
+  don't create multiple ones for the same DB."
  [database-or-id]
-  (or
-   ;; we have an existing pool for this database, so use it
-   (get @database-id->connection-pool (u/get-id database-or-id))
-   ;; Even tho `set-pool!` will properly shut down old pools if two threads call this method at the same time, we
-   ;; don't want to end up with a bunch of simultaneous threads creating pools only to have them destroyed the very
-   ;; next instant. This will cause their queries to fail. Thus we should do the usual locking here and make sure only
-   ;; one thread will be creating a pool at a given instant.
-   (locking create-pool-lock
-     (or
-      ;; check if another thread created the pool while we were waiting to acquire the lock
-      (get @database-id->connection-pool (u/get-id database-or-id))
-      ;; create a new pool and add it to our cache, then return it
-      (let [db (if (map? database-or-id)
-                 database-or-id
-                 (db/select-one [Database :id :engine :details] :id database-or-id))]
-        (u/prog1 (create-pool! db)
-          (set-pool! (u/get-id database-or-id) <>)))))))
+  (let [database-id (u/get-id database-or-id)]
+    (or
+     ;; we have an existing pool for this database, so use it
+     (get @database-id->connection-pool database-id)
+     ;; Even tho `set-pool!` will properly shut down old pools if two threads call this method at the same time, we
+     ;; don't want to end up with a bunch of simultaneous threads creating pools only to have them destroyed the very
+     ;; next instant. This will cause their queries to fail. Thus we should do the usual locking here and make sure only
+     ;; one thread will be creating a pool at a given instant.
+     (locking create-pool-lock
+       (or
+        ;; check if another thread created the pool while we were waiting to acquire the lock
+        (get @database-id->connection-pool database-id)
+        ;; create a new pool and add it to our cache, then return it
+        (let [db (db/select-one [Database :id :engine :details] :id database-id)]
+          (u/prog1 (create-pool! db)
+            (set-pool! database-id <>))))))))


 ;;; +----------------------------------------------------------------------------------------------------------------+