Enable debugUnreturnedConnectionStackTraces in...

Enable debugUnreturnedConnectionStackTraces in data-warehouse-connection-pool-properties (#49456) (#49807) * Add blurb from C3P0 docs to comment for unreturnedConnectionTimeout * Enable debugUnreturnedConnectionStackTraces in data-warehouse-connection-pool-properties Closes #47981 * Add defsetting for jdbc-data-warehouse-debug-unreturned-connection-stack-traces And only enable debugUnreturnedConnectionStackTraces in the C3P0 config if the setting is enabled. There is some memory cost to enabling the setting, and since maxPoolSize is user-configurable, it seems safer to disable debugUnreturnedConnectionStackTraces by default. * Add metabase.logger/level-enabled? * Log a warning if C3P0 logs are not configured correctly to log stack traces Log a warning if the user enables the jdbc-data-warehouse-debug-unreturned-connection-stack-traces but does not raise the log level for the com.mchange package to INFO or higher. * Add docs for MB_JDBC_DATA_WAREHOUSE_DEBUG_UNRETURNED_CONNECTION_STACK_TRACES * Improve markdown docs for MB_JDBC_DATA_WAREHOUSE_DEBUG_UNRETURNED_CONNECTION_STACK_TRACES --------- Co-authored-by: appleby <86076+appleby@users.noreply.github.com> Co-authored-by: Jeff Bruemmer <jeff.bruemmer@gmail.com>

Enable debugUnreturnedConnectionStackTraces in...
9a43165a · github-automation-metabase · GitHub · 5dee9733 · 9a43165a · 9a43165a
Unverified Commit 9a43165a authored 5 months ago by github-automation-metabase Committed by GitHub 5 months ago
--- a/docs/configuring-metabase/environment-variables.md
+++ b/docs/configuring-metabase/environment-variables.md
@@ -1883,6 +1883,20 @@ This variable affects connections that are severed and undetected by Metabase (t

 Unless set otherwise, the default production value for `metabase.query-processor.query-timeout-ms` is used which is 1,200,000 ms (i.e. 1,200 seconds or 20 minutes).

+### `MB_JDBC_DATA_WAREHOUSE_DEBUG_UNRETURNED_CONNECTION_STACK_TRACES`
+
+Type: boolean<br>
+Default: `false`<br>
+Since: v51.3
+
+If `true`, log a stack trace for any connections killed due to exceeding the timeout specified in [MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS](#mb_jdbc_data_warehouse_unreturned_connection_timeout_seconds).
+
+Note: In addtion to enabling this variable, you need to update the com.mchange log level to INFO or higher via a custom log4j configuration in order to see the stack traces in the logs.
+
+See the [Metabase log configuration](./log-configuration.md) documentation for how to configure log levels.
+
+See [MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS](#mb_jdbc_data_warehouse_unreturned_connection_timeout_seconds) for setting the timeout after which connections will be killed.
+
 ### `MB_JETTY_ASYNC_RESPONSE_TIMEOUT`

 Type: integer<br>

--- a/src/metabase/cmd/resources/other-env-vars.md
+++ b/src/metabase/cmd/resources/other-env-vars.md
@@ -221,6 +221,18 @@ This variable affects connections that are severed and undetected by Metabase (t

 Unless set otherwise, the default production value for `metabase.query-processor.query-timeout-ms` is used which is 1,200,000 ms (i.e. 1,200 seconds or 20 minutes).

+### `MB_JDBC_DATA_WAREHOUSE_DEBUG_UNRETURNED_CONNECTION_STACK_TRACES`
+
+Type: boolean<br>
+Default: `false`<br>
+Since: v51.3
+
+If `true`, log a stack trace for any connections killed due to exceeding the timeout specified in [MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS](#mb_jdbc_data_warehouse_unreturned_connection_timeout_seconds).
+
+In order to see the stack traces in the logs, you'll also need to update the com.mchange log level to "INFO" or higher via a custom log4j configuration. For configuring log levels, see [Metabase log configuration](./log-configuration.md).
+
+To set a timeout for how long Metabase should wait before it kills unreturned connections, see [MB_JDBC_DATA_WAREHOUSE_UNRETURNED_CONNECTION_TIMEOUT_SECONDS](#mb_jdbc_data_warehouse_unreturned_connection_timeout_seconds).
+
 ### `MB_JETTY_ASYNC_RESPONSE_TIMEOUT`

 Type: integer<br>

--- a/src/metabase/driver/sql_jdbc/connection.clj
+++ b/src/metabase/driver/sql_jdbc/connection.clj
@@ -9,6 +9,7 @@
   [metabase.driver.util :as driver.u]
   [metabase.lib.metadata :as lib.metadata]
   [metabase.lib.metadata.jvm :as lib.metadata.jvm]
+   [metabase.logger :as logger]
   [metabase.models.interface :as mi]
   [metabase.models.setting :as setting]
   [metabase.query-processor.pipeline :as qp.pipeline]
@@ -22,7 +23,8 @@
   [toucan2.core :as t2])
  (:import
   (com.mchange.v2.c3p0 DataSources)
-   (javax.sql DataSource)))
+   (javax.sql DataSource)
+   (org.apache.logging.log4j Level)))

 (set! *warn-on-reflection* true)

@@ -103,6 +105,19 @@ For setting the maximum, see [MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE](#mb_ap
                    (long (/ qp.pipeline/*query-timeout-ms* 1000))))
  :setter     :none)

+(setting/defsetting jdbc-data-warehouse-debug-unreturned-connection-stack-traces
+  "Tell c3p0 to log a stack trace for any connections killed due to exceeding the timeout specified in
+  [[jdbc-data-warehouse-unreturned-connection-timeout-seconds]].
+
+  Note: You also need to update the com.mchange log level to INFO or higher in the log4j configs in order to see the
+  stack traces in the logs."
+  :visibility :internal
+  :type       :boolean
+  :default    false
+  :export?    false
+  :setter     :none
+  :doc        false) ; This setting is documented in other-env-vars.md.
+
 (defmethod data-warehouse-connection-pool-properties :default
  [driver database]
  {;; only fetch one new connection at a time, rather than batching fetches (default = 3 at a time). This is done in
@@ -137,10 +152,36 @@ For setting the maximum, see [MB_APPLICATION_DB_MAX_CONNECTION_POOL_SIZE](#mb_ap
   ;;
   ;; Kill idle connections above the minPoolSize after 5 minutes.
   "maxIdleTimeExcessConnections" (* 5 60)
-   ;; kill connections after this amount of time if they haven't been returned -- this should be the same as the query
-   ;; timeout. This theoretically shouldn't happen since the QP should kill things after a certain timeout but it's
-   ;; better to be safe than sorry -- it seems like in practice some connections disappear into the ether
+   ;; [From dox] Seconds. If set, if an application checks out but then fails to check-in [i.e. close()] a Connection
+   ;; within the specified period of time, the pool will unceremoniously destroy() the Connection. This permits
+   ;; applications with occasional Connection leaks to survive, rather than eventually exhausting the Connection
+   ;; pool. And that's a shame. Zero means no timeout, applications are expected to close() their own
+   ;; Connections. Obviously, if a non-zero value is set, it should be to a value longer than any Connection should
+   ;; reasonably be checked-out. Otherwise, the pool will occasionally kill Connections in active use, which is bad.
+   ;;
+   ;; This should be the same as the query timeout. This theoretically shouldn't happen since the QP should kill
+   ;; things after a certain timeout but it's better to be safe than sorry -- it seems like in practice some
+   ;; connections disappear into the ether
   "unreturnedConnectionTimeout"  (jdbc-data-warehouse-unreturned-connection-timeout-seconds)
+   ;; [From dox] If true, and if unreturnedConnectionTimeout is set to a positive value, then the pool will capture
+   ;; the stack trace (via an Exception) of all Connection checkouts, and the stack traces will be printed when
+   ;; unreturned checked-out Connections timeout. This is intended to debug applications with Connection leaks, that
+   ;; is applications that occasionally fail to return Connections, leading to pool growth, and eventually
+   ;; exhaustion (when the pool hits maxPoolSize with all Connections checked-out and lost). This parameter should
+   ;; only be set while debugging, as capturing the stack trace will slow down every Connection check-out.
+   ;;
+   ;; As noted in the C3P0 docs, this does add some overhead to create the Exception at Connection checkout.
+   ;; criterium/quick-bench indicates this is ~600ns of overhead per Exception created on my laptop, which is small
+   ;; compared to the overhead added by testConnectionCheckout, above. The memory usage will depend on the size of the
+   ;; stack trace, but clj-memory-meter reports ~800 bytes for a fresh Exception created at the REPL (which presumably
+   ;; has a smaller-than-average stack).
+   "debugUnreturnedConnectionStackTraces" (u/prog1 (jdbc-data-warehouse-debug-unreturned-connection-stack-traces)
+                                            (when (and <> (not (logger/level-enabled? 'com.mchange Level/INFO)))
+                                              (log/warn "jdbc-data-warehouse-debug-unreturned-connection-stack-traces"
+                                                        "is enabled, but INFO logging is not enabled for the"
+                                                        "com.mchange namespace. You must raise the log level for"
+                                                        "com.mchange to INFO via a custom log4j config in order to"
+                                                        "see stacktraces in the logs.")))
   ;; Set the data source name so that the c3p0 JMX bean has a useful identifier, which incorporates the DB ID, driver,
   ;; and name from the details
   "dataSourceName"               (format "db-%d-%s-%s"

--- a/src/metabase/logger.clj
+++ b/src/metabase/logger.clj
@@ -82,6 +82,14 @@
    (name (ns-name a-namespace))
    (name a-namespace)))

+(defn level-enabled?
+  "Is logging at `level` enabled for `a-namespace`?"
+  (^Boolean [level]
+   (level-enabled? *ns* level))
+  (^Boolean [a-namespace level]
+   (let [^Logger logger (log.impl/get-logger log/*logger-factory* a-namespace)]
+     (.isEnabled logger level))))
+
 (defn effective-ns-logger
  "Get the logger that will be used for the namespace named by `a-namespace`."
  ^LoggerConfig [a-namespace]

--- a/test/metabase/driver/sql_jdbc/connection_test.clj
+++ b/test/metabase/driver/sql_jdbc/connection_test.clj
@@ -240,6 +240,42 @@
      (is (= 20
             (sql-jdbc.conn/jdbc-data-warehouse-unreturned-connection-timeout-seconds))))))

+(deftest ^:parallel include-debug-unreturned-connection-stack-traces-test
+  (testing "We should be setting debugUnreturnedConnectionStackTraces (#47981)"
+    (is (=? {"debugUnreturnedConnectionStackTraces" boolean?}
+            (sql-jdbc.conn/data-warehouse-connection-pool-properties :h2 (mt/db))))))
+
+(deftest debug-unreturned-connection-stack-traces-test
+  (testing "We should be able to set jdbc-data-warehouse-debug-unreturned-connection-stack-traces via env var (#47981)"
+    (doseq [setting [true false]]
+      (mt/with-temp-env-var-value! [mb-jdbc-data-warehouse-debug-unreturned-connection-stack-traces (str setting)]
+        (is (= setting
+               (sql-jdbc.conn/jdbc-data-warehouse-debug-unreturned-connection-stack-traces))
+            (str "setting=" setting))))))
+
+(deftest debug-unreturned-connection-stack-traces-misconfigured-c3p0-log-warning-test
+  (testing "We should log a warning if debug stack traces are enabled but c3p0 INFO logs are not (#47981)\n"
+    ;; kondo thinks the c3p0-log-level binding is unused
+    #_{:clj-kondo/ignore [:unused-binding]}
+    (letfn [(warning-found? [warnings]
+              (boolean (some #(str/includes?
+                               (:message %)
+                               "You must raise the log level for com.mchange to INFO")
+                             warnings)))
+            (warnings-logged? [c3p0-log-level setting warning-expected?]
+              (mt/with-temp-env-var-value! [mb-jdbc-data-warehouse-debug-unreturned-connection-stack-traces setting]
+                (mt/with-log-level [com.mchange c3p0-log-level]
+                  (mt/with-log-messages-for-level [warnings :warn]
+                    (and (= setting
+                            (get (sql-jdbc.conn/data-warehouse-connection-pool-properties :h2 (mt/db))
+                                 "debugUnreturnedConnectionStackTraces"))
+                         (= warning-expected? (warning-found? (warnings))))))))]
+      (are [c3p0-log-level setting warning-expected?] (warnings-logged? c3p0-log-level setting warning-expected?)
+        :error true  true
+        :error false false
+        :info  true  false
+        :info  false false))))
+
 (defn- init-h2-tcp-server [port]
  (let [args   ["-tcp" "-tcpPort", (str port), "-tcpAllowOthers" "-tcpDaemon"]
        server (Server/createTcpServer (into-array args))]

--- a/test/metabase/logger_test.clj
+++ b/test/metabase/logger_test.clj
@@ -9,6 +9,7 @@
   [metabase.logger :as logger]
   [metabase.test :as mt])
  (:import
+   (org.apache.logging.log4j Level)
   (org.apache.logging.log4j.core Logger)))

 (set! *warn-on-reflection* true)
@@ -94,3 +95,37 @@
        (finally
          (when (.exists f)
            (io/delete-file f)))))))
+
+(deftest level-enabled?-test
+  (are [set-level check-level expected-value] (= expected-value
+                                                 (mt/with-log-level [metabase.logger-test set-level]
+                                                   (logger/level-enabled? 'metabase.logger-test check-level)))
+    :error Level/ERROR true
+    :error Level/WARN  false
+    :error Level/INFO  false
+    :error Level/DEBUG false
+    :error Level/TRACE false
+
+    :warn Level/ERROR true
+    :warn Level/WARN  true
+    :warn Level/INFO  false
+    :warn Level/DEBUG false
+    :warn Level/TRACE false
+
+    :info Level/ERROR true
+    :info Level/WARN  true
+    :info Level/INFO  true
+    :info Level/DEBUG false
+    :info Level/TRACE false
+
+    :debug Level/ERROR true
+    :debug Level/WARN  true
+    :debug Level/INFO  true
+    :debug Level/DEBUG true
+    :debug Level/TRACE false
+
+    :trace Level/ERROR true
+    :trace Level/WARN  true
+    :trace Level/INFO  true
+    :trace Level/DEBUG true
+    :trace Level/TRACE true))