Ensure we are paginating resultsets (#18477)

* Ensure we are paginating resultsets Made big tables in both pg and mysql pg: ```sql create table large_table ( id serial primary key, large_text text ); insert into large_table (large_text) select repeat('Z', 4000) from generate_series(1, 500000) ``` In mysql use the repl: ```clojure (jdbc/execute! (sql-jdbc.conn/db->pooled-connection-spec 5) ["CREATE TABLE large_table (id int NOT NULL PRIMARY KEY AUTO_INCREMENT, foo text);"]) (do (jdbc/insert-multi! (sql-jdbc.conn/db->pooled-connection-spec 5) :large_table (repeat 50000 {:foo (apply str (repeat 5000 "Z"))})) :done) (jdbc/execute! (sql-jdbc.conn/db->pooled-connection-spec 5) ["ALTER TABLE large_table add column properties json default null"]) (jdbc/execute! (sql-jdbc.conn/db->pooled-connection-spec 5) ["update large_table set properties = '{\"data\":{\"cols\":null,\"native_form\":{\"query\":\"SELECT `large_table`.`id` AS `id`, `large_table`.`foo` AS `foo` FROM `large_table` LIMIT 1\",\"params\":null},\"results_timezone\":\"UTC\",\"results_metadata\":{\"checksum\":\"0MnSKb8145UERWn18F5Uiw==\",\"columns\":[{\"semantic_type\":\"type/PK\",\"coercion_strategy\":null,\"name\":\"id\",\"field_ref\":[\"field\",200,null],\"effective_type\":\"type/Integer\",\"id\":200,\"display_name\":\"ID\",\"fingerprint\":null,\"base_type\":\"type/Integer\"},{\"semantic_type\":null,\"coercion_strategy\":null,\"name\":\"foo\",\"field_ref\":[\"field\",201,null],\"effective_type\":\"type/Text\",\"id\":201,\"display_name\":\"Foo\",\"fingerprint\":{\"global\":{\"distinct-count\":1,\"nil%\":0.0},\"type\":{\"type/Text\":{\"percent-json\":0.0,\"percent-url\":0.0,\"percent-email\":0.0,\"percent-state\":0.0,\"average-length\":500.0}}},\"base_type\":\"type/Text\"}]},\"insights\":null,\"count\":1}}'"]) ``` and then from the terminal client repeat this until we have 800,000 rows: ```sql insert into large_table (foo, properties) select foo, properties from large_table; ``` Then can exercise from code with the following: ```clojure (-> (qp/process-query {:database 5 ; use appropriate db and tables here :query {:source-table 42 ;; :limit 1000000 }, :type :query} ;; don't retain any rows, purely just counting ;; so resultset is what retains too many rows {:rff (fn [metadata] (let [c (volatile! 0)] (fn count-rff ([] {:data metadata}) ([result] (assoc-in result [:data :count] @c)) ([result _row] (vswap! c inc) result)))) }) :data :count) ``` PG was far easier to blow up. Mysql took quite a bit of data. Then we just set a fetch size on the result set so that we (hopefully) only have than many rows in memory in the resultset at once. The streaming will write to the download stream as it goes. PG has one other complication in that the fetch size can only be honored if autoCommit is false. The reasoning seems to be that each statement is in a transaction and commits and to commit it has to close resultsets and therefore it has to realize the entire resultset otherwise you would only get the initial page if any. * Set default fetch size to 500 ;; Long queries on gcloud pg ;; limit 10,000 ;; fetch size | t1 | t2 | t3 ;; ------------------------------- ;; 100 | 6030 | 8804 | 5986 ;; 500 | 1537 | 1535 | 1494 ;; 1000 | 1714 | 1802 | 1611 ;; 3000 | 1644 | 1595 | 2044 ;; limit 30,000 ;; fetch size | t1 | t2 | t3 ;; ------------------------------- ;; 100 | 17341 | 15991 | 16061 ;; 500 | 4112 | 4182 | 4851 ;; 1000 | 5075 | 4546 | 4284 ;; 3000 | 5405 | 5055 | 4745 * Only set fetch size if not default (0) Details of `:additional-options "defaultRowFetchSize=3000"` can set a default fetch size and we can easily honor that. This allows overriding per db without much work on our part. * Remove redshift custom fetch size code This removes the automatic insertion of a defaultRowFetchSize=5000 on redshift dbs. Now we always set this to 500 in the sql-jdbc statement and prepared statement fields. And we also allow custom ones to persist over our default of 500. One additional benefit of removing this is that it always included the option even if a user added ?defaultRowFetchSize=300 themselves so this should actually give more control to our users. Profiling quickly on selecting 79,000 rows from redshift, there essentially no difference between a fetch size of 500 (the default) and 5000 (the old redshift default); both were 12442 ms or so. * unused require of settings in redshift tests * Appease the linter * Unnecessary redshift connection details tests

Ensure we are paginating resultsets (#18477)
a33fa568 · dpsutton · GitHub · 76a35846 · a33fa568 · a33fa568
Unverified Commit a33fa568 authored 3 years ago by dpsutton Committed by GitHub 3 years ago
--- a/modules/drivers/redshift/src/metabase/driver/redshift.clj
+++ b/modules/drivers/redshift/src/metabase/driver/redshift.clj
@@ -207,8 +207,7 @@
   (merge
    {:classname                     "com.amazon.redshift.jdbc42.Driver"
     :subprotocol                   "redshift"
-     :subname                       (str "//" host ":" port "/" db
-                                         "?defaultRowFetchSize=" (pubset/redshift-fetch-size))
+     :subname                       (str "//" host ":" port "/" db)
     :ssl                           true
     :OpenSourceSubProtocolOverride false}
    (dissoc opts :host :port :db))))

--- a/modules/drivers/redshift/test/metabase/driver/redshift_test.clj
+++ b/modules/drivers/redshift/test/metabase/driver/redshift_test.clj
@@ -9,7 +9,6 @@
            [metabase.driver.sql-jdbc.sync.describe-database :as sync.describe-database]
            [metabase.models.database :refer [Database]]
            [metabase.models.field :refer [Field]]
-            [metabase.models.setting :as setting]
            [metabase.models.table :refer [Table]]
            [metabase.plugins.jdbc-proxy :as jdbc-proxy]
            [metabase.public-settings :as pubset]
@@ -22,8 +21,7 @@
            [metabase.test.util :as tu]
            [metabase.util :as u]
            [toucan.db :as db])
-  (:import [java.sql ResultSet ResultSetMetaData]
-           metabase.plugins.jdbc_proxy.ProxyDriver))
+  (:import metabase.plugins.jdbc_proxy.ProxyDriver))

 (use-fixtures :once (fixtures/initialize :plugins))
 (use-fixtures :once (fixtures/initialize :db))
@@ -208,35 +206,6 @@
                    (partial into {})
                    (db/select [Field :name :database_type :base_type] :table_id table-id {:order-by [:name]}))))))))))

-(defn- assert-jdbc-url-fetch-size [db fetch-size]
-  (with-open [conn (.getConnection (sql-jdbc.execute/datasource db))]
-    (let [md  (.getMetaData conn)
-          url (.getURL md)]
-      (is (str/includes? url (str "defaultRowFetchSize=" fetch-size))))))
-
-(deftest test-jdbc-fetch-size
-  (testing "Redshift JDBC fetch size is set correctly in PreparedStatement"
-    (mt/test-driver :redshift
-      ;; the default value should always be picked up if nothing is set
-      (assert-jdbc-url-fetch-size (mt/db) (:default (setting/resolve-setting :redshift-fetch-size)))
-      (mt/with-temporary-setting-values [redshift-fetch-size "14"]
-        ;; create a new DB in order to pick up the change to the setting here
-        (mt/with-temp Database [db {:engine :redshift, :details (:details (mt/db))}]
-          (mt/with-db db
-            ;; make sure the JDBC URL has the defaultRowFetchSize parameter set correctly
-            (assert-jdbc-url-fetch-size db 14)
-            ;; now, actually run a query and see if the PreparedStatement has the right fetchSize set
-            (mt/with-everything-store
-              (let [orig-fn sql-jdbc.execute/reducible-rows
-                    new-fn  (fn [driver ^ResultSet rs ^ResultSetMetaData rsmeta canceled-chan]
-                              (is (= 14 (.getFetchSize (.getStatement rs))))
-                              (orig-fn driver rs rsmeta canceled-chan))]
-                (with-redefs [sql-jdbc.execute/reducible-rows new-fn]
-                  (is (= [[1]] (-> {:query "SELECT 1"}
-                                   (mt/native-query)
-                                   (qp/process-query)
-                                   (mt/rows)))))))))))))
-
 (deftest syncable-schemas-test
  (mt/test-driver :redshift
    (testing "Should filter out schemas for which the user has no perms"
@@ -295,31 +264,3 @@
                      (is (contains? (schemas) fake-schema-name))))
                  (testing "normally, ::fake-schema should be filtered out (because it does not exist)"
                    (is (not (contains? (schemas) fake-schema-name)))))))))))))
-
-(deftest connection-details->spec-test
-  (mt/with-temporary-setting-values [redshift-fetch-size "14"]
-    (testing "Configure connection without additional-options should include defaultRowFetchSize"
-      (is (= {:classname                     "com.amazon.redshift.jdbc42.Driver"
-              :subprotocol                   "redshift"
-              :subname                       "//testhost:5432/testdb?defaultRowFetchSize=14"
-              :OpenSourceSubProtocolOverride false
-              :user                          "testuser"
-              :ssl                           true}
-             (sql-jdbc.conn/connection-details->spec :redshift
-               {:host "testhost"
-                :port 5432
-                :db   "testdb"
-                :user "testuser"}))))
-    (testing "Configure connection with additional-options should not replace defaultRowFetchSize"
-      (is (= {:classname                     "com.amazon.redshift.jdbc42.Driver"
-              :subprotocol                   "redshift"
-              :subname                       "//testhost:5432/testdb?defaultRowFetchSize=14&TCPKeepAlive=FALSE"
-              :OpenSourceSubProtocolOverride false
-              :user                          "testuser"
-              :ssl                           true}
-             (sql-jdbc.conn/connection-details->spec :redshift
-               {:host               "testhost"
-                :port               5432
-                :db                 "testdb"
-                :user               "testuser"
-                :additional-options "TCPKeepAlive=FALSE"}))))))
--- a/src/metabase/driver/sql_jdbc/execute.clj
+++ b/src/metabase/driver/sql_jdbc/execute.clj
@@ -14,6 +14,7 @@
            [metabase.driver.sql-jdbc.execute.old-impl :as execute.old]
            [metabase.driver.sql-jdbc.sync.interface :as sql-jdbc.sync]
            [metabase.mbql.util :as mbql.u]
+            [metabase.models.setting :refer [defsetting]]
            [metabase.query-processor.context :as context]
            [metabase.query-processor.error-type :as qp.error-type]
            [metabase.query-processor.interface :as qp.i]
@@ -195,6 +196,12 @@
        (.setReadOnly conn true)
        (catch Throwable e
          (log/debug e (trs "Error setting connection to read-only"))))
+      (try
+        ;; set autocommit to false so that pg honors fetchSize. Otherwise it commits the transaction and needs the
+        ;; entire realized result set
+        (.setAutoCommit conn false)
+        (catch Throwable e
+          (log/debug e (trs "Error setting connection to autoCommit false"))))
      (try
        (.setHoldability conn ResultSet/CLOSE_CURSORS_AT_COMMIT)
        (catch Throwable e
@@ -267,6 +274,13 @@
      (set-parameter driver stmt (inc i) param))
    params)))

+(defsetting ^:private sql-jdbc-fetch-size
+  "Fetch size for result sets. We want to ensure that the jdbc ResultSet objects are not realizing the entire results
+  in memory."
+  :default 500
+  :type :integer
+  :visibility :internal)
+
 (defmethod prepared-statement :sql-jdbc
  [driver ^Connection conn ^String sql params]
  (let [stmt (.prepareStatement conn
@@ -279,6 +293,11 @@
        (.setFetchDirection stmt ResultSet/FETCH_FORWARD)
        (catch Throwable e
          (log/debug e (trs "Error setting prepared statement fetch direction to FETCH_FORWARD"))))
+      (try
+        (when (zero? (.getFetchSize stmt))
+          (.setFetchSize stmt (sql-jdbc-fetch-size)))
+        (catch Throwable e
+          (log/debug e (trs "Error setting prepared statement fetch size to fetch-size"))))
      (set-parameters! driver stmt params)
      stmt
      (catch Throwable e
@@ -301,6 +320,11 @@
        (.setFetchDirection stmt ResultSet/FETCH_FORWARD)
        (catch Throwable e
          (log/debug e (trs "Error setting statement fetch direction to FETCH_FORWARD"))))
+      (try
+        (when (zero? (.getFetchSize stmt))
+          (.setFetchSize stmt (sql-jdbc-fetch-size)))
+        (catch Throwable e
+          (log/debug e (trs "Error setting statement fetch size to fetch-size"))))
      stmt
      (catch Throwable e
        (.close stmt)

--- a/src/metabase/public_settings.clj
+++ b/src/metabase/public_settings.clj
@@ -406,9 +406,3 @@
  :visibility :public
  :type       :integer
  :default    180)
-
-(defsetting redshift-fetch-size
-  (deferred-tru "Controls the fetch size used for Redshift queries (in PreparedStatement), via defaultRowFetchSize.")
-  :visibility :public
-  :type       :integer
-  :default    5000)