-
- Downloads
Ensure we are paginating resultsets (#18477)
* Ensure we are paginating resultsets Made big tables in both pg and mysql pg: ```sql create table large_table ( id serial primary key, large_text text ); insert into large_table (large_text) select repeat('Z', 4000) from generate_series(1, 500000) ``` In mysql use the repl: ```clojure (jdbc/execute! (sql-jdbc.conn/db->pooled-connection-spec 5) ["CREATE TABLE large_table (id int NOT NULL PRIMARY KEY AUTO_INCREMENT, foo text);"]) (do (jdbc/insert-multi! (sql-jdbc.conn/db->pooled-connection-spec 5) :large_table (repeat 50000 {:foo (apply str (repeat 5000 "Z"))})) :done) (jdbc/execute! (sql-jdbc.conn/db->pooled-connection-spec 5) ["ALTER TABLE large_table add column properties json default null"]) (jdbc/execute! (sql-jdbc.conn/db->pooled-connection-spec 5) ["update large_table set properties = '{\"data\":{\"cols\":null,\"native_form\":{\"query\":\"SELECT `large_table`.`id` AS `id`, `large_table`.`foo` AS `foo` FROM `large_table` LIMIT 1\",\"params\":null},\"results_timezone\":\"UTC\",\"results_metadata\":{\"checksum\":\"0MnSKb8145UERWn18F5Uiw==\",\"columns\":[{\"semantic_type\":\"type/PK\",\"coercion_strategy\":null,\"name\":\"id\",\"field_ref\":[\"field\",200,null],\"effective_type\":\"type/Integer\",\"id\":200,\"display_name\":\"ID\",\"fingerprint\":null,\"base_type\":\"type/Integer\"},{\"semantic_type\":null,\"coercion_strategy\":null,\"name\":\"foo\",\"field_ref\":[\"field\",201,null],\"effective_type\":\"type/Text\",\"id\":201,\"display_name\":\"Foo\",\"fingerprint\":{\"global\":{\"distinct-count\":1,\"nil%\":0.0},\"type\":{\"type/Text\":{\"percent-json\":0.0,\"percent-url\":0.0,\"percent-email\":0.0,\"percent-state\":0.0,\"average-length\":500.0}}},\"base_type\":\"type/Text\"}]},\"insights\":null,\"count\":1}}'"]) ``` and then from the terminal client repeat this until we have 800,000 rows: ```sql insert into large_table (foo, properties) select foo, properties from large_table; ``` Then can exercise from code with the following: ```clojure (-> (qp/process-query {:database 5 ; use appropriate db and tables here :query {:source-table 42 ;; :limit 1000000 }, :type :query} ;; don't retain any rows, purely just counting ;; so resultset is what retains too many rows {:rff (fn [metadata] (let [c (volatile! 0)] (fn count-rff ([] {:data metadata}) ([result] (assoc-in result [:data :count] @c)) ([result _row] (vswap! c inc) result)))) }) :data :count) ``` PG was far easier to blow up. Mysql took quite a bit of data. Then we just set a fetch size on the result set so that we (hopefully) only have than many rows in memory in the resultset at once. The streaming will write to the download stream as it goes. PG has one other complication in that the fetch size can only be honored if autoCommit is false. The reasoning seems to be that each statement is in a transaction and commits and to commit it has to close resultsets and therefore it has to realize the entire resultset otherwise you would only get the initial page if any. * Set default fetch size to 500 ;; Long queries on gcloud pg ;; limit 10,000 ;; fetch size | t1 | t2 | t3 ;; ------------------------------- ;; 100 | 6030 | 8804 | 5986 ;; 500 | 1537 | 1535 | 1494 ;; 1000 | 1714 | 1802 | 1611 ;; 3000 | 1644 | 1595 | 2044 ;; limit 30,000 ;; fetch size | t1 | t2 | t3 ;; ------------------------------- ;; 100 | 17341 | 15991 | 16061 ;; 500 | 4112 | 4182 | 4851 ;; 1000 | 5075 | 4546 | 4284 ;; 3000 | 5405 | 5055 | 4745 * Only set fetch size if not default (0) Details of `:additional-options "defaultRowFetchSize=3000"` can set a default fetch size and we can easily honor that. This allows overriding per db without much work on our part. * Remove redshift custom fetch size code This removes the automatic insertion of a defaultRowFetchSize=5000 on redshift dbs. Now we always set this to 500 in the sql-jdbc statement and prepared statement fields. And we also allow custom ones to persist over our default of 500. One additional benefit of removing this is that it always included the option even if a user added ?defaultRowFetchSize=300 themselves so this should actually give more control to our users. Profiling quickly on selecting 79,000 rows from redshift, there essentially no difference between a fetch size of 500 (the default) and 5000 (the old redshift default); both were 12442 ms or so. * unused require of settings in redshift tests * Appease the linter * Unnecessary redshift connection details tests
Showing
- modules/drivers/redshift/src/metabase/driver/redshift.clj 1 addition, 2 deletionsmodules/drivers/redshift/src/metabase/driver/redshift.clj
- modules/drivers/redshift/test/metabase/driver/redshift_test.clj 1 addition, 60 deletions...s/drivers/redshift/test/metabase/driver/redshift_test.clj
- src/metabase/driver/sql_jdbc/execute.clj 24 additions, 0 deletionssrc/metabase/driver/sql_jdbc/execute.clj
- src/metabase/public_settings.clj 0 additions, 6 deletionssrc/metabase/public_settings.clj
Please register or sign in to comment