Skip to content
Snippets Groups Projects
Unverified Commit 9d112756 authored by Cam Saul's avatar Cam Saul Committed by GitHub
Browse files

Use window functions for cumulative sum and cumulative count (#40752)

* Use window functions for SQL drivers for cumulative count and sum

* H2/Snowflake/Redshift implementation

* Test fixes :wrench:

* Use SELECT expression positions as the default window function implementation

* Some test fixes.

* Fix bugs in add-alias-info

* MongoDB fixes

* Another pass

* BigQuery should use positions?

* Disable tests for the failing drivers, we can fix in follow-on PRs.

* Extra info

* Add GH issue numbers

* Update more GH issue numbers

* Fix SQL Server!!!! wooo

* I THOUGHT WE WERE DISABLING SPARK SQL!!!!!!!!!!1

* Cleanup to get things ready for review.

* Snowflake + Vertica shouldn't use output column numbers for ORDER BY in OVER

* Code cleanup

* Test fix? :wrench:
parent 15bc3349
No related branches found
No related tags found
No related merge requests found
Showing
with 273 additions and 93 deletions
......@@ -13,6 +13,61 @@ title: Driver interface changelog
efficient way possible. This is currently only required for drivers that support the `:uploads` feature, and has
a default implementation for JDBC-based drivers.
- New feature `:window-functions` has been added. Drivers that implement this method are expected to implement the
cumulative sum (`:cum-sum`) and cumulative count (`:cum-count`) aggregation clauses in their native query language.
For non-SQL drivers (drivers not based on our `:sql` or `:sql-jdbc` drivers), this feature flag is set to `false` by
default; the old (broken) post-processing implementations of cumulative aggregations will continue to be used. (See
issues [#13634](https://github.com/metabase/metabase/issues/13634) and
[#15118](https://github.com/metabase/metabase/issues/15118) for more information on why the old implementation is
broken.)
Non-SQL drivers should be updated to implement cumulative aggregations natively if possible.
SQL drivers that support ordering by output column numbers in `OVER` `ORDER BY` expressions, e.g.
```sql
SELECT
sum(sum(my_column) OVER (ORDER BY 1 ROWS UNBOUNDED PRECEDING) AS cumulative_sum
date_trunc(my_timestamp, 'month') AS my_timestamp_month
FROM
my_table
GROUP BY
date_trunc(my_timestamp, 'month')
ORDER BY
date_trunc(my_timestamp, 'month')
```
will work without any changes; this is the new default behavior for drivers based on the `:sql` or `:sql-jdbc`
drivers.
For databases that do not support ordering by output column numbers (e.g. MySQL/MariaDB), you can mark your driver
as not supporting the `:sql/window-functions.order-by-output-column-numbers` feature flag, e.g.
```clj
(defmethod driver/database-supports? [:my-driver :sql/window-functions.order-by-output-column-numbers]
[_driver _feature _database]
false)
```
In this case, the `:sql` driver will instead generate something like
```sql
SELECT
sum(sum(my_column) OVER (ORDER BY date_trunc(my_timestamp, 'month') ROWS UNBOUNDED PRECEDING) AS cumulative_sum
date_trunc(my_timestamp, 'month') AS my_timestamp_month
FROM
my_table
GROUP BY
date_trunc(my_timestamp, 'month')
ORDER BY
date_trunc(my_timestamp, 'month')
```
If neither of these strategies work for you, you might need to do something more complicated -- see
[#40982](https://github.com/metabase/metabase/pull/40982) for an example of complex query transformations to get
fussy BigQuery working. (More on this soon.) If all else fails, you can always specify that your driver does not
support `:window-functions`, and it will fall back to using the old broken implementation.
## Metabase 0.49.1
- Another driver feature has been added: `describe-fields`. If a driver opts-in to supporting this feature, The
......
......@@ -341,7 +341,7 @@ describe("scenarios > question > custom column", () => {
.should("have.length.of.at.least", 8);
});
it.skip("should create custom column after aggregation with 'cum-sum/count' (metabase#13634)", () => {
it("should create custom column after aggregation with 'cum-sum/count' (metabase#13634)", () => {
cy.createQuestion(
{
name: "13634",
......@@ -365,7 +365,7 @@ describe("scenarios > question > custom column", () => {
cy.log("Reported failing in v0.34.3, v0.35.4, v0.36.8.2, v0.37.0.2");
// eslint-disable-next-line no-unscoped-text-selectors -- deprecated usage
cy.findByText("Foo Bar");
cy.findAllByText("57911");
cy.findAllByText("57,911");
});
it("should not be dropped if filter is changed after aggregation (metaabase#14193)", () => {
......
......@@ -32,10 +32,11 @@
;;; | metabase.driver method impls |
;;; +----------------------------------------------------------------------------------------------------------------+
(doseq [[feature supported?] {:foreign-keys true
:datetime-diff true
:nested-fields false #_true ; huh? Not sure why this was `true`. Disabled for now.
:test/jvm-timezone-setting false}]
(doseq [[feature supported?] {:datetime-diff true
:foreign-keys true
:nested-fields false
:sql/window-functions.order-by-output-column-numbers false
:test/jvm-timezone-setting false}]
(defmethod driver/database-supports? [:athena feature] [_driver _feature _db] supported?))
;;; +----------------------------------------------------------------------------------------------------------------+
......
......@@ -462,10 +462,13 @@
:datetime-diff true
:now true
:convert-timezone true
;; BigQuery uses timezone operators and arguments on calls like extract() and timezone_trunc() rather than literally
;; using SET TIMEZONE, but we need to flag it as supporting set-timezone anyway so that reporting timezones are
;; returned and used, and tests expect the converted values.
:set-timezone true}]
;; BigQuery uses timezone operators and arguments on calls like extract() and
;; timezone_trunc() rather than literally using SET TIMEZONE, but we need to flag it as
;; supporting set-timezone anyway so that reporting timezones are returned and used, and
;; tests expect the converted values.
:set-timezone true
;; temporarily disabled -- will be fixed by #40982
:window-functions false}]
(defmethod driver/database-supports? [:bigquery-cloud-sdk feature] [_driver _feature _db] supported?))
;; BigQuery is always in UTC
......
......@@ -106,8 +106,14 @@
(qp.compile/compile
(mt/mbql-query attempts
{:aggregation [[:count]]
:filter [:time-interval $datetime :last :month]}))))
:filter [:time-interval $datetime :last :month]}))))))))))
(deftest ^:parallel no-initial-projection-test-2
(mt/test-driver :mongo
(testing "Don't need to create initial projections anymore (#4216)"
(testing "Don't create an initial projection for datetime-fields that use `:default` bucketing (#14838)"
(mt/with-clock #t "2021-02-15T17:33:00-08:00[US/Pacific]"
(mt/dataset attempted-murders
(testing "should still work even with bucketing bucketing"
(let [tz (qp.timezone/results-timezone-id :mongo mt/db)
query (mt/with-metadata-provider (mt/id)
......@@ -147,8 +153,7 @@
{"$project" {"_id" false
"datetime" "$_id.datetime"
"datetime_2" "$_id.datetime_2"
"count" true}}
{"$sort" {"datetime" 1}}]
"count" true}}]
:collection "attempts"
:mbql? true}
query))
......
......@@ -43,15 +43,16 @@
(driver/register! :presto-jdbc, :parent #{:sql-jdbc
::sql-jdbc.legacy/use-legacy-classes-for-read-and-set})
(doseq [[feature supported?] {:set-timezone true
:basic-aggregations true
:standard-deviation-aggregations true
:expressions true
:native-parameters true
:expression-aggregations true
:binning true
:foreign-keys true
:now true}]
(doseq [[feature supported?] {:basic-aggregations true
:binning true
:expression-aggregations true
:expressions true
:foreign-keys true
:native-parameters true
:now true
:set-timezone true
:sql/window-functions.order-by-output-column-numbers false
:standard-deviation-aggregations true}]
(defmethod driver/database-supports? [:presto-jdbc feature] [_driver _feature _db] supported?))
;;; Presto API helpers
......
......@@ -33,11 +33,12 @@
(driver/register! :redshift, :parent #{:postgres ::sql-jdbc.legacy/use-legacy-classes-for-read-and-set})
(doseq [[feature supported?] {:test/jvm-timezone-setting false
:nested-field-columns false
:describe-fields true
:describe-fks true
:connection-impersonation true}]
(doseq [[feature supported?] {:connection-impersonation true
:describe-fields true
:describe-fks true
:nested-field-columns false
:sql/window-functions.order-by-output-column-numbers false
:test/jvm-timezone-setting false}]
(defmethod driver/database-supports? [:redshift feature] [_driver _feat _db] supported?))
;;; +----------------------------------------------------------------------------------------------------------------+
......
......@@ -49,11 +49,12 @@
(driver/register! :snowflake, :parent #{:sql-jdbc ::sql-jdbc.legacy/use-legacy-classes-for-read-and-set})
(doseq [[feature supported?] {:datetime-diff true
:now true
:convert-timezone true
:connection-impersonation true
:connection-impersonation-requires-role true}]
(doseq [[feature supported?] {:connection-impersonation true
:connection-impersonation-requires-role true
:convert-timezone true
:datetime-diff true
:now true
:sql/window-functions.order-by-output-column-numbers false}]
(defmethod driver/database-supports? [:snowflake feature] [_driver _feature _db] supported?))
(defmethod driver/humanize-connection-error-message :snowflake
......
......@@ -209,8 +209,10 @@
:native-parameters true
:nested-queries true
:standard-deviation-aggregations true
:test/jvm-timezone-setting false}]
(defmethod driver/database-supports? [:sparkql feature] [_driver _feature _db] supported?))
:test/jvm-timezone-setting false
;; disabled for now, see issue #40991 to fix this.
:window-functions false}]
(defmethod driver/database-supports? [:sparksql feature] [_driver _feature _db] supported?))
;; only define an implementation for `:foreign-keys` if none exists already. In test extensions we define an alternate
;; implementation, and we don't want to stomp over that if it was loaded already
......
......@@ -32,13 +32,14 @@
(driver/register! :sqlserver, :parent :sql-jdbc)
(doseq [[feature supported?] {:regex false
:case-sensitivity-string-filter-options false
:now true
:datetime-diff true
:convert-timezone true
:test/jvm-timezone-setting false
:index-info true}]
(doseq [[feature supported?] {:case-sensitivity-string-filter-options false
:convert-timezone true
:datetime-diff true
:index-info true
:now true
:regex false
:sql/window-functions.order-by-output-column-numbers false
:test/jvm-timezone-setting false}]
(defmethod driver/database-supports? [:sqlserver feature] [_driver _feature _db] supported?))
(defmethod driver/database-supports? [:sqlserver :percentile-aggregations]
......@@ -448,7 +449,9 @@
;; For the GROUP BY, we replace the unoptimized fields with the optimized ones, e.g.
;;
;; GROUP BY year(field), month(field)
(apply sql.helpers/group-by new-hsql (mapv (partial sql.qp/->honeysql driver) optimized)))))
(apply sql.helpers/group-by new-hsql (mapv (partial sql.qp/->honeysql driver) optimized))
;; remove duplicate group by clauses (from the optimize breakout clauses stuff)
(update new-hsql :group-by distinct))))
(defn- optimize-order-by-subclauses
"Optimize `:order-by` `subclauses` using [[optimized-temporal-buckets]], if possible."
......@@ -468,7 +471,9 @@
;; year(), month(), and day() can make use of indexes while DateFromParts() cannot.
(let [query (update query :order-by optimize-order-by-subclauses)
parent-method (get-method sql.qp/apply-top-level-clause [:sql-jdbc :order-by])]
(parent-method driver :order-by honeysql-form query)))
(-> (parent-method driver :order-by honeysql-form query)
;; order bys have to be distinct in SQL Server!!!!!!!1
(update :order-by distinct))))
;; SQLServer doesn't support `TRUE`/`FALSE`; it uses `1`/`0`, respectively; convert these booleans to numbers.
(defmethod sql.qp/->honeysql [:sqlserver Boolean]
......
......@@ -27,11 +27,12 @@
::sql-jdbc.legacy/use-legacy-classes-for-read-and-set
::sql.qp.empty-string-is-null/empty-string-is-null})
(doseq [[feature supported?] {:percentile-aggregations false
:datetime-diff true
:now true
:convert-timezone true
:test/jvm-timezone-setting false}]
(doseq [[feature supported?] {:convert-timezone true
:datetime-diff true
:now true
:percentile-aggregations false
:sql/window-functions.order-by-output-column-numbers false
:test/jvm-timezone-setting false}]
(defmethod driver/database-supports? [:vertica feature] [_driver _feature _db] supported?))
(defmethod driver/db-start-of-week :vertica
......
......@@ -80,7 +80,6 @@
(fn [& args]
(apply f* (unique-identifier) args))))
(defn increment-app-db-unique-indentifier!
"Increment the [[unique-identifier]] for the Metabase application DB. This effectively flushes all caches using it as
a key (including things using [[mdb/memoize-for-application-db]]) such as the Settings cache. Should only be used
......
......@@ -589,7 +589,10 @@
:describe-fields
;; Does the driver support fingerprint the fields. Default is true
:fingerprint})
:fingerprint
;; Does this driver support window functions like cumulative count and cumulative sum? (default: false)
:window-functions})
(defmulti database-supports?
"Does this driver and specific instance of a database support a certain `feature`?
......@@ -609,8 +612,11 @@
(database-supports? :mongo :set-timezone mongo-db) ; -> true"
{:arglists '([driver feature database]), :added "0.41.0"}
(fn [driver feature _database]
(when-not (features feature)
(throw (Exception. (tru "Invalid driver feature: {0}" feature))))
;; only make sure unqualified keywords are explicitly defined in [[features]].
(when (simple-keyword? feature)
(when-not (features feature)
(throw (ex-info (tru "Invalid driver feature: {0}" feature)
{:feature feature}))))
[(dispatch-on-initialized-driver driver) feature])
:hierarchy #'hierarchy)
......
......@@ -66,16 +66,16 @@
;;; | metabase.driver impls |
;;; +----------------------------------------------------------------------------------------------------------------+
(doseq [[feature supported?] {:full-join false
:regex true
:percentile-aggregations false
:actions true
(doseq [[feature supported?] {:actions true
:actions/custom true
:datetime-diff true
:full-join false
:index-info true
:now true
:percentile-aggregations false
:regex true
:test/jvm-timezone-setting false
:uploads true
:index-info true}]
:uploads true}]
(defmethod driver/database-supports? [:h2 feature]
[_driver _feature _database]
supported?))
......
......@@ -53,20 +53,21 @@
(defmethod driver/display-name :mysql [_] "MySQL")
(doseq [[feature supported?] {:persist-models true
:convert-timezone true
:datetime-diff true
:now true
:regex false
:percentile-aggregations false
:full-join false
:uploads true
:schemas false
;; MySQL LIKE clauses are case-sensitive or not based on whether the collation of the server and the columns
;; themselves. Since this isn't something we can really change in the query itself don't present the option to the
;; users in the UI
:case-sensitivity-string-filter-options false
:index-info true}]
(doseq [[feature supported?] {;; MySQL LIKE clauses are case-sensitive or not based on whether the collation of the
;; server and the columns themselves. Since this isn't something we can really change in
;; the query itself don't present the option to the users in the UI
:case-sensitivity-string-filter-options false
:convert-timezone true
:datetime-diff true
:full-join false
:index-info true
:now true
:percentile-aggregations false
:persist-models true
:regex false
:schemas false
:sql/window-functions.order-by-output-column-numbers false
:uploads true}]
(defmethod driver/database-supports? [:mysql feature] [_driver _feature _db] supported?))
;; This is a bit of a lie since the JSON type was introduced for MySQL since 5.7.8.
......
......@@ -61,13 +61,14 @@
(defmethod driver/display-name :postgres [_] "PostgreSQL")
;; Features that are supported by Postgres and all of its child drivers like Redshift
(doseq [[feature supported?] {:convert-timezone true
:datetime-diff true
:now true
:persist-models true
:table-privileges true
:schemas true
:connection-impersonation true}]
(doseq [[feature supported?] {:connection-impersonation true
:convert-timezone true
:datetime-diff true
:now true
:persist-models true
:schemas true
:sql/window-functions.order-by-output-column-numbers false
:table-privileges true}]
(defmethod driver/database-supports? [:postgres feature] [_driver _feature _db] supported?))
(defmethod driver/database-supports? [:postgres :nested-field-columns]
......
......@@ -18,16 +18,17 @@
(driver/register! :sql, :abstract? true)
(doseq [feature [:standard-deviation-aggregations
:foreign-keys
:expressions
(doseq [feature [:advanced-math-expressions
:binning
:expression-aggregations
:expressions
:foreign-keys
:native-parameters
:nested-queries
:binning
:advanced-math-expressions
:percentile-aggregations
:regex]]
:regex
:standard-deviation-aggregations
:window-functions]]
(defmethod driver/database-supports? [:sql feature] [_driver _feature _db] true))
(doseq [join-feature [:left-join
......
......@@ -420,7 +420,7 @@
of `honeysql-form`. Most drivers can use the default implementations for all of these methods, but some may need to
override one or more (e.g. SQL Server needs to override this method for the `:limit` clause, since T-SQL uses `TOP`
instead of `LIMIT`)."
{:added "0.32.0", :arglists '([driver top-level-clause honeysql-form query]), :style/indent 2}
{:added "0.32.0", :arglists '([driver top-level-clause honeysql-form query]), :style/indent [:form]}
(fn [driver top-level-clause _ _]
[(driver/dispatch-on-initialized-driver driver) top-level-clause])
:hierarchy #'driver/hierarchy)
......@@ -687,6 +687,101 @@
(->honeysql driver mbql-expr)
(->honeysql driver power)])
;;; NOCOMMIT -- we need to document the new `:window-functions` feature.
(defmethod driver/database-supports? [:sql :sql/window-functions.order-by-output-column-numbers]
[_driver _feature _database]
true)
(defn- format-rows-unbounded-preceding [_clause _args]
["ROWS UNBOUNDED PRECEDING"])
(sql/register-clause!
::rows-unbounded-preceding
#'format-rows-unbounded-preceding
nil)
(defn- over-rows-with-order-by-copying-expressions
"e.g.
OVER (ORDER BY date_trunc('month', my_column))"
[driver expr]
(let [{:keys [group-by]} (apply-top-level-clause
driver
:breakout
{}
*inner-query*)]
[:over
[expr
{:order-by (mapv (fn [expr]
[expr :asc])
group-by)
::rows-unbounded-preceding []}]]))
;;; For databases that do something intelligent with ORDER BY 1, 2: we can take advantage of that functionality
;;; implement the window function versions of cumulative sum and cumulative sum more simply.
(defn- over-rows-with-order-by-output-column-numbers
"e.g.
OVER (ORDER BY 1 ROWS UNBOUNDED PRECEDING)"
[_driver expr]
[:over
[expr
{:order-by (mapv (fn [i]
[[:inline (inc i)] :asc])
(range (count (:breakout *inner-query*))))
::rows-unbounded-preceding []}]])
(defn- cumulative-aggregation-window-function [driver expr]
(let [over-rows-with-order-by (if (driver/database-supports? driver
:sql/window-functions.order-by-output-column-numbers
(lib.metadata/database (qp.store/metadata-provider)))
over-rows-with-order-by-output-column-numbers
over-rows-with-order-by-copying-expressions)]
(over-rows-with-order-by driver expr)))
;;; cum-count()
;;;
;;; should compile to SQL like
;;;
;;; sum(count()) OVER (ORDER BY ...)
;;;
;;; where the ORDER BY matches what's in the query (i.e., the breakouts), or
;;;
;;; sum(count()) OVER (ORDER BY 1 ROWS UNBOUNDED PRECEDING)
;;;
;;; if the database supports ordering by SELECT expression position
(defmethod ->honeysql [:sql :cum-count]
[driver [_cum-count expr-or-nil]]
;; a cumulative count with no breakouts doesn't really mean anything, just compile it as a normal count.
(if (empty? (:breakout *inner-query*))
(->honeysql driver [:count expr-or-nil])
(cumulative-aggregation-window-function
driver
[:sum (if expr-or-nil
[:count (->honeysql driver expr-or-nil)]
[:count :*])])))
;;; cum-sum(total)
;;;
;;; should compile to SQL like
;;;
;;; sum(sum(total)) OVER (ORDER BY ...)
;;;
;;; where the ORDER BY matches what's in the query (i.e., the breakouts), or
;;;
;;; sum(sum(total)) OVER (ORDER BY 1 ROWS UNBOUNDED PRECEDING)
;;;
;;; if the database supports ordering by SELECT expression position
(defmethod ->honeysql [:sql :cum-sum]
[driver [_cum-sum expr]]
;; a cumulative sum with no breakouts doesn't really mean anything, just compile it as a normal sum.
(if (empty? (:breakout *inner-query*))
(->honeysql driver [:sum expr])
(cumulative-aggregation-window-function
driver
[:sum [:sum (->honeysql driver expr)]])))
(defn- interval? [expr]
(mbql.u/is-clause? :interval expr))
......
......@@ -135,6 +135,7 @@
driver/dispatch-on-initialized-driver
:hierarchy #'driver/hierarchy)
;;; TODO -- we should just make this a FEATURE!!!!!1
(defmulti ^Statement statement-supported?
"Indicates whether the given driver supports creating a java.sql.Statement, via the Connection. By default, this is
true for all :sql-jdbc drivers. If the underlying driver does not support Statement creation, override this as
......
......@@ -399,11 +399,13 @@
(mu/defn add-order-by-clause :- mbql.s/MBQLQuery
"Add a new `:order-by` clause to an MBQL `inner-query`. If the new order-by clause references a Field that is
already being used in another order-by clause, this function does nothing."
[inner-query :- mbql.s/MBQLQuery
[_ [_ id-or-name :as _field], :as order-by-clause] :- mbql.s/OrderBy]
(let [existing-fields (set (for [[_ [_ id-or-name]] (:order-by inner-query)]
id-or-name))]
(if (existing-fields id-or-name)
[inner-query :- mbql.s/MBQLQuery
[_dir orderable, :as order-by-clause] :- mbql.s/OrderBy]
(let [existing-orderables (into #{}
(map (fn [[_dir orderable]]
orderable))
(:order-by inner-query))]
(if (existing-orderables orderable)
;; Field already referenced, nothing to do
inner-query
;; otherwise add new clause at the end
......@@ -577,6 +579,7 @@
(assert (not= candidate original)
(str "unique-alias-fn must return a different string than its input. Input: "
(pr-str candidate)))
(swap! id+original->unique assoc [id name-key] candidate)
(recur id candidate))))))))))
(mu/defn uniquify-names :- [:and
......@@ -632,7 +635,6 @@
Most often, `aggregation->name-fn` will be something like `annotate/aggregation-name`, but for purposes of keeping
the `metabase.legacy-mbql` module seperate from the `metabase.query-processor` code we'll let you pass that in yourself."
{:style/indent 1}
[aggregation->name-fn :- fn?
aggregations :- [:sequential mbql.s/Aggregation]]
(lib.util.match/replace aggregations
......@@ -648,7 +650,6 @@
(mu/defn pre-alias-and-uniquify-aggregations :- UniquelyNamedAggregations
"Wrap every aggregation clause in a `:named` clause with a unique name. Combines `pre-alias-aggregations` with
`uniquify-named-aggregations`."
{:style/indent 1}
[aggregation->name-fn :- fn?
aggregations :- [:sequential mbql.s/Aggregation]]
(-> (pre-alias-aggregations aggregation->name-fn aggregations)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment