shared · 0c190ec62f2bf93b6c4a5967aa5f9bbcf123c5b9 · Engineering Digital Service / Metabase

Datasets preserve metadata (#19158)
dpsutton authored 3 years ago
* Preserve metadata and surface metadata of datasets

Need to handle two cases:
1. querying the dataset itself
2. a question that is a nested question of the dataset

1. Querying the dataset itself
This is a bit of a shift of how metadata works. Previously it was just
thrown away and saved on each query run. This kind of needs to be this
way because when you edit a question, we do not ensure that the metadata
stays in sync! There's some checksum operation that ensures that the
metadata hasn't been tampered with, but it doesn't ensure that it
actually matches the query any longer.

So imagine you add a new column to a query. The metadata is not changed,
but its checksum matches the original query's metadata and the backend
happily saves this. Then on a subsequent run of the query (or if you hit
visualize before saving) the metadata is tossed and updated.

So to handle this carelessness, we have to allow the metadata that can
be edited to persist across just running the dataset query. So when
hitting the card api, stick the original metadata in the middleware (and
update the normalize ns not to mangle field_ref -> field-ref among
others). Once we have this smuggled in, when computing the metadata in
annotate, we need a way to index the columns. The old and bad way was
the following:

```clojure
;; old
(let [field-id->metadata (u/key-by :id source-metadata)] ...)
;; new and better
(let [ref->metadata (u/key-by (comp u/field-ref->key :field_ref) source-metadata)] )
```

This change is important because ids are only for fields that map to
actual database columns. computed columns, case, manipulations, and all
native fields will lack this. But we can make field references.

Then for each field in the newly computed metadata, allow the non-type
information to persist. We do not want to override type information as
this can break a query, but things like description, display name,
semantic type can survive.

This metadata is then saved in the db as always so we can continue with
the bit of careless metadata saving that we do.

2. a question that is a nested question of the dataset
This was a simpler change to grab the source-metadata and ensure that it
is blended into the result metadata in the same way.

Things i haven't looked at yet: column renaming, if we need to allow
conversions to carry through or if those necessarily must be opaque (ie,
once it has been cast forget that it was originally a different type so
we don't try to cast the already cast value), and i'm sure some other
things. But it has been quite a pain to figure all of this stuff
out. Especially the divide between native and mbql since native requires
the first row of values back before it can detect some types.

* Add in base-type specially

Best to use field_refs to combine metadata from datasets. This means
that we add this ref before the base-type is known. So we have to update
this base-type later once they are known from sampling the results

* Allow column information through

I'm not sure how this base-type is set for
annotate-native-cols. Presumably we don't have and we get it from the
results but this is not true. I guess we do some analysis on count
types. I'm not sure why they failed though.

* Correctly infer this stuff

This was annoying. I like :field_ref over :name for indexing, as it has
a guaranteed unique name. But datasets will have unique names due to a
restriction*. The problem was that annotating the native results before
we had type information gave us refs like `[:field "foo" {:base-type
:type/*}]`, but then this ruined the merge strategy at the end and
prevented a proper ref being merged on top. Quite annoying. This stuff
is very whack-a-mole in that you fix one bit and another breaks
somewhere else**.

* cannot have identical names for a subselect:
    select id from (select 1 as id, 2 as id)

** in fact, another test broke on this commit

* Revert "Correctly infer this stuff"

This reverts commit 1ffe44e90076b024efd231f84ea8062a281e69ab.

* Annotate but de-annotate in a way

To combine metadata from the db, really, really want to make sure they
actually match up. Cannot use name as this could collide when there are
two IDs in the same query. Combining metadata on that gets nasty real
quick.

For mbql and native, its best to use field_refs. Field_refs offer the
best of both worlds: if id, we are golden and its by id. If by name,
they have been uniquified already. So this will run into issues if you
reorder a query or add a new column in with the same name but i think
that's the theoretical best we can do.

BUT, we have to do a little cleanup for this stuff. When native adds the
field_ref, it needs to include some type information. But this isn't
known until after the query runs for native since its just an opaque
query until we run it. So annotating will add a `[:field name
{:base_type :type/*}]` and then our merging doesn't clobber that
later. So its best to add the field_refs, match up with any db metadata,
and then remove the field_refs.

* Test that metadata flows through

* Test mbql datasets and questions based on datasets

* Test mbql/native queries and nested queries

* Recognize that native query bubbles into nested

When using a nested query based on a native query, the metadata from the
underlying dataset is used. Previously we would clobber this with the
metadata from the expected cols of the wrapping mbql query. This would
process the display name with `humanization/name->human-readable-name`
whereas for native it goes through `u/qualified-name`.

I originally piped the native's name through the humanization but that
leads to lots of test failures, and perhaps correct failures. For
instance, a csv test asserts the column title is "COUNT(*)" but the
change would emit "Count(*)", a humanization of count(*) isn't
necessarily an improvement nor even correct.

It is possible that we could change this in the future but I'd want it
to be a deliberate change. It should be mechanical, just adjusting
`annotate-native-cols` in annotate.clj to return a humanized display
name and then fixing tests.

* Allow computed display name on top of source metadata name

If we have a join, we want the "pretty" name to land on top of the
underlying table's name. "alias → B Column" vs "B Column".

* Put dataset metadata in info, not middleware

* Move metadata back under dataset key in info

We want to ensure that dataset information is propagated, but card
information should be computed fresh each time. Including the card
information each time leads to errors as it erroneously thinks the
existing card info should shadow the dataset information. This is
actually a tricky case: figuring out when to care about information at
arbitrary points in the query processor.

* Update metadata to :info not :middleware in tests

* Make var private and comment about info metadata
5f8bc305
History
5f8bc305 3 years ago
History
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
Name	Last commit	Last update
..
src/metabase
test/metabase