dpsutton
authored
* Preserve metadata and surface metadata of datasets Need to handle two cases: 1. querying the dataset itself 2. a question that is a nested question of the dataset 1. Querying the dataset itself This is a bit of a shift of how metadata works. Previously it was just thrown away and saved on each query run. This kind of needs to be this way because when you edit a question, we do not ensure that the metadata stays in sync! There's some checksum operation that ensures that the metadata hasn't been tampered with, but it doesn't ensure that it actually matches the query any longer. So imagine you add a new column to a query. The metadata is not changed, but its checksum matches the original query's metadata and the backend happily saves this. Then on a subsequent run of the query (or if you hit visualize before saving) the metadata is tossed and updated. So to handle this carelessness, we have to allow the metadata that can be edited to persist across just running the dataset query. So when hitting the card api, stick the original metadata in the middleware (and update the normalize ns not to mangle field_ref -> field-ref among others). Once we have this smuggled in, when computing the metadata in annotate, we need a way to index the columns. The old and bad way was the following: ```clojure ;; old (let [field-id->metadata (u/key-by :id source-metadata)] ...) ;; new and better (let [ref->metadata (u/key-by (comp u/field-ref->key :field_ref) source-metadata)] ) ``` This change is important because ids are only for fields that map to actual database columns. computed columns, case, manipulations, and all native fields will lack this. But we can make field references. Then for each field in the newly computed metadata, allow the non-type information to persist. We do not want to override type information as this can break a query, but things like description, display name, semantic type can survive. This metadata is then saved in the db as always so we can continue with the bit of careless metadata saving that we do. 2. a question that is a nested question of the dataset This was a simpler change to grab the source-metadata and ensure that it is blended into the result metadata in the same way. Things i haven't looked at yet: column renaming, if we need to allow conversions to carry through or if those necessarily must be opaque (ie, once it has been cast forget that it was originally a different type so we don't try to cast the already cast value), and i'm sure some other things. But it has been quite a pain to figure all of this stuff out. Especially the divide between native and mbql since native requires the first row of values back before it can detect some types. * Add in base-type specially Best to use field_refs to combine metadata from datasets. This means that we add this ref before the base-type is known. So we have to update this base-type later once they are known from sampling the results * Allow column information through I'm not sure how this base-type is set for annotate-native-cols. Presumably we don't have and we get it from the results but this is not true. I guess we do some analysis on count types. I'm not sure why they failed though. * Correctly infer this stuff This was annoying. I like :field_ref over :name for indexing, as it has a guaranteed unique name. But datasets will have unique names due to a restriction*. The problem was that annotating the native results before we had type information gave us refs like `[:field "foo" {:base-type :type/*}]`, but then this ruined the merge strategy at the end and prevented a proper ref being merged on top. Quite annoying. This stuff is very whack-a-mole in that you fix one bit and another breaks somewhere else**. * cannot have identical names for a subselect: select id from (select 1 as id, 2 as id) ** in fact, another test broke on this commit * Revert "Correctly infer this stuff" This reverts commit 1ffe44e90076b024efd231f84ea8062a281e69ab. * Annotate but de-annotate in a way To combine metadata from the db, really, really want to make sure they actually match up. Cannot use name as this could collide when there are two IDs in the same query. Combining metadata on that gets nasty real quick. For mbql and native, its best to use field_refs. Field_refs offer the best of both worlds: if id, we are golden and its by id. If by name, they have been uniquified already. So this will run into issues if you reorder a query or add a new column in with the same name but i think that's the theoretical best we can do. BUT, we have to do a little cleanup for this stuff. When native adds the field_ref, it needs to include some type information. But this isn't known until after the query runs for native since its just an opaque query until we run it. So annotating will add a `[:field name {:base_type :type/*}]` and then our merging doesn't clobber that later. So its best to add the field_refs, match up with any db metadata, and then remove the field_refs. * Test that metadata flows through * Test mbql datasets and questions based on datasets * Test mbql/native queries and nested queries * Recognize that native query bubbles into nested When using a nested query based on a native query, the metadata from the underlying dataset is used. Previously we would clobber this with the metadata from the expected cols of the wrapping mbql query. This would process the display name with `humanization/name->human-readable-name` whereas for native it goes through `u/qualified-name`. I originally piped the native's name through the humanization but that leads to lots of test failures, and perhaps correct failures. For instance, a csv test asserts the column title is "COUNT(*)" but the change would emit "Count(*)", a humanization of count(*) isn't necessarily an improvement nor even correct. It is possible that we could change this in the future but I'd want it to be a deliberate change. It should be mechanical, just adjusting `annotate-native-cols` in annotate.clj to return a humanized display name and then fixing tests. * Allow computed display name on top of source metadata name If we have a join, we want the "pretty" name to land on top of the underlying table's name. "alias → B Column" vs "B Column". * Put dataset metadata in info, not middleware * Move metadata back under dataset key in info We want to ensure that dataset information is propagated, but card information should be computed fresh each time. Including the card information each time leads to errors as it erroneously thinks the existing card info should shadow the dataset information. This is actually a tricky case: figuring out when to care about information at arbitrary points in the query processor. * Update metadata to :info not :middleware in tests * Make var private and comment about info metadata
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
Name | Last commit | Last update |
---|---|---|
.. | ||
src/metabase | ||
test/metabase |