Unverified Commit 722a80a9 authored 4 years ago by dpsutton Committed by GitHub 4 years ago
Initial refingerprinting (#13687)

* If scheduled analyze task is short, refingerprint tables

- make the sync steps return their values rather than log and nil
- dynamic var in fingerprint.clj to swap out query clauses for fields
- update the fingerprint runner to not consume whole list when
    refingerprinting
- algorith for fields to refingerprint: shuffle table and do up to
    1000 fields
- we refingerprint after analyzing in the task if two conditions hold:
   1. the analysis lasted under 5 minutes
      Don't want to hog our CPU or connections
   2. no fields were fingerprinted.
      The first analysis will fingerprint everything and analyze
      fields base on that. Seems subsequently we almost never
      fingerprint unless its a new field.

TODO for the future to make it better:
- manual overrides to prevent these refingerprinting (might be
necessary before this goes live. I lean towards actually doing this
and making it opt IN so that people can enable. We verify that its
working frequently enough to be helpful but not causing problems. Then
in 38 or 39 we flip it and make it opt OUT)
- better strategies for what to refingerprint. Right now just picks
tables at random. We don't have a place to write down frequency of use
of tables nor if the fingerprints are changing substantially (for some
notion of substantial). Also, only date and number fingerprints are
used by the app at the moment. Could just bias to these fields for the
moment.
- Our analysis doesn't override if there's already a special_type (Or
other field things). We don't capture if special_type and other
aspects of a field are manually computed (and therefore a candidate to
use ongoing fingerprint results (state fields based on percentage of
state values, etc). If this becomes the case and our analysis can
become more mature to improving insights and knowing its not
clobbering a human override/input we could just make the initial
fingerprint smarter.

As it stands this step is after the normal fingerprinting so that we
don't accidentally do too much work and because we can't really use
the information in the analysis/classify steps yet.

Docstrings for linter

* Add refingerprint column to Database

its nullable now so that people can opt in and we can migrate to opt
out in the future with the following strategy:  if null, set to True,
set default to True. This allows us to respect people who have turned
it off and enabling in a future release when we are sure the
performance ramifications are not too severe.

* Add tests for refingerprinting

* Test for refingerprinting being bounded

* Update UI verbiage for refingerprinting
parent ca41ee40
Branches