Files
impala/testdata/workloads/functional-query/queries/QueryTest
Fucun Chu 65c6a81ed9 IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions
These functions can be used to get cardinality estimates of data
using Theta algorithm from Apache DataSketches. ds_theta_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized Theta sketch in string format. This can be written to a
table or be fed directly to ds_theta_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the Theta sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' Theta see:
https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch25_parquet.lineitem to compare perfomance
   with ds_hll_*. ds_theta_* is faster than ds_hll_* on the original
   data, the difference is around 1%-10%. ds_hll_estimate() is faster
   than ds_theta_estimate() on existing sketch. HLL and Theta gives
   closer estimate except for string. see IMPALA-10464.

Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Reviewed-on: http://gerrit.cloudera.org:8080/17008
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-17 17:09:48 +00:00
..
2021-02-15 22:25:41 +00:00