Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.
Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Improving the cardinality estimate for Flajolet and Martin's algorithm
used in distinctpc and distinctpcsa. The estimate for small cardinalities
is improved by providing a correction hinted to in the original paper.
We use the correction constant 1.75 proposed by Scheuermann et al
DialM-POMC '07 [Near-Optimal Compression of Probabilistic Counting
Sketches for Networking Applications]
Change-Id: I90410328a1a01a72601e7e95ae719fb8caf1587f
Reviewed-on: http://gerrit.cloudera.org:8080/395
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
This is preparation for fixing IMPALA-97. These changes are mostly
non-functional to bring the code closer to styling standards.
The biggest functional changes should be:
1) IMPALA-1623 was caused by a misuse of a constructor and that code
didn't compile after the refactor so the bug was fixed.
2) TimestampValue.Hash() seems to have been hashing the time twice
instead of the time and date.
3) Timings using TimestampValue.time() would not be accurate when
crossing midnight (time and date are separate fields).
4) Timings using local time should use UTC to avoid daylight savings
problems.
5) Use system monotonic clock in util/time.h.
Some timings may still be affected by #3 & 4 above but fixing those
isn't the purpose of this change.
Change-Id: I26056c876c4361e898acc3656aa98abf6f153a6b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5779
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
This patch also fixes IMPALA-1164: NDV() now returns a BIGINT (and not STRING).
Change-Id: Ia2a3272204938579d61091ee4f7f2d1cbf38ed55
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4338
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
This is implemented in the BE using HLL (but we could change this in the
future).
These estimates usually work better than the other algorithm we have and
we've not implemented all the improvements from the google paper.
Change-Id: Ied715ddd0e1a7cbe7f5f90469f1ed3d4b9c537c7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/956
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
This patch redoes how the aggregation node is implemented. The functionality is
now split between aggregation-node, agg-expr and aggregate-functions. This is a working
progress (there's still a lot of debug stuff I added that needs to be cleaned up) but
it does pass the tests.
Aggregation-node is now very simple and now only deals with the grouping part.
Aggregate-expr serves as the glue between the agg node and the aggregate functions.
The aggregation functions are implemented with the UDA interface. I've reimplemented
our existing aggregate functions with this setup. For true UDAs, the binaries would be
loaded in aggregate-expr.
This also includes some preliminary changes in the FE. We now need to annotate each
AggNode as executing the update vs. merge phase (root aggs execute update, others
execute merge) and if it needs a finalize step (only the root does). This is more
general than our builtins which are too simple to need this structure.
There is a big TODO here to allow the intermediate types between agg nodes to change.
For example, in distinct estimate, the input type is the column type and the output type
is a bigint. We'd like the intermediate type to be CHAR(256). This is different since
currently, the intermediate type and output type have always been the same. We've hacked
around this by having both the intermediate and output type be TYPE_STRING. I've left
this for another patch (changing the BE to support this is trivial).
For aggregates that result in strings, we used to store some additional stuff past the
end of the tuple. The layout was:
<tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc
The rationale for this is that we want to reuse the buffer for min/max and grow the buffer
more quickly for group_concat. This breaks down the abstraction between agg-expr and
agg-node and is not something UDAs can use in general. Rather than try to hack around
this, I think the proper solution is to the intermediate type not be StringValue and
to contain the buffer length itself.
This patch also resurrects the distinct estimate code. The distinct estimate functions
exercise all of the code paths.
Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346
Reviewed-on: http://gerrit.ent.cloudera.com:8080/564
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>