Commit Graph

7 Commits

Author SHA1 Message Date
Shant Hovsepian
6d87fe090c Improve Hll estimate for small cardinalities.
Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.

Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2015-07-16 19:38:17 +00:00
Shant Hovsepian
69079411bf Improve distinctpc/sa for small cardinalities.
Improving the cardinality estimate for Flajolet and Martin's algorithm
used in distinctpc and distinctpcsa. The estimate for small cardinalities
is improved by providing a correction hinted to in the original paper.

We use the correction constant 1.75 proposed by Scheuermann et al
DialM-POMC '07 [Near-Optimal Compression of Probabilistic Counting
Sketches for Networking Applications]

Change-Id: I90410328a1a01a72601e7e95ae719fb8caf1587f
Reviewed-on: http://gerrit.cloudera.org:8080/395
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-05-24 06:26:47 +00:00
casey
fd09294b74 TimestampValue refactor and cleanup (part 2) (IMPALA-1623)
This is preparation for fixing IMPALA-97. These changes are mostly
non-functional to bring the code closer to styling standards.

The biggest functional changes should be:

1) IMPALA-1623 was caused by a misuse of a constructor and that code
   didn't compile after the refactor so the bug was fixed.
2) TimestampValue.Hash() seems to have been hashing the time twice
   instead of the time and date.
3) Timings using TimestampValue.time() would not be accurate when
   crossing midnight (time and date are separate fields).
4) Timings using local time should use UTC to avoid daylight savings
   problems.
5) Use system monotonic clock in util/time.h.

Some timings may still be affected by #3 & 4 above but fixing those
isn't the purpose of this change.

Change-Id: I26056c876c4361e898acc3656aa98abf6f153a6b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5779
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
2015-02-03 01:49:55 -08:00
Nong Li
a1b2de9c95 Update distinctpc/pcsa to return bigint.
Change-Id: Iac3414aa0151f52ba9ec028da152b09fc09af264
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4637
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-10-06 15:12:12 -07:00
Alex Behm
0fb380961c IMPALA-1187: Add appx_count_distinct query option to rewrite COUNT(DISTINCT) to NDV().
This patch also fixes IMPALA-1164: NDV() now returns a BIGINT (and not STRING).

Change-Id: Ia2a3272204938579d61091ee4f7f2d1cbf38ed55
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4338
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-09-20 16:11:34 -07:00
Nong Li
7f08146b88 Add ndv (distinct estimate) as a builtin aggregate function.
This is implemented in the BE using HLL (but we could change this in the
future).

These estimates usually work better than the other algorithm we have and
we've not implemented all the improvements from the google paper.

Change-Id: Ied715ddd0e1a7cbe7f5f90469f1ed3d4b9c537c7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/956
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:03 -08:00
Nong Li
15db34e356 AggregationNode refactoring
This patch redoes how the aggregation node is implemented. The functionality is
now split between aggregation-node, agg-expr and aggregate-functions. This is a working
progress (there's still a lot of debug stuff I added that needs to be cleaned up) but
it does pass the tests.

Aggregation-node is now very simple and now only deals with the grouping part.
Aggregate-expr serves as the glue between the agg node and the aggregate functions.
The aggregation functions are implemented with the UDA interface. I've reimplemented
our existing aggregate functions with this setup. For true UDAs, the binaries would be
loaded in aggregate-expr.

This also includes some preliminary changes in the FE. We now need to annotate each
AggNode as executing the update vs. merge phase (root aggs execute update, others
execute merge) and if it needs a finalize step (only the root does). This is more
general than our builtins which are too simple to need this structure.

There is a big TODO here to allow the intermediate types between agg nodes to change.
For example, in distinct estimate, the input type is the column type and the output type
is a bigint. We'd like the intermediate type to be CHAR(256). This is different since
currently, the intermediate type and output type have always been the same. We've hacked
around this by having both the intermediate and output type be TYPE_STRING. I've left
this for another patch (changing the BE to support this is trivial).
For aggregates that result in strings, we used to store some additional stuff past the
end of the tuple. The layout was:
<tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc

The rationale for this is that we want to reuse the buffer for min/max and grow the buffer
more quickly for group_concat. This breaks down the abstraction between agg-expr and
agg-node and is not something UDAs can use in general. Rather than try to hack around
this, I think the proper solution is to the intermediate type not be StringValue and
to contain the buffer length itself.

This patch also resurrects the distinct estimate code. The distinct estimate functions
exercise all of the code paths.

Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346
Reviewed-on: http://gerrit.ent.cloudera.com:8080/564
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:13 -08:00