impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 12:01:11 -05:00

Author	SHA1	Message	Date
Shant Hovsepian	6d87fe090c	Improve Hll estimate for small cardinalities. Based on Google's HyperLogLog++ paper. Uses a bias correcting interpolation as a sub algorithm for Hll estimates within a specific range. Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9 Reviewed-on: http://gerrit.cloudera.org:8080/415 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2015-07-16 19:38:17 +00:00
Shant Hovsepian	69079411bf	Improve distinctpc/sa for small cardinalities. Improving the cardinality estimate for Flajolet and Martin's algorithm used in distinctpc and distinctpcsa. The estimate for small cardinalities is improved by providing a correction hinted to in the original paper. We use the correction constant 1.75 proposed by Scheuermann et al DialM-POMC '07 [Near-Optimal Compression of Probabilistic Counting Sketches for Networking Applications] Change-Id: I90410328a1a01a72601e7e95ae719fb8caf1587f Reviewed-on: http://gerrit.cloudera.org:8080/395 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-05-24 06:26:47 +00:00
casey	fd09294b74	TimestampValue refactor and cleanup (part 2) (IMPALA-1623) This is preparation for fixing IMPALA-97. These changes are mostly non-functional to bring the code closer to styling standards. The biggest functional changes should be: 1) IMPALA-1623 was caused by a misuse of a constructor and that code didn't compile after the refactor so the bug was fixed. 2) TimestampValue.Hash() seems to have been hashing the time twice instead of the time and date. 3) Timings using TimestampValue.time() would not be accurate when crossing midnight (time and date are separate fields). 4) Timings using local time should use UTC to avoid daylight savings problems. 5) Use system monotonic clock in util/time.h. Some timings may still be affected by #3 & 4 above but fixing those isn't the purpose of this change. Change-Id: I26056c876c4361e898acc3656aa98abf6f153a6b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5779 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2015-02-03 01:49:55 -08:00
Nong Li	a1b2de9c95	Update distinctpc/pcsa to return bigint. Change-Id: Iac3414aa0151f52ba9ec028da152b09fc09af264 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4637 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-10-06 15:12:12 -07:00
Alex Behm	0fb380961c	IMPALA-1187: Add appx_count_distinct query option to rewrite COUNT(DISTINCT) to NDV(). This patch also fixes IMPALA-1164: NDV() now returns a BIGINT (and not STRING). Change-Id: Ia2a3272204938579d61091ee4f7f2d1cbf38ed55 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4338 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-09-20 16:11:34 -07:00
Nong Li	7f08146b88	Add ndv (distinct estimate) as a builtin aggregate function. This is implemented in the BE using HLL (but we could change this in the future). These estimates usually work better than the other algorithm we have and we've not implemented all the improvements from the google paper. Change-Id: Ied715ddd0e1a7cbe7f5f90469f1ed3d4b9c537c7 Reviewed-on: http://gerrit.ent.cloudera.com:8080/956 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:03 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00

7 Commits