impala

mirror of https://github.com/apache/impala.git synced 2026-01-01 09:00:42 -05:00

Author	SHA1	Message	Date
Matthew Jacobs	b83aa4984b	Add compute histograms aggregate function Adds an aggregate function to compute equi-depth histograms. The UDA creates a sample of the column values using weighted reservoir sampling and computes the histogram from the sorted sample. TODO: * Extract highly frequent values into separate buckets (i.e. 'compressed histogram'). * Expose separate finalize fn to produce samples and histogram data for stats Change-Id: I314ce5fb8c73b935c4d61ea5bbd6816c59b3b41e Reviewed-on: http://gerrit.ent.cloudera.com:8080/3552 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit c5c475712f88244e15160befaf4e99d6e165a148) Reviewed-on: http://gerrit.ent.cloudera.com:8080/3608	2014-07-25 00:21:10 -07:00
Dimitris Tsirogiannis	5a6f53db16	Add partition pruning tests The following changes are included in this commit: 1. Modified the alltypesagg table to include an additional partition key that has nulls. 2. Added a number of tests in hdfs.test that exercise the partition pruning logic (see IMPALA-887). 3. Modified all the tests that are affected by the change in alltypesagg. Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236	2014-06-24 02:14:27 -07:00
Nong Li	8f4dc0f2f0	IMPALA-974: Switch from FloatLiteral to DecimalLiteral. Float/Doubles are lossy so using those as the default literal type is problematic. Change-Id: I5a619dd931d576e2e6cd7774139e9bafb9452db9 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2758 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-05-31 22:19:06 -07:00
Victor Bittorf	6f31dc7f8a	Adding STDDEV builtin. Change-Id: I79e5aee1e9e879aa2d09078ab45bc149675e1d4a Reviewed-on: http://gerrit.ent.cloudera.com:8080/2341 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit a42c375d933c0b7ffe7c9b6702777679492d7ad6) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2464	2014-05-06 13:06:26 -07:00
Nong Li	457055f8f4	IMPALA-892: Fix subexpr for IR generated from compound predicate. Change-Id: I638533827e97f3486eb75a571b18f9e8d1cd4aed Reviewed-on: http://gerrit.ent.cloudera.com:8080/1973 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-03-18 16:49:34 -07:00
Nong Li	5022aa08fb	IMPALA-869: Fix result initialization for MIN(). Change-Id: I50eceb04c0eb1c9eedb9c963cb75d2fc5aeb4825 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1847 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-03-11 17:26:31 -07:00
Nong Li	f6de8d9e30	IMPALA-765: Fix subexpr elimination codegen optimization. The previous implementation did not properly handle replacing the is_null return argument from expr calls. Change-Id: I96cd0dfca8876b4f914b0cbc4eb459ea3dcdf230 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1795 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-03-10 15:20:53 -07:00
Alex Behm	a615ebc549	IMPALA-822,IMP-1271: Binding predicates on an aggregation now properly trigger slot materialization. The bug was that the number of materialized agg-tuple slots did not correspond to the number of materialized agg functions, due to binding predicates against an AggNode causing slot materialization after SelectStmt.materializeRequiredSlots(). This patch fixes the issue by taking binding predicates (bound to a slot in an agg tuple) into consideration in SelectStmt.materializeRequiredSlots(). I added a new sanity check in AggregationNode.toThrift() surfaced another issue with slot materialization that is also fixed in this patch. The ordering exprs must be marked before the agg exprs in SelectStmt.materializeRequiredSlots() because the odering exprs may contain agg exprs that are only referenced inside the ORDER BY clause. Change-Id: I1bdc0466f583907bed625ce6608938e59faee83f Reviewed-on: http://gerrit.ent.cloudera.com:8080/1639 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1818 Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-03-08 00:25:26 -08:00
Alex Behm	cb8150e8ee	IMPALA-817: Check equality of function name in Function.equals(). Change-Id: Ib9b4ee3a21f90fdb0d7ebccd89462dc67040bd1e Reviewed-on: http://gerrit.ent.cloudera.com:8080/1594 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1611 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2014-02-19 17:13:51 -08:00
Nong Li	0d2919fe7f	Refactor scalar and aggregate function analysis and execution. This patch cleans up analysis and execution of scalar and aggregate functions so that there is no difference between how builtins and user functions are handled. The only difference is that the catalog is populated with the builtins all the time. The BE always gets a TFunction object and just executes it (builtins will have an empty hdfs file location). This removes the opcode registry and all of the functionality is subsumed by the catalog, most of which was already duplicated there anyway. This also introduces the concept of a system database; databases that the user cannot modify and is populated automatically on startup. Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577	2014-02-18 18:40:08 -08:00
Nong Li	15db34e356	AggregationNode refactoring This patch redoes how the aggregation node is implemented. The functionality is now split between aggregation-node, agg-expr and aggregate-functions. This is a working progress (there's still a lot of debug stuff I added that needs to be cleaned up) but it does pass the tests. Aggregation-node is now very simple and now only deals with the grouping part. Aggregate-expr serves as the glue between the agg node and the aggregate functions. The aggregation functions are implemented with the UDA interface. I've reimplemented our existing aggregate functions with this setup. For true UDAs, the binaries would be loaded in aggregate-expr. This also includes some preliminary changes in the FE. We now need to annotate each AggNode as executing the update vs. merge phase (root aggs execute update, others execute merge) and if it needs a finalize step (only the root does). This is more general than our builtins which are too simple to need this structure. There is a big TODO here to allow the intermediate types between agg nodes to change. For example, in distinct estimate, the input type is the column type and the output type is a bigint. We'd like the intermediate type to be CHAR(256). This is different since currently, the intermediate type and output type have always been the same. We've hacked around this by having both the intermediate and output type be TYPE_STRING. I've left this for another patch (changing the BE to support this is trivial). For aggregates that result in strings, we used to store some additional stuff past the end of the tuple. The layout was: <tuple> <length of 1st string buffer>,<length of 2nd string buffer>, etc The rationale for this is that we want to reuse the buffer for min/max and grow the buffer more quickly for group_concat. This breaks down the abstraction between agg-expr and agg-node and is not something UDAs can use in general. Rather than try to hack around this, I think the proper solution is to the intermediate type not be StringValue and to contain the buffer length itself. This patch also resurrects the distinct estimate code. The distinct estimate functions exercise all of the code paths. Change-Id: Ic152a2cd03bc1713967673681e1e6204dcd80346 Reviewed-on: http://gerrit.ent.cloudera.com:8080/564 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:53:13 -08:00
Aaron Davidson	00275ce3a9	(IMPALA-422) Add string concatenation function Implements a group_concat() function which concatenates all the values in a group together. The format is group_concat(str_col, [separator]). The default separator is ', '. NULLs are ignored. Change-Id: If152df6f528401117dba81d66ef691bfb548cc7d Reviewed-on: http://gerrit.ent.cloudera.com:8080/117 Reviewed-by: Aaron Davidson <aaron.davidson@cloudera.com> Tested-by: Aaron Davidson <aaron.davidson@cloudera.com>	2014-01-08 10:52:21 -08:00
Alex Behm	c9040aee22	IMPALA-111: COUNT(DISTINCT col) returns wrong results -- does not ignore NULLs.	2014-01-08 10:50:09 -08:00
Alex Behm	1b2e8280d4	Fix NULL issues.	2014-01-08 10:49:32 -08:00
Henry Robinson	8d87972695	Improve parser coverage This patch adds support for the following SQL constructs - Unary + operator - The ALL keyword, in SELECT ALL and SELECT aggregate_func(ALL ) - REAL and INTEGER as type synonyms for DOUBLE and INT respectively - The AS keyword after a table spec. e.g. SELECT FROM tbl AS t0	2014-01-08 10:48:54 -08:00
Alexander Behm	39e443407b	IMPALA-136: GROUP BY float/double.	2014-01-08 10:48:43 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Lenni Kuff	30dbf59ef2	Final changes to enable Python test infrastructure and tests With this change the Python tests will now be called as part of buildall and the corresponding Java tests have been disabled. The new tests can also be invoked calling ./tests/run-tests.sh directly. This includes a fix from Nong that caused wrong results for limit on non-io manager formats.	2014-01-08 10:46:57 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00
Nong Li	b22b565a92	Fix codegen for min/max of bool col.	2014-01-08 10:46:43 -08:00
Marcel Kornacker	ea050a43ad	Switching over backend runtime structures to new planner. Added container-util.h	2014-01-08 10:46:20 -08:00
Henry Robinson	3519701529	Support backtick quoting for identifiers	2014-01-08 10:46:00 -08:00
Alan Choi	88101bc90e	This patch implements the probabilistic counting algorithm as an aggregate "distinctpc" and "distinctpcsa". We've gathered statistics on an internal dataset (all columns) which is part of our regression data. It's roughly 400mb, ~100 columns, int/bigint/string type. On Hive, it took roughly 64sec. On this Impala implementation, it took 35sec. By adding inline to hash-util.h (which we don't), we can achieve 24~26sec. Change-Id: Ibcba3c9512b49e8b9eb0c2fec59dfd27f14f84c3	2014-01-08 10:44:27 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00

24 Commits