impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 12:01:11 -05:00

Author	SHA1	Message	Date
Taras Bobrovytsky	57d7c614bc	IMPALA-5036: Parquet count star optimization Instead of materializing empty rows when computing count star, we use the data stored in the Parquet RowGroup.num_rows field. The Parquet scanner tuple is modified to have one slot into which we will write the num rows statistic. The aggregate function is changed from count to a special sum function that gets initialized to 0. We also add a rewrite rule so that count(<literal>) is rewritten to count(*) in order to make sure that this optimization is applied in all cases. Testing: - Added functional and planner tests Change-Id: I536b85c014821296aed68a0c68faadae96005e62 Reviewed-on: http://gerrit.cloudera.org:8080/6812 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-06 01:26:44 +00:00
Alex Behm	9f678a7426	IMPALA-5547: Rework FK/PK join detection. Reworks the FK/PK join detection logic to: - more accurately recognize many-to-many joins - avoid dim/dim joins for multi-column PKs The new detection logic maintains our existing philosophy of generally assuming a FK/PK join, unless there is strong evidence to the contrary, as follows. For each set of simple equi-join conjuncts between two tables, we compute the joint NDV of the right-hand side columns by multiplication, and if the joint NDV is significantly smaller than the right-hand side row count, then we are fairly confident that the right-hand side is not a PK. Otherwise, we assume the set of conjuncts could represent a FK/PK relationship. Extends the explain plan to include the outcome of the FK/PK detection at EXPLAIN_LEVEL > STANDARD. Performance testing: 1. Full TPC-DS run on 10TB: - Q10 improved by >100x - Q72 improved by >25x - Q17,Q26,Q29 improved by 2x - Q64 regressed by 10x - Total runtime: Improved by 2x - Geomean: Minor improvement The regression of Q64 is understood and we will try to address it in follow-on changes. The previous plan was better by accident and not because of superior logic. 2. Nightly TPC-H and TPC-DS runs: - No perf differences Testing: - The existing planner test cover the changes. - Code/hdfs run passed. Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067 Reviewed-on: http://gerrit.cloudera.org:8080/7257 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-03 00:04:54 +00:00
Tim Armstrong	c4d284f3cc	IMPALA-5483: Automatically disable codegen for small queries This is similar to the single-node execution optimisation, but applies to slightly larger queries that should run in a distributed manner but won't benefit from codegen. This adds a new query option disable_codegen_rows_threshold that defaults to 50,000. If fewer than this number of rows are processed by a plan node per impalad, the cost of codegen almost certainly outweighs the benefit. Using rows processed as a threshold is justified by a simple model that assumes the cost of codegen and execution per row for the same operation are proportional. E.g. if x is the complexity of the operation, n is the number of rows processed, C is a constant factor giving the cost of codegen and Ec/Ei are constant factor giving the cost of codegen'd and interpreted execution and d, then the cost of the codegen'd operator is C * x + Ec * x * n and the cost of the interpreted operator is Ei * x * n. Rearranging means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that (at least with the simplified model) it makes sense to choose interpretation or codegen based on a constant threshold. The model also implies that it is somewhat safer to choose codegen because the additional cost of codegen is O(1) but the additional cost of interpretation is O(n). I ran some experiments with TPC-H Q1, varying the input table size, to determine what the cut-over point where codegen was beneficial was. The cutover was around 150k rows per node for both text and parquet. At 50k rows per node disabling codegen was very beneficial - around 0.12s versus 0.24s. To be somewhat conservative I set the default threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the cutover tends to be higher because there are plan nodes that process many fewer than the max rows. Fix a couple of minor issues in the frontend - the numNodes_ calculation could return 0 for Kudu, and the single node optimization didn't handle the case where for a scan node with conjuncts, a limit and missing stats correctly (it considered the estimate still valid.) Testing: Updated e2e tests that set disable_codegen to set disable_codegen_rows_threshold to 0, so that those tests run both with and without codegen still. Added an e2e test to make sure that the optimisation is applied in the backend. Added planner tests for various cases where codegen should and shouldn't be disabled. Perf: Added a targeted perf test for a join+agg over a small input, which benefits from this change. Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e Reviewed-on: http://gerrit.cloudera.org:8080/7153 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-29 21:14:59 +00:00
sandeep akinapelli	536a0612ac	IMPALA-5280: Coalesce chains of OR conditions to an IN predicate This change introduces a new rule to merge disjunct equality predicates into an IN predicate. As with every rule being applied bottom up, the rule merges the leaf OR predicates into an in predicate and subsequently merges the OR predicate to the existing IN predicate It will also merge two compatible IN predicates into a single IN predicate. Patch also addresses review comments to normalize the binary predicates and testcases for the same. binary predicates of the form constant <op> non constant are normalized to non constant <op> constant Change-Id: If02396b752c5497de9a92828c24c8062027dc2e2 Reviewed-on: http://gerrit.cloudera.org:8080/7110 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-29 08:57:29 +00:00
Lars Volker	ae5a8770ee	IMPALA-3643/IMPALA-5344: Fix FE tests on Java 8 This change fixes the frontend tests to make them run on Java 8, by replacing HashMap with LinkedHashMap and HashSet with LinkedHashSet where needed. To test this I ran the frontend tests using both Oracle Java 7 and Oracle Java 8 and made sure they passed. I also verified that the tests pass with OpenJDK 7. Change-Id: Iad8e1dccec3a51293a109c420bd2b88b9d1e0625 Reviewed-on: http://gerrit.cloudera.org:8080/7073 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-28 01:24:11 +00:00
Tim Armstrong	dbb0d863be	IMPALA-5160: adjust spill buffer size based on planner estimates Scale down the buffer size in hash joins and hash aggregations if estimates indicate that the build side of the join is small. This greatly reduces minimum memory requirements for joins in some common cases, e.g. small dimension tables. Currently this is not plumbed through to the backend and only takes effect in planner tests. Testing: Added targeted planner tests for small/mid/large/unknown memory requirements for aggregations and joins. Change-Id: I57b5b4c528325d478c8a9b834a6bc5dedab54b5b Reviewed-on: http://gerrit.cloudera.org:8080/6963 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-27 20:30:34 +00:00
Alex Behm	00535ab170	IMPALA-5562: Only recomputeMemLayout() if tuple has a layout. For queries where plan generation is terminated early due to LIMIT 0 or similar, some tuples may not have a mem layout because no PlanNode has been generated to materialize them. The fix is to make recomputeMemLayout() a no-op if the tuple does not have an existing mem layout. Testing: - added regression test Change-Id: I08548c6bfa7dbf4655e55636605bebb89d2a2239 Reviewed-on: http://gerrit.cloudera.org:8080/7264 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-23 06:17:14 +00:00
Vincent Tran	d062257462	IMPALA-5494: Fixes the selectivity of NOT IN predicates This change modifies the logic of NOT IN predicate so that the planner can calculate the correct node cardinality. Prior to this change, both IN and NOT IN predicates shared the same selectivity, which resulted in the same cardinality during planning. The selectivity is calculated by the following heuristic: selectivity = 1 - (num of predicate children / num of distinct values) Change-Id: I69e6217257b5618cb63e13b32ba3347fa0483b63 Reviewed-on: http://gerrit.cloudera.org:8080/7168 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-16 22:18:07 +00:00
Alex Behm	ecda49f3e3	IMPALA-5381: Adds DEFAULT_JOIN_DISTRIBUTION_MODE query option. Adds a new query option DEFAULT_JOIN_DISTRIBUTION_MODE to control which join distribution mode is chosen when the join inputs have an unknown cardinality (e.g., missing stats) or when the expected costs of the different strategies are equal. Values for DEFAULT_JOIN_DISTRIBUTION_MODE: [BROADCAST, SHUFFLE] Default: BROADCAST Note that this change effectively undoes IMPALA-5120. Testing: - Added new planner tests - Core/hdfs run passed Change-Id: Ibd34442f422129d53bef5493fc9cbe7375a0765c Reviewed-on: http://gerrit.cloudera.org:8080/7059 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-04 08:11:53 +00:00
Michael Ho	f15589573b	IMPALA-5376: Loads all TPC-DS tables This change loads the missing tables in TPC-DS. In addition, it also fixes up the loading of the partitioned table store_sales so all partitions will be loaded. The existing TPC-DS queries are also updated to use the parameters for qualification runs as noted in the TPC-DS specification. Some hard-coded partition filters were also removed. They were there due to the lack of dynamic partitioning in the past. Some missing TPC-DS queries are also added to this change, including query28 which discovered the infamous IMPALA-5251. Having all tables in TPC-DS available paves the way for us to include all supported TPCDS queries in our functional testing. Due to the change in the data, planner tests and the E2E tests have different results than before. The results of E2E tests were compared against the run done with Netezza and Vertica. The divergence were all due to the truncation behavior of decimal types in DECIMAL_V1. Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9 Reviewed-on: http://gerrit.cloudera.org:8080/6877 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-27 05:19:53 +00:00
Alex Behm	e89d7057a6	IMPALA-2373: Extrapolate row counts for HDFS tables. The main idea of this patch is to use table stats to extrapolate the row counts for new/modified partitions. Existing behavior: - Partitions that lack the row count stat are ignored when estimating the cardinality of HDFS scans. Such partitions effectively have an estimated row count of zero. - We always use the row count stats for partitions that have one. The row count may be innaccurate if data in such partitions has changed significantly. Summary of changes: - Enhance COMPUTE STATS to also store the total number of file bytes in the table. - Use the table-level row count and file bytes stats to estimate the number of rows in a scan. - A new impalad startup flag is added to enable/disable the extrapolation behavior. The feature is disabled by default. Note that even with the feature disabled, COMPUTE STATS stores the file bytes so you can enable the feature without having to run COMPUTE STATS again. Testing: - Added new FE unit test - Added new EE test Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/6840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-26 21:06:17 +00:00
Thomas Tauber-Marshall	014c5603f8	IMPALA-5354: INSERT hints for Kudu tables A previous change, IMPALA-3742, added an exchange node and sort node to plans for inserts into Kudu tables to partition and sort the input to match the target table. This patch enables INSERT hints for Kudu tables - 'noshuffle' which removes the exchange node from the plan and 'noclustered' which removes the sort node. Insert hints have no effect for inserts that are small enough to result in a single node execution. Testing: - Updated FE planner and analysis tests. - Ran Kudu EE tests. Change-Id: Idbd1ef977446ffee157ce3ce0b476e1f08a75d05 Reviewed-on: http://gerrit.cloudera.org:8080/6980 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-25 21:08:59 +00:00
Alex Behm	ee0fc260d1	IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. Syntax: <tableref> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)] The first number specifies the percent of table bytes to sample. The second number specifies the random seed to use. The sampling is coarse-grained. Impala keeps randomly adding files to the sample until at least the desired percentage of file bytes have been reached. Examples: SELECT * FROM t TABLESAMPLE SYSTEM(10) SELECT * FROM t TABLESAMPLE SYSTEM(50) REPEATABLE(1234) Testing: - Added parser, analyser, planner, and end-to-end tests - Private core/hdfs run passed Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 Reviewed-on: http://gerrit.cloudera.org:8080/6868 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-24 02:38:08 +00:00
Lars Volker	0c8b2d3dbe	IMPALA-5144: Remove sortby() hint The sortby() hint is superseded by the SORT BY SQL clause, which has been introduced in IMPALA-4166. This changes removes the hint. Change-Id: I83e1cd6fa7039035973676322deefbce00d3f594 Reviewed-on: http://gerrit.cloudera.org:8080/6885 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-22 00:40:04 +00:00
Lars Volker	3610533f4b	IMPALA-5339: Fix analysis with sort.columns and expr rewrites IMPALA-4166 introduced a bug by duplicating code that adds sort expressions. Upon re-analysis, this code would hit an IndexOutOfBoundsException. Change-Id: Ibebba29509ae7eaa691fe305500cda6bd41a179a Reviewed-on: http://gerrit.cloudera.org:8080/6921 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-20 00:56:45 +00:00
Zach Amsden	d6e612f5c7	IMPALA-5180: Don't use non-deterministic exprs in partition pruning Non-deterministic exprs which evaluate as constant should not be used during HDFS partition pruning. We consider Exprs which have no SlotRefs as bound by default, and thus we end up trying to apply them indisrciminately. Constant propagation makes this situation easier to run into and the behavior is rather unexpected. The fix for now is to explicitly disallow non-deterministic Exprs in partition pruning. Change-Id: I91054c6bf017401242259a1eff5e859085285546 Reviewed-on: http://gerrit.cloudera.org:8080/6575 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-19 08:50:25 +00:00
Matthew Jacobs	24c77f194b	IMPALA-5137: Support pushing TIMESTAMP predicates to Kudu This change builds on the support for reading and writing TIMESTAMP columns to Kudu tables (see [1]), adding support for pushing TIMESTAMP predicates to Kudu for scans. Binary predicates and IN list predicates are supported. Testing: Added some planner and EE tests to validate the behavior. 1: https://gerrit.cloudera.org/#/c/6526/ Change-Id: I08b6c8354a408e7beb94c1a135c23722977246ea Reviewed-on: http://gerrit.cloudera.org:8080/6789 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-18 21:09:51 +00:00
Lars Volker	1ada9dac88	IMPALA-4166: Add SORT BY sql clause This change adds support for adding SORT BY (...) clauses to CREATE TABLE and ALTER TABLE statements. Examples are: CREATE TABLE t (i INT, j INT, k INT) PARTITIONED BY (l INT) SORT BY (i, j); CREATE TABLE t SORT BY (int_col,id) LIKE u; CREATE TABLE t LIKE PARQUET '/foo' SORT BY (id,zip); ALTER TABLE t SORT BY (int_col,id); ALTER TABLE t SORT BY (); Sort columns can only be specified for Hdfs tables and effectiveness may vary based on storage type; for example TEXT tables will not see improved compression. The SORT BY clause must not contain clustering columns. The columns in the SORT BY clause are stored in the 'sort.columns' table property and will result in an additional SORT node being added to the plan before the final table sink. Specifying sort columns also enables clustering during inserts, so the SORT node will contain all partitioning columns first, followed by the sort columns. We do this because sort columns add a SORT node to the plan and adding the clustering columns to the SORT node is cheap. Sort columns supersede the sortby() hint, which we will remove in a subsequent change (IMPALA-5144). Until then, it is possible to specify sort columns using both ways at the same time and the column lists will be concatenated. Change-Id: I08834f38a941786ab45a4381c2732d929a934f75 Reviewed-on: http://gerrit.cloudera.org:8080/6495 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-12 15:43:30 +00:00
Thomas Tauber-Marshall	aca07ee816	IMPALA-5120: Default to partitioned join when stats are missing Previously, we defaulted to broadcast join when stats were missing, but this can lead to disastrous plans when the right hand side is actually large. Its always difficult to make good plans when stats are missing, but defaulting to partitioned joins should reduce the risk of disastrous plans. Testing: - Added a planner test that joins a table with no stats. Change-Id: Ie168ecfcd5e7c5d3c60d16926c151f8f134c81e0 Reviewed-on: http://gerrit.cloudera.org:8080/6803 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-08 19:05:11 +00:00
Joe McDonnell	aa05c6493b	IMPALA-3654: Parquet stats filtering for IN predicate This generates min/max predicates for InPredicates that have only constant values in the IN list. It is only used for statistics filtering on Parquet files. Change-Id: I4a88963a7206f40a867e49eceeaf03fdd4f71997 Reviewed-on: http://gerrit.cloudera.org:8080/6810 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-06 03:40:57 +00:00
Thomas Tauber-Marshall	801c95f39f	IMPALA-3742: Partitions and sort INSERTs for Kudu tables Bulk DMLs (INSERT, UPSERT, UPDATE, and DELETE) for Kudu are currently painful because we just send rows randomly, which creates a lot of work for Kudu since it partitions and sorts data before writing, causing writes to be slow and leading to timeouts. We can alleviate this by sending the rows to Kudu already partitioned and sorted. This patch partitions and sorts rows according to Kudu's partitioning scheme for INSERTs and UPSERTs. A followup patch will handle UPDATE and DELETE. It accomplishes this by inserting an exchange node and a sort node into the plan before the operation. Both the exchange and the sort are given a KuduPartitionExpr which takes a row and calls into the Kudu client to return its partition number. It also disallows INSERT hints for Kudu tables, since the hints that we support (SHUFFLE, CLUSTER, SORTBY), so longer make sense. Testing: - Updated planner tests. - Ran the Kudu functional tests. - Ran performance tests demonstrating that we can now handle much larger inserts without having timeouts. Change-Id: I84ce0032a1b10958fdf31faef225372c5c38fdc4 Reviewed-on: http://gerrit.cloudera.org:8080/6559 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 01:40:43 +00:00
Zach Amsden	77304530f1	IMPALA-5003: Constant propagation in scan conjuncts Implements constant propagation within conjuncts and applies the optimization to scan conjuncts and collection conjuncts within Hdfs scan nodes. The optimization is applied during planning. At scan nodes in particular, we want to optimize to enable partition pruning. In certain cases, we might end up with a FALSE conditional, which now will convert to an EmptySet node. Testing: Expanded the test cases for the planner to achieve constant propagation. Added Kudu, datasource, Hdfs and HBase tests to validate we can create EmptySetNodes. Change-Id: I79750a8edb945effee2a519fa3b8192b77042cb4 Reviewed-on: http://gerrit.cloudera.org:8080/6389 Tested-by: Impala Public Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2017-05-02 01:12:14 +00:00
Thomas Tauber-Marshall	6cddb952ce	IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs Previously, exprs used in sorts were evaluated lazily. This can potentially be bad for performance if the exprs are expensive to evaluate, and it can lead to crashes if the exprs are non-deterministic, as this violates assumptions of our sorting algorithm. This patch addresses these issues by materializing ordering exprs. It does so when the expr is non-deterministic (including when it contains a UDF, which we cannot currently know if they are non-deterministic), or when its cost exceeds a threshold (or the cost is unknown). Testing: - Added e2e tests in test_sort.py. - Updated planner tests. Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e Reviewed-on: http://gerrit.cloudera.org:8080/6322 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-26 22:34:04 +00:00
Michael Ho	42ca45e830	IMPALA-5251: Fix propagation of input exprs' types in 2-phase agg Since commit `d2d3f4c` (on asf-master), TAggregateExpr contains the logical input types of the Aggregate Expr. The reason they are included is that merging aggregate expressions will have input tyes of the intermediate values which aren't necessarily the same as the input types. For instance, NDV() uses a binary blob as its intermediate value and it's passed to its merge aggregate expressions as a StringVal but the input type of NDV() in the query could be DecimalVal. In this case, we consider DecimalVal as the logical input type while StringVal is the intermediate type. The logical input types are accessed by the BE via GetConstFnAttr() during interpretation and constant propagation during codegen. To handle distinct aggregate expressions (e.g. select count(distinct)), the FE uses 2-phase aggregation by introducing an extra phase of split/merge aggregation in which the distinct aggregate expressions' inputs are coverted and added to the group-by expressions in the first phase while the non-distinct aggregate expressions go through the normal split/merge treatement. The bug is that the existing code incorrectly propagates the intermediate types of the non-grouping aggregate expressions as the logical input types to the merging aggregate expressions in the second phase of aggregation. The input aggregate expressions for the non-distinct aggregate expressions in the second phase aggregation are already merging aggregate expressions (from phase one) in which case we should not treat its input types as logical input types. This change fixes the problem above by checking if the input aggregate expression passed to FunctionCallExpr.createMergeAggCall() is already a merging aggregate expression. If so, it will use the logical input types recorded in its 'mergeAggInputFn_' as references for its logical input types instead of the aggregate expression input types themselves. Change-Id: I158303b20d1afdff23c67f3338b9c4af2ad80691 Reviewed-on: http://gerrit.cloudera.org:8080/6724 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-26 21:40:32 +00:00
aphadke	8660c404c9	IMPALA-5145: Do not constant fold null in CastExprs Constant folding null values in CastExprs causes CTAS statements to fail. This regresses the observed behavior before constant folding was introduced. This change does not constant fold null in CastExprs. Change-Id: Ia7aa1ab7f53a9dcc7560ded321a9d1e1ee2d18e3 Reviewed-on: http://gerrit.cloudera.org:8080/6663 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-20 19:39:17 +00:00
Tim Armstrong	9a29dfc91b	IMPALA-3748: minimum buffer requirements in planner Compute the minimum buffer requirement for spilling nodes and per-host estimates for the entire plan tree. This builds on top of the existing resource estimation code, which computes the sets of plan nodes that can execute concurrently. This is cleaned up so that the process of producing resource requirements is clearer. It also removes the unused VCore estimates. Fixes various bugs and other issues: * computeCosts() was not called for unpartitioned fragments, so the per-operator memory estimate was not visible. * Nested loop join was not treated as a blocking join. * The TODO comment about union was misleading * Fix the computation for mt_dop > 1 by distinguishing per-instance and per-host estimates. * Always generate an estimate instead of unpredictably returning -1/"unavailable" in many circumstances - there was little rhyme or reason to when this happened. * Remove the special "trivial plan" estimates. With the rest of the cleanup we generate estimates <= 10MB for those trivial plans through the normal code path. I left one bug (IMPALA-4862) unfixed because it is subtle, will affect estimates for many plans and will be easier to review once we have the test infra in place. Testing: Added basic planner tests for resource requirements in both the MT and non-MT cases. Re-enabled the explain_level tests, which appears to be the only coverage for many of these estimates. Removed the complex and brittle test cases and replaced with a couple of much simpler end-to-end tests. Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37 Reviewed-on: http://gerrit.cloudera.org:8080/5847 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-18 20:36:08 +00:00
Joe McDonnell	077c07eec7	IMPALA-4859: Push down IS NULL / IS NOT NULL to Kudu This detects IS NULL / IS NOT NULL and creates a Kudu predicate to push this to Kudu. For testing, there are planner tests to verify that the predicate is pushed to Kudu. There are also end-to-end tests for correctness. Change-Id: I9c96fec8d41f77222879c0ffdd6940b168e47e65 Reviewed-on: http://gerrit.cloudera.org:8080/5958 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-25 04:51:36 +00:00
Taras Bobrovytsky	a50c344077	IMPALA-3586: Implement union passthrough The union node acts as pass through operator and forwards row batches from it's children without materializing. This is done in the case when the child's tuple layout is identical to union node tuple layout and no functions need to be applied to the child row batches. Removed operand reordering in the FE because it's simpler and safer to handle all passthrough children before non-passthrough children in the BE. The recent improvements to memory management allowed us to remove this requirement. Testing: - Added new planner and end to end tests that cover the new functionality. - Updated existing tests to reflect the new behavior. Perf: Ran a benchmark on a local 10 GB tpcds dataset. I used an unpartitioned version of the store_sales table. There was over a 2x performance improvement for the following query: SELECT COUNT(ss_sold_time_sk), COUNT(ss_item_sk), COUNT(ss_customer_sk), COUNT(ss_cdemo_sk), COUNT(ss_hdemo_sk), COUNT(ss_addr_sk), COUNT(ss_store_sk), COUNT(ss_promo_sk), COUNT(ss_ticket_number), COUNT(ss_quantity), COUNT(ss_wholesale_cost), COUNT(ss_list_price), COUNT(ss_sales_price), COUNT(ss_ext_discount_amt), COUNT(ss_ext_sales_price), COUNT(ss_ext_wholesale_cost), COUNT(ss_ext_list_price), COUNT(ss_ext_tax), COUNT(ss_coupon_amt), COUNT(ss_net_paid), COUNT(ss_net_paid_inc_tax), COUNT(ss_net_profit), COUNT(ss_sold_date_sk) FROM ( select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned union all select * from tpcds_10_parquet.store_sales_unpartitioned ) t Before: Total Time: 43s164ms Summary: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------------ 13:AGGREGATE 1 224.721us 224.721us 1 1 28.00 KB -1.00 B FINALIZE 12:EXCHANGE 1 24.578us 24.578us 3 1 0 -1.00 B UNPARTITIONED 11:AGGREGATE 3 2s402ms 3s060ms 3 1 119.00 KB 10.00 MB 00:UNION 3 35s380ms 37s846ms 288.01M 288.01M 3.08 MB 0 \|--02:SCAN HDFS 3 184.197ms 219.931ms 28.80M 28.80M 535.03 MB 1.88 GB store_sales_unpartitioned \|--03:SCAN HDFS 3 131.956ms 153.401ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned \|--04:SCAN HDFS 3 178.456ms 247.721ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned \|--05:SCAN HDFS 3 189.398ms 242.251ms 28.80M 28.80M 535.01 MB 1.88 GB store_sales_unpartitioned \|--06:SCAN HDFS 3 122.786ms 156.528ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned \|--07:SCAN HDFS 3 147.467ms 183.391ms 28.80M 28.80M 535.13 MB 1.88 GB store_sales_unpartitioned \|--08:SCAN HDFS 3 147.502ms 186.273ms 28.80M 28.80M 535.01 MB 1.88 GB store_sales_unpartitioned \|--09:SCAN HDFS 3 130.086ms 154.682ms 28.80M 28.80M 535.04 MB 1.88 GB store_sales_unpartitioned \|--10:SCAN HDFS 3 122.701ms 161.056ms 28.80M 28.80M 534.89 MB 1.88 GB store_sales_unpartitioned 01:SCAN HDFS 3 287.863ms 330.436ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned After: Total Time: 19s139ms Summary: Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail ------------------------------------------------------------------------------------------------------------------------------ 13:AGGREGATE 1 166.241us 166.241us 1 1 28.00 KB -1.00 B FINALIZE 12:EXCHANGE 1 71.695us 71.695us 3 1 0 -1.00 B UNPARTITIONED 11:AGGREGATE 3 2s971ms 3s809ms 3 1 3.08 MB 10.00 MB 00:UNION 3 207.956ms 222.846ms 288.01M 288.01M 0 0 \|--02:SCAN HDFS 3 1s533ms 1s535ms 28.80M 28.80M 532.28 MB 1.88 GB store_sales_unpartitioned \|--03:SCAN HDFS 3 1s554ms 1s669ms 28.80M 28.80M 525.73 MB 1.88 GB store_sales_unpartitioned \|--04:SCAN HDFS 3 1s568ms 1s716ms 28.80M 28.80M 525.03 MB 1.88 GB store_sales_unpartitioned \|--05:SCAN HDFS 3 1s503ms 1s617ms 28.80M 28.80M 527.43 MB 1.88 GB store_sales_unpartitioned \|--06:SCAN HDFS 3 1s560ms 1s634ms 28.80M 28.80M 528.52 MB 1.88 GB store_sales_unpartitioned \|--07:SCAN HDFS 3 1s489ms 1s643ms 28.80M 28.80M 534.81 MB 1.88 GB store_sales_unpartitioned \|--08:SCAN HDFS 3 1s534ms 1s581ms 28.80M 28.80M 528.10 MB 1.88 GB store_sales_unpartitioned \|--09:SCAN HDFS 3 1s558ms 1s674ms 28.80M 28.80M 526.77 MB 1.88 GB store_sales_unpartitioned \|--10:SCAN HDFS 3 1s504ms 1s692ms 28.80M 28.80M 527.83 MB 1.88 GB store_sales_unpartitioned 01:SCAN HDFS 3 1s682ms 1s911ms 28.80M 28.80M 526.14 MB 1.88 GB store_sales_unpartitioned Change-Id: Ia8f6d5062724ba5b78174c3227a7a796d10d8416 Reviewed-on: http://gerrit.cloudera.org:8080/5816 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-21 22:24:01 +00:00
Lars Volker	768fc0ea27	IMPALA-4734: Set parquet::RowGroup::sorting_columns This changes the HdfsParquetTableWriter to populate the parquet::RowGroup::sorting_columns list with all columns mentioned in a 'sortby()' hint within INSERT statements. The columns are added to the list in the order in which they appear inside the hint. The change also adds backports.tempfile to the python requirements to provide 'tempfile.TemporaryDirectory' on python 2.7. The change also changes the default ordering for columns mentioned in 'sortby()' hints from descending to ascending. To test this change, we write a table with a 'sortby()' hint and verify, that the sorting_columns get populated correctly. Change-Id: Ib42aab585e9e627796e9510e783652d49d74b56c Reviewed-on: http://gerrit.cloudera.org:8080/6219 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-07 09:07:05 +00:00
Joe McDonnell	9b923a1a27	IMPALA-4624: Implement Parquet dictionary filtering Here is a basic summary of the changes: Frontend looks for conjuncts that operate on a single slot and pass a map from slot id to the conjunct index through thrift to the backend. The conjunct indices are the indices into the normal PlanNode conjuncts list. The conjuncts need to satisfy certain conditions: 1. They are bound on a single slot 2. They are deterministic (no random functions) 3. They evaluate to FALSE on a NULL input. This is because the dictionary does not include NULLs, so any condition that evaluates to TRUE on NULL cannot be evaluated by looking only at the dictionary. The backend converts the indices into ExprContexts. These are cloned in the scanner threads. The dictionary read codepath has been removed from ReadDataPage into its own function, InitDictionary. This has also been turned into its own step in row group initialization. ReadDataPage will not see any dictionary pages unless the parquet file is invalid. For dictionary filtering, we initialize dictionaries only as needed to evaluate the conjuncts. The Parquet scanner evaluates the dictionary filter conjuncts on the dictionary to see if any dictionary entry passes. If no entry passes, the row group is eliminated. If the row group passes the dictionary filtering, then we initialize all remaining dictionaries. Dictionary filtering is controlled by a new query option, parquet_dictionary_filtering, which is on by default. Since column chunks can have a mixture of encodings, dictionary filtering uses three tests to determine whether this is purely dictionary encoded: 1. If the encoding_stats is in the parquet file, then use it to determine if there are only dictionary encoded pages (i.e. there are no data pages with an encoding other than PLAIN_DICTIONARY). -OR- 2. If the encoding stats are not present, then look at the encodings. The column is purely dictionary encoded if: a) PLAIN_DICTIONARY is present AND b) Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed -OR- 3. If this file was written by an older version of Impala, then we know that dictionary failover happens when the dictionary reaches 40,000 values. Dictionary filtering can proceed as long as the dictionary is smaller than that. parquet-mr writes the encoding list correctly in the current version in our environment (1.5.0). This means that check #2 works on some existing files (potentially most existing parquet-mr files). parquet-mr writes the encoding stats starting in 1.9.0. This is the version where check #1 will start working. Impala's parquet writer now implements both, so either check above will work. Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Reviewed-on: http://gerrit.cloudera.org:8080/5904 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-06 23:20:34 +00:00
Lars Volker	749a55c4ad	IMPALA-2328: Read support for min/max Parquet statistics This change adds support for skipping row groups based on Parquet row group statistics. With this change we only support reading statistics from Parquet files for numerical types (bool, integer, floating point) and for simple predicates of the forms <slot> <op> <constant> or <constant> <op> <slot>, where <op> is LT, LE, GE, GT, and EQ. Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4 Reviewed-on: http://gerrit.cloudera.org:8080/6032 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-23 11:16:17 +00:00
Alex Behm	ffd297bdea	IMPALA-4916: Fix maintenance of set of item sets in DisjointSet. The bug: The DisjointSet maintains a set of unique item sets using a HashSet<Set<T>>. The problem is that we modified the Set<T> elements after inserting them into the HashSet. This caused the removal of elements from the HashSet to fail. Removal is required for maintaining a consistent DisjointSet. The removal could even fail for the same Set<T> instance because the hashCode() changed from when the Set<T> was originally inserted to when the removal was attempted due to mutation of the Set<T>. An inconsistent DisjointSet can lead to incorrect equivalence classes, which can lead to missing, redundant and even non-executable predicates. Incorrect results and crashes are possible. For most queries, an inconsistent DisjointSet does not alter the equivalence classes, and even fewer queries have incorrect plans. In fact, many of our existing planner tests trigger this bug, but only 3 of them lead to an incorrect value transfer graph, and all 3 had correct plans. The fix: Use an IdentityHashMap to store the set of item sets. It does not rely on the hashCode() and equals() of the stored elements, so the same object can be added and later removed, even when mutated in the meantime. Testing: - Added a Preconditions check in DisjointSet that asserts correct removal of an item set. Many of our existing tests hit the check before this fix. - Added a new unit test for DisjointSet which triggers the bug. - Augmented DisjointSet.checkConsistency() to check for inconsistency in the set of item sets. - Added validation of the value-transfer graph in single-node planner tests. - A private core/hdfs run succeeded. Change-Id: I609c8795c09becd78815605ea8e82e2f99e82212 Reviewed-on: http://gerrit.cloudera.org:8080/5980 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-15 22:16:30 +00:00
Alex Behm	cd153d66dc	IMPALA-4263: Fix wrong ommission of agg/analytic hash exchanges. The bug: Our detection of partition compatibility for grouping aggregations and analytic functions did not take into account the effect of outer joins within the same fragment. As a result, we used to incorrectly omit a required hash exchange. For example, a hash exchange + merge phase is required if the grouping expressions of an aggregation reference tuples that are made nullable within the same fragment. The exchange is needed to bring together NULLs produced by outer-join non-matches. The fix: Check that the grouping/partition exprs do not reference tuples that are made nullable within the same fragment. Testing: Planner tests pass locally. Change-Id: I121222179378e56836422a69451d840a012c9e54 Reviewed-on: http://gerrit.cloudera.org:8080/5774 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2017-02-13 23:00:01 +00:00
Marcel Kornacker	70ae2e38eb	IMPALA-4739: ExprRewriter fails on HAVING clauses The bug was that expr rewrite rules such as ExtractCommonConjunctRule analyzed their own output, which doesn't work for syntactic elements that allow column aliases, such as the HAVING clause. The fix was to remove the analysis step (the re-analysis happens anyway in AnalysisCtx). Change-Id: Ife74c61f549f620c42f74928f6474e8a5a7b7f00 Reviewed-on: http://gerrit.cloudera.org:8080/5662 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins	2017-01-12 02:31:44 +00:00
Lars Volker	ce9b332ee9	IMPALA-4163: Add sortby() query hint This change introduces the sortby() query plan hint for insert statements. When specified, sortby(a, b) will add an additional sort step to the plan to order data by columns a, b before inserting it into the target table. Change-Id: I37a3ffab99aaa5d5a4fd1ac674b3e8b394a3c4c0 Reviewed-on: http://gerrit.cloudera.org:8080/5051 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-12-17 05:37:43 +00:00
Matthew Jacobs	c2faf4a8a1	IMPALA-4662: Fix NULL literal handling in Kudu IN list predicates The KuduScanNode attempts to push IN list predicates to the Kudu scan, but NULL literals cannot be pushed. The code in KuduScanNode needed to check if the Literals in the InPredicate is a NullLiteral, in which case the entire IN list should not be pushed to Kudu. The same handling is already in place for binary predicate pushdown. Change-Id: Iaf2c10a326373ad80aef51a85cec64071daefa7b Reviewed-on: http://gerrit.cloudera.org:8080/5505 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-15 23:00:24 +00:00
Alex Behm	80f85179f9	IMPALA-3126: Conservative assignment of inner-join On-clause predicates. Implements the following conservative but correct policy for assigning predicates from the On-clause of an inner join: If the predicate references an outer-joined tuple, then evaluate it at the inner join that the On-clause belongs to. Cleans up Analyzer.canEvalPredicate(). Change-Id: Idf45323ed9102ffb45c9d94a130ea3692286f215 Reviewed-on: http://gerrit.cloudera.org:8080/4982 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-09 02:12:46 +00:00
Alex Behm	eaa14f2750	IMPALA-4614: Set eval cost of timestamp literals. The main issue was that the eval cost was not set for timestamp literals, so a preconditions check was hit when trying to order a list of conjuncts by cost. Another subtle issue made the bug only reproducible by a specific query against a Kudu table in our tests, although the bug is not Kudu specific: The eval cost of Exprs was not recomputed in analyze(), even after resetting an Expr, e.g., during a substitution. As a result, the bug was only reproducible for a list of conjuncts that contained an inferred predicate with a timestamp literal. This patch does not contain a fix for that issue due to its complexity/risk. It is tracked in IMPALA-4620. Testing: Ran planner tests locally. Ran query_test.py locally. A private core/hdfs run passed. Change-Id: Ife30420bafbd1c64a5e3385e5755909110b4b354 Reviewed-on: http://gerrit.cloudera.org:8080/5404 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2016-12-08 04:31:12 +00:00
Dan Burkert	f83652c1da	Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and `BUCKETS` keywords that were going to be newly released in Impala 2.6, but are now unused. Additionally, a few remaining uses of the `DISTRIBUTE BY` syntax has been switched to `PARTITION BY`. Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922 Reviewed-on: http://gerrit.cloudera.org:8080/5382 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-07 07:31:16 +00:00
Alex Behm	6098ac7162	IMPALA-4592: Improve error msg for non-deterministic predicates. Impala cannot correctly evaluate or assign some non-deterministic predicates. This patch improves the error message shown when trying to evaluate such unsupported predicates for the purpose of partition pruning. Change-Id: I94765f62bde94f4faa7fc5c26d928099ca1496d1 Reviewed-on: http://gerrit.cloudera.org:8080/5386 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-07 06:27:51 +00:00
Dimitris Tsirogiannis	cba93f1ac3	IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2 Reviewed-on: http://gerrit.cloudera.org:8080/5317 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 10:41:53 +00:00
Alex Behm	f837754377	IMPALA-3167: Fix assignment of WHERE conjunct through grouping agg + OJ. Background: We generally allow the assignment of predicates below the nullable side of a left/right outer join, explained as follows using an example: SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id WHERE t2.int_col < 10 The scan of 't2' picks up 't2.int_col < 10' via Analyzer.getBoundPredicates() and recognizes that the predicate must also be evaluated by a join later, so the predicate is not marked as assigned. The join then picks up the unassigned predicate via Analyzer.getUnassignedConjuncts(). The bug was that our logic for detecting whether a bound predicate must also be evaluated at a join node was flawed because it only considered whether the tuples of the source or destination predicate were outer joined (plus other conditions). The underlying assumption is that either the source or destination tuple are bound by a tuple produced by a TableRef, but in the buggy query the source predicate is bound by an aggregation tuple, so we incorrectly marked the bound predicate as assigned in Analyzer.getBoundPredicates(). The fix is to conservatively not mark bound predicates as assigned if the slots referenced by the predicate have equivalent slots that belong to an outer-joined tuple. As a result, a plan node may pick up the same predicate multiple times, once via Analyzer.getBoundPredicates() and another time via Analyzer.getUnassignedConjuncts(). Those are deduped now. The following example explains the duplicate predicate assignment: SELECT * FROM (SELECT * FROM t t1) a LEFT OUTER JOIN t b ON a.id = b.id WHERE a.id < 10 1. The predicate 'a.id < 10' gets migrated into the inline view. 'a.id < 10' is marked as assigned but is still registered as a single-tid conjunct in the Analyzer for potential propagation 2. The scan node of 't1' calls Analyzer.getBoundPredicates() and generates 't1.id < 10' based on the source predicate 'a.id < 10'. 3. The scan node of 't1' picks up the migrated conjunct 't1.id < 10' via Analyzer.getUnassignedConjuncts(). Change-Id: I774d13a13ad1e8fe82512df98dc29983bdd232eb Reviewed-on: http://gerrit.cloudera.org:8080/4960 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 07:24:01 +00:00
Alex Behm	b656f570e9	IMPALA-4578: Pick up bound predicates for Kudu scan nodes. The bug was a simple oversight. In KuduScanNiode.init() we forgot to call Analyzer.getBoundPredicates(). Change-Id: I19a38d6ea8cc0d2b0ddc3808d1f9ffef5ce306a8 Reviewed-on: http://gerrit.cloudera.org:8080/5365 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 06:19:03 +00:00
Matthew Jacobs	9f387c8583	IMPALA-4571: Push IN predicates to Kudu Fixes the KuduScanNode to convert InPredicates to KuduPredicates and push them to the Kudu scan if possible. An InPredicate can be pushed to the scan if expression is of the exact form: <SlotRef> IN (<LiteralExpr>, <LiteralExpr>, ...) That means the InPredicate has the following properties: 1) It has a list of literal values (i.e. not a subquery); All values are LiteralExprs (not SlotRefs). 2) Not negative, i.e. only 'IN' supported, not 'NOT IN' 3) The SlotRef is not wrapped in any casts 4) The types of all values match the type of the SlotRef exactly. A planner test was added exercising all supported types as well as exprs where the values would not be supported. TODO: perf testing TODO: consider a limit on the number of list values before keeping the predicate on the Impala scan node (determine from testing) Change-Id: I8988d4819d20d467b48e286917e347ca00f60cf0 Reviewed-on: http://gerrit.cloudera.org:8080/5316 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 03:24:39 +00:00
Alex Behm	12cc508178	IMPALA-3125: Fix assignment of equality predicates from an outer-join On-clause. Impala used to incorrectly assign On-clause equality predicates from an outer join if those predicates referenced multiple tables, but only one side of the outer join. The fix is to add an additional check in Analyzer.getEqJoinConjuncts() to prevent that incorrect assignment. Change-Id: I719e0eeacccad070b1f9509d80aaf761b572add0 Reviewed-on: http://gerrit.cloudera.org:8080/4986 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-05 09:31:25 +00:00
Alex Behm	852e272b32	IMPALA-4303: Do not reset() qualifier of union operands. The bug: We used to reset() the qualifier of union operands to their original value obtained during parsing. This leads to problems when union operands are unnested and we need to rewrite Subqueries. In particular, the first union operand of a nested union was reset() to a null qualifier, but that operand could be somewhere in the middle of the list of unnested operands in the parent. At that point, we've lost information about the qualifier of the unnested operand. The fix: The simplest solution is to not reset() the qualifier. The other alternative is be to reset() the qualifier, but also undo any unnesting. That seems unnecessary and wasteful. Change-Id: I157bb0f08c4a94fd779487d7c23edd64a537a1f6 Reviewed-on: http://gerrit.cloudera.org:8080/4963 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-05 00:58:30 +00:00
Dimitris Tsirogiannis	3934e13b3b	IMPALA-4283: Ensure Kudu-specific lineage and audit behavior With this commit we add support for auditing all Kudu-specific operations and we enable column lineage for INSERT and UPSERT statements on Kudu tables. No lineage output is generated for DELETE and UPDATE statements. Change-Id: Idc4ca1cd63bcfa4370c240a5c4a4126ed6704f4d Reviewed-on: http://gerrit.cloudera.org:8080/5151 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-24 01:21:56 +00:00
Alex Behm	bbf5255d0e	IMPALA-1788: Fold constant expressions. Adds a new ExprRewriteRule for replacing constant expressions with their literal equivalent via BE evaluation. Applies the new rule together with the existing ones on the parse tree, after analysis. Limitations - Constant folding is applied on the unresolved expressions. As a result, it only works for expressions that are constant within a single query block, as opposed to expressions that may become constant after fully substituting inline-view exprs. - Exprs are not normalized, so some opportunities for constant folding are missed for certain expr-tree shapes. This patch includes the following interesting changes: - Introduces a timestamp literal that can only be produced by constant folding (not expressible directly via SQL). - To make sure that rewrites have no user-visible effect, the original result types and column labels of the top-level statement are restored after the rewrites are performed. - Does not fold exprs if their evaluation resulted in a warning or error, or if the resulting value is not representable by corresponding FE LiteralExpr. - Fixes an existing issue with converting strings between the FE/BE. String produced in the BE that have characters with a value > 127 are not correctly deserialized into a Java String via thrift. We detect this case during constant folding and abandon folding of such exprs. - Fixes several issues with detecting/reporting errors in NativeEvalConstExprs(). - Cleans up ExprContext::GetValue() into ExprContext::GetConstantValue() which clarifies its only use of evaluating exprs from the FE. Testing: - Modifies expr-test.cc to run all tests through the constant folding path. - Adds basic planner and rewrite rule tests. - Exhaustive test run passed Change-Id: If672b703db1ba0bfc26e5b9130161798b40a69e9 Reviewed-on: http://gerrit.cloudera.org:8080/5109 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-23 21:11:30 +00:00
Alex Behm	263f222557	IMPALA-4490: Only generate runtime filters for hash join nodes. Change-Id: I167725e260bd0f91c2bfc164eb044321192d5b95 Reviewed-on: http://gerrit.cloudera.org:8080/5117 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-18 00:26:35 +00:00
Alex Behm	f5e660dd6e	IMPALA-4470: Avoid creating a NumericLiteral from NaN/infinity/-0. Our NumericLiteral is backed by a BigDecimal which cannot represent the special float values NaN, infinity or negative zero. As a result, when evaluating constant expressions from the FE we hit an exception when trying to create a NumericLiteral from a NaN or infinity value. Before, negative zero would silently get converted to zero which is dangerous. The fix is to treat the expr evaluation as a failure and not replace the constant Expr with a LiteralExpr. Change-Id: I8243b2ee9fa9c470d078b385583f2f48b606a230 Reviewed-on: http://gerrit.cloudera.org:8080/5050 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-16 23:55:42 +00:00

1 2 3 4 5 ...

395 Commits