Instead of materializing empty rows when computing count star, we use
the data stored in the Parquet RowGroup.num_rows field. The Parquet
scanner tuple is modified to have one slot into which we will write the
num rows statistic. The aggregate function is changed from count to a
special sum function that gets initialized to 0. We also add a rewrite
rule so that count(<literal>) is rewritten to count(*) in order to make
sure that this optimization is applied in all cases.
Testing:
- Added functional and planner tests
Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Reviewed-on: http://gerrit.cloudera.org:8080/6812
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs
The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.
For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.
Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.
Performance testing:
1. Full TPC-DS run on 10TB:
- Q10 improved by >100x
- Q72 improved by >25x
- Q17,Q26,Q29 improved by 2x
- Q64 regressed by 10x
- Total runtime: Improved by 2x
- Geomean: Minor improvement
The regression of Q64 is understood and we will try to address it
in follow-on changes. The previous plan was better by accident and
not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
- No perf differences
Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.
Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Reviewed-on: http://gerrit.cloudera.org:8080/7257
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This is similar to the single-node execution optimisation, but applies
to slightly larger queries that should run in a distributed manner but
won't benefit from codegen.
This adds a new query option disable_codegen_rows_threshold that
defaults to 50,000. If fewer than this number of rows are processed
by a plan node per impalad, the cost of codegen almost certainly
outweighs the benefit.
Using rows processed as a threshold is justified by a simple
model that assumes the cost of codegen and execution per row for
the same operation are proportional. E.g. if x is the complexity
of the operation, n is the number of rows processed, C is a
constant factor giving the cost of codegen and Ec/Ei are constant
factor giving the cost of codegen'd and interpreted execution and
d, then the cost of the codegen'd operator is C * x + Ec * x * n
and the cost of the interpreted operator is Ei * x * n. Rearranging
means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that
(at least with the simplified model) it makes sense to choose
interpretation or codegen based on a constant threshold. The
model also implies that it is somewhat safer to choose codegen
because the additional cost of codegen is O(1) but the additional
cost of interpretation is O(n).
I ran some experiments with TPC-H Q1, varying the input table size, to
determine what the cut-over point where codegen was beneficial was.
The cutover was around 150k rows per node for both text and parquet.
At 50k rows per node disabling codegen was very beneficial - around
0.12s versus 0.24s. To be somewhat conservative I set the default
threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the
cutover tends to be higher because there are plan nodes that process
many fewer than the max rows.
Fix a couple of minor issues in the frontend - the numNodes_
calculation could return 0 for Kudu, and the single node optimization
didn't handle the case where for a scan node with conjuncts, a limit
and missing stats correctly (it considered the estimate still valid.)
Testing:
Updated e2e tests that set disable_codegen to set
disable_codegen_rows_threshold to 0, so that those tests run both
with and without codegen still.
Added an e2e test to make sure that the optimisation is applied in
the backend.
Added planner tests for various cases where codegen should and shouldn't
be disabled.
Perf:
Added a targeted perf test for a join+agg over a small input, which
benefits from this change.
Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e
Reviewed-on: http://gerrit.cloudera.org:8080/7153
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This change introduces a new rule to merge disjunct equality
predicates into an IN predicate. As with every rule being applied
bottom up, the rule merges the leaf OR predicates into an in predicate
and subsequently merges the OR predicate to the existing IN predicate
It will also merge two compatible IN predicates into a single IN
predicate.
Patch also addresses review comments to
normalize the binary predicates and testcases for the same.
binary predicates of the form constant <op> non constant are normalized
to non constant <op> constant
Change-Id: If02396b752c5497de9a92828c24c8062027dc2e2
Reviewed-on: http://gerrit.cloudera.org:8080/7110
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This change fixes the frontend tests to make them run on Java 8, by
replacing HashMap with LinkedHashMap and HashSet with LinkedHashSet
where needed.
To test this I ran the frontend tests using both Oracle Java 7 and
Oracle Java 8 and made sure they passed. I also verified that the tests
pass with OpenJDK 7.
Change-Id: Iad8e1dccec3a51293a109c420bd2b88b9d1e0625
Reviewed-on: http://gerrit.cloudera.org:8080/7073
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
Scale down the buffer size in hash joins and hash aggregations if
estimates indicate that the build side of the join is small.
This greatly reduces minimum memory requirements for joins in some
common cases, e.g. small dimension tables.
Currently this is not plumbed through to the backend and only takes
effect in planner tests.
Testing:
Added targeted planner tests for small/mid/large/unknown memory
requirements for aggregations and joins.
Change-Id: I57b5b4c528325d478c8a9b834a6bc5dedab54b5b
Reviewed-on: http://gerrit.cloudera.org:8080/6963
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
For queries where plan generation is terminated early due to LIMIT 0
or similar, some tuples may not have a mem layout because no PlanNode
has been generated to materialize them. The fix is to make
recomputeMemLayout() a no-op if the tuple does not have an existing
mem layout.
Testing:
- added regression test
Change-Id: I08548c6bfa7dbf4655e55636605bebb89d2a2239
Reviewed-on: http://gerrit.cloudera.org:8080/7264
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
This change modifies the logic of NOT IN predicate so that
the planner can calculate the correct node cardinality. Prior
to this change, both IN and NOT IN predicates shared the
same selectivity, which resulted in the same cardinality
during planning.
The selectivity is calculated by the following heuristic:
selectivity = 1 - (num of predicate children /
num of distinct values)
Change-Id: I69e6217257b5618cb63e13b32ba3347fa0483b63
Reviewed-on: http://gerrit.cloudera.org:8080/7168
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Adds a new query option DEFAULT_JOIN_DISTRIBUTION_MODE to
control which join distribution mode is chosen when the join
inputs have an unknown cardinality (e.g., missing stats) or when
the expected costs of the different strategies are equal.
Values for DEFAULT_JOIN_DISTRIBUTION_MODE: [BROADCAST, SHUFFLE]
Default: BROADCAST
Note that this change effectively undoes IMPALA-5120.
Testing:
- Added new planner tests
- Core/hdfs run passed
Change-Id: Ibd34442f422129d53bef5493fc9cbe7375a0765c
Reviewed-on: http://gerrit.cloudera.org:8080/7059
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This change loads the missing tables in TPC-DS. In addition,
it also fixes up the loading of the partitioned table store_sales
so all partitions will be loaded. The existing TPC-DS queries are
also updated to use the parameters for qualification runs as noted
in the TPC-DS specification. Some hard-coded partition filters were
also removed. They were there due to the lack of dynamic partitioning
in the past. Some missing TPC-DS queries are also added to this change,
including query28 which discovered the infamous IMPALA-5251.
Having all tables in TPC-DS available paves the way for us to include
all supported TPCDS queries in our functional testing. Due to the change
in the data, planner tests and the E2E tests have different results than
before. The results of E2E tests were compared against the run done with
Netezza and Vertica. The divergence were all due to the truncation behavior
of decimal types in DECIMAL_V1.
Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9
Reviewed-on: http://gerrit.cloudera.org:8080/6877
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
The main idea of this patch is to use table stats to
extrapolate the row counts for new/modified partitions.
Existing behavior:
- Partitions that lack the row count stat are ignored
when estimating the cardinality of HDFS scans. Such
partitions effectively have an estimated row count
of zero.
- We always use the row count stats for partitions that
have one. The row count may be innaccurate if data in
such partitions has changed significantly.
Summary of changes:
- Enhance COMPUTE STATS to also store the total number
of file bytes in the table.
- Use the table-level row count and file bytes stats
to estimate the number of rows in a scan.
- A new impalad startup flag is added to enable/disable
the extrapolation behavior. The feature is disabled by
default. Note that even with the feature disabled,
COMPUTE STATS stores the file bytes so you can enable
the feature without having to run COMPUTE STATS again.
Testing:
- Added new FE unit test
- Added new EE test
Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/6840
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
A previous change, IMPALA-3742, added an exchange node and
sort node to plans for inserts into Kudu tables to partition
and sort the input to match the target table.
This patch enables INSERT hints for Kudu tables - 'noshuffle'
which removes the exchange node from the plan and
'noclustered' which removes the sort node.
Insert hints have no effect for inserts that are small enough
to result in a single node execution.
Testing:
- Updated FE planner and analysis tests.
- Ran Kudu EE tests.
Change-Id: Idbd1ef977446ffee157ce3ce0b476e1f08a75d05
Reviewed-on: http://gerrit.cloudera.org:8080/6980
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Syntax:
<tableref> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]
The first number specifies the percent of table bytes to sample.
The second number specifies the random seed to use.
The sampling is coarse-grained. Impala keeps randomly adding
files to the sample until at least the desired percentage of
file bytes have been reached.
Examples:
SELECT * FROM t TABLESAMPLE SYSTEM(10)
SELECT * FROM t TABLESAMPLE SYSTEM(50) REPEATABLE(1234)
Testing:
- Added parser, analyser, planner, and end-to-end tests
- Private core/hdfs run passed
Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Reviewed-on: http://gerrit.cloudera.org:8080/6868
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The sortby() hint is superseded by the SORT BY SQL clause, which has
been introduced in IMPALA-4166. This changes removes the hint.
Change-Id: I83e1cd6fa7039035973676322deefbce00d3f594
Reviewed-on: http://gerrit.cloudera.org:8080/6885
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
IMPALA-4166 introduced a bug by duplicating code that adds sort
expressions. Upon re-analysis, this code would hit an
IndexOutOfBoundsException.
Change-Id: Ibebba29509ae7eaa691fe305500cda6bd41a179a
Reviewed-on: http://gerrit.cloudera.org:8080/6921
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
Non-deterministic exprs which evaluate as constant should not be
used during HDFS partition pruning. We consider Exprs which have no
SlotRefs as bound by default, and thus we end up trying to apply
them indisrciminately. Constant propagation makes this situation
easier to run into and the behavior is rather unexpected.
The fix for now is to explicitly disallow non-deterministic Exprs
in partition pruning.
Change-Id: I91054c6bf017401242259a1eff5e859085285546
Reviewed-on: http://gerrit.cloudera.org:8080/6575
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This change builds on the support for reading and writing
TIMESTAMP columns to Kudu tables (see [1]), adding support
for pushing TIMESTAMP predicates to Kudu for scans.
Binary predicates and IN list predicates are supported.
Testing: Added some planner and EE tests to validate the
behavior.
1: https://gerrit.cloudera.org/#/c/6526/
Change-Id: I08b6c8354a408e7beb94c1a135c23722977246ea
Reviewed-on: http://gerrit.cloudera.org:8080/6789
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
This change adds support for adding SORT BY (...) clauses to CREATE
TABLE and ALTER TABLE statements. Examples are:
CREATE TABLE t (i INT, j INT, k INT) PARTITIONED BY (l INT) SORT BY (i, j);
CREATE TABLE t SORT BY (int_col,id) LIKE u;
CREATE TABLE t LIKE PARQUET '/foo' SORT BY (id,zip);
ALTER TABLE t SORT BY (int_col,id);
ALTER TABLE t SORT BY ();
Sort columns can only be specified for Hdfs tables and effectiveness may
vary based on storage type; for example TEXT tables will not see
improved compression. The SORT BY clause must not contain clustering
columns. The columns in the SORT BY clause are stored in the
'sort.columns' table property and will result in an additional SORT node
being added to the plan before the final table sink. Specifying sort
columns also enables clustering during inserts, so the SORT node will
contain all partitioning columns first, followed by the sort columns. We
do this because sort columns add a SORT node to the plan and adding the
clustering columns to the SORT node is cheap.
Sort columns supersede the sortby() hint, which we will remove in a
subsequent change (IMPALA-5144). Until then, it is possible to specify
sort columns using both ways at the same time and the column lists
will be concatenated.
Change-Id: I08834f38a941786ab45a4381c2732d929a934f75
Reviewed-on: http://gerrit.cloudera.org:8080/6495
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
Previously, we defaulted to broadcast join when stats were
missing, but this can lead to disastrous plans when the
right hand side is actually large.
Its always difficult to make good plans when stats are missing,
but defaulting to partitioned joins should reduce the risk of
disastrous plans.
Testing:
- Added a planner test that joins a table with no stats.
Change-Id: Ie168ecfcd5e7c5d3c60d16926c151f8f134c81e0
Reviewed-on: http://gerrit.cloudera.org:8080/6803
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
This generates min/max predicates for InPredicates that
have only constant values in the IN list. It is only
used for statistics filtering on Parquet files.
Change-Id: I4a88963a7206f40a867e49eceeaf03fdd4f71997
Reviewed-on: http://gerrit.cloudera.org:8080/6810
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Bulk DMLs (INSERT, UPSERT, UPDATE, and DELETE) for Kudu
are currently painful because we just send rows randomly,
which creates a lot of work for Kudu since it partitions
and sorts data before writing, causing writes to be slow
and leading to timeouts.
We can alleviate this by sending the rows to Kudu already
partitioned and sorted. This patch partitions and sorts
rows according to Kudu's partitioning scheme for INSERTs
and UPSERTs. A followup patch will handle UPDATE and DELETE.
It accomplishes this by inserting an exchange node and a sort
node into the plan before the operation. Both the exchange and
the sort are given a KuduPartitionExpr which takes a row and
calls into the Kudu client to return its partition number.
It also disallows INSERT hints for Kudu tables, since the
hints that we support (SHUFFLE, CLUSTER, SORTBY), so longer
make sense.
Testing:
- Updated planner tests.
- Ran the Kudu functional tests.
- Ran performance tests demonstrating that we can now handle much
larger inserts without having timeouts.
Change-Id: I84ce0032a1b10958fdf31faef225372c5c38fdc4
Reviewed-on: http://gerrit.cloudera.org:8080/6559
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
Implements constant propagation within conjuncts and applies the
optimization to scan conjuncts and collection conjuncts within Hdfs
scan nodes. The optimization is applied during planning. At scan
nodes in particular, we want to optimize to enable partition pruning.
In certain cases, we might end up with a FALSE conditional, which
now will convert to an EmptySet node.
Testing: Expanded the test cases for the planner to achieve constant
propagation. Added Kudu, datasource, Hdfs and HBase tests to validate
we can create EmptySetNodes.
Change-Id: I79750a8edb945effee2a519fa3b8192b77042cb4
Reviewed-on: http://gerrit.cloudera.org:8080/6389
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Previously, exprs used in sorts were evaluated lazily. This can
potentially be bad for performance if the exprs are expensive to
evaluate, and it can lead to crashes if the exprs are
non-deterministic, as this violates assumptions of our sorting
algorithm.
This patch addresses these issues by materializing ordering exprs.
It does so when the expr is non-deterministic (including when it
contains a UDF, which we cannot currently know if they are
non-deterministic), or when its cost exceeds a threshold (or the
cost is unknown).
Testing:
- Added e2e tests in test_sort.py.
- Updated planner tests.
Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e
Reviewed-on: http://gerrit.cloudera.org:8080/6322
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
Since commit d2d3f4c (on asf-master), TAggregateExpr contains
the logical input types of the Aggregate Expr. The reason they
are included is that merging aggregate expressions will have
input tyes of the intermediate values which aren't necessarily
the same as the input types. For instance, NDV() uses a binary
blob as its intermediate value and it's passed to its merge
aggregate expressions as a StringVal but the input type of NDV()
in the query could be DecimalVal. In this case, we consider
DecimalVal as the logical input type while StringVal is the
intermediate type. The logical input types are accessed by the
BE via GetConstFnAttr() during interpretation and constant
propagation during codegen.
To handle distinct aggregate expressions (e.g. select count(distinct)),
the FE uses 2-phase aggregation by introducing an extra phase of
split/merge aggregation in which the distinct aggregate expressions'
inputs are coverted and added to the group-by expressions in the first
phase while the non-distinct aggregate expressions go through the normal
split/merge treatement.
The bug is that the existing code incorrectly propagates the intermediate
types of the non-grouping aggregate expressions as the logical input types
to the merging aggregate expressions in the second phase of aggregation.
The input aggregate expressions for the non-distinct aggregate expressions
in the second phase aggregation are already merging aggregate expressions
(from phase one) in which case we should not treat its input types as
logical input types.
This change fixes the problem above by checking if the input aggregate
expression passed to FunctionCallExpr.createMergeAggCall() is already
a merging aggregate expression. If so, it will use the logical input
types recorded in its 'mergeAggInputFn_' as references for its logical
input types instead of the aggregate expression input types themselves.
Change-Id: I158303b20d1afdff23c67f3338b9c4af2ad80691
Reviewed-on: http://gerrit.cloudera.org:8080/6724
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Constant folding null values in CastExprs causes CTAS statements
to fail. This regresses the observed behavior before constant folding
was introduced. This change does not constant fold null in CastExprs.
Change-Id: Ia7aa1ab7f53a9dcc7560ded321a9d1e1ee2d18e3
Reviewed-on: http://gerrit.cloudera.org:8080/6663
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Compute the minimum buffer requirement for spilling nodes and
per-host estimates for the entire plan tree.
This builds on top of the existing resource estimation code, which
computes the sets of plan nodes that can execute concurrently. This is
cleaned up so that the process of producing resource requirements is
clearer. It also removes the unused VCore estimates.
Fixes various bugs and other issues:
* computeCosts() was not called for unpartitioned fragments, so
the per-operator memory estimate was not visible.
* Nested loop join was not treated as a blocking join.
* The TODO comment about union was misleading
* Fix the computation for mt_dop > 1 by distinguishing per-instance and
per-host estimates.
* Always generate an estimate instead of unpredictably returning
-1/"unavailable" in many circumstances - there was little rhyme or
reason to when this happened.
* Remove the special "trivial plan" estimates. With the rest of the
cleanup we generate estimates <= 10MB for those trivial plans through
the normal code path.
I left one bug (IMPALA-4862) unfixed because it is subtle, will affect
estimates for many plans and will be easier to review once we have the
test infra in place.
Testing:
Added basic planner tests for resource requirements in both the MT and
non-MT cases.
Re-enabled the explain_level tests, which appears to be the only
coverage for many of these estimates. Removed the complex and
brittle test cases and replaced with a couple of much simpler
end-to-end tests.
Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
Reviewed-on: http://gerrit.cloudera.org:8080/5847
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This detects IS NULL / IS NOT NULL and creates a Kudu
predicate to push this to Kudu.
For testing, there are planner tests to verify that the
predicate is pushed to Kudu. There are also end-to-end
tests for correctness.
Change-Id: I9c96fec8d41f77222879c0ffdd6940b168e47e65
Reviewed-on: http://gerrit.cloudera.org:8080/5958
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
The union node acts as pass through operator and forwards row batches
from it's children without materializing. This is done in the case
when the child's tuple layout is identical to union node tuple layout
and no functions need to be applied to the child row batches.
Removed operand reordering in the FE because it's simpler and safer to
handle all passthrough children before non-passthrough children in the
BE. The recent improvements to memory management allowed us to remove
this requirement.
Testing:
- Added new planner and end to end tests that cover the new
functionality.
- Updated existing tests to reflect the new behavior.
Perf:
Ran a benchmark on a local 10 GB tpcds dataset. I used an unpartitioned
version of the store_sales table. There was over a 2x performance
improvement for the following query:
SELECT
COUNT(ss_sold_time_sk),
COUNT(ss_item_sk),
COUNT(ss_customer_sk),
COUNT(ss_cdemo_sk),
COUNT(ss_hdemo_sk),
COUNT(ss_addr_sk),
COUNT(ss_store_sk),
COUNT(ss_promo_sk),
COUNT(ss_ticket_number),
COUNT(ss_quantity),
COUNT(ss_wholesale_cost),
COUNT(ss_list_price),
COUNT(ss_sales_price),
COUNT(ss_ext_discount_amt),
COUNT(ss_ext_sales_price),
COUNT(ss_ext_wholesale_cost),
COUNT(ss_ext_list_price),
COUNT(ss_ext_tax),
COUNT(ss_coupon_amt),
COUNT(ss_net_paid),
COUNT(ss_net_paid_inc_tax),
COUNT(ss_net_profit),
COUNT(ss_sold_date_sk)
FROM (
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
union all
select * from tpcds_10_parquet.store_sales_unpartitioned
) t
Before:
Total Time: 43s164ms
Summary:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
------------------------------------------------------------------------------------------------------------------------------
13:AGGREGATE 1 224.721us 224.721us 1 1 28.00 KB -1.00 B FINALIZE
12:EXCHANGE 1 24.578us 24.578us 3 1 0 -1.00 B UNPARTITIONED
11:AGGREGATE 3 2s402ms 3s060ms 3 1 119.00 KB 10.00 MB
00:UNION 3 35s380ms 37s846ms 288.01M 288.01M 3.08 MB 0
|--02:SCAN HDFS 3 184.197ms 219.931ms 28.80M 28.80M 535.03 MB 1.88 GB store_sales_unpartitioned
|--03:SCAN HDFS 3 131.956ms 153.401ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned
|--04:SCAN HDFS 3 178.456ms 247.721ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned
|--05:SCAN HDFS 3 189.398ms 242.251ms 28.80M 28.80M 535.01 MB 1.88 GB store_sales_unpartitioned
|--06:SCAN HDFS 3 122.786ms 156.528ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned
|--07:SCAN HDFS 3 147.467ms 183.391ms 28.80M 28.80M 535.13 MB 1.88 GB store_sales_unpartitioned
|--08:SCAN HDFS 3 147.502ms 186.273ms 28.80M 28.80M 535.01 MB 1.88 GB store_sales_unpartitioned
|--09:SCAN HDFS 3 130.086ms 154.682ms 28.80M 28.80M 535.04 MB 1.88 GB store_sales_unpartitioned
|--10:SCAN HDFS 3 122.701ms 161.056ms 28.80M 28.80M 534.89 MB 1.88 GB store_sales_unpartitioned
01:SCAN HDFS 3 287.863ms 330.436ms 28.80M 28.80M 534.98 MB 1.88 GB store_sales_unpartitioned
After:
Total Time: 19s139ms
Summary:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
------------------------------------------------------------------------------------------------------------------------------
13:AGGREGATE 1 166.241us 166.241us 1 1 28.00 KB -1.00 B FINALIZE
12:EXCHANGE 1 71.695us 71.695us 3 1 0 -1.00 B UNPARTITIONED
11:AGGREGATE 3 2s971ms 3s809ms 3 1 3.08 MB 10.00 MB
00:UNION 3 207.956ms 222.846ms 288.01M 288.01M 0 0
|--02:SCAN HDFS 3 1s533ms 1s535ms 28.80M 28.80M 532.28 MB 1.88 GB store_sales_unpartitioned
|--03:SCAN HDFS 3 1s554ms 1s669ms 28.80M 28.80M 525.73 MB 1.88 GB store_sales_unpartitioned
|--04:SCAN HDFS 3 1s568ms 1s716ms 28.80M 28.80M 525.03 MB 1.88 GB store_sales_unpartitioned
|--05:SCAN HDFS 3 1s503ms 1s617ms 28.80M 28.80M 527.43 MB 1.88 GB store_sales_unpartitioned
|--06:SCAN HDFS 3 1s560ms 1s634ms 28.80M 28.80M 528.52 MB 1.88 GB store_sales_unpartitioned
|--07:SCAN HDFS 3 1s489ms 1s643ms 28.80M 28.80M 534.81 MB 1.88 GB store_sales_unpartitioned
|--08:SCAN HDFS 3 1s534ms 1s581ms 28.80M 28.80M 528.10 MB 1.88 GB store_sales_unpartitioned
|--09:SCAN HDFS 3 1s558ms 1s674ms 28.80M 28.80M 526.77 MB 1.88 GB store_sales_unpartitioned
|--10:SCAN HDFS 3 1s504ms 1s692ms 28.80M 28.80M 527.83 MB 1.88 GB store_sales_unpartitioned
01:SCAN HDFS 3 1s682ms 1s911ms 28.80M 28.80M 526.14 MB 1.88 GB store_sales_unpartitioned
Change-Id: Ia8f6d5062724ba5b78174c3227a7a796d10d8416
Reviewed-on: http://gerrit.cloudera.org:8080/5816
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This changes the HdfsParquetTableWriter to populate the
parquet::RowGroup::sorting_columns list with all columns mentioned in a
'sortby()' hint within INSERT statements. The columns are added to the
list in the order in which they appear inside the hint.
The change also adds backports.tempfile to the python requirements to
provide 'tempfile.TemporaryDirectory' on python 2.7.
The change also changes the default ordering for columns mentioned in
'sortby()' hints from descending to ascending.
To test this change, we write a table with a 'sortby()' hint and verify,
that the sorting_columns get populated correctly.
Change-Id: Ib42aab585e9e627796e9510e783652d49d74b56c
Reviewed-on: http://gerrit.cloudera.org:8080/6219
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
Here is a basic summary of the changes:
Frontend looks for conjuncts that operate on a single slot and pass a
map from slot id to the conjunct index through thrift to the backend.
The conjunct indices are the indices into the normal PlanNode conjuncts list.
The conjuncts need to satisfy certain conditions:
1. They are bound on a single slot
2. They are deterministic (no random functions)
3. They evaluate to FALSE on a NULL input. This is because the dictionary
does not include NULLs, so any condition that evaluates to TRUE on NULL
cannot be evaluated by looking only at the dictionary.
The backend converts the indices into ExprContexts. These are cloned in
the scanner threads.
The dictionary read codepath has been removed from ReadDataPage into its
own function, InitDictionary. This has also been turned into its own step
in row group initialization. ReadDataPage will not see any dictionary
pages unless the parquet file is invalid.
For dictionary filtering, we initialize dictionaries only as needed to evaluate
the conjuncts. The Parquet scanner evaluates the dictionary filter conjuncts on the
dictionary to see if any dictionary entry passes. If no entry passes, the row
group is eliminated. If the row group passes the dictionary filtering, then we
initialize all remaining dictionaries.
Dictionary filtering is controlled by a new query option,
parquet_dictionary_filtering, which is on by default.
Since column chunks can have a mixture of encodings, dictionary filtering
uses three tests to determine whether this is purely dictionary encoded:
1. If the encoding_stats is in the parquet file, then use it to determine if
there are only dictionary encoded pages (i.e. there are no data pages with
an encoding other than PLAIN_DICTIONARY).
-OR-
2. If the encoding stats are not present, then look at the encodings. The column
is purely dictionary encoded if:
a) PLAIN_DICTIONARY is present
AND
b) Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed
-OR-
3. If this file was written by an older version of Impala, then we know that
dictionary failover happens when the dictionary reaches 40,000 values.
Dictionary filtering can proceed as long as the dictionary is smaller than
that.
parquet-mr writes the encoding list correctly in the current version in our
environment (1.5.0). This means that check #2 works on some existing files
(potentially most existing parquet-mr files).
parquet-mr writes the encoding stats starting in 1.9.0. This is the version
where check #1 will start working.
Impala's parquet writer now implements both, so either check above will work.
Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Reviewed-on: http://gerrit.cloudera.org:8080/5904
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the forms <slot> <op> <constant> or
<constant> <op> <slot>, where <op> is LT, LE, GE, GT, and EQ.
Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Reviewed-on: http://gerrit.cloudera.org:8080/6032
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
The bug: The DisjointSet maintains a set of unique item sets
using a HashSet<Set<T>>. The problem is that we modified
the Set<T> elements after inserting them into the HashSet.
This caused the removal of elements from the HashSet to fail.
Removal is required for maintaining a consistent DisjointSet.
The removal could even fail for the same Set<T> instance because
the hashCode() changed from when the Set<T> was originally
inserted to when the removal was attempted due to mutation
of the Set<T>.
An inconsistent DisjointSet can lead to incorrect equivalence
classes, which can lead to missing, redundant and even
non-executable predicates. Incorrect results and crashes are
possible.
For most queries, an inconsistent DisjointSet does not alter
the equivalence classes, and even fewer queries have incorrect
plans.
In fact, many of our existing planner tests trigger this bug,
but only 3 of them lead to an incorrect value transfer graph,
and all 3 had correct plans.
The fix: Use an IdentityHashMap to store the set of item sets.
It does not rely on the hashCode() and equals() of the stored
elements, so the same object can be added and later removed,
even when mutated in the meantime.
Testing:
- Added a Preconditions check in DisjointSet that asserts
correct removal of an item set. Many of our existing tests
hit the check before this fix.
- Added a new unit test for DisjointSet which triggers
the bug.
- Augmented DisjointSet.checkConsistency() to check for
inconsistency in the set of item sets.
- Added validation of the value-transfer graph in
single-node planner tests.
- A private core/hdfs run succeeded.
Change-Id: I609c8795c09becd78815605ea8e82e2f99e82212
Reviewed-on: http://gerrit.cloudera.org:8080/5980
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The bug: Our detection of partition compatibility for
grouping aggregations and analytic functions did not take into
account the effect of outer joins within the same fragment.
As a result, we used to incorrectly omit a required hash exchange.
For example, a hash exchange + merge phase is required if the
grouping expressions of an aggregation reference tuples
that are made nullable within the same fragment. The exchange is
needed to bring together NULLs produced by outer-join non-matches.
The fix: Check that the grouping/partition exprs do not reference
tuples that are made nullable within the same fragment.
Testing: Planner tests pass locally.
Change-Id: I121222179378e56836422a69451d840a012c9e54
Reviewed-on: http://gerrit.cloudera.org:8080/5774
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
The bug was that expr rewrite rules such as ExtractCommonConjunctRule
analyzed their own output, which doesn't work for syntactic elements
that allow column aliases, such as the HAVING clause.
The fix was to remove the analysis step (the re-analysis happens anyway
in AnalysisCtx).
Change-Id: Ife74c61f549f620c42f74928f6474e8a5a7b7f00
Reviewed-on: http://gerrit.cloudera.org:8080/5662
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
This change introduces the sortby() query plan hint for insert
statements. When specified, sortby(a, b) will add an additional sort
step to the plan to order data by columns a, b before inserting it into
the target table.
Change-Id: I37a3ffab99aaa5d5a4fd1ac674b3e8b394a3c4c0
Reviewed-on: http://gerrit.cloudera.org:8080/5051
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
The KuduScanNode attempts to push IN list predicates to the
Kudu scan, but NULL literals cannot be pushed. The code in
KuduScanNode needed to check if the Literals in the
InPredicate is a NullLiteral, in which case the entire IN
list should not be pushed to Kudu.
The same handling is already in place for binary predicate
pushdown.
Change-Id: Iaf2c10a326373ad80aef51a85cec64071daefa7b
Reviewed-on: http://gerrit.cloudera.org:8080/5505
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Implements the following conservative but correct policy for assigning
predicates from the On-clause of an inner join:
If the predicate references an outer-joined tuple, then evaluate it at
the inner join that the On-clause belongs to.
Cleans up Analyzer.canEvalPredicate().
Change-Id: Idf45323ed9102ffb45c9d94a130ea3692286f215
Reviewed-on: http://gerrit.cloudera.org:8080/4982
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The main issue was that the eval cost was not set for
timestamp literals, so a preconditions check was hit
when trying to order a list of conjuncts by cost.
Another subtle issue made the bug only reproducible by
a specific query against a Kudu table in our tests,
although the bug is not Kudu specific: The eval cost
of Exprs was not recomputed in analyze(), even after
resetting an Expr, e.g., during a substitution. As a
result, the bug was only reproducible for a list
of conjuncts that contained an inferred predicate
with a timestamp literal.
This patch does not contain a fix for that issue due
to its complexity/risk. It is tracked in IMPALA-4620.
Testing: Ran planner tests locally. Ran query_test.py
locally. A private core/hdfs run passed.
Change-Id: Ife30420bafbd1c64a5e3385e5755909110b4b354
Reviewed-on: http://gerrit.cloudera.org:8080/5404
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.
Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Impala cannot correctly evaluate or assign some non-deterministic
predicates. This patch improves the error message shown when
trying to evaluate such unsupported predicates for the purpose
of partition pruning.
Change-Id: I94765f62bde94f4faa7fc5c26d928099ca1496d1
Reviewed-on: http://gerrit.cloudera.org:8080/5386
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Background: We generally allow the assignment of predicates below the
nullable side of a left/right outer join, explained as follows using an
example:
SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id
WHERE t2.int_col < 10
The scan of 't2' picks up 't2.int_col < 10' via
Analyzer.getBoundPredicates() and recognizes that the predicate must
also be evaluated by a join later, so the predicate is not marked as
assigned. The join then picks up the unassigned predicate via
Analyzer.getUnassignedConjuncts().
The bug was that our logic for detecting whether a bound predicate must
also be evaluated at a join node was flawed because it only considered
whether the tuples of the source or destination predicate were outer
joined (plus other conditions).
The underlying assumption is that either the source or destination tuple
are bound by a tuple produced by a TableRef, but in the buggy query the
source predicate is bound by an aggregation tuple, so we incorrectly
marked the bound predicate as assigned in Analyzer.getBoundPredicates().
The fix is to conservatively not mark bound predicates as assigned if
the slots referenced by the predicate have equivalent slots that
belong to an outer-joined tuple. As a result, a plan node may pick up
the same predicate multiple times, once via
Analyzer.getBoundPredicates() and another time via
Analyzer.getUnassignedConjuncts(). Those are deduped now.
The following example explains the duplicate predicate assignment:
SELECT * FROM (SELECT * FROM t t1) a LEFT OUTER JOIN t b ON a.id = b.id
WHERE a.id < 10
1. The predicate 'a.id < 10' gets migrated into the inline view.
'a.id < 10' is marked as assigned but is still registered as
a single-tid conjunct in the Analyzer for potential propagation
2. The scan node of 't1' calls Analyzer.getBoundPredicates() and
generates 't1.id < 10' based on the source predicate 'a.id < 10'.
3. The scan node of 't1' picks up the migrated conjunct 't1.id < 10'
via Analyzer.getUnassignedConjuncts().
Change-Id: I774d13a13ad1e8fe82512df98dc29983bdd232eb
Reviewed-on: http://gerrit.cloudera.org:8080/4960
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The bug was a simple oversight. In KuduScanNiode.init()
we forgot to call Analyzer.getBoundPredicates().
Change-Id: I19a38d6ea8cc0d2b0ddc3808d1f9ffef5ce306a8
Reviewed-on: http://gerrit.cloudera.org:8080/5365
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Fixes the KuduScanNode to convert InPredicates to
KuduPredicates and push them to the Kudu scan if possible.
An InPredicate can be pushed to the scan if expression is of
the exact form:
<SlotRef> IN (<LiteralExpr>, <LiteralExpr>, ...)
That means the InPredicate has the following properties:
1) It has a list of literal values (i.e. not a subquery);
All values are LiteralExprs (not SlotRefs).
2) Not negative, i.e. only 'IN' supported, not 'NOT IN'
3) The SlotRef is not wrapped in any casts
4) The types of all values match the type of the SlotRef
exactly.
A planner test was added exercising all supported types as
well as exprs where the values would not be supported.
TODO: perf testing
TODO: consider a limit on the number of list values before
keeping the predicate on the Impala scan node
(determine from testing)
Change-Id: I8988d4819d20d467b48e286917e347ca00f60cf0
Reviewed-on: http://gerrit.cloudera.org:8080/5316
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Impala used to incorrectly assign On-clause equality predicates from an
outer join if those predicates referenced multiple tables, but only one
side of the outer join.
The fix is to add an additional check in Analyzer.getEqJoinConjuncts()
to prevent that incorrect assignment.
Change-Id: I719e0eeacccad070b1f9509d80aaf761b572add0
Reviewed-on: http://gerrit.cloudera.org:8080/4986
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The bug: We used to reset() the qualifier of union operands
to their original value obtained during parsing. This leads to
problems when union operands are unnested and we need to rewrite
Subqueries. In particular, the first union operand of a nested union
was reset() to a null qualifier, but that operand could be somewhere
in the middle of the list of unnested operands in the parent. At that
point, we've lost information about the qualifier of the unnested
operand.
The fix: The simplest solution is to not reset() the qualifier.
The other alternative is be to reset() the qualifier, but also
undo any unnesting. That seems unnecessary and wasteful.
Change-Id: I157bb0f08c4a94fd779487d7c23edd64a537a1f6
Reviewed-on: http://gerrit.cloudera.org:8080/4963
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
With this commit we add support for auditing all Kudu-specific
operations and we enable column lineage for INSERT and UPSERT
statements on Kudu tables. No lineage output is generated for DELETE and
UPDATE statements.
Change-Id: Idc4ca1cd63bcfa4370c240a5c4a4126ed6704f4d
Reviewed-on: http://gerrit.cloudera.org:8080/5151
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Adds a new ExprRewriteRule for replacing constant expressions
with their literal equivalent via BE evaluation. Applies the
new rule together with the existing ones on the parse tree,
after analysis.
Limitations
- Constant folding is applied on the unresolved expressions.
As a result, it only works for expressions that are constant
within a single query block, as opposed to expressions that
may become constant after fully substituting inline-view exprs.
- Exprs are not normalized, so some opportunities for constant
folding are missed for certain expr-tree shapes.
This patch includes the following interesting changes:
- Introduces a timestamp literal that can only be produced
by constant folding (not expressible directly via SQL).
- To make sure that rewrites have no user-visible effect,
the original result types and column labels of the top-level
statement are restored after the rewrites are performed.
- Does not fold exprs if their evaluation resulted in a
warning or error, or if the resulting value is not
representable by corresponding FE LiteralExpr.
- Fixes an existing issue with converting strings between
the FE/BE. String produced in the BE that have characters
with a value > 127 are not correctly deserialized into a
Java String via thrift. We detect this case during constant
folding and abandon folding of such exprs.
- Fixes several issues with detecting/reporting errors in
NativeEvalConstExprs().
- Cleans up ExprContext::GetValue() into
ExprContext::GetConstantValue() which clarifies its only use
of evaluating exprs from the FE.
Testing:
- Modifies expr-test.cc to run all tests through the constant
folding path.
- Adds basic planner and rewrite rule tests.
- Exhaustive test run passed
Change-Id: If672b703db1ba0bfc26e5b9130161798b40a69e9
Reviewed-on: http://gerrit.cloudera.org:8080/5109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Our NumericLiteral is backed by a BigDecimal which cannot
represent the special float values NaN, infinity or negative zero.
As a result, when evaluating constant expressions from the FE we
hit an exception when trying to create a NumericLiteral from
a NaN or infinity value. Before, negative zero would silently
get converted to zero which is dangerous.
The fix is to treat the expr evaluation as a failure and not
replace the constant Expr with a LiteralExpr.
Change-Id: I8243b2ee9fa9c470d078b385583f2f48b606a230
Reviewed-on: http://gerrit.cloudera.org:8080/5050
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins