Commit Graph

926 Commits

Author SHA1 Message Date
Alex Behm
54a46e9459 IMPALA-3065/IMPALA-3062: Restrict !empty() predicates to scan nodes.
The bug:
Evaluating !empty() predicates at non-scan nodes interacts
poorly with our BE projection of collection slots. For example,
rows could incorrectly be filtered if a !empty() predicate is
assigned to a plan node that comes after the unnest of the
collection that also performs the projection.

The fix:
This patch reworks the generation of !empty() predicates
introduced in IMPALA-2663 for correctness purposes.
The predicates are generated in cases where we can ensure that
they will be assigned only by the parent scan, and no other
plan node.

The conditions are as follows:
- collection table ref is relative and non-correlated
- collection table ref represents the rhs of an inner/cross/semi join
- collection table ref's parent tuple is not outer joined

Change-Id: Ie975ce139a103285c4e9f93c59ce1f1d2aa71767
Reviewed-on: http://gerrit.cloudera.org:8080/2399
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Silvius Rus <srus@cloudera.com>
Tested-by: Internal Jenkins
2016-03-02 23:23:05 -08:00
Tim Armstrong
6cdcdb12ff Test for IMPALA-2987
Add a custom cluster test that tests for delays in registering data
stream receivers. We add a stress option to artificially delay this
registration to ensure that it can be handled correctly.

Change-Id: Id5f5746b6023c301bacfa305c525846cdde822c9
Reviewed-on: http://gerrit.cloudera.org:8080/2306
Tested-by: Internal Jenkins
Reviewed-by: Silvius Rus <srus@cloudera.com>
2016-03-02 23:23:04 -08:00
Alex Behm
a303f25256 IMPALA-3071: Fix assignment of On-clause predicates belonging to an inner join.
The bug: On-clause predicates belonging to an inner join were not always assigned
correctly if they referenced an outer-joined tuple. Specifically, our logic
for detecting whether a predicate can be assigned below an outer join if also
left at the outer-join node was not correct, and so we assigned the predicate
below the join, but did not also leave it at the outer join.

The fix: Assign an inner join On-clause conjunct that references an outer-joined
tuple to the join that the On-clause belongs to.

Change-Id: Iffef7718679d48f866fa90fd3257f182cbb385ae
Reviewed-on: http://gerrit.cloudera.org:8080/2309
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-29 22:22:41 -08:00
Juan Yu
c9b33ddf63 IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.
Fix a bug in which Impala only reads the first stream
of a multi-stream bz2/gzip file.
Changes the bz2 decoder to read the file in a streaming
fashion rather than reading the entire file into memory
before it can be decompressed.

Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8
(cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e)
Reviewed-on: http://gerrit.cloudera.org:8080/2219
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2016-02-28 21:31:37 -08:00
Dimitris Tsirogiannis
2c37d99fed IMPALA-3089: Perform static partition pruning in the FE with disjunctive
BETWEEN predicates

This commit fixes an issue where the slow path is employed during static
partition pruning for disjunctive BETWEEN predicates, inroducing
significant latency during planning, especially for tables with large
number of partitions.

Change-Id: I66ef566fa176a859d126d49152921a176a491b0a
Reviewed-on: http://gerrit.cloudera.org:8080/2320
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-02-26 15:37:24 -08:00
Alex Behm
5c0e1fa1e8 IMPALA-2974: Use Type.toSql() instead of toString() in ALTER TABLE CHANGE COLUMN.
Change-Id: I140bdea755e44d3f2ceb4a8f5e288faaddaa963f
Reviewed-on: http://gerrit.cloudera.org:8080/2285
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-26 15:37:24 -08:00
Dimitris Tsirogiannis
197eb43477 IMPALA-3074: AnalysisError when runtime filter has incompatible source
and target exprs

This commit fixes an issue where an AnalysisError is thrown when a
runtime filter has incompatible source and target exprs. This is
triggered when a runtime filter has multiple candidate target scan nodes
not all of which produce a target expr which is cast-compatible with the
source expr.

Change-Id: I544c8fc66915f684ba24d20de525563638c4039d
Reviewed-on: http://gerrit.cloudera.org:8080/2307
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-02-24 19:54:40 -08:00
Dimitris Tsirogiannis
d3b92b0d9f IMPALA-3039: Restrict the number of runtime filters generated
This commit adds a query option, MAX_NUM_RUNTIME_FILTERS, to restrict
the number of runtime filters generated per query. If more than
MAX_NUM_RUNTIME_FILTERS are generated, the runtime filters are sorted by
the selectivity of the associate source join nodes and the
MAX_NUM_RUNTIME_FILTERS most selective filters are applied. Also with
this commit, non-selective filters are automatically discarded, irrespective
of the value of MAX_NUM_RUNTIME_FILTERS.

Change-Id: Ifd41ef6919a6d2b283a8801861a7179c96ed87c6
Reviewed-on: http://gerrit.cloudera.org:8080/2262
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-02-24 19:54:40 -08:00
Alex Behm
2c8f41b7d4 IMPALA-2832: Fix cloning of FunctionCallExpr.
The bug was that we were not properly cloning the params
of a FunctionCallExpr. In a CTAS we analyze the underlying
query stmt twice, the first time on a clone of the original
stmt. The problem was that the first analysis affected the
second analysis due to an improper clone, leading to missing
slots in a scan because the corresponding SlotRefs were
already analyzed.

Change-Id: I0025c0ee54b2f2cb3ba470b26a9de5aa5a3a3ade
Reviewed-on: http://gerrit.cloudera.org:8080/2291
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-24 13:31:00 -08:00
Tim Armstrong
52362d4079 IMPALA-3047: separate create table test with nested types
We need to skip queries that select from tables wiht nested types is
running with the old aggs and joins. To achieve this, move the failing
test to a separate test and use the skip decorator.

Change-Id: Iaf1351c711b524be66a99084657926909425cbff
Reviewed-on: http://gerrit.cloudera.org:8080/2272
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-02-24 13:31:00 -08:00
Alex Behm
a99e17457b Fix a non-determinisic test in complex-types-file-formats.test.
Change-Id: I98cc3045a6a6131dba8b0a475d5d51de7bdba455
Reviewed-on: http://gerrit.cloudera.org:8080/2268
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2016-02-22 20:16:24 -08:00
Alex Behm
c6fd5a0fe4 IMPALA-2844: Allow count(*) on RC files with complex types.
This patch also fixes the incorrect error message reported
in the JIRA.

Change-Id: I2c7b732767d154c36bc7189df5177d27a35d0d7b
Reviewed-on: http://gerrit.cloudera.org:8080/2267
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-22 20:16:24 -08:00
Alex Behm
8b32cbb904 IMPALA-2820: Support unquoted keywords as struct-field names.
After this patch structs can be parsed/created with field names
that are regular identifiers or keywords, even if unquoted.
This fix is needed for parsing type strings stored in the
Hive Metastore which could contain unquoted identifiers that
correspond to Impala keywords.

The parser changes required an upgrade of Cup and its Maven plugin.
In the old version, the generated parser would not compile because
of a giant method that exceeded the JVM maximum allowed size for a
single method.

Change-Id: Ic989c7afd034216f6db4c8f9f3901c025cceb524
Reviewed-on: http://gerrit.cloudera.org:8080/2249
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-22 20:16:24 -08:00
Bharath Vissapragada
ef0dac661c IMPALA-2843: Persist hive udfs across catalog restarts
This commit adds a new feature to persist hive/java udfs across
catalog restarts. IMPALA-1748 already added this for non-java
udfs by storing them in parameters map of the Db object and
reading them back at catalog startup. However we follow a
different approach for hive udfs by converting them to Hive's
function format and adding them as hive functions to the metastore.
This makes it possible to share udfs between hive and Impala as the
udfs added from one service are accessible to other. This commit
takes care of format conversions between hive and impala and user
can just add function once in either of the services.

Background: Hive and impala treat udfs differently. Hive resolves the
evaluate function in the udf class at runtime depending on the data
types of the input arguments. So user can add one function by name and
can pass any arguments to it as long as there is a compatible evaluate
function in the udf class. However Impala takes the input types of the
udf as a part of function definition (that maps to only one evaluate
function) and loads the function only for those set of input argument
types. If we have multiple 'evaluate' methods, we need to add multiple
functions one for each of them.

This commit adds new variants of CREATE | DROP FUNCTIONS  to Impala which
lets the user to create and drop hive/java udfs without input argument
types or return types. Catalog takes care of loading/dropping the udf
signatures corresponding to each "evaluate" method in the udf symbol
class. The syntax is as follows,

CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts>
DROP FUNCTION [IF EXISTS] <function name>

Examples:

CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;

The older way of creating hive/java udfs with specific signature is still supported,
however they are *not* persisted across restarts. So a restart of catalog can
wipe them out. Additionally this commit also loads all the compatible java udfs
added outside of Impala and they needn't be separately loaded. One thing
to note here is that the functions added using the new CREATE FUNCTION
can only be dropped using the new DROP FUNCTION syntax (without
signature). The same rule applies for the java udfs added using the old
CREATE FUNCTION syntax (with signature).

Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d
Reviewed-on: http://gerrit.cloudera.org:8080/2250
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 23:04:03 -08:00
Marcell Szabo
8135ef6eaa IMPALA-2641: Add IF EXISTS clause to TRUNCATE TABLE statement
Change-Id: I3169390b0e04f07fb4ea53d987d86a76482d7e9d
Reviewed-on: http://gerrit.cloudera.org:8080/1905
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 14:08:58 +00:00
Skye Wanderman-Milne
5a81d2db88 IMPALA-2184: don't inline timestamp methods with try/catch blocks in IR
We do not have exceptions enabled for codegen'd code, so exceptions
thrown by functions called by codegen'd functions cannot be caught by
the codegen'd functions. TimestampValue::UnixTimeToPtime() has a
try/catch around boost::posix_time::ptime_from_tm(), but since it was
inlined into the TimestampFunctions::FromUnix() IR the try/catch
didn't work. This patch moves the UnixTimeToPtime() implementation to
the .cc file so it doesn't get included in the IR. It does the same
for TimestampParser::Parse() in case it gets inlined into IR code as
well.

Change-Id: Ic0af73629e1e3b6bf18cbf5d832973712b068527
Reviewed-on: http://gerrit.cloudera.org:8080/2210
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 00:03:23 -08:00
Bharath Vissapragada
1b40a83903 IMPALA-2382: Add support for Hive udfs returning primitive types
Hive allows udfs with primitive data types as return values (along
with Writables) and input arguments. This commmit adds this support
for Impala.

Change-Id: I2ec24eab5a824772a8618d7fb97ae5c7ea2a0e39
Reviewed-on: http://gerrit.cloudera.org:8080/2207
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 00:03:22 -08:00
Lars Volker
6b566a2d35 IMPALA-3004: Fix QueryTest tests
Test files in testdata/workloads/functional-query/queries/QueryTest
are parsed by test_file_parser.py, which used to ignore everything
before the first ==== line as a file header. This change fixes all
affected files.

This change also modifies the test file parser to forbid headers
starting with what looks like a subsection title ('----'), which
should prevent the reintroduction of similar errors in the future.

Change-Id: Iaa1bc5ffd02782e24289c7843dcb35401c334519
Reviewed-on: http://gerrit.cloudera.org:8080/2220
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 00:03:15 -08:00
Michael Ho
f9232c98b0 IMPALA-3018: Fix AllocBuffer() and CopyStringVal() to handle empty strings.
AllocBuffer() and CopyStringVal() are two helper functions used by
various UDAs to allocate buffers for StringVal during their Init()
and Update() functions. Previously, these functions assumed that
the buffer length is always greater than 0. That turned out to be
an invalid assumption. This change removes this assumption and
handles zero-length StringVal by initializing its 'ptr' to NULL and
'len' to 0. A new test is also added to exercise this case.

Change-Id: Ia1e4140376c65ca3c734c40ecc3cce15b8bf2d3f
Reviewed-on: http://gerrit.cloudera.org:8080/2211
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-02-18 01:25:10 -08:00
Skye Wanderman-Milne
9aeb77023f IMPALA-2993: don't check for "Failed to allocate buffer for collection" error
This test query is supposed to check the error path for when a
collection buffer cannot be allocated. However, it's flaky because the
collection allocations are not very big (< 2KB), so it's possible for
a different operator to trigger OOM.

I think the correct solution is to create a test file that contains
very large collections, so a large collection allocation will trigger
OOM, rather than many small collection allocations. For now though,
let's disable the specific collection allocation check to unblock the
build, even though we risk losing coverage.

Change-Id: Iab4c9b605186926c522cf692246a37882fbdfcdb
Reviewed-on: http://gerrit.cloudera.org:8080/2208
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2016-02-18 01:25:10 -08:00
Matthew Jacobs
35ad46c1ce IMPALA-2996: Mem limit too high for expected OOM test failure
A regression test for IMPALA-2265, IMPALA-2559 expected a
query to fail with an OOM but the mem limit is now too high.
This reduces the mem limit of the test case to be as low as
it can be without failing to set up the operators.

Change-Id: I056c3ad4067e5466e3690c3b4d597b9815a7a234
Reviewed-on: http://gerrit.cloudera.org:8080/2186
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 45ba3109e752dfdeefdf5627a5d57079f73b24c9)
2016-02-17 20:22:14 -08:00
Tim Armstrong
212bea529f IMPALA-2994: Temporary workaround for flaky spilling test
The test was recently reenabled in commit
71a0a7d998702781ae44270f8c742b10c34c0efc.

Continue running the test but loosen the memory limit and don't check
the runtime profile. The memory limits for this set of tests needs
revisiting in any case.

Change-Id: I195e8ad3b67c8ff85d5d15c2646a13f5feb57553
Reviewed-on: http://gerrit.cloudera.org:8080/2183
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 51632f39a45ba9deac9b86bbdb14ff10cbee35ac)
2016-02-17 20:21:57 -08:00
Henry Robinson
2212240106 IMPALA-2552: Runtime filter forwarding between joins and scans
This patch adds the ability for operators to compute and forward bitmap
filters from one operator to another, across fragment and machine
boundaries.

Filters are provided as part of the plan from the frontend. In this
patch hash join nodes produce filters from their build side input, and
propagate them, via the query's coordinator, to the scan nodes which
provide the probe side for that join. The scan nodes may then filter
their rows before they are sent to the join, reducing the amount of work
the join has to do.

Filters are attached to the local RuntimeState's RuntimeFilterBank by
the join node. When complete, they are asynchronously sent to the
coordinator via a new UpdateFilter() RPC. The coordinator maintains a
routing table that maps incoming filters to their recipient
backends. For partitioned joins, the filters must be aggregated from all
providers. The coordinator performs this aggregation and transmits the
completed filter only when all inputs have been received.

In this patch, filtering can occur in up to four places in a scan:

1. Before initial scan ranges are issued (all file formats, partition
   columns only)
2. Before each scan range is processed (all file formats, partition
   columns only)
3. Before a row group is processed (Parquet, partition columns only)
4. During assembly of every row (Parquet, any column)

This patch also replaces the existing bitmap-based filters with Bloom
Filter based ones.

The Bloom Filters are statically sized to have an expected false positive rate of
0.1 on 2^20 distinct items. This yields Bloom Filters of 1MB in
size. This is configurable by setting --bloom_filter_size, and we
will perform tests to determine a good default. The query option
RUNTIME_BLOOM_FILTER_SIZE can override the command-line flag on a
per-query basis.

This patch also simplifies and improves the memory handling for
allocated filters by the RuntimeFilterBank. New filters are tracked
through the query memory tracker, and owned by the fragment instance's
RuntimeState::obj_pool().

This patch also adds a simple heuristic to disable filter creation based
on estimated false-positive rate for the Bloom Filter. By default the
maximum FP rate is set to 75%. It can be controlled by setting
--max_filter_error_rate.

Finally, this patch adds short-circuit publication for filters that are
broadcast, and does so always even when distributed runtime filter
propagation is disabled.

To avoid cross-compilation problems, bloom-filter.h was rewritten in C++98.

Change-Id: Icea03a87cf1705c1b4aa46f86f13141c4b58da10
Reviewed-on: http://gerrit.cloudera.org:8080/1861
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
2016-02-13 16:19:41 +00:00
Tim Armstrong
1c102d9d8e Reenable tests that were disabled for IMPALA-1305
A couple of tests were disabled because of IMPALA-1305. Now that the fix
is in, those tests can be reenabled. I ran them in a loop to make sure
that they weren't flaky.

Also fix the spelling mistake in the file name.

Change-Id: I1bfcc619911a92d93b871be3a14852aa11f78da9
Reviewed-on: http://gerrit.cloudera.org:8080/2150
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-02-13 10:08:13 +00:00
Alex Behm
d7ee6fa7a4 IMPALA-2663: Filter out tuples with empty collections in scan.
We now generate predicates for filtering out empty collections
directly in the parent scan that materializes the collections.
This optimization is conservatively applied only for uncorrelated
relative table references because that makes it safe/easy to determine
the join type (the optimization is incorrect for outer and anti joins).

The change provides a substantial improvement for queries that
have selective predicates on nested collections, or for data sets
that naturally have many empty collections. The performance improvement
comes from:
(1) The new predicates are assigned to a scan, so we get multi-threading.
(2) We avoid expensive subplan iterations for collections that would
    yield an empty subplan result anyway.

Performance measurements on 10-node using nested TPCH-300 on some of
the queries originally mentioned in the JIRA:

TPCH-Q12, 10x speedup
Before: 111s
After:   11s

TPCH-Q7
Before: 205s
After:  128s

TPCH-Q5
Before: 48s
After:  40s

TPCH-Q3
Before: 18s
After:  14s

The following microbenchmark query designed to highlight the
improvement also gets a ~10x speedup.

select c_custkey, o_orderkey
from customer c, c.c_orders
where o_orderkey = 1884930

Before: 11.3s
After:   1.8s

Change-Id: I0d0dc90442a61d62cc8f7dad186560490b62441a
Reviewed-on: http://gerrit.cloudera.org:8080/2118
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-12 00:23:43 +00:00
Dimitris Tsirogiannis
c943d6ab7d IMPALA-2552: Add support for runtime filter propagation (FE)
This commit adds support for runtime filter propagation in the frontend.
During planning, the frontend computes a set of filters that are
constructed by join operators and are applied at scan operators in order
to filter scanned tuples or scan ranges. The filters are identified
from equi-join predicates by traversing the single-node plan tree in a
top-down fashion.

A query option, termed enable_runtime_filter_propagation, is added to
enable/disable runtime filter propagation (disabled by default). When
runtime filter propagation is enabled, the output of EXPLAIN is modified
to include information about the runtime filters that are constructed/applied.
Also, an event is added to the query timeline to track the time spent in the planner
while computing runtime filters.

Testing:

Functional planner tests are added.

Change-Id: Id79a38313051d95da32c897b176a40d26b0dda1d
Reviewed-on: http://gerrit.cloudera.org:8080/1532
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
2016-02-12 00:11:45 +00:00
Tim Armstrong
2c2670e389 IMPALA-1305: streaming pre-aggregations
Aggregations are implemented as a distributed pre-aggregation, an
exchange, then a final aggregation that produces the results of the
aggregation. In many cases the pre-aggregation significantly reduces the
amount of data to be exchanged. However, in other cases, the
preaggregation does not greatly reduce the amount of data exchanged or
can use a lot of memory and starve other operators that would benefit
more from the additional memory.

In these cases we would be better off "passing through" some input tuples
by transforming them into intermediate tuples without aggregating them.

This patch adds a streaming pre-aggregation mode to
PartitionedAggregationNode that tries to aggregate input rows with a
hash table, but can switch to passing through the input tuples (after
transforming them into the appropriate tuple format). It does this if
it hits a memory limit or if the aggregation is not sufficiently
reducing the node's output (specifically, if the number of aggregated
rows in the hash table is more than half the number of unaggregated rows
consumed by the pre-aggregation). Pre-aggregations never need to spill
because they can pass through rows when under memory pressure.

This initial implementation is quite conservative: it retains the
partitioning of the previous implementation because switching to a
single partition proved to regress performance of some queries while
improving others. It also always keeps hash tables around and updates
them with matching input rows so that reduction statistics are updated
and early decisions to pass through data can be reversed.  Future work
could explore different approaches within the new framework to get
larger performance gains. Currently we see significant performance
benefits for queries with a very low reduction factor, e.g. group by on
a nearly unique column

Includes codegen support for the passthrough streaming.

Adds a query option, disable_streaming_preaggregations, in case a user
wants to revert to the old behaviour.

Adds TPC-H tests to exercise the new passthrough code path and updates
planner tests to include the new [STREAMING] detail added by the planner.

Change-Id: Ia40525340cba89a8c4e70164ae11447e96494664
Reviewed-on: http://gerrit.cloudera.org:8080/1698
Tested-by: Internal Jenkins
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
2016-02-11 19:03:51 +00:00
Skye Wanderman-Milne
039bd44fdf IMPALA-2688: decimal codegen support in aggregations
This patch implements codegen support for aggregations with decimal
input and intermediate type. For the following benchmark query:

SELECT l_discount, count(*) AS cnt
FROM biglineitem
GROUP BY l_discount
HAVING cnt > 9999999999999

Query time went from 8.85s to 3.74s (2.4x faster).

Change-Id: I25934fcd6324e5bf1fa6859496107bf2ec68b8d3
Reviewed-on: http://gerrit.cloudera.org:8080/2050
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2016-02-11 02:32:22 +00:00
Anuj Phadke
d787e1e3a7 IMPALA-2425: Broadcast join hint not enforced when low memory limit is set.
Broadcast joins are disabled if the size of the rhs hash table exceeds
the per node mem_limit. This change forces a broadcast join if the
broadcast join hint is enforced.

Change-Id: Iff9bd4d01736c48e52306ac79f74ab6ef0938f2a
Reviewed-on: http://gerrit.cloudera.org:8080/1967
Reviewed-by: Huaisi Xu <hxu@cloudera.com>
Tested-by: Internal Jenkins
2016-02-10 11:30:19 +00:00
Alex Behm
9a9886ee37 IMPALA-2950: Fully resolve exprs before wrapping with TupleIsNullPredicates.
The bug: In SingleNodePlanner.createInlineViewPlan() we need to wrap some
exprs with TupleIsNullPredicates to preserve correctness if the inline view
is outer joined. The bug was that we used to perform this wrapping on
the rhs of the inline view's smap, and not the final output smap after
those rhs exprs have been resolved against the physical output of the
inline view's plan root. As a result, the TupleIsNullWrapping did not
work correctly for deeply nested inline views with exprs that require
wrapping at various nesting levels.

The fix: Resolve the exprs against the physical output of the inline view's
plan root before performing the TupleIsNullPredicate wrapping.

Change-Id: I183bba6a36bf5e19a88687ed8c82977ae769ddf4
Reviewed-on: http://gerrit.cloudera.org:8080/2092
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-10 07:16:58 +00:00
Michael Ho
40f75fb1ba IMPALA-2925: Fix flaky tests in test_alloc_fail_update()
test_alloc_fail_update() aims to stress memory allocation
failure in the Update(), Serialize() and/or Finalize() functions
of UDAs. However, this test included some UDFs which allocated
memory in their Init() functions and not during their Update()
functions. This change removes those UDFs from the test.

Change-Id: I1ecc7e838e34ebc9ea3c878fee8ea2497b5fa23e
Reviewed-on: http://gerrit.cloudera.org:8080/2005
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-02-10 00:54:11 +00:00
Lars Volker
f9c718e4ea IMPALA-2959: Fix S3 failure caused by broken regex
IMPALA-2862 fixed parsing for regular expressions in the result
verifier. This change fixes a test that had a broken regular expression,
which was not caught by the exhaustive test suite.

I search for tests with a similar issue but couldn't find any:
git grep "regex:[^,]\+'"

Change-Id: I3aaca6bdfdc1eaab715929aa5fc6b64e6c969656
Reviewed-on: http://gerrit.cloudera.org:8080/2089
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-08 20:14:05 +00:00
Alex Behm
ecf46a5af8 IMPALA-976: Improvements to scan and join cardinality estimates.
1. Improved join cardinality estimation.
   For each equi join predicate we try to determine whether it is
   a foreign/primary key (FK/PK) join condition, and either use a
   special FK/PK estimation or a generic estimation method. We
   maintain the minimum cardinality for each method separately,
   and finally return in order of preference:
   - the FK/PK estimate, if there was at least one FP/PK predicate
   - the generic estimate, if there was at least one predicate with
     sufficient stats
   - otherwise, we optimistically assume a FK/PK join with a join
     selectivity of 1, and return the left-hand size cardinality
2. More robust handling of conjuncts with unknown selectivities,
   and conjuncts that are not independent. Uses exponential backoff.
3. More accurate broadcast vs. partitioned join cost estimation.
   We now account for the 4 byte per-tuple overhead when serializing
   rows over an exchange. This change is especially helpful in cases
   where one side of the join has no materialized slots, i.e., it
   has a row size of 0, and an exchange used to appear free.

We are obviously not done with improving join cardinality estimates.
This patch is merely a step in the right direction, in particular,
the code and behavior are now more explicit and easier to reason about
than before, and better reflects the original intent (i.e., fixes the
IMPALA-976 bug).

Change-Id: I00d8e8230e2844cb807d128d82b35ee78db7d774
Reviewed-on: http://gerrit.cloudera.org:8080/1668
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-06 09:26:46 +00:00
Michael Ho
3d7a4477ee IMPALA-2948: Fix a bug in the planner when fast partition key scan is enabled
When the query option OPTIMIZE_PARTITION_KEY_SCANS is true, we may
acquire the partition key values from the metadata and generate a
union node containing constant expressions only. There is a bug in
the planner when generating the union node as it skips evaluating
the constant expressions for unmaterialized slots but union node
expects an entry in the constant expression lists for each slot
in the tuple descriptor even if the slot is not materialized.

This change fixes the problem by inserting a dummy null values
in the constant expression list for unmaterialized slots and lets
the union node filter them out. A test is also added to verify
the fix.

Change-Id: I9ed49dca0101b96bd9b20e6d1e5b1d56f654e911
Reviewed-on: http://gerrit.cloudera.org:8080/2067
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-06 05:28:28 +00:00
Alex Behm
733d135212 IMPALA-852,IMPALA-2215: Analyze HAVING clause before aggregation.
In SelectStmt.analyzeAggregation(), we need to analyze the HAVING clause
first so we can check if it contains aggregates.
Also, we need analyze/register it even if we are not computing aggregates.

Change-Id: Ieedfb64bf9a8f1390c0231a8b4aa25120ee5542b
Reviewed-on: http://gerrit.cloudera.org:8080/2066
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-06 01:31:34 +00:00
Alex Behm
ba1ad352a6 IMPALA-2926: Fix off-by-one bug in SelectNode::CopyRows().
The bug was that we were not updating child_row_idx_
when the output batch was at capacity, leading us to
double count that last child_row_idx_, and incorrectly
returning extra rows.

Change-Id: I85b2f1c146861ec7756887b0d2c574365d90233e
Reviewed-on: http://gerrit.cloudera.org:8080/2044
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-05 19:23:37 +00:00
Alex Behm
50bf474187 IMPALA-2906: Fix an edge case with materializing TupleIsNullPredicates in analytic sorts.
The bug: In order to preserve tuple nullability information through analytic sorts
we materialize the relevant expressions that contain TupleIsNullPredicates, and
with appropriate changes to the analytic sort's output smap. However, in some
edge cases, we incorrectly materialized an expr with a TupleIsNullPredicate
that could not be evaluated at that sort node because the tuple ids referenced
by the expr were not produced by the sort's input. For example, this scenario
was possible when a constant expr was wrapped in a TupleIsNullPredicate, and our
isBoundByTupleIds() check failed to filter out the expr from materialization
at the analytic sort.

The fix: Our existing code in the AnalyticPlanner already does the right thing.
We were simply missing the implementation of TupleIsNullPredicate.isBoundByTupleIds().

Change-Id: I72774f698545220922dd8ffbfa514aa87d26f97d
Reviewed-on: http://gerrit.cloudera.org:8080/2008
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-05 19:15:53 +00:00
Dimitris Tsirogiannis
ccf1f8f73f IMPALA-2734: Correlated EXISTS subqueries with HAVING clause return wrong results
This commit fixes an issue where wrong results are returned if an EXISTS subquery
contains a HAVING clause and non-equality correlated binary predicates. This case does
not have a valid rewrite as the HAVING clause needs to be applied after the correlated
predicates have been evaluated. With this fix, we detect cases like this and throw an
AnalysisException.

Change-Id: I159f956e2b01f408601829b5d2afcf11d76bedcd
Reviewed-on: http://gerrit.cloudera.org:8080/1927
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-02-04 01:06:03 +00:00
Jim Apple
7fc739f6d6 IMPALA-2897: Fix equality comparisons on null build-side rows.
If a hash table stores null build-side rows, then it must treat null
expressions in build side rows as being equal. Otherwise, long collision
chains can accumulate, as rows with the same nulls will have the same
hash values but not compare equal.

Equality between build rows and probe rows stays the same.

Change-Id: I5f11addca7dc97408f6eb89de5082657333d17b9
Reviewed-on: http://gerrit.cloudera.org:8080/1956
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-02-03 20:46:32 +00:00
Lars Volker
4fc7f15376 IMPALA-2862: Fix regex parsing in test result verifier
Test results can be verified using regular expressions. The extraction
of the regular expression substring from the expected test results had a
bug where only the first character of an expression was considered. This
lead to wrong but undetected test results.

Change-Id: Ia670da6e0758455a86dc44744b96b9465d890af3
Reviewed-on: http://gerrit.cloudera.org:8080/1818
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-02 21:55:57 +00:00
Bharath Vissapragada
aed3505c8d IMPALA-1651: CREATE TABLE LIKE shouldn't inherit hdfs caching settings from source table
Change-Id: Ia5dba8ac463d088b50e1d16a7b5db1941d7c6989
Reviewed-on: http://gerrit.cloudera.org:8080/1917
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-01-28 13:40:32 +00:00
Dimitris Tsirogiannis
b78b37cbdc IMPALA-1480: Slow DDL statements with large number of partitions
This commit improves the performance of DDL statements on tables with
large number of partitions. Previously, the catalog would force-reload
the entire table metadata during the execution of DDL and insert
statements, causing significant delays for tables with large number of
partitions. With this commit the catalog is reusing any cached table
entries to partially reload table metadata for only those partitions
that have been modified. With this change we've improved the performance
of some DDL and insert statements by at least 4-5X.

This commit also adds basic table-level locking to protect table
metadata from concurrent DDL operations.

Preliminary performance measurements
-----------------------------------
Workload: insert into table partition () select ... limit 10
Iterations: 10

Num partitions  OLD (avg time sec)	NEW (avg time sec)
1K		1.15			0.45
5K		3.65			0.9
10K		5.75			1.38
15K		10.1			2.02
30K		25.4			4.46

Workload: alter table partition() set location...
Iterations: 10

Num partitions	OLD (avg time sec)	NEW (avg time sec)
1K		0.8			0.47
5K		4.3			0.71
10K		7.1			1.2
15K		13.2			1.8
30K		26.8			3.4

Change-Id: I4da7fb6df0a71162b0cb60e6025a4019cb9572bf
Reviewed-on: http://gerrit.cloudera.org:8080/1706
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-01-28 09:18:57 +00:00
Alex Behm
0687c54792 IMPALA-2894: Move regression test into a different .test file.
We cannot run certain nested types queries with the legacy joins/aggs,
so to fix a build I just moved a recently added test into a different
.test file that already does not run with legacy joins/agggs.

Change-Id: I0ec0e61535ad01333129bd49beca4aa481f04d74
Reviewed-on: http://gerrit.cloudera.org:8080/1918
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2016-01-27 20:41:45 +00:00
Dimitris Tsirogiannis
6b3c9cac45 IMPALA-2870: Fix failing metadata.test_ddl.TestDdlStatements.test_create_table test
This commit fixes a DDL test that was failing because some newly added test cases were
using a database that had been dropped by another test case. The temporary fix is to use
fully qualified names for the specified tables.

Change-Id: I3bb022e2497283faeb84c85f922cda95beca2a32
Reviewed-on: http://gerrit.cloudera.org:8080/1909
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-01-27 01:51:42 +00:00
Michael Ho
0853ea1a7d IMPALA-2499: Evaluate a SELECT block using partition metadata.
This patch implements an optimization for evaluating table refs
with metadata instead of table scans for queries which satisfy
the following conditions:

(1) All scan slots being materialized are partition columns.
(2) The SELECT block only contains aggregate expressions bound
    by the these slots. The aggregate expressions should have
    distinct semantics such as SELECT MIN(X) from T; or
    SELECT COUNT(DISTINCT x) FROM T; If there are no aggregate
    expressions in the SELECT block, the query block must contain
    grouping expressions.

If the above conditions are satisfied, the scan nodes in the plan
of the SELECT block will be replaced with union nodes which materialize
the partition key values.

The query "select min(year), max(month) from functional.alltypes;"
went from 440ms on average to 10ms (44x speed up) with this change.
The speed-up depends on the number of rows per partition.

The following are plans before and after this change for the
following query:
select month, min(year) from functional.alltypes group by month

Before:

01:AGGREGATE [FINALIZE]
|  output: min(year)
|  group by: month
|
00:SCAN HDFS [functional.alltypes]
   partitions=24/24 files=24 size=478.45KB

After:

01:AGGREGATE [FINALIZE]
|  output: min(year)
|  group by: month
|
00:UNION
   constant-operands=24

This optimization is enabled by the query option 'optimize_partition_key_scans'.
Note that there are some caveats with this optimization. In particular,
the returned values may be inconsistent with that of the conventional
plans in the following two cases when this optimization is enabled.

1. If a user deletes a file without doing a refresh, the metadata
   becomes stale. The conventional plan will return an error in the
   scan (due to missing files). With this optimization, the partition
   key values of deleted partitions may be returned.

2. With the conventional plan, an empty partition will not be included
   in the evaluation of the return values. A partition is empty if it
   either contains (1) no file or (2) the file contains no row.
   This optimization may return different values in the second case
   above when there are no rows in the file.

Due to the potential inconsistencies above, users need to opt-in
for this optimization.

Change-Id: I30d4c7dab7610a30773fc60044499c468684dc9a
Reviewed-on: http://gerrit.cloudera.org:8080/1638
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 22:06:26 +00:00
Michael Ho
f3e7274342 IMPALA-2711: Fix memory leak in Rand().
MathFunctions::RandPrepare() allocates a 4-bytes seed and
stores it in the FunctionContext's thread local state.
However, it was never freed. This change fixes the problem
by adding a close function for Rand() so it has a chance to
free the seed. A new test is also added to verify the fix.

Change-Id: Ibcc2e1ca0d052b86defe80aad471f9fdaac5a453
Reviewed-on: http://gerrit.cloudera.org:8080/1855
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 11:53:38 +00:00
Alex Behm
95951a36e8 IMPALA-2539: Unmark collections slots of empty union operands.
Change-Id: I401f9b9a5e5457120600a7cb5b54f84adb8477f7
Reviewed-on: http://gerrit.cloudera.org:8080/1895
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 11:32:40 +00:00
Michael Ho
968c61c940 IMPALA-2824: Restore query options after each test.
A failed test case inside a test file will leave the rest of
the test cases in the file unexecuted. Some test cases may
modify some query options such as memory limit and then
restore them in the subsequent test cases in the same file.
The failure of those test cases will leave the query options
modified, causing cascading failures to other test cases
which aren't expected to be run with the modified query
options (e.g. lowered memory limit). This problem may lead
to broken builds which are recorded in IMPALA-2724 and
IMPALA-2824.

This change fixes the problem above by checking if a test
case modifies any query option and if so, restore those
modified query options to their default values. This change
makes the assumption that a test should not modify an option
specified in its test vector so it's safe to restore the
modified query options to their default values.

Change-Id: Ib88d1dcb6a65183e1afc8eef0c764179a9f6a8ce
Reviewed-on: http://gerrit.cloudera.org:8080/1774
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 03:13:05 +00:00
Lars Volker
0aec98f674 IMPALA-2749: Fix decimal multiplication overflow
When multiplying double and decimal values, we used to cast all doubles
to decimals before doing the multiplication. Due to the precision of two
decimals being added during multiplication, the effective value range of
the resulting decimal type could become very small and overflows could
happen.

This change switches the behavior to cast to double precision types when
at least one of the input operands is of type float or double. In such
cases we will not have exact results in general and we assume the user
would normally not expect exact results from an inherently inexact
datatype.

Change-Id: Idd28c5471506c68a860beb0778d98c8d25825f9f
Reviewed-on: http://gerrit.cloudera.org:8080/1820
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2016-01-23 23:59:27 +00:00
Juan Yu
708ab3c669 IMPALA-2565: Planner tests are flaky due to file size mismatches.
Fix flaky test by ignoring file size in explain plan comparison.

Change-Id: I38871e5e16a6b60860aed4ea89c108fecdfd60d0
Reviewed-on: http://gerrit.cloudera.org:8080/1767
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-01-23 09:38:40 +00:00