Commit Graph

1097 Commits

Author SHA1 Message Date
Tim Armstrong
a280b93a37 IMPALA-2070: include the database comment when showing databases
As part of change, refactor catalog and frontend functions to return
TDatabase/Db objects instead of just the string names of databases -
this required a lot of method/variable renamings.

Add test for creating database with comment. Modify existing tests
that assumed only a single column in SHOW DATABASES results.

Change-Id: I400e99b0aa60df24e7f051040074e2ab184163bf
Reviewed-on: http://gerrit.cloudera.org:8080/620
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-12-04 01:40:41 +00:00
Michael Ho
ba0bd1d0da IMPALA-2612: Free local allocations once for every row batch when building hash tables.
When building hash tables for the build side in partitioned
hash join or aggreagtion, we will evaluate the build or probe
side expressions to compute the hash values for each TupleRow.
Evaluation of certain expressions (e.g. CastToChar) requires
"local" memory allocation. "Local" memory allocation is supposed
to be freed after processing each row batch.

However, the calls to free local allocations are missing in
PartitionedHashJoinNode::BuildHashTableInternal() and
PartitionedAggregationNode::ProcessStream(). This causes all
"local" memory allocation to accumulate potentially for the
entire duration of the query or until GetNext() is called.
This may lead to unnecessary memory allocation failure as
memory limit is exceeded.

This patch calls ExecNode::FreeLocalAllocations() at least once
per row-batch when building hash tables. It also adds the missing
checks for the query status in the loop building hash tables.
Please note that QueryMaintenance() isn't called due to its
overhead in memory limit checks.

Change-Id: Idbeab043a45b0aaf6b6a8c560882bd1474a1216d
Reviewed-on: http://gerrit.cloudera.org:8080/1448
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2015-11-26 03:21:46 +00:00
Alex Behm
096073472f IMPALA-1459: Fix migration/assignment of On-clause predicates inside inline views.
The bug: Our predicate assignment logic used to rely on a flag isWhereClauseConjunct
set in Exprs to determine whether assigning a predicate at a certain plan node
was correct or not. This flag was intended to distinguish between predicates from
the On-clause of a join and other predicates, and the bug was that we used
!isWhereClauseConjunct to imply that a predicate originated from an On-clause.
For example, predicates migrated into inline views apply to the post-join,
post-aggregation, post-analytic result of the inline view, but the existing
flag and logic were insufficient to correctly assign predicates coming
from the On-clause of an enclosing block.

This patch removes the isWhereClauseConjunct flag in favor of an isOnClauseConjunct
flag which can more directly capture the originally intended logic.

This patch also adds a test to cover the same issue reported in IMPALA-2665.

Change-Id: I9ff9086b8a4e0bd2090bf88319cb917245ae73d2
Reviewed-on: http://gerrit.cloudera.org:8080/1453
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-11-20 04:53:28 +00:00
Tim Armstrong
643c800d62 IMPALA-2473: reduce scanner memory usage
This patch reduces memory usage of scanners by adjusting how batch
capacity is checked and handled and by freeing unneeded memory.

Change RowBatch::AtCapacity(MemPool) so that batches with no rows
cannot hold onto an unbounded amount of memory - instead they
will pass these batches up operator tree so that the resources
can be freed.

The Parquet scanner also only checked capacity every 1024 rows.
With large rows (e.g. nested collections), it can overrun the
intended 8mb limit. It also didn't include the MemPool usage
in its checks. After the change the scanner will produce smaller
batches if rows contain large nested collections or strings.
I benchmarked this with a scan of the nested TPC-H customers
tables. The row batch sized decrease from ~16MB to ~8MB. If the
nested collections were larger this would be more drastic.

Also pass at capacity up the tree if no rows passed the conjuncts in
the DataSourceScanNode and Parquet scanner so that resources can be
freed.

HdfsTableSink is modified to avoid the incorrect assumption that a batch
only has 0 rows at eos. It is also refactored to pass a related flag as
an argument to make the semantics clearer.

Two simple benchmarks (one column and many columns) shows no change
in scanner performance:
 > set num_scanner_threads=1;
 > select count(l_orderkey) from biglineitem;
 > select count(l_orderkey), count(l_partkey), count(l_suppkey),
   count(l_returnflag), count(l_quantity), count(l_linenumber),
   count(l_extendedprice), count(l_linestatus), count(l_shipdate),
   count(l_commitdate) from biglineitem;

Change-Id: I3b79671ffd3af50a2dc20c643b06cc353ba13503
Reviewed-on: http://gerrit.cloudera.org:8080/1239
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-11-19 22:57:05 +00:00
Taras Bobrovytsky
22df1fe1ca Random nested schema and data generation
Change-Id: Ie89f140ed389cd877a84ffe2df892853ac9897f2
Reviewed-on: http://gerrit.cloudera.org:8080/1167
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-11-14 05:19:32 +00:00
Amos Bird
12e57c5b32 IMPALA-2196: Implement DESCRIBE DATABASE [FORMATTED|EXTENDED] <db_name>.
This commmit implements DESCRIBE DATABASE [FORMATTED|EXTENDED] <db_name>.
Without FORMATTED|EXTENDED this statement only prints database's location and
comment. With FORMATTED|EXTENDED it will output all the properties of the database
(e.g. OWNER and PARAMETERS). Currently we only retrieve privileges stored in
hive metastore.

This commit also implements DESCRIBE EXTENDED <table>, which is the same
as DESCRIBE FORMATTED <table> for consistency purpose.

Change-Id: I2a101ec0e3d27b344fcb521eb00e5bdbcbac8986
Reviewed-on: http://gerrit.cloudera.org:8080/804
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-11-09 21:13:38 +00:00
Dan Hecht
82c7dbe265 IMPALA-2559: Fix check failed: sorter_runs_.back()->is_pinned_
Status was ignored by the callers of CreateMerger(), which caused the
sorter to continue in a bad state, leading to this DCHECK.

Change-Id: I1aae11d142cdbf3923289768e296dbf340f049e9
Reviewed-on: http://gerrit.cloudera.org:8080/1367
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-11-07 01:50:57 +00:00
Michael Ho
eb479d040e IMPALA-1714: Check if tables or partitions specified in "CREATE/ALTER TABLE ... CACHE IN ..." statements are cacheable.
This change adds more analysis checks to verify the location
of the table or partition specified in a "CREATE/ALTER TABLE
... CACHED IN ..." statement can actually be cached. Caching
is only supported for HDFS locations.

If table-wide caching is enabled for a table, adding a partition
at an uncacheable location will be disallowed for that table
unless the attribute "UNCACHED" is explicitly specified.

Enabling table-wide caching for a table at an uncacheable
location or a table with partitions at uncacheable locations
will also be disallowed. However, caching can still be enabled
for individual partitions whose underlying locations
support caching.

Change-Id: I2299c9285126f4b035360f2ef902147188ccd5f1
Reviewed-on: http://gerrit.cloudera.org:8080/1373
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2015-11-06 23:30:14 +00:00
Bharath Vissapragada
084b9b1692 IMPALA-2432: Add query endtime to impalad's lineage
This commit adds query endtime to impalad's lineage log entries consumed
by navigator. The lineage graph is constructed in the frontend and is then
passed to the backend as a serialized thrift object. When the query terminates
(includes cancellations and aborts), the backend appends the query endtime
("endTime") to the lineage graph and generates the lineage log entry in
JSON format.

Change-Id: I2236e98895ae9a159ad6e78b0e18e3622fdc3306
Reviewed-on: http://gerrit.cloudera.org:8080/934
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2015-11-04 08:39:12 +00:00
Skye Wanderman-Milne
dd2eb951d7 IMPALA-2558: DCHECK in parquet scanner after block read error
There was an incorrect DCHECK in the parquet scanner. If abort_on_error
is false, the intended behaviour is to skip to the next row group, but
the DCHECK assumed that execution should have aborted if a parse error
was encountered.

This also:
- Fixes a DCHECK after an empty row group. InitColumns() would try to
  create empty scan ranges for the column readers.
- Uses metadata_range_->file() instead of stream_->filename() in the
  scanner. InitColumns() was using stream_->filename() in error
  messages, which used to work but now stream_ is set to NULL before
  calling InitColumns().

Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701
Reviewed-on: http://gerrit.cloudera.org:8080/1257
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-30 22:35:57 +00:00
Jim Apple
19b6bf0201 IMPALA-2226: Throw AnalysisError if table properties are too large (for the Hive metastore)
This only enforced the defaults in Hive. Users how manually choose to
change the schema in Hive may trigger these new analysis exceptions in
this commit unnecessarily. The Hive issue tracking the length
restrictions is

https://issues.apache.org/jira/browse/HIVE-9815

Change-Id: Ia30f286193fe63e51a10f0c19f12b848c4b02f34
Reviewed-on: http://gerrit.cloudera.org:8080/721
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2015-10-29 19:39:25 +00:00
Michael Ho
34a94c2503 IMPALA-2404: Implements built-in function regexp_match_count
This patch implements a new built-in function
regexp_match_count. This function returns the number of
matching occurrences in input.

The regexp_match_count() function has the following syntax:

int = regexp_match_count(string input, string pattern)
int = regexp_match_count(string input, string pattern,
    int start_pos, string flags)

The input value specifies the string on which the regular
expression is processed.

The pattern value specifies the regular expression.

The start_pos value specifies the character position
at which to start the search for a match. It is set
to 1 by default if it's not specified.

The flags value (if specified) dictates the behavior of
the regular expression matcher:

m: Specifies that the input data might contain more than
one line so that the '^' and the '$' matches should take
that into account.

i: Specifies that the regex matcher is case insensitive.

c: Specifies that the regex matcher is case sensitive.

n: Specifies that the '.' character matches newlines.

By default, the flag value is set to 'c'. Note that the
flags are consistent with other existing built-in functions
(e.g. regexp_like) so certain flags in IBM netezza such as
's' are not supported to avoid confusion.

Change-Id: Ib33ece0448f78e6a60bf215640f11b5049e47bb5
Reviewed-on: http://gerrit.cloudera.org:8080/1248
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-10-27 10:11:13 +00:00
Michael Ho
f0c2742641 IMPALA-2004: Implement "SHOW CREATE" for udfs and udas.
This patch extends the SHOW statement to also support
user-defined functions and user-defined aggregate functions.
The syntax of the new SHOW statements is as follows:

SHOW CREATE [AGGREGATE] FUNCTION [<db_name>.]<func_name>;

<db_name> and <func_name> are the names of the database
and udf/uda respectively.

Sample outputs of the new SHOW statements are as follows:

Query: show create function fn
+------------------------------------------------------------------+
| result                                                           |
+------------------------------------------------------------------+
| CREATE FUNCTION default.fn()                                     |
|  RETURNS INT                                                     |
|  LOCATION 'hdfs://localhost:20500/test-warehouse/libTestUdfs.so' |
|  SYMBOL='_Z2FnPN10impala_udf15FunctionContextE'                  |
|                                                                  |
+------------------------------------------------------------------+

Query: show create aggregate function agg_fn
+------------------------------------------------------------------------------------------+
| result                                                                                   |
+------------------------------------------------------------------------------------------+
| CREATE AGGREGATE FUNCTION default.agg_fn(INT)                                            |
|  RETURNS BIGINT                                                                          |
|  LOCATION 'hdfs://localhost:20500/test-warehouse/libudasample.so'                        |
|  UPDATE_FN='_Z11CountUpdatePN10impala_udf15FunctionContextERKNS_6IntValEPNS_9BigIntValE' |
|  INIT_FN='_Z9CountInitPN10impala_udf15FunctionContextEPNS_9BigIntValE'                   |
|  MERGE_FN='_Z10CountMergePN10impala_udf15FunctionContextERKNS_9BigIntValEPS2_'           |
|  FINALIZE_FN='_Z13CountFinalizePN10impala_udf15FunctionContextERKNS_9BigIntValE'         |
|                                                                                          |
+------------------------------------------------------------------------------------------+

Please note that all the overloaded functions which match
the given function name and category will be printed.

This patch also extends the python test infrastructure to
support expected results which include newline characters.
A new subsection comment called 'MULTI_LINE' has been added
for the 'RESULT' section. With this comment, a test can
include its multi-line output inside [ ] and the content
inside [ ] will be treated as a single line, including the
newline character.

Change-Id: Idbe433eeaf5e24ed55c31d905fea2a6160c46011
Reviewed-on: http://gerrit.cloudera.org:8080/1271
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-10-23 05:11:07 +00:00
Matthew Jacobs
1cd95f79f0 Add partitions to table_no_newline_part IF NOT EXISTS
Loading functional data may fail if table_no_newline_part
partitions already exist. We should only ADD with IF NOT
EXISTS to handle this case as we do with other tables.

Change-Id: I5fe5c318d2cbbd5b5419394212b94a2fe7d386ce
Reviewed-on: http://gerrit.cloudera.org:8080/1261
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-17 00:52:53 +00:00
Martin Grund
65772cf9ce IMPALA-2527: Disable small-query optimization for collection types
When scanning tables with collection types the limit applied to the scan
node is not sufficient enough to safely enable the small-query
optimization. This patch adds an additional check to the
MaxRowsProcessedVisitor that will abort checking the number of processed
rows once a scan node accesse a collection type.

Change-Id: Ic43baf3f97acfb8d7b53b0591c215046179d18b3
Reviewed-on: http://gerrit.cloudera.org:8080/1235
Reviewed-by: Silvius Rus <srus@cloudera.com>
Tested-by: Internal Jenkins
2015-10-14 17:51:38 -07:00
Bharath Vissapragada
4ed0742f3e IMPALA-2310: Add PURGE option to DROP TABLE/ALTER TBL DROP PART
This commit adds PURGE option to DROP TABLE/ALTER TABLE DROP
PARTITION statements. Following is the usage:

1. DROP TABLE <tablename> takes an optional argument PURGE. Adding
purge purges the table data by skipping trash, if configured.

  DROP TABLE [<database>.]<tablename> [IF EXISTS] [PURGE]

2. PURGE is also supported with alter table drop partition query
with the following syntax. If specified, impala purges the partition
data by skipping trash.

  ALTER TABLE [<database>.]<tablename> DROP PARTITION [IF EXISTS] [PURGE]

This patch also helps the use case where Trash and the data directories
are in different encryption zones, in which case we cannot move the data
during ALTER/DROP. Then purge option can be used to skip the trash and
make sure data is actually deleted.

Change-Id: I64bf71d660b719896c32e0f3a7ab768f30ec7b3b
(cherry picked from commit 585d4f8d9e809f3bf194018dd161a22d3f144270)
Reviewed-on: http://gerrit.cloudera.org:8080/1244
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-10-14 17:51:37 -07:00
Michael Ho
26690ff1cd IMPALA-2204: Underscore in like predicate does not work for
multi-line text

This patch fixes the option passed to the RE2 regex matcher
so that it will count the newline character '\n' as a valid
candidate for '.'. Previously, that option was set to false
by default, causing multi-line text to fail to match against
patterns with wildcard characters in it. This patch also adds
some tests to address these cases and fixes some typos in
like-predicate.h.

Change-Id: I25367623f587bf151e4c87cc7cb6aec3cd57e41a
Reviewed-on: http://gerrit.cloudera.org:8080/1172
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2015-10-14 19:16:48 +00:00
Dan Hecht
84c4c2ce86 IMPALA-2480, IMPALA-2519: Don't force IO-buffer on probe side when spilling PHJ
This fixes a regression introduced with:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce
min mem needed for PAGG/PHJ

Prior to that change, as soon as any partition's stream overflowed
its small buffers, all partitions' streams would be switched
immediately to IO-buffers, which would be satisfied by the initial
buffer "reservation".

After that change, individual streams are switched to IO-buffers on
demand as they overflow their small buffers.  However, that change
also made it so that Partition::Spill() would eagerly switch that
partition's streams to IO-buffers, and fail the query if the buffer
is not available.  The buffer may not be available because the
reserved buffers may be in use by other partition's streams.

We don't need to fail the query if the switch to IO-buffers in
Partition::Spill() fails.  Instead, we should just let the streams
switch on demand as they fill up the small buffers.  When that
happens, if the IO buffer is not available, then we already have a
mechanism to pick partitions to spill until we can get the IO-buffer
(in the worst case it means working our way back down to the initial
reservation).  See AppendRowStreamFull() and BuildHashTables().

The symptom of this regression was that some queries would fail at a
lower memory limit than before.

Also revert the max_block_mgr_memory values back to their originals.

Additional testing: loop custom_cluster/spilling.py.  We should also
remeasure minimum memory required by queries after this change.

Change-Id: I11add15540606d42cd64f2af99f4e96140ae8bb5
Reviewed-on: http://gerrit.cloudera.org:8080/1228
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-10-12 14:41:08 -07:00
Matthew Jacobs
16759d7989 IMPALA-2529: expr test case fails on non-partitioned HJ
Failure due to an issue with NULL tuples (IMPALA-2375)
where NULL tuples come from the right side of a left
outer join where the right side comes from an inline view
which produces 0 slots (e.g. the view selects a constant).
The HJ doesn't handle them correctly because the planner
inserts an IsTupleNull expr. This isn't an issue for the
PHJ because the BufferedTupleStream returns non-NULL Tuple*
ptrs even for tuples with no slots.

Per IMPALA-2375, we're going to address this after 2.3, so
moving this test case into joins-partitioned.test which only
runs on the PHJ.

Change-Id: I64cb7e8ffd60f3379aa8860135db5af8e66d686f
Reviewed-on: http://gerrit.cloudera.org:8080/1231
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-10-12 14:41:05 -07:00
Dan Hecht
0144fc3be6 IMPALA-2530: fix flaky test_analytic_fns test case
Commit: IMPALA-2265: Sorter was not checking the returned Status of
PrepareRead

added a new test case that sets a mem_limit. The mem_limit was
calibrated against functional_parquet tables, but during exhaustive
this test is run against other formats.  Other scanners will use
different amounts of memory causing this test to fail in exhaustive.
Fix the query to always run exactly as it was tuned.

Change-Id: I8140653825cb4f303ad569f087757148c756e42d
Reviewed-on: http://gerrit.cloudera.org:8080/1230
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-10-12 14:41:04 -07:00
Sailesh Mukil
c66ea8ad03 IMPALA-2514: DCHECK on destroying an ExprContext
If an error occurs while processing an Expr, only that ExprContext
is closed and the others are still left open. This leads to a DCHECK
during teardown of the query because the destructor of ExprContext
expects it to be closed.

In this patch, we close all the ExprContexts if an error occurs in any
one.

Change-Id: Ic748bfd213511c314c59594d075048f6c6d82073
Reviewed-on: http://gerrit.cloudera.org:8080/1222
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-10-12 14:41:00 -07:00
Skye Wanderman-Milne
b669800432 IMPALA-2484: make array allocation failure test less flaky
This query was occasionally succeeding. This patch lowers the mem
limit so it's is less likely to succeed. It also adds the specific
"Failed to allocate buffer" message to the expected error so we don't
accidentally lose coverage. This test could still become flaky in the
future (I'm not sure 4mb is guaranteed to work every time, plus it
could change in the future), but I can't think of a better solution
than to continue adjusting it or only rely on more targetted unit
tests.

Change-Id: Id585f0e3b2c77a0279efffe9dd8b8b4472225730
Reviewed-on: http://gerrit.cloudera.org:8080/1207
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2015-10-09 16:47:48 -07:00
Skye Wanderman-Milne
cfe1e38d6e IMPALA-2495: make Expr::IsConstant() recurse on children
Before, Expr::IsConstant() manually specified the constant Expr
classes, but TupleIsNullPredicate and AnalyticExpr overrode
IsConstant() to always return false (which Expr::IsConstant() didn't
specify). This meant that unless the TupleIsNullPredicate was the root
expr, TupleIsNullPredicate::IsConstant() would never be called and
Expr::IsConstant() would return true. This patch changes
Expr::IsConstant() to recurse on its children, rather than having it
contain the constant logic for all expr types.

Change-Id: I756eb945e04c791eff39c33305fe78d957ec29f4
Reviewed-on: http://gerrit.cloudera.org:8080/1214
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2015-10-09 16:47:46 -07:00
Alex Behm
bbaef98281 IMPALA-2445: Preserve chain of table refs until end of computeParentAndSubplanRefs().
The bug: While separating the parent table refs from the subplan refs, we immediately
changed the left table link of a chosen ref. However, we rely on the link structure
to correctly determine the required table ref ids, so in some cases we missed
a required table ref id due to a broken table ref chain.

The fix: Preserve the original chain of table refs until the end of
computeParentAndSubplanRefs().

TODO:
This fix has the unfortunate consequence of making the plan for nested TPCH Q21
worse. It should be possible to change the planning logic to be correct and
also get the original better plan for Q21, but this needs some more thought.

Change-Id: Ib8dc13c950f7783b62ce6ab7c8a6f534f9a9bb31
Reviewed-on: http://gerrit.cloudera.org:8080/1177
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins

Conflicts:
	testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
2015-10-09 16:47:00 -07:00
Ippokratis Pandis
49b588a714 IMPALA-2265: Sorter was not checking the returned Status of PrepareRead
The sorter was dropping on the floor the returned Status of the
PrepareRead() calls. PrepareRead() tries to Pin() blocks. In some
queries with large sorts, those Pin() calls could fail with OOM,
but because the sorter was ignoring the returned Status it would
happily put the unpinned block in the vector of blocks and eventually
seg fault, because the buffer_desc_ of that block was NULL.

This patch fixes this problem and adds a test that eventually we may
want to move to the exhaustive build because it takes quite some time.
It also changes the comments of the sorter class to the doxygen style.

Change-Id: Icad48bcfbb97a68f2d51b015a37a7345ebb5e479
Reviewed-on: http://gerrit.cloudera.org:8080/1156
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-09 16:42:03 -07:00
Dimitris Tsirogiannis
38f29b048d IMPALA-2474: PlannerTest fails due to nested types file size mismatch
(part2)

With this commit we use regex when comparing the file sizes of table
'tpch_nested_parquet.region' in the PlannerTest.

Change-Id: I03fa177c9d36d60bcb5ce7eece8a5a7c98bb7985
Reviewed-on: http://gerrit.cloudera.org:8080/1216
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins

Conflicts:
	testdata/workloads/functional-planner/queries/PlannerTest/nested-collections.test
	testdata/workloads/functional-planner/queries/PlannerTest/tpch-nested.test
2015-10-09 16:39:28 -07:00
Ippokratis Pandis
f1ef5170cb IMPALA-2168: Do not try to access streams of repartitioned spilled partition in right-joins
In case of right joins (right outer, right anti and full outer), if
a spilled partition was repartitioned we would try to access its
build rows stream, even though that was already set to NULL leading
to SEGV.

Change-Id: Ia570333c62a4da1152d8d47be9176ac024ba3f5f
Reviewed-on: http://gerrit.cloudera.org:8080/1209
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-10-09 16:33:14 -07:00
Sailesh Mukil
277a92a14a IMPALA-2479: Failure in TestParquet.test_verify_runtime_profile
The test_verify_runtime_profile test failed during C5.5 builds and
GVMs because this test relies on the table lineitem_multiblock to
have 3 blocks. However, due to the rules to load the data not being
followed in the functional_schema_template.sql file, the table ended
up being stored with only one block.

This change moves the data load to the end of create-load-data.sh
file which would load the data even for snapshots.

Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be
Reviewed-on: http://gerrit.cloudera.org:8080/1153
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-10-08 15:16:35 -07:00
Dan Hecht
13738e26e6 IMPALA-2474: PlannerTest fails due to nested types file size mismatch
For some reason, on each full data load, Hive seems to slightly change
the file size of the tpch_nested.customer files.  The PlannerTest
result comparator already supports regex:, so use that to regex
away the file size to get the builds working.

Change-Id: If84ac71bc3a309407efa6c597be71f83993c5533
Reviewed-on: http://gerrit.cloudera.org:8080/1148
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-08 15:16:33 -07:00
Sailesh Mukil
9f5fbdfb9a IMPALA-2497: Flaky test in analytic_fns.test
The test introduced as a part of IMPALA-2457 was flaky as the order
of the results was not enforced and could change. This patch forces
ordering and fixes this bug.

Change-Id: Ib03d2d93818f33835b267347dcd3a94062aa7475
Reviewed-on: http://gerrit.cloudera.org:8080/1189
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-10-07 14:47:42 -07:00
Juan Yu
41509ce3c1 IMPALA-2477: Parquet metadata randomly 'appears stale'
Stream::ReadBytes() could fail by other reasons than
'stale metadata'. Adding Errorcode Check to make sure
Impala return proper error message.

It also fixes IMPALA-2488 metadata.test_stale_metadata
fails on non-hdfs filesystem.

Change-Id: I9a25df3fb49f721bf68d1b07f42a96ce170abbaa
Reviewed-on: http://gerrit.cloudera.org:8080/1166
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-10-07 14:47:41 -07:00
Matthew Jacobs
38bc1c77b8 IMPALA-2375: Disabling/moving tests that don't work with the old HJ
Change-Id: I6d1d0d0edd3b60e854130c4d8b9fcbe765c1aba0
Reviewed-on: http://gerrit.cloudera.org:8080/1173
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-07 14:47:40 -07:00
Alex Behm
c153d094d4 IMPALA-2478: Unset the expr id of bound conjuncts.
The bug: When assigning bound collection conjuncts to scan, we
incorrectly marked the source of the bound conjunct as assigned
even though the source conjunct must also be evaluated by a join
(source conjunct is a where-clause conjunct bound by an
outer-joined tuple). The root issue was that a bound conjunct
retained the expr id of its source.

The fix: Unset the expr id of bound conjuncts to prevent callers
from inadvertently marking the source conjunct as assigned.

Change-Id: Ica775adfc551d9fc0457a2392c4988cb2eb7de72
Reviewed-on: http://gerrit.cloudera.org:8080/1149
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-10-07 14:47:38 -07:00
Sailesh Mukil
f06497e1d6 IMPALA-2457: PERCENT_RANK() returns NaN for row group with 1 row
The analytic function PERCENT_RANK() is implemented as a query
rewrite in the analysis stage. An edge case where the count of the
rows returned is 1 resulted in a divide by zero which returned NaN
as the result. This patch fixes that by creating a conditional
expression.

Change-Id: Ic8d826363e4108e0246b8e844355f3382a4a3193
Reviewed-on: http://gerrit.cloudera.org:8080/1131
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-10-07 14:47:37 -07:00
Dimitris Tsirogiannis
a4d24954b5 IMPALA-2446: Fix wrong predicate assignment in outer joins
This commit fixes an issue where a predicate from the WHERE clause that
can be evaluated at a join node is incorrectly assigned to that node's
join conjuncts even if it is an outer join, thereby causing the join to
return wrong results.

Change-Id: Ibf83e4e2c7b618532b3635b312a70a2fa12a0286
Reviewed-on: http://gerrit.cloudera.org:8080/1129
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-10-06 10:54:10 -07:00
Alex Behm
d5e0e2eebc IMPALA-2456: For hash joins inside a subplan, open child(0) before doing the build.
The bug: A query with a subplan containing a hash join with unnest nodes
on both the build and probe sides would not project the collectionn-typed
slots referenced in unnest nodes of the probe side. The reason is that
we used to first complete the hash join build before opening the probe
side. Since the build does a deep-copy those collection-typed slots
to be unnested in the probe side would not be projected.

Example query that exhibited the bug:

subplan
  hash join
    nested-loop join
      singular row src
    unnest t.c1
  unnest t.c2
scan t

The tuple of 't' has two-collection typed slots, one for 't.c1', and
another for 't.c2'. If the hash join completes the build without
opening the probe side, then the 't.c2' slot would not be projected
and deep copied into the build-side hash table. That collection
would then be returned in GetNext() of the hash join.

The fix: For hash joins inside a subplan, open child(0) before doing
the build.

Change-Id: I569107b5ecafdbb75f3562707947ecc73951140c
Reviewed-on: http://gerrit.cloudera.org:8080/1128
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-10-06 10:54:08 -07:00
Alex Behm
b89d69da90 Nested Types: Enforce and test maximum nesting depth of 100.
The limit of 100 was determined empirically by generating deeply
nested Parquet and Avro files and then trying to run queries with
and without subplans over them (one absolute table ref vs. all relative
table refs for maximally nested subplans).

Based on those experiments we can handle up to 200 levels of nesting,
but the queries get very slow. At 300 levels, we exceed the stack space
due to the recursive implementation of the scan. Also, we decode the
rep/def levels of Parquet as uint8_t. I settled with 100 because it is
safe, future proof and reasonably high for most practical cases.

Change-Id: Iebdfa96a6dd6060387e38eaedb8ddf0f9901ac24
Reviewed-on: http://gerrit.cloudera.org:8080/905
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:54 -07:00
Alex Behm
06c96e4074 IMPALA-2349,IMPALA-2412: Planner fixes to subplan ordering.
IMPALA-2349:
The bug was that we were not adding the parent tuple ids of a relative or
correlated table ref to the list of required tuple ids if that table ref
itself depended on a relative table ref (nested subplans).

This patch simplifies and fixes the planning with straight_join. The
required tuple ids are properly set, and the ordering requirement is
enforced by adding the last parent table ref's id to the list of
required table ref ids.

IMPALA-2412:
The bug was that we were relying on both the required materialized tuple
ids as well as the table ref ids to determine whether a table ref belongs
into a subplan at a certain level. However, as the existing comments in
the code actually already state, the subplan placement should be determined
only based on whether the required parent tuple ids are materilaized.

The correct join/subplan ordering is independent, and is handled by
the required table ref ids.

Change-Id: I922fcbd0039242bf5940534d667926cdbdf72946
Reviewed-on: http://gerrit.cloudera.org:8080/907
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:51 -07:00
Dan Hecht
749c19c5fc Revert "IMPALA-2474: update file sizes in planner tests"
This reverts commit 54409d82a42623e156cd775810b9a76fc5ae7407.

Change-Id: I47b7a73a7a9741f27b13e8983efc9e3ddf0d67f7
Reviewed-on: http://gerrit.cloudera.org:8080/1147
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Dan Hecht <dhecht@cloudera.com>
2015-10-05 11:30:49 -07:00
Dan Hecht
86d0780b07 Move IMPALA-2259 test case to nested-types-runtime.test
These tests are failing in the old agg/join nightly.  This test
references a nested types table, so we need to skip it when the old
agg/join is enabled.  So, move it to a place where this already happens.

Change-Id: I09400760fd0b7506e4c127bbef92ab413d5d8615
Reviewed-on: http://gerrit.cloudera.org:8080/1143
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:46 -07:00
Dan Hecht
b2ba536563 IMPALA-2474: update file sizes in planner tests
The 5.5.x nightly data load build failed due to file sizes. Adjust
the file sizes to try to get the build passing again. We really
need to make the planner test smarter, but let's do that on
cdh5-trunk.

Change-Id: I51de031e771b75edcd2b00f1c24f9d0e21b3cf98
Reviewed-on: http://gerrit.cloudera.org:8080/1145
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Dan Hecht <dhecht@cloudera.com>
2015-10-05 11:30:45 -07:00
Tim Armstrong
5bbe2fe23d IMPALA-2469: decimal table not found in exhaustive build
Decimal tables are not generated for all file formats. The
analytic-fns test is run on some of these formats, so fails
when it cannot find the decimal_tiny table. Move the test to
the decimal test that handles this correctly.

Change-Id: Ic23b21ed90496fcc9f2f84cfd3dd92899d00498b
2015-10-05 11:30:41 -07:00
Sailesh Mukil
7778b3ded5 IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.
Impala supports reading Parquet files with multiple row groups but
with possible performance degradation due to remote reads. This patch
maximizes scan locality by allowing multiple impalads to scan the
rowgroups in their local splits. Each impalad starts a new scan range
for each split local to it if that split contains row group(s) that
need to be scanned.

Change-Id: Iaecc5fb8e89364780bc59dbfa9ae51d0d124d16e
Reviewed-on: http://gerrit.cloudera.org:8080/908
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:39 -07:00
Dan Hecht
df5271e7c1 Fix flaky test_spilling.py test case
The commit:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ

recently lowered a couple of limits from 100m to 40m, which appears to
be too aggressive based on occasionol gvm failures.  Let's bump
it back up.

Change-Id: I2c3cc24841cf3a305785890329d77e4e9e74f6e5
Reviewed-on: http://gerrit.cloudera.org:8080/1125
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:34 -07:00
Alex Behm
6866c3045b IMPALA-2430: Mark unused collection-typed slots as non-materialized.
The bug: During planning, when generating an EmptySetNode for a query
block (or a portion thereof) that contained relative table refs we still
populated the corresponding collection-typed slots in the parent scan,
ultimately hitting a sanity DCHECK in the BE.

The fix: Since those collection-typed slots are never used, the
corresponding parent scan should not materilaize them. When creating
an EmptySetNode we mark the appropriate collection-typed slots
as non-materialized.

Change-Id: If0b9c37c46c0e27be7f1b47c395c8c90b499e323
Reviewed-on: http://gerrit.cloudera.org:8080/1092
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-10-01 13:58:43 -07:00
Tim Armstrong
75887730cb IMPALA-2233: avoid loss of precision in function arguments
This patch changes the resolution of overloaded functions so that we
prefer functions where there is no loss of precision in argument types.
Previously, the logic would happily convert DECIMAL to FLOAT even if
there was a more suitable overload available.  E.g. greatest(TINYINT,
DECIMAL) was resolved to greatest(FLOAT...) instead of greatest(DECIMAL).
This only changes behaviour when no overload exactly matches the argument
types, but the arguments can be converted with no loss of precision,
e.g. TINYINT to DECIMAL.

This patch introduces a conceptual distinction between strict and
non-strict compatibility. All contexts aside from function matching
use non-strict to support the current behavior of implicitly casting
decimals to floats/doubles.

This patch also makes resolution of overloaded functions consistent
regardless of what order functions were added to the Db - overloads are
checked in a canonical order.

Switching to this canonical order revealed further problems with
overload resolution where the correct overload was selected only because
of the order in which it was added to the database. For example, the
logic equally preferred resolving fn(STRING, TINYINT) to
fn(TIMESTAMP, INT) or fn(STRING, INT). This required changes to the
compatibility matrix.

Various cleanup and simplification of the type compatibility logic is
also included.

Change-Id: I50e657c78cdcb925b616b5b088b801510020e255
Reviewed-on: http://gerrit.cloudera.org:8080/845
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-10-01 13:58:40 -07:00
Skye Wanderman-Milne
68fef6a5bf IMPALA-2213: make Parquet scanner fail query if the file size metadata is stale
This patch changes the Parquet scanner to check if it can't read the
full footer scan range, indicating that file has been overwritten by a
shorter file without refreshing the table metadata. Before it would
DCHECK. This patch adds a test for this case, as well as the case
where the new file is longer than the metadata states (which fails
with an existing error).

Change-Id: Ie2031ac2dc90e4f2573bd3ca8a3709db60424f07
Reviewed-on: http://gerrit.cloudera.org:8080/1084
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2015-10-01 13:58:39 -07:00
Skye Wanderman-Milne
0c5e6a804f IMPALA-2443: add support for more Parquet array encodings
This patch adds full support for the various Parquet array encodings,
as well as tests that use files from
https://github.com/apache/hive/tree/master/data/files. This should
allow us to read any existing array data.

Change-Id: I3d22ae237b1dc82ee75a83c1d4890d76316fadee
Reviewed-on: http://gerrit.cloudera.org:8080/826
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2015-10-01 13:58:37 -07:00
Matthew Jacobs
851056489d IMPALA-2440: Fix old HJ full outer join with no rows
When a full outer join on the old (non-partitioned)
HashJoinNode, if any join fragment has 0 build rows and 0
probe rows an extra null row will be produced.

Change-Id: I75373edc4f6b3b0c23afba3c1fa363c613f23507
Reviewed-on: http://gerrit.cloudera.org:8080/1068
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-09-30 17:17:47 -07:00
Alex Behm
cb713840b7 IMPALA-2434: Always set the eos return value in SubplanNode::GetNext().
The bug was that SubplanNode::GetNext() was not explicitly setting the returned
eos to false if eos had not been reached yet. As a result, a UnionNode with
a SubplanNode as its second operand could return fewer rows than expected becasue
the eos was carried over from the previous union operand, and only a single batch
was returned from the SubplanNode.

The fix is to always set the eos return value in SubplanNode::GetNext().

Change-Id: I9f7516d7b740b9e089ea29b9fe416f3f47314e2c
Reviewed-on: http://gerrit.cloudera.org:8080/1076
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-30 17:17:44 -07:00