Commit Graph

494 Commits

Author SHA1 Message Date
Dimitris Tsirogiannis
dd5ecb9deb IMPALA-1960: Illegal reference to non-materialized tuple when query has
an empty select-project-join block

This commit fixes an issue where an aggregation expr may reference a
non-materialized slot if the query contains an empty select-project-join
block. This fix ensures that all the exprs in an aggregation reference
materialized slots/tuples.

Change-Id: Ic2cc9818061b3f06ab1d1cebf4e604352c2df6d1
Reviewed-on: http://gerrit.cloudera.org:8080/348
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-04-21 23:29:14 +00:00
Henry Robinson
f22b8659fd IMPALA-1595: Add 'location' to SHOW [TABLE STATS|PARTITIONS] for HDFS tables
This patch adds a 'location' column to the output of SHOW TABLE STATS /
SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE
SET LOCATION commands, particularly for partitions, and is easier to
identify than the output of DESCRIBE FORMATTED.

Some existing tests in alter-table.test have been updated to include
checking the location output before and after a SET LOCATION
command. The tests in show.test have also been updated to check for the
location; all other tests that use SHOW [TABLE STATS|PARTITIONS] use a
generic regex to avoid overly verbose tests.

Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb
Reviewed-on: http://gerrit.cloudera.org:8080/316
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2015-04-21 19:27:50 +00:00
Alex Behm
7067a5d94d IMPALA-1519: Fix wrapping of exprs via a TupleIsNullPredicate with analytics.
The bug:
Analytic functions introduced a few challenges in properly wrapping
exprs with TupleIsNullPredicates when substituting exprs from outer-joined
inline views.

1. The logical to physical tuple mapping during the plan generation of analytics
invalidated the tuple ids originally set in upstream TupleIsNullPredicates
introduced during analysis (e.g., in the result exprs).

2. TupleIsNullPredicates require specific tuple ids for evaluation.
Since sort nodes materializes a new tuple, it's impossible to evaluate
TupleIsNullPredicates referring to a sort's input after the sort.
Non-analytic sorts handle this case during analysis by materializing
the result of that select block. However, analytic sorts used to only materialize
the slots of materialized tuple ids of the input plan node.

The fixes:

1. Move the TupleIsNullPredicate wrapping from the inline-view analysis into
the inline-view planning. This avoids the original problem because all physical
output tuples are known during plan generation. This simple change has a few
subtle consequences: First, we must rely on the plan root's output smap for
substituting the final result exprs, and *not* use the top-level base table smap
generated during analysis. Second, during plan generation we must use an inline
view's smap (and *not* its base table smap) for generating the output smap of its
plan such that we can properly wrap the rhs exprs in TupleIsNullPredicates
at every level.
This change also fixes IMPALA-1946 by deferring the TupleIsNullWrapping to
planning time.

2. To preserve the information whether an input tuple was null or not at an
anlytic sort, we materialize TupleIsNullPredicates, which are then substituted
by a SlotRef into the sort's tuple in ancestor nodes.

This patch also cleans up and consolidates the code used for wrapping exprs into
TupleIsNullPredicate itself.

Change-Id: I5c6d142bdf9c99ece2a564e557d4ffe22ac90865
Reviewed-on: http://gerrit.cloudera.org:8080/317
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-04-14 23:33:20 +00:00
Dimitris Tsirogiannis
d8e5bbe2da IMPALA-1949: Analysis exception when a binary operator contain an IN
operator with values

This commit fixes an issue where a query is not successfully analyzed if an
IN operator with values appears in a binary predicate.

Change-Id: Ia3b83803a553b9a3b3489382fc53978a720c4b4f
Reviewed-on: http://gerrit.cloudera.org:8080/334
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-04-14 03:54:33 +00:00
Dimitris Tsirogiannis
4eceeacf16 IMPALA-1550: Invalid rewrite when EXISTS subqueries contain aggregate
functions

This commit fixes an issue where a [NOT] EXISTS subquery that contains
an aggregate function will sometimes be incorrectly rewritten into a
join, thereby returning incorrect results.

Change-Id: I18b211d76ee3de77d8061603ff5bb1fbceae2e60
Reviewed-on: http://gerrit.cloudera.org:8080/266
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-04-02 19:11:00 +00:00
Juan Yu
e121bc9b0a IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line.
I did a local benchmark and there's minimal performance impact(<1%)

Change-Id: I8d84a145acad886c52587258b27d33cff96ea399
(cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0)
Reviewed-on: http://gerrit.cloudera.org:8080/189
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-03-20 19:58:50 -07:00
Skye Wanderman-Milne
9d6586cdb8 Addendum to IMPALA-1755 patch
This patch introduces SetLookup functionality for timestamp and
decimal types, as well addressing remaining code review comments.

Change-Id: Ied40d2d55adbdea891ff2ab97b30f0d3986645f9
Reviewed-on: http://gerrit.cloudera.org:8080/245
Tested-by: Internal Jenkins
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2015-03-20 14:37:23 -07:00
Matthew Jacobs
e8527ddb8e IMPALA-1888: FIRST_VALUE may produce incorrect results with preceding windows
Fixes a bug where FIRST_VALUE may produce incorrect results (or a DCHECK
failure in debug) when there is a window like "ROWS X PRECEDING Y PRECEDING",
such that X < Y and X > the size of a partition.

For windows with an end boundary that is PRECEDING (i.e.
the entire window is before a row), there is some special handling between
partitions, and the logic was not correct in some corner cases for FIRST_VALUE.

Change-Id: Ied5d440684e99dcaf60b47489c90300891f09b91
Reviewed-on: http://gerrit.cloudera.org:8080/236
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-03-20 14:37:19 -07:00
Skye Wanderman-Milne
5118c55a0a IMPALA-1810: IN predicate was not comparing DecimalVals correctly
The IN predicate wasn't using the decimal type when comparing decimal
values. I benchmarked this on a modified version of TPCDS-Q8 (i.e. a
query with a huge decimal IN predicate) and there is a ~5% performance
degradation with codegen enabled (surprisingly, there appears to be a
slight performance gain with codegen disabled). We should be able to
remove this penalty when we add constant injection via codegen.

Change-Id: Ie1296fd50c68d06a343701442da49fe8d3cd16dd
Reviewed-on: http://gerrit.cloudera.org:8080/230
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2015-03-20 14:37:18 -07:00
Alex Behm
745e64a096 IMPALA-1837: Handle truncation when implicitly casting a literal to a decimal.
Implicit casting to decimals allows truncating digits from the left of the
decimal point (see TypesUtil). A literal that is implicitly cast to a decimal
with truncation is wrapped into a CastExpr so the BE can evaluate it and report
a warning. This behavior is consistent with casting/overflow of non-constant
exprs that return decimal.
IMPALA-1837: Without the CastExpr wrapping, such literals can exceed the max
expected byte size sent to the BE in toThrift().

Change-Id: Icd7b8751b39b8031832eec04bd8eac7d7000ddf8
Reviewed-on: http://gerrit.cloudera.org:8080/195
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-03-11 19:58:58 -07:00
Ippokratis Pandis
e36c436fa6 Adding tests with right joins and duplicates
Those tests were added as part of the new hash table implementation, as we didn't have
tests with right joins and duplicates (and other conjuncts) as well as aggregation
distinct queries with group bys on multiple columns. Adding them as a separate
patch, To improve testing coverage in the 2.2 release branch.

Change-Id: Id1b4f27fa6e587b2031635974ac9d2d39a1b015a
Reviewed-on: http://gerrit.cloudera.org:8080/193
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-03-11 16:39:40 -07:00
Dan Hecht
25b54eac1e S3: Fix test_multiple_filesystems.py
The filesizes changed slightly, causing the S3 CI build to fail.
Let's regex the file sizes in the compute stats expected results.

Change-Id: Ie95bdf3a253a28aa2b6f3deb281948780ca2cc6a
Reviewed-on: http://gerrit.cloudera.org:8080/200
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Dan Hecht <dhecht@cloudera.com>
2015-03-11 16:39:39 -07:00
Dan Hecht
2916132283 S3: enable more tests for S3
As needed, fix up file paths and other misc things to get
more test cases running against S3.

Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad
Reviewed-on: http://gerrit.cloudera.org:8080/167
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-03-11 16:39:39 -07:00
Alex Behm
adb19deece Re-enable tests that had been temporarily removed to unblock the full data load.
The following commits disabled tests to unblock the full data load:
a00a9a5e53f7a8e7a1e3c931ea0e4b7db21c6f00
bf29d06f2e53bb924d250275d51f5ccd1213531d

This patch re-enables those tests and adds new tests to guard against
regressions to HIVE-6308.

Unfortunately, we cannot completely remove the analysis check for HIVE-6308
in our code, because there is still one case where COMPUTE STATS will fail on
a Hive-created Avro table: If there is a mismatch in column names between
the Avro schema and the column defs given to a CREATE TABLE in Hive.

Change-Id: I81ae6b526db02fdfc634e09eeb9d12036e2adfdd
Reviewed-on: http://gerrit.cloudera.org:8080/180
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-03-11 16:39:38 -07:00
Dan Hecht
5aa8195534 S3: add end-to-end test for multiple filesystems
Verify DDL and queries when a table spans multiple filesystems
and across tables that live on different filesystems.

Change-Id: I4258bebae4a5a2758666f5c2e283eb2d205c995e
Reviewed-on: http://gerrit.cloudera.org:8080/166
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-03-11 16:39:38 -07:00
Matthew Jacobs
7216d09fe7 IMPALA-1808: AnalyticEvalNode cannot handle partition/order by exprs with NaN
Analytic function evaluation was broken when partition or order by exprs
evaluated to NaN (i.e. 0/0). We were relying on the comparison of the
current row with the previous row to be equal (i.e. x == x), but x != x
if x is NaN, and in the case of the very first row in the stream, some
logic breaks if x != x. The fix is to handle the very first row specially.

Change-Id: I1c33445d55a70c7f107f05eeadef272b7973ee11
Reviewed-on: http://gerrit.cloudera.org:8080/179
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2015-03-11 16:39:37 -07:00
Dimitris Tsirogiannis
c88d179413 IMPALA-1636: Generalize index-based partition pruning to allow constant
expressions

This commit enables fast partition pruning for cases where constant
expressions appear in binary or IN predicates. During partition pruning,
the constant expressions are evaluated in the BE and are replaced by the
computed results as LiteralExprs.

Change-Id: Ie8a2accf260391117559dc6c0a565f907c516478
Reviewed-on: http://gerrit.cloudera.org:8080/144
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-03-07 09:51:27 +00:00
Henry Robinson
146fe64a26 IMPALA-1615: Don't drop row count during DROP INCREMENTAL STATS
Change-Id: I1ae23ca9d70eeb58a3c7c8c59fb633832edcff58
Reviewed-on: http://gerrit.cloudera.org:8080/148
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-03-05 20:22:49 +00:00
Dan Hecht
41e3b6b61e S3: fix grant_revoke test to run against S3
1) Fix up locations to take FILESYSTEM_PREFIX into account
   so we can run the test against non-default FS.
2) Fix up results and catch sections.
3) Since S3 doesn't support INSERT, split the test into
   another version that expects different results for the
   INSERT part.  The rest of the test is identical, and
   we can remove this new .test file once INSERT is supported.

Change-Id: I50d21048b846aa985d1eefc50fc33bda05ebe509
Reviewed-on: http://gerrit.cloudera.org:8080/146
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-03-05 18:16:45 +00:00
zuowang
16792b28be IMPALA-1437: Implement SHOW FILES IN <table>
Query:SHOW FILES IN db.table

Result:
+---------------------------------------------+------+---------------------+
| path                                        | size | partition           |
+---------------------------------------------+------+---------------------+
| hdfs://namenode/path/to/partition/file1.dat | 128B | year=2010, month=11 |
| hdfs://namenode/path/to/partition/file2.dat | 256B | year=2010, month=12 |
| hdfs://namenode/path/to/partition/file3.dat | 1.3G | year=2011, month=1  |
+---------------------------------------------+------+---------------------+

Query:SHOW FILES IN db.table PARTITION(year=2010, month=12)

Result:
+---------------------------------------------+------+---------------------+
| path                                        | size | partition           |
+---------------------------------------------+------+---------------------+
| hdfs://namenode/path/to/partition/file2.dat | 256B | year=2010, month=12 |
+---------------------------------------------+------+---------------------+

Only support Hdfs tables. Will throw exceptions for other kinds of table.

Partition is optional. Will throw exception if specified partition cannot be found.

Change-Id: I6480ed87ab6cdfb02a60bffa72a8047a161f92ab
Reviewed-on: http://gerrit.cloudera.org:8080/19
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2015-03-05 05:13:50 +00:00
Matthew Jacobs
27209e4cb1 Fix exhaustive tests: move analytic fn tests using decimal_tbl to decimal.test
Change-Id: Iaaa5bd59b27d2db2736874e96d38cb823f6e4a56
Reviewed-on: http://gerrit.cloudera.org:8080/147
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-03-05 03:05:49 +00:00
Matthew Jacobs
0c8022b9bc IMPALA-1559: FIRST_VALUE rewrite fn type might not match slot type
In some cases, the lookup for the first_value_rewrite function could
return a different fn with the wrong (yet technically compatible) type.

first_value_rewrite takes 2 parameters, the first is the parameter to
first_value, and the second is an integer explicitly added by the FE during
analysis. first_value_rewrite has signatures where the first parameter can
be of any type and the second is always a BIGINT. When the FE inserts the
second parameter, it may not actually be a BIGINT (e.g. it could be a
TINYINT, SMALLINT, etc.). In this case, the function arguments will not
match one of the registered signatures exactly and some compatable function
will be returned. For example, FIRST_VALUE(1.1) has a DECIMAL parameter
and if a TINYINT/SMALLINT/INT parameter is added for the rewrite, then
the first_value_rewrite fn lookup happened to match the fn taking a
FLOAT and BIGINT. (Ideally DECIMAL would not be implicitly castable to
a FLOAT/DOUBLE, but NumericLiterals do allow this casting.)

As a result, the agg fn actually returned a FLOAT while the AnalyticExpr
was of the type DECIMAL, and the analytic tuple contained a DECIMAL slot
which would throw a DCHECK in the BE (or perhaps crash in retail builds).

This fixes the issue by setting the NumericLiterals to be explicitlyCast,
i.e. the type will not change if re-analyzed. Then the correct fn signature
is found.

Change-Id: I1cefa3e29734ae647bd690263bb63f08f10ea8b9
Reviewed-on: http://gerrit.cloudera.org:8080/136
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-03-04 05:22:53 +00:00
Dan Hecht
c8fb10f50a S3: Some more work toward enabling additional S3 test coverage
Add skip markers for S3 that can be used to categorize the tests that
are skipped against S3 to help see what coverage is missing.  Soon
we'll be reworking some tests and/or adding new tests to get back the
important gaps.

Also, add a mechanism to parameterize paths in the .test files, and
start using these new variables.  This is a step toward enabling some
more tests against S3.

Finally, a fix for buildall.sh to stop the minicluster before applying
the metastore snapshot. Otherwise, this fails since the ms db is in
use.

Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0
Reviewed-on: http://gerrit.cloudera.org:8080/127
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-03-03 08:29:13 +00:00
Matthew Jacobs
296d1bba2f IMPALA-1562: AnalyticEvalNode not properly handling nullable tuples
When an analytic fn does not contain partition or order by exprs (i.e. empty
OVER() clause), we should not be comparing the previous and current rows.
It is not necessary because the analytic fn is applied to the entire input,
and attempting to access the child tuples could reference invalid memory
because there might be nullable tuples. When there are either partition or
orderby exprs, then there is a sort node preceding the analytic node and the
sort node always produces non-null tuples (though tuples may have all null
slots).

Change-Id: I5788295682b4c9a1dd8a3078e11da5767f12214c
Reviewed-on: http://gerrit.cloudera.org:8080/129
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-02-28 05:01:15 +00:00
Ippokratis Pandis
d58aedff42 IMPALA-1820: Start with small pages for hash tables during repartitioning
The change of the PARTITION_FANOUT from 32 to 16 exposed a pathological case due to
the lack of coordination across concurrently executing spilling nodes of the same query.
In particular, when we repartition a partition we try to initialize hash tables for the
new partitions. But each hash table needs a block (for the nodes). In case there were not
any IO-sized blocks available, because they had been consumed by other nodes, we would get
into a loop trying to repartition those smaller partitions that couldn't initialize their
hash table. Additional repartitions that, among others, would need additional blocks for
the new streams. These partitions would end up being very small, still we would fail the
query when we were reaching the MAX_PARTITION_DEPTH limit, which was fixed to 4.

This patch fixes the problem by initializing the hash tables during repartitions with
small pages. That is, the hash tables always first use a 64KB and a 512KB block for their
nodes before switching to IO-sized blocks. This helps the partitioning algorithm to
finish when we end up with partitions that can fit in those small pages. The performance
may not be optimal, still the memory consumption is lower and the algorithm finishes. For
example, without this patch and with PARTITION_FANOUT == 16 in order to run TPC-H Q18 and
Q20 we needed 3.4GB and 3.1GB respectively. With this patch TPC-H Q18 needs ~1GB and Q20
975MB.

This patch also removes the restriction of stopping repartitioning when we are reaching
4 levels of repartitioning. Instead, whenever we repartition we compare the size of
the input partition to the size of the largest new partition. If there is no reduction
on the size we stop the algorithm. Otherwise, we keep on repartitioning. That should
help in cases of skew (e.g. due to bad hashing). There is a new MAX_PARTITION_DEPTH limit
of 16. It is very unlikely we will ever hit this limit.

Change-Id: Ib33fece10585448bc2d07bb39d0535d78b168ccc
Reviewed-on: http://gerrit.cloudera.org:8080/119
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-02-28 00:42:04 +00:00
Dan Hecht
99d3caacb7 Enable local filesystem tables
The S3 work really enabled any Hadoop FileSystem to work with Impala,
but a small tweak is needed for LocalFileSystem due to how the Hadoop
Path code deals with URIs that don't have an authority component.

While we aren't claiming support for arbitrary FileSystem's at this
tiem, it is useful to test this.  Since the S3 testing is done as a
nightly test rather than pre-checkin, we can use the LocalFileSystem to
regression test that:

1) Impala can access table data living on a secondary filesystem,
   i.e. not the filesystem specified by fs.defaultFS.
2) Impala does not make assumptions that the filesystem has type
   DistributedFileSystem.

Change-Id: Ie9b858ea440c9b3b332602e034c8052b168c57da
Reviewed-on: http://gerrit.cloudera.org:8080/121
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-02-27 18:48:56 +00:00
Alex Behm
37ca6b81ae IMPALA-1567: Ignore 'hidden' files with special suffixes.
Currently, we only consider files hidden if they have the special
prefixes "." or "_". However, some tools use special suffixes
to indicate a file is being operated on, and should be considered
invisible.

This patch adds the following hidden suffixes:
'.tmp' - Flume's default for temp files
'.copying' - hdfs put may produce these

Change-Id: I151eafd0286fa91e062407e12dd71cfddd442430
Reviewed-on: http://gerrit.cloudera.org:8080/80
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-02-24 10:55:22 +00:00
Alex Behm
ffd124b48e IMPALA-1711: Save, drop and restore col stats when renaming a table across dbs.
The HMS does not properly migrate column stats when moving a table across
databases (HIVE-9720). Apart from losing the stats, this issue also prevents
the newly renamed table from being dropped.

To work around the issue, this patch manually drops+adds the column stats
when renaming a table across databases.

Change-Id: If901c5d1e9a6b2cedc35034a537f18c361c8ffa1
Reviewed-on: http://gerrit.cloudera.org:8080/72
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-02-24 00:10:35 +00:00
ishaan
8369c3b13b Remove explicit references to functional_hbase tables from .test files.
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.

Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-02-23 23:32:41 +00:00
Alex Behm
ad6b9364c0 IMPALA-1629: Compute stats properly updates CHAR/VARCHAR column stats.
The problem was that VARCHAR was not present in a few switch statements
for updating/populating column stats in various places.

Change-Id: I0b2b316b734d27a7ff08701b0986014be2473443
Reviewed-on: http://gerrit.cloudera.org:8080/65
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-02-19 22:53:42 +00:00
casey
dbc504fad1 IMPALA-1579: UNIX_TIMESTAMP() should return BIGINTs instead of INTs
This should fix the last y2k38 problem. Previously calling
unix_timestamp() with a input of '2038-01-19 03:14:08' or later would
return a negative value due to a 32 bit int overflow. This patch
switches from 32 to 64 bit ints.

Change-Id: Ic9180887d6c828f6ecd25435be86fd0bd52d3f0d
Reviewed-on: http://gerrit.cloudera.org:8080/61
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-02-16 00:59:34 +00:00
Martin Grund
5578c0668f IMPALA-1750: Fix Validate HDFS Cache Parameters for Upgrades
In the case, that a cached table was created in a version of Impala that
did not have the property for the cache replication factor, the loading
of the table will fail until the table is un-cached and cached again.

This patch fixes this behavior and ignores this missing parameter.

Change-Id: I118020dd5bd7fb203d91853d5ef946f2c4c8a695
Reviewed-on: http://gerrit.cloudera.org:8080/48
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-02-14 00:54:22 +00:00
Martin Grund
fdafbc5709 IMPALA-1645 and IMPALA-1632: Verify Cache Directives
When a table is loaded in the catalog, we will now perform a check to
verify that the cache directive ID and cache replication factor is still
valid and the data is current.

If the cache directive does no longer exist, we issue a error message
and mark the table / partition as uncached. Furthermore, the replication
factor is updated with the information from the actual cache directive.

In the case of insert statement there is a special situation as the
catalog update is happening synchronously and will try to access the
cache directive information that might be stale. Thus in this insert
path, we catch the possible not found exception and reset the caching
information.

Change-Id: I882041ce5395b8a3d17e9fc2750053393340df65
Reviewed-on: http://gerrit.cloudera.org:8080/40
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-02-11 03:35:46 +00:00
Juan Yu
a7e95e0992 IMPALA-1614: Compute stats fails if table name starts with number
Change-Id: Iedac1ec0207a6e7b68ff9575c7c8473bbaf394cf
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5908
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: jenkins
2015-02-04 12:28:54 -08:00
casey
fd09294b74 TimestampValue refactor and cleanup (part 2) (IMPALA-1623)
This is preparation for fixing IMPALA-97. These changes are mostly
non-functional to bring the code closer to styling standards.

The biggest functional changes should be:

1) IMPALA-1623 was caused by a misuse of a constructor and that code
   didn't compile after the refactor so the bug was fixed.
2) TimestampValue.Hash() seems to have been hashing the time twice
   instead of the time and date.
3) Timings using TimestampValue.time() would not be accurate when
   crossing midnight (time and date are separate fields).
4) Timings using local time should use UTC to avoid daylight savings
   problems.
5) Use system monotonic clock in util/time.h.

Some timings may still be affected by #3 & 4 above but fixing those
isn't the purpose of this change.

Change-Id: I26056c876c4361e898acc3656aa98abf6f153a6b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5779
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
2015-02-03 01:49:55 -08:00
Martin Grund
b1bbc2e15f IMPALA-1587: Patch to fix exhaustive test for compute stats
This is a fix for the exhaustive test of compute stats that was
introduced by the new layout of the show partitions output. In the
exhaustive test one column too many was added to the expected output.

Change-Id: Ie83c114cd8ac1a711da64de3c82578020eb332af
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5865
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: jenkins
2015-01-27 22:33:25 -08:00
Martin Grund
cee1e84c1e IMPALA-1587: Extending caching directives for multiple replicas
This patch adds the possibility to specify the number of replicas that
should be cached in main memory. This can be useful in high QPS
scenarios as the majority of the load is no longer the single cached
replica, but a set of cached replicas. While the cache replication
factor can be larger than the block replication factor on disk, the
difference will be ignored by HDFS until more replicas become
available.

This extends the current syntax for specifying the cache pool in the
following way:

   cached in 'poolName'

is extended with the optional replication factor

   cached in 'poolName' with replication = XX

By default, the cache replication factor is set to 1. As this value is
not yet configurable in HDFS it's defined as a constant in the JniCatalog
thrift specification. If a partitioned table is cached, all its child
partitions inherit this cache replication factor. If child partitions
have a custom cache replication factor, changing the cache replication
factor on the partitioned table afterwards will overwrite this custom
value. If a new partition is added to the table, it will again inherit
the cache replication factor of the parent independent of the cache pool
that is used to cache the partition.

To review changes and status of the replication factor for tables and
partitions the replication factor is part of output of the "show
partitions" command.

Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-01-26 20:30:59 -08:00
Dan Hecht
3735ea94a0 S3: Don't seek/read past file end
DistributedFileSystem is lenient about seeking past the end of the file.
Other FileSystem implementations, such as NativeS3FileSystem, return an
error on this condition.  That leads to a scary looking message in the
query warnings.

So, when creating scan ranges, let's require that the ranges fall within
the file bounds (at least according to what the HdfsFileDesc indicates
is the length). There were a couple of kinds of AllocateScanRange()
callsites that needed to be fixed up:

1) When a stream wants to read past a scan range, be careful not to read
past the end of the file.

2) When Impala needs to "guess" at the length of a range, use the
file_length as an upper bound on the guess.  We were already doing this
someplaces but not everywhere.

3) When the scan range is derived from parquet metadata, validate the
metadata against file_length and issue appropriate errors.  This will
give better diagnostics for corrupt files.

Note that we can't rely on this for safety (HdfsFileDesc file_length may
be stale), but it does mean that when metadata is up-to-date Impala will
no longer try to access beyond the end of files (and so we'll no longer
get false positive errors from the filesystem).

Additionally, this change revealed a pre-existing problem with files
that have multiple row-groups.  The first time through InitColumns(),
stream_ was set to NULL.  But, stream_->filename could potentially be
accessed when constructing error statuses for subsequent row-groups.

Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
2015-01-08 16:19:35 -08:00
Skye Wanderman-Milne
cfd4ff2546 IMPALA-1589: allow up to 8 non-variadic arguments in the interpreted UDF path
Change-Id: Ie17763366311554ee1a58ed6b8a8d40973ae20d9
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5604
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-12-16 18:53:16 -08:00
Skye Wanderman-Milne
0f8ebbc5be Add missing "drop function" from Hive UDF test
Change-Id: Ieb073556be1a68de0a7b5574b43608c2f368c290
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5553
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
2014-12-10 23:02:47 -08:00
Dimitris Tsirogiannis
57132bf021 IMPALA-1535: Partition pruning with NULL
This commit fixes the issue where partition pruning returns wrong
results when a binary predicate contains a NULL literal.

Change-Id: I24c647184dcef49d12d6ff422e28667777df7784
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5443
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5549
2014-12-10 17:33:11 -08:00
Alex Behm
325f5a4551 [CDH5] Fix exhaustive test runs: Correct malformed test section.
Change-Id: Ief7128b8d21144199c629ee002c81b0930d2fc14
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5496
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-12-04 18:23:00 -08:00
Alex Behm
f696861c5c Throw error on unrecognized test sections.
Our .test file parser used to not abort tests when there
is a malformed test/section. This patch changes that behavior
to report an error and treat the test as failed.

Quite a few tests were not well-formed, and were not executed
as a result. This patch fixes those tests.

Arguably, the test file parser should be more flexible in which places
to accept comments, but this patch does not address that problem.

Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-12-02 18:08:09 -08:00
Henry Robinson
13b9cdd6b0 More test coverage for incremental stats
Change-Id: I17778dcf019c2a219baa678211f221b7e04813bb
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5446
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 2f5ec47e0b5dd26bc9dfe884481d3a316201be2d)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5460
Reviewed-by: Henry Robinson <henry@cloudera.com>
2014-12-01 17:26:39 -08:00
Alex Behm
c0f2e043b4 Fix exhaustive test runs: Preserve types when substituting root output exprs.
A recent change (3ccee71) to fix resetAnalysisState() of NullLiterals
exposed another bug during exhaustive test runs.
For insert queries into Parquet, the types in the schema of the generated
Parquet files are based on the insert exprs, correctly assuming that
the FE handles all the necessary casting to make sure the Parquet file
schema and the table schema match.
Since we apply an smap on the output exprs towards the end of planning,
NullLiterals were reset to the NULL_TYPE, causing the Parquet schema
to incorrectly have BOOLEAN columns (we cast naked NULL_LITERALS to
BOOLEAN in toThrift()), leading to a mismatch of the Parquet schema
and the table schema. Subsequent queries on such a table failed,
correctly reporting a type mismatch.

The fix is to preserve types when doing the substitution on the output exprs.

Change-Id: I135f1b826b06a6a200df7b73343d2eb1fb4b7b80
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5453
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5455
2014-11-30 01:08:08 -08:00
Henry Robinson
98064c4da8 Fix crash when columns are dropped between compute stats calls
FinalizePartitionedColumnStats() should have iterated over the list of
columns present in the table, rather than the list in the existing stats
data structure. If a column was dropped, but still persisted in the old
structure, it was possible that we could index off the end of an array.

Change-Id: Ib1ab7690ffae05afff826b9d1a15871337691739
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5437
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
(cherry picked from commit cee8305cd2878c8f00622d39ddd43b7a5dfbbc0d)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5447
Reviewed-by: Henry Robinson <henry@cloudera.com>
2014-11-26 23:25:11 -08:00
Dimitris Tsirogiannis
30a5d1d452 IMPALA-1526: Invalid tuple idx from IsNullPredicate exprs cause Impala
to crash

This commit fixes the issue where the tuple ids of semi-joined tuples
are falsely added in the list of materialized tuple ids of
IsNullPredicate exprs, causing Impala to crash. The fix is to exclude
semi-joined tuple ids from the list of materialized tuples ids of select
statements.

Change-Id: I93712be9d03dd54dc9172f51a5ba99e85aa05455
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5405
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5434
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
2014-11-26 14:42:54 -08:00
Matthew Jacobs
e004307bbe IMPALA-1419, IMPALA-1542: Fix NullLiteral to reset its type in resetAnalysisState
Queries with arithmetic exprs containing a NullLiteral child failed (IMPALA-1419)
or crashed (IMPALA-1542) because re-analysis of these exprs was incorrect.

Change-Id: Ice3461aed53863123bcf8f38af123d89ad3b7d6a
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5429
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
2014-11-26 14:29:48 -08:00
Alex Behm
4ad15bb2be IMPALA-1524: Materialize all tuples produced by an EmptySetNode.
Change-Id: I3b151ace464c67634104f84f7223c948fed8909e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5406
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
(cherry picked from commit c2959485a066b5c0b40e8b0790d526726236d0c9)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5409
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-11-25 23:21:02 -08:00
Skye Wanderman-Milne
390e773a44 rand() is not a constant expr
Also fixes a bug in Expr::DebugString()

Change-Id: I32b53072755781d0858481187864d2319b9ae1cb
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5400
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 6de9fab17a5032dd7c9d1ef6b8071703c67d223f)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5425
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-11-25 18:38:27 -08:00