Commit Graph

5231 Commits

Author SHA1 Message Date
Dimitris Tsirogiannis
3db5ced4ce IMPALA-3726: Add support for Kudu-specific column options
This commit adds support for Kudu-specific column options in CREATE
TABLE statements. The syntax is:
CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [, ....])
where option is:
| NULL
| NOT NULL
| ENCODING encoding_val
| COMPRESSION compression_algorithm
| DEFAULT expr
| BLOCK_SIZE num

The output of the SHOW CREATE TABLE statement was altered to include all the specified
column options for Kudu tables.

Change-Id: I727b9ae1b7b2387db752b58081398dd3f3449c02
Reviewed-on: http://gerrit.cloudera.org:8080/5026
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-18 11:41:01 +00:00
Attila Jeges
60414f0633 IMPALA-4278: Don't abort Catalog startup quickly if HMS is not present
This change introduces a new catalogd startup option
(init_first_metastore_client_timeout_seconds) that specifies the
time in seconds catalogd should spend on retrying to establish a
connection to HMS the first time on startup before giving up and
exiting fatally.

Setting this startup option to a value that is greater than the HMS
startup time will allow CM to start Impala at the same time or even
before HMS.

The default value of init_first_metastore_client_timeout_seconds is
120 seconds.

Change-Id: I546d8fe9836004832ae40110c9fe22b3e704e11b
Reviewed-on: http://gerrit.cloudera.org:8080/5095
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-11-18 03:12:12 +00:00
Jim Apple
76719de5d1 Don't overwrite user's .ssh/config file when bootstrapping
From bash's manual page on redirecting with '>'

    Redirection of output causes the file whose name results from the
    expansion of word to be opened for writing on file descriptor n,
    or the standard output (file descriptor 1) if n is not specified.
    If the file does not exist it is created; if it does exist it is
    truncated to zero size.

Change-Id: I0d1a56441fcb5a2a2aed043fc1ece866c5d8287a
Reviewed-on: http://gerrit.cloudera.org:8080/4967
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Impala Public Jenkins
2016-11-18 03:06:52 +00:00
Alex Behm
263f222557 IMPALA-4490: Only generate runtime filters for hash join nodes.
Change-Id: I167725e260bd0f91c2bfc164eb044321192d5b95
Reviewed-on: http://gerrit.cloudera.org:8080/5117
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-18 00:26:35 +00:00
Jim Apple
3be0f122a5 IMPALA-3398: Add docs to main Impala branch.
These are refugees from doc_prototype. They can be rendered with the
DITA Open Toolkit version 2.3.3 by:

/tmp/dita-ot-2.3.3/bin/dita \
  -i impala.ditamap \
  -f html5 \
  -o $(mktemp -d) \
  -filter impala_html.ditaval

Change-Id: I8861e99adc446f659a04463ca78c79200669484f
Reviewed-on: http://gerrit.cloudera.org:8080/5014
Reviewed-by: John Russell <jrussell@cloudera.com>
Tested-by: John Russell <jrussell@cloudera.com>
2016-11-17 22:38:44 +00:00
Tim Armstrong
46f5ad48e3 IMPALA-3202: refactor scratch file management into TmpFileMgr
This is a pure refactoring patch that moves all of the logic
for allocating scratch file ranges into TmpFileMgr in anticipation of
this logic being used by the new BufferPool.

There should be no behavioural changes.

Also remove a bunch of TODOs that we're not going to fix.

Change-Id: I0c56c195f3f28d520034f8c384494e566635fc62
Reviewed-on: http://gerrit.cloudera.org:8080/4898
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 21:56:14 +00:00
Jim Apple
45ac10aa40 IMPALA-4476: Use unique_database to stop races in test_udfs.py
These tests have been failing nondeterministically in larger machines
with 16 cores. This change should stop races in haddop fs -put and
drop/create function.

Change-Id: I520a8b817ad7e32dba299c2535033f55f1bd1c84
Reviewed-on: http://gerrit.cloudera.org:8080/5124
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Impala Public Jenkins
2016-11-17 21:35:56 +00:00
Henry Robinson
2648bfbd90 Improve message output from run-step.sh
run-step prints a message to tell the reader what it's doing. However,
that message wasn't flushed so that run-step could print OK or FAILED on
the same line. The result was that long-running steps wouldn't print
anything to the log until they were done, at least in Jenkins contexts.

This patch changes it so that the message is flushed, and then the
result is printed on a separate line (including the time it took to run
the step).

  $ run-step "Hello world!" helloworld.out sleep 5
  Hello world! (logging to /tmp/helloworld.out)...
      OK (Took: 0 min 5 sec)

Change-Id: Iaced729f0ef6aa93174cd90b1516d3c34fe41a22
Reviewed-on: http://gerrit.cloudera.org:8080/5116
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 09:35:14 +00:00
Taras Bobrovytsky
eb8120d218 IMPALA-3812: Fix error message for unsupported types
Before this patch an unclear error message was returned if DATE or
DATETIME appeared in the select list after a star expansion. This was
because DATE and DATETIME PrimitiveType was serialized as INVALID_TYPE.
This is fixed by serializing correctly.

Change-Id: I9019b4bfd219f94e554c795befd3ff5e39706ea9
Reviewed-on: http://gerrit.cloudera.org:8080/4859
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 05:31:34 +00:00
Tim Armstrong
0ab3d7691e IMPALA-4392: restore PeakMemoryUsage to DataSink profiles
The join build sink patches refactored the DataSink interface and
inadvertently removed this counter from the profile.
The problem was that the sink MemTracker was not initialized with the
sink's profile.

The fix is for the sink to create the MemTracker itself.

Testing:
Ran core tests. Manually checked profile to make sure the counter
appeared in HdfsTableSink, DataStreamSender, etc.

Change-Id: Iaa5db623a84c47d5904033ec26aece74f500a2c9
Reviewed-on: http://gerrit.cloudera.org:8080/4969
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 04:37:56 +00:00
Matthew Jacobs
77a2941a42 IMPALA-3713,IMPALA-4439: Fix Kudu DML shell reporting
Adds support in the shell to report the number of modified
rows for all DML operations, as well as the number of rows
with errors.

Testing: Added shell tests.

Change-Id: I3d3d7aa8d176e03ea58fb00f2a81fb3e34965aa1
Reviewed-on: http://gerrit.cloudera.org:8080/5103
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 04:13:25 +00:00
Thomas Tauber-Marshall
3833707dbd IMPALA-4466: Improve Kudu CRUD test coverage
The results in the test files were verified by hand.

This patch also introduces a new test section 'DML_RESULTS', which
takes the name of a table as a comment and the contents of the
table as its body and then verifies that the body matches the
actual contents of the table. This makes it easy to check that a
DML operation has the desired effect on the contents of a table,
rather than always having to add another test case that runs a
select on the table. For now, this section cannot be used in a
test along with the RESULTS or ERRORS sections.

TODO: Refactor the DML test case handling (IMPALA-4471)

Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6
Reviewed-on: http://gerrit.cloudera.org:8080/4953
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 02:54:30 +00:00
Dan Hecht
ab0d21ab79 IMPALA-4493: fix string-compare-test when using clang
Only the 0 value or sign bit is specified in the return
value for strncmp(), so fix up the test accordingly.

Testing:
- verified the new test still reproduces IMPALA-4436
- verify the new test passes under ASAN build

Change-Id: I5d82ac2bff33fdbf66275fcfc6558c4bc29de5e7
Reviewed-on: http://gerrit.cloudera.org:8080/5110
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-17 01:46:23 +00:00
Alex Behm
f5e660dd6e IMPALA-4470: Avoid creating a NumericLiteral from NaN/infinity/-0.
Our NumericLiteral is backed by a BigDecimal which cannot
represent the special float values NaN, infinity or negative zero.
As a result, when evaluating constant expressions from the FE we
hit an exception when trying to create a NumericLiteral from
a NaN or infinity value. Before, negative zero would silently
get converted to zero which is dangerous.

The fix is to treat the expr evaluation as a failure and not
replace the constant Expr with a LiteralExpr.

Change-Id: I8243b2ee9fa9c470d078b385583f2f48b606a230
Reviewed-on: http://gerrit.cloudera.org:8080/5050
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-16 23:55:42 +00:00
Matthew Jacobs
107fc4e9f9 IMPALA-4477: Upgrade Kudu version to latest master
Change the toolchain build and Kudu version to use
the latest master, using Kudu commit e836ac.

Change-Id: I49f8582cc3c0f776167fe3decf4236345ba78bd3
Reviewed-on: http://gerrit.cloudera.org:8080/5106
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-11-16 21:57:37 +00:00
Alex Behm
0a654b3186 Run MT_DOP tests on all file formats.
Change-Id: I28d5bcc48bbe32fb970b41daa919096061a05beb
Reviewed-on: http://gerrit.cloudera.org:8080/5025
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 23:47:40 +00:00
Jim Apple
9434e38abb clang-tidy should tidy tests; fix alignas error in clang builds.
run_clang_tidy.sh was mistakenly using -notests, which doesn't even
compile the tests, rather than -skiptests, which compiles (but does
not run) the backend tests.

When I discovered this, I also found that all clang builds (including
tidy and asan) had been broken by my previous alignas commit
(10a4c5a2e4). This patch fixes that as
well.

Change-Id: Ib7066039f78d7ee039db619b96e25665b4d63503
Reviewed-on: http://gerrit.cloudera.org:8080/5094
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 23:33:45 +00:00
Michael Ho
38ee3b6942 IMPALA-4444: Transfer row group resources to row batch on scan failure
Previously, if any column reader fails in HdfsParqetScanner::AssembleRows(),
the memory pools associated with the ScratchTupleBatch will be freed. This
is problematic as ScratchTupleBatch may contain memory pools which are still
referenced by row batches shipped upstream. This is possible because memory
pools used by parquet column readers (e.g. decompressor_pool_) won't be
transferred to a ScratchTupleBatch until the data page is exhausted. So,
the memory pools of the previous data page is always attached to the
ScratchTupleBatch of the current data page. On a scan failure, it's not
necessarily safe to free the memory pool attached to the current ScratchTupleBatch.

This patch fixes the problem above by transferring the memory pool and other
resources associated with a row group to the current row batch in the parquet
scanner on scan failure so it can eventually be freed by upstream operators as
the row batch is consumed.

Change-Id: Id70df470e98dd96284fd176bfbb946e9637ad126
Reviewed-on: http://gerrit.cloudera.org:8080/5052
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 23:02:50 +00:00
Dan Hecht
6937fa9a4c IMPALA-4436: StringValue::StringCompare() should match strncmp()
According to the C standard, strncmp() interprets characters as
unsigned, whereas StringCompare() uses char (which happens to be
signed).  This means that for values greater than 127, they don't give
the same result (which is especially bad considering StringCompare()
falls back to strncmp(), and so the answer depends on the mismatched
position).

Fix StringCompare() to interpret as unsigned char.

Change-Id: Ic0750f98d8c5ef7d0c0ea279cd1f80b4acbad1be
Reviewed-on: http://gerrit.cloudera.org:8080/5083
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 20:47:18 +00:00
Jim Apple
4b774880c9 Increase wait times for startup of Hive and its Metastore
On Ubuntu 14.04 on AWS EC2 m4.4x, instances, these components
frequently take more than 30 seconds to start. I have seen the HMS
take more than 90 seconds; this patch sets a more conservative timeout
default.

Change-Id: I43eb8646cca495578c8f9730faa04812957d2917
Reviewed-on: http://gerrit.cloudera.org:8080/5068
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 20:35:01 +00:00
Jim Apple
b3cbc960a7 IMPALA-4434: In Python, ''.split('\n') is [''], which has length 1
This test simply may have never been run in GMT or UTC - it appears to
have an easy-to-make off-by-one error.

Change-Id: Iac4943085b0693deb380499cd0e141eb672bead8
Reviewed-on: http://gerrit.cloudera.org:8080/5061
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 15:29:26 +00:00
Alex Behm
91b5264e52 IMPALA-4479: Use correct isSet() thrift function when evaluating constant bool exprs.
Change-Id: Ie3ba195a5241ca630bd0cf71b83d423733b06546
Reviewed-on: http://gerrit.cloudera.org:8080/5088
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 11:17:43 +00:00
Thomas Tauber-Marshall
e6e2baea33 IMPALA-4372: 'Describe formatted' returns types in upper case
A recent change caused 'describe formatted' to display the types
in all upper case, but we want 'describe formatted' to match Hive's
'describe' output, which displays the types in lower case.

This patch also fixes several problems with test_describe_formatted,
which was encountering an error but reporting success.

Change-Id: I274b97d4d1247244247fb38a5ca7f4c10bba8d22
Reviewed-on: http://gerrit.cloudera.org:8080/4861
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 05:38:12 +00:00
Jim Apple
0ea4a666dc IMPALA-4433: Always generate testdata using the same time zone setting
Before this change, testdata was generated using the
java.util.TimeZone.getDefault() TimeZone of the machine it was running
on.  This patch standardizes on "America/Los_Angeles", which matches
the existing expected results in the end-to-end tests.

Change-Id: Iaf7cc796e44e9ff64880f9ae852f40961592f279
Reviewed-on: http://gerrit.cloudera.org:8080/5058
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 04:18:33 +00:00
Sailesh
f4a5d863c3 IMPALA-4465: Don't hold process wide lock while serializing Runtime Profile in GetRuntimeProfileStr()
This patch changes the code so that the query_exec_state_map_lock_
is not held while serializing the RuntimeProfile, since that is a
pretty expensive operation and happens at least once per query. This
change makes a lot of client calls have less lock contention including
Webserver calls and query registration/unregistration.

Change-Id: I3ad8a1d6644259f177dfb3b29b3ba1ad6a76210a
Reviewed-on: http://gerrit.cloudera.org:8080/5035
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 03:33:14 +00:00
Amos Bird
628685ae74 IMPALA-1654: General partition exprs in DDL operations.
This commit handles partition related DDL in a more general way. We can
now use compound predicates to specify a list of partitions in
statements like ALTER TABLE DROP PARTITION and COMPUTE INCREMENTAL
STATS, etc. It will also make sure some statements only accept one
partition at a time, such as PARTITION SET LOCATION and LOAD DATA. ALTER
TABLE ADD PARTITION remains using the old PartitionKeyValue's logic.

The changed partition related DDLs are as follows,

Table: p (i int) partitioned by (j int, k string)
Partitions:
+-------+---+-------+--------+------+--------------+-------------------+
| j     | k | #Rows | #Files | Size | Bytes Cached | Cache Replication |
+-------+---+-------+--------+------+--------------+-------------------+
| 1     | a | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| 1     | b | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| 1     | c | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| 2     | d | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| 2     | e | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| 2     | f | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| Total |   | -1    | 0      | 0B   | 0B           |                   |
+-------+---+-------+--------+------+--------------+-------------------+

1. show files in p partition (j<2, k='a');
2. alter table p partition (j<2, k in ("b","c") set cached in 'testPool';

// j can appear more than once,
3.1. alter table p partition (j<2, j>0, k<>"d") set uncached;
// it is the same as
3.2. alter table p partition (j<2 and j>0, not k="e") set uncached;
// we can also do 'or'
3.3. alter table p partition (j<2 or j>0, k like "%") set uncached;

// missing 'k' matches all values of k
4. alter table p partition (j<2) set fileformat textfile;
5. alter table p partition (k rlike ".*") set serdeproperties ("k"="v");
6. alter table p partition (j is not null) set tblproperties ("k"="v");
7. alter table p drop partition (j<2);
8. compute incremental stats p partition(j<2);

The remaining old partition related DDLs are as follows,

1. load data inpath '/path/from' into table p partition (j=2, k="d");
2. alter table p add partition (j=2, k="g");
3. alter table p partition (j=2, k="g") set location '/path/to';
4. insert into p partition (j=2, k="g") values (1), (2), (3);

General partition expressions or partially specified partition specs
allows partition predicates to return empty partition set no matter
'IF EXISTS' is specified.

Examples:

[localhost.localdomain:21000] >
alter table p drop partition (j=2, k="f");
Query: alter table p drop partition (j=2, k="f")
+-------------------------+
| summary                 |
+-------------------------+
| Dropped 1 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.78s
[localhost.localdomain:21000] >
alter table p drop partition (j=2, k<"f");
Query: alter table p drop partition (j=2, k<"f")
+-------------------------+
| summary                 |
+-------------------------+
| Dropped 2 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.41s
[localhost.localdomain:21000] >
alter table p drop partition (k="a");
Query: alter table p drop partition (k="a")
+-------------------------+
| summary                 |
+-------------------------+
| Dropped 1 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.25s
[localhost.localdomain:21000] > show partitions p;
Query: show partitions p
+-------+---+-------+--------+------+--------------+-------------------+
| j     | k | #Rows | #Files | Size | Bytes Cached | Cache Replication |
+-------+---+-------+--------+------+--------------+-------------------+
| 1     | b | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| 1     | c | -1    | 0      | 0B   | NOT CACHED   | NOT CACHED        |
| Total |   | -1    | 0      | 0B   | 0B           |                   |
+-------+---+-------+--------+------+--------------+-------------------+
Fetched 3 row(s) in 0.01s

Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f
Reviewed-on: http://gerrit.cloudera.org:8080/3942
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 03:27:36 +00:00
Bharath Vissapragada
3f2f008ac4 IMPALA-3552: Make incremental stats max serialized size configurable
The fix "IMPALA-2648/IMPALA-2664" introduced a conservative limitation
on the maximum serialized size of incremental stats. As a side effect,
some users with very large tables are experiencing regressions
especially when they upgrade impala and the serialized size goes
beyond 200MB.

To mitigate the issue, the change introduces a new gflag,
'inc_stats_size_limit_bytes' to make the max serialized size
configurable, which allows impala users to specify their own maximum
serialized size. Default value for inc_stats_size_limit_bytes is
200MB.

The change introduces a TBackendGflags class to pass the gflags from
backend to the Frontend and the Catalog via thrift.  This also revamps
existing query options to use the TBackendConfig.

Change-Id: I33684725a61eabc67237503e61178305d37d3cb5
Reviewed-on: http://gerrit.cloudera.org:8080/4867
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-11-15 03:22:11 +00:00
Michael Ho
cac02d6b76 IMPALA-4452: Always call AggFnEvaluator::Open() before AggFnEvaluator::Init()
As part of the fix for IMPALA-2379, the expression contexts of
aggregation function evaluators are expected to be opened before
their initFn() are called so \ constant arguments can be accessed
in initFn(). However, the legacy aggregation node wasn't updated
to follow this order for singleton result tuple (i.e. no group-by).

This patch fixes the problem by deferring the creation of the
singleton tuple to a point in AggregationNode::Open() after the
expression contexts of all aggregate function evaluators have
been opened. PartitionedAggregationNode() was already updated
to follow this order.

This patch also fixes a minor bug in which uninitialized entries
of agg_fn_ctxs_[] may be accessed in AggregationNode::Close()
if AggregationNode::Prepare() fails.

Change-Id: I2f261dee47821c517d8dbe1babf4112462d85807
Reviewed-on: http://gerrit.cloudera.org:8080/5049
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-11-14 22:38:09 +00:00
Jim Apple
10a4c5a2e4 IMPALA-4480: zero_length_region_ must be as aligned as max_align_t
MemPool::TryAllocateAligned returns memory that might be that aligned,
and it returns &MemPool::zero_length_region_ when called to allocate a
block of size 0.

While I'm here, do some things to make diagnosing test failures from
terminal output easier.

Change-Id: Ia31b27e38897f357478c4eedaab0c787e731b2d4
Reviewed-on: http://gerrit.cloudera.org:8080/5062
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-14 21:25:42 +00:00
Matthew Jacobs
4258b9f09e IMPALA-4477: Upgrade Kudu version to latest master
Change the toolchain build and Kudu version to use the
latest master, using Kudu commit 88b023.

Change-Id: I21c5bc0d28df83cd2e57cd30b6ab416e0d430775
Reviewed-on: http://gerrit.cloudera.org:8080/5054
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-13 02:52:33 +00:00
Thomas Tauber-Marshall
d15f86cb6f IMPALA-4454: test_kudu.TestShowCreateTable flaky
The cause of the flakiness is Kudu CREATE TABLE operations
that are sometimes taking a long time, leading to timeouts
in the hiveserver2 connection. This patch adds the ability
for tests using the 'conn' pytest fixture to specify a
timeout to connect(), and sets a timeout of 5 minutes for
this test.

Change-Id: I2727c27ff66140ac4043bcad332cd4e1d72b255f
Reviewed-on: http://gerrit.cloudera.org:8080/5040
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-11 20:04:01 +00:00
Jim Apple
c68734ad65 IMPALA-4455: MemPoolTest.TryAllocateAligned failure: sizeof v. alignof
This was testing all memory alignments up to and including
sizeof(max_align_t), but the standard says nothing about that. It does
say things about alignof(max_align_t), including that malloc() returns
memory at least that aligned.

In both gcc and clang on our currently supported platforms,
max_align_t has sizeof == 32 and alignof == 16, so this test expected
an alignment that malloc was not guaranteed to provide.

Change-Id: Ic2dbabcb9af2874d8ed354738243dfca9c492b08
Reviewed-on: http://gerrit.cloudera.org:8080/5022
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Jim Apple <jbapple@cloudera.com>
2016-11-11 03:44:07 +00:00
David Knupp
b14f319708 IMPALA-4461: Make sure data gets loaded for wide hbase tables.
Ths patch reverts a change that broke the exhaustive suite of Impala
tests. The change was introduced here:

ce4c5f6743

The orginal problem was that data load was failing when run against a
remote cluster, due to a 4000 byte max for SERDEPROPERTIES.PARAM_VALUE,
a limitation that is well described in HIVE-1364. Locally, when we load
data, we work around the issue here:

https://github.com/apache/incubator-impala/blob/master/bin/create-test-configuration.sh#L99

When testing on CDH remote cluster however, this "fix" never gets applied.
(It also assumes the database will always by postgres.)

I made this change without realizing its full effect, or appreciating
exactly how exhaustive our exhaustive test suite really is. Another
solution will need to be found for the case of remote cluster testing,
but this should unblock the local build for now.

As far as testing, I ran the full suite of tests in query_test/
test_scanners.py, and they all pass after removing these lines.

Change-Id: If2148d6546789c6c53c8e045717081b24ce76689
Reviewed-on: http://gerrit.cloudera.org:8080/5033
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-11 00:37:59 +00:00
Alex Behm
0aeb68050b IMPALA-1286: Extract common conjuncts from disjunctions.
Adds a new ExprRewriteRule to extract common conjuncts from
disjunctions.

Examples:
(a AND b AND c) OR (b AND d) ==> b AND ((a AND c) OR (d))
(a AND b) OR (a AND b) ==> a AND b
(a AND b AND c) OR (c) ==> c

Adds a new query option ENABLE_EXPR_REWRITES to enable/disable
non-essential expr rewrites in the FE. Note that some rewrites
are required, e.g., BetweenToCompoundRule. Disabling the rewrites
is useful for testing, in particular, to make sure that the exprs
specified in expr-test.cc are executed as written.

Testing: Added a new unit test in ExprRewriteRulesTest.

Change-Id: I3cf9b950afaa3fd753d1b09ba5e540b5258940ad
Reviewed-on: http://gerrit.cloudera.org:8080/4877
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 09:44:59 +00:00
Matthew Jacobs
cfac09de10 IMPALA-3710: Kudu DML should ignore conflicts, pt2
Second part of IMPALA-3710, which removed the IGNORE DML
option and changed the following errors on Kudu DML
operations to be ignored:
1) INSERT where the PK already exists
2) UPDATE/DELETE where the PK doesn't exist

This changes other data-related errors to be ignored as
well:
3) NULLs in non-nullable columns, i.e. null constraint
  violoations.
4) Rows with PKs that are in an 'uncovered range'.

It became clear that we can't differentiate between (3) and
(4) because both return a Kudu 'NotFound' error code. The
Impala error codes have been simplified as well: we just
report a generic KUDU_NOT_FOUND error in these cases.

This also adds some metadata to the thrift report sent to
the coordinator from sinks so the total number of rows with
errors can be added to the profile. Note that this does not
include a breakdown of error counts by type/code because we
cannot differentiate between all of these cases yet.

An upcoming change will add this new info to the beeswax
interface and show it in the shell output (IMPALA-3713).

Testing: Updated kudu_crud tests to check the number of rows
with errors.

Change-Id: I4eb1ad91dc355ea51de261c3a14df0f9d28c879c
Reviewed-on: http://gerrit.cloudera.org:8080/4985
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 06:43:41 +00:00
Tim Armstrong
d7246d64c7 IMPALA-1430,IMPALA-4108: codegen all builtin aggregate functions
This change enables codegen for all builtin aggregate functions,
e.g. timestamp functions and group_concat.

There are several parts to the change:
* Adding support for generic UDAs. Previous the codegen code did not
  handle multiple input arguments or NULL return values.
* Defaulting to using the UDA interface when there is not a special
  codegen path (we have implementations of all builtin aggregate
  functions for the interpreted path).
* Remove all the logic to disable codegen for the special cases that now
  are supported.

Also fix the generation of code to get/set NULL bits since I needed
to add functionality there anyway.

Testing:
Add tests that check that codegen was enabled for builtin aggregate
functions. Also fix some gaps in the preexisting tests.

Also add tests for UDAs that check input/output nulls are handled
correctly, in anticipation of enabling codegen for arbitrary UDAs.
The tests are run with both codegen enabled and disabled. To avoid
flaky tests, we switch the UDF tests to use "unique_database".

Perf:
Ran local TPC-H and targeted perf. Spent a lot of time on TPC-H Q1,
since my original approach regressed it ~5%. In the end the problem was
to do with the ordering of loads/stores to the slot and null bit in the
generated code: the previous version of the code exploited some
properties of the particular aggregate function. I ended up replicating
this behaviour to avoid regressing perf.

Change-Id: Id9dc21d1d676505d3617e1e4f37557397c4fb260
Reviewed-on: http://gerrit.cloudera.org:8080/4655
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 03:27:12 +00:00
Jim Apple
6775893894 IMPALA-4447: Rein in overly broad sed that dirties the tree
This patch fixes a sed expression to make sure it only laters the code
it is meant to alter, not the comment describing the code.

Tested with tests/run-tests.py query_test/test_udfs.py

Change-Id: I51a0498d24b7fccc05b6183123501766cb36f85e
Reviewed-on: http://gerrit.cloudera.org:8080/5008
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 02:44:36 +00:00
Tim Armstrong
27e9f0aea1 IMPALA-4446: expr-test fails under ASAN
Various places in the LikePredicate code assumed StringVal is
null-terminated. There is no such guarantee. By coincidence string
literals were sometimes backed by std::string storage that was
null-terminated, so this bug was latent until recently.

Testing:
Was able to reproduce the failure locally under ASAN, now the test
passes. Running the full ASAN tests to verify, but putting this up
for review first to unbreak the build sooner.

Change-Id: I0ac10d34dd6463ab52e41de1002ef065cfe63a20
Reviewed-on: http://gerrit.cloudera.org:8080/5000
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 02:03:24 +00:00
aphadke
f3d23be478 IMPALA-4258: Remove duplicated and unused test macros
Macros defined in test-macros.h are either duplicated in gtest-util.h
or are unused anywhere in the code. This change deletes test-macros.h

Change-Id: I08539d7e46b89d7e0a4338510b65f9867814c275
Reviewed-on: http://gerrit.cloudera.org:8080/4917
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 01:23:41 +00:00
Jim Apple
4af2ea4fe4 IMPALA-4438: Serialize test_failpoints.py to reduce memory pressure
On EC2 c3.4xlarge instances, with 8cores and 30GB RAM, this test could
trigger the Linux OOM killer by running tests in parallel. This patch
switches to serial execution, which makes the test take four minutes,
rather than one to two minutes.

Change-Id: Iea4a588e1228d38f90387a077cbe530257636b7d
Reviewed-on: http://gerrit.cloudera.org:8080/4999
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 01:06:33 +00:00
Jim Apple
ae24bf2850 Add -build_shared_libs for default build for speed.
This is already recommended by the wiki:

https://cwiki.apache.org/confluence/display/IMPALA/Building+Impala

Change-Id: Ic83db07e59ff339dcce7362bd296ebcfd60b71d6
Reviewed-on: http://gerrit.cloudera.org:8080/4970
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-11-09 00:36:28 +00:00
Matthew Jacobs
08d89a5cc3 IMPALA-3710: Kudu DML should ignore conflicts by default
Removes the non-standard IGNORE syntax that was allowed for
DML into Kudu tables to indicate that certain errors should
be ignored, i.e. not fail the query and continue. However,
because there is no way to 'roll back' mutations that
occurred before an error occurs, tables are left in an
inconsistent state and it's difficult to know what rows were
successfully modified vs which rows were not. Instead, this
change makes it so that we always 'ignore' these conflicts,
i.e. a 'best effort'. In the future, when Kudu will provide
the mechanisms Impala needs to provide a notion of isolation
levels, then Impala will be able to provide options for more
traditional semantics.

After this change, the following errors are ignored:
* INSERT where the PK already exists
* UPDATE/DELETE where the PK doesn't exist

Another follow-up patch will change other violations to be
handled in this way as well, e.g. nulls inserted in
non-nullable cols.

Reporting:
The number of rows inserted is reported to the coordinator,
which makes the aggregate available to the shell and via the
profile.
TODO: Return rows modified for INSERT via HS2 (IMPALA-1789).
TODO: Return rows modified for other CRUD (beeswax+hs2) (IMPALA-3713).
TODO: Return error counts for specific warnings (IMPALA-4416).

Testing:
Updated tests. Ran all functional tests. More tests will be
needed when other conflicts are handled in the same way.

Change-Id: I83b5beaa982d006da4997a2af061ef7c22cad3f1
Reviewed-on: http://gerrit.cloudera.org:8080/4911
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 20:34:00 +00:00
Martin Grund
ce4c5f6743 IMPALA-4365: Enabling end-to-end tests on a remote cluster
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:

  - Managed by Cloudera Manager (CM)
  - GPL Extras need to be installed
  - KMS and KeyTrustee installed and available as a service
  - SERDEPROPERTIES in the Hive DB modified to accept wide tables
  - Hive warehouse dir points to /test-warehouse

The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)

It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.

Usage: remote_data_load.py [options] <cm_host address>

Options:
  -h, --help            show this help message and exit
  --snapshot-file=SNAPSHOT_FILE
                        Path to the test-warehouse archive
  --cm-user=CM_USER     Cloudera Manager admin user
  --cm-pass=CM_PASS     Cloudera Manager admin user password
  --gateway=GATEWAY     Gateway host to upload the data from. If not
                        set, uses the CM host as gateway.
  --ssh-user=SSH_USER   System user on the remote machine with
                        passwordless SSH configured.
  --no-load             Do not try to load the snapshot
  --exploration-strategy=EXPLORATION_STRATEGY
  --test                Run end-to-end tests against cluster

Testing:

This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.

However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.

Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 10:16:55 +00:00
Tim Armstrong
ef689edf36 IMPALA-4437: fix crash in disk-io-mgr
This fixes another issue where the 'buffer_' field was not set to NULL
on an error, triggering a DCHECK.

Testing:
Added a unit test that triggers the bug on the two different codepaths
that I fixed.

Change-Id: Ib76cf5ba8d368b2b37bdc1d2133b8ddcb39f9e00
Reviewed-on: http://gerrit.cloudera.org:8080/4979
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 08:29:58 +00:00
Tim Armstrong
51b1310681 IMPALA-3872: allow providing PyPi mirror for python packages
We still rely on the python.org json API, which doesn't seem to be
mirrored (instead there's a html-based index format implemented by
the mirrors).

The mirror can be provided by setting the PYPI_MIRROR environment
variable. The default is "https://pypi.python.org".

Change-Id: Ibc11f010332c0225121c86c9930e35c7ac01409c
Reviewed-on: http://gerrit.cloudera.org:8080/4770
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 05:34:50 +00:00
Tim Armstrong
381e719065 IMPALA-4266: Java udf returning string can give incorrect results
The memory management of string results was wrong: strings returned from
Exprs must live until the next time FreeLocalAllocations() is called.
Otherwise the buffer holding the string is freed or reused by the next
UDF call. The fix is to copy string values into a buffer with the
right lifetime.

Testing:
Added a regression test based on Bharath's example that reproduced the
bug reliably.

Change-Id: I705d271814cb1143f67d8a12f4fd87bab7a8e161
Reviewed-on: http://gerrit.cloudera.org:8080/4941
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 02:47:11 +00:00
Tim Armstrong
10fa472fa6 IMPALA-4302,IMPALA-2379: constant expr arg fixes
This patch fixes two issues around handling of constant expr args.
The patches are combined because they touch some of the same code
and depend on some of the same memory management cleanup.

First, it fixes IMPALA-2379, where constant expr args were not visible
to UDAFs. The issue is that the input exprs need to be opened before
calling the UDAF Init() function.

Second, it avoids overhead from repeated evaluation of constant
arguments for ScalarFnCall expressions on both the codegen'd and
interpreted paths. A common example is an IN predicate with a
long list of constant values.

The interpreted path was inefficient because it always evaluated all
children expressions. Instead in this patch constant args are
evaluated once and cached. The memory management of the AnyVal*
objects was somewhat nebulous - adjusted it so that they're allocated
from ExprContext::mem_pool_, which has the correct lifetime.

The codegen'd path was inefficient only with varargs - with fixed
arguments the LLVM optimiser is able to infer after inlining that
the expressions are constant and remove all evaluation. However,
for varargs it stores the vararg values into a heap-allocated buffer.
The LLVM optimiser is unable to remove these stores because they
have a side-effect that is visible to code outside the function.

The codegen'd path is improved by evaluating varargs into an automatic
buffer that can be optimised out. We also make a small related change
to bake the string constants into the codegen'd code.

Testing:
Ran exhaustive build.

Added regression test for IMPALA-2379 and MemPool test for aligned
allocation. Added a test for in predicates with constant strings.

Perf:
Added a targeted query that demonstrates the improvement. Also manually
validated the non-codegend perf. Also ran TPC-H and targeted perf
queries locally - didn't see any significant changes.

+--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| Workload           | Query                         | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
+--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| TARGETED-PERF(_20) | primitive_filter_in_predicate | parquet / none / none | 1.19   | 9.82        | I -87.85%  |   3.82%   |   0.71%        | 1           | 10    |
+--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+

(I) Improvement: TARGETED-PERF(_20) primitive_filter_in_predicate [parquet / none / none] (9.82s -> 1.19s [-87.85%])
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+
| Operator     | % of Query | Avg      | Base Avg | Delta(Avg) | StdDev(%) | Max      | Base Max | Delta(Max) | #Hosts | #Rows  | Est #Rows |
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+
| 01:AGGREGATE | 14.39%     | 155.88ms | 214.61ms | -27.37%    |   2.68%   | 163.38ms | 227.53ms | -28.19%    | 1      | 1      | 1         |
| 00:SCAN HDFS | 85.60%     | 927.46ms | 9.43s    | -90.16%    |   4.49%   | 1.01s    | 9.50s    | -89.42%    | 1      | 13.77K | 14.05K    |
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+

Change-Id: I45c3ed8c9d7a61e94a9b9d6c316e8a53d9ff6c24
Reviewed-on: http://gerrit.cloudera.org:8080/4838
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 02:44:51 +00:00
Sailesh
e3483c44a3 IMPALA-4441: Divide-by-zero in RuntimeProfile::SummaryStatsCounter::SetStats
This patch anticipates the case where total_num_values_ can be 0 and
makes sure a divide-by-zero is not possible.

Change-Id: I33f1e6fb45505dce7d79497d1632c5f63a409151
Reviewed-on: http://gerrit.cloudera.org:8080/4975
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 00:55:47 +00:00
Matthew Jacobs
1a99b78227 IMPALA-4442: Fix FE ParserTests UnsatisfiedLinkError
In some development environments, the ParserTests may always fail with an
java.lang.UnsatisfiedLinkError:

org.apache.impala.service.FeSupport.NativeGetStartupOptions()[B
  at o.a.i.service.FeSupport.NativeGetStartupOptions(Native Method)
  at o.a.i.service.FeSupport.GetStartupOptions(FeSupport.java:268)
  at o.a.i.common.RuntimeEnv.<init>(RuntimeEnv.java:47)
  at o.a.i.common.RuntimeEnv.<clinit>(RuntimeEnv.java:34)
  at o.a.i.testutil.TestUtils.assumeKuduIsSupported(TestUtils.java:288)
  at o.a.i.analysis.ParserTest.TestKuduUpdate(ParserTest.java:1697)

I believe the issue is related to some static loading of
classes and/or libraries in Java because changing the
ParserTest to initialize the Frontend makes the error go
away. I haven't been able to pin-point the exact issue with
loading, but it makes sense that the ParserTest should
initialize the Frontend static state if it will be called by
libfesupport later since it seems to be an issue affecting
some environments and not others, i.e. subject to
environmental factors.

This fixes the issue by changing ParserTest to extend
FrontendTestBase which initializes the Frontend class
statically.

Change-Id: I1828504f79c51679f9ca07176bffbe248d450e87
Reviewed-on: http://gerrit.cloudera.org:8080/4976
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-08 00:31:35 +00:00
Thomas Tauber-Marshall
3be4b3efd0 IMPALA-1169: Admission control info on the queries debug webpage
This patch adds a new event, 'Queued', to the query event log to
indicate when a query is queued by the admission controller. This
means that queries on the '/queries' page that are currently
queued will display this as their 'Last Event', making it possible
to see which queries are current queued.

It also adds a column to show the resource pool associated with
the queries, and it updates the wording of the first event that
gets marked for each query from 'Start execution' to 'Query
submitted', since this is before planning and admission control
and therefore execution hasn't actually startd yet.

Change-Id: I504e3c829a14318721e3a42de6281bcc578f7283
Reviewed-on: http://gerrit.cloudera.org:8080/4756
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-11-07 23:26:02 +00:00