Commit Graph

446 Commits

Author SHA1 Message Date
Alex Behm
e57fd2d831 IMPALA-3491: Use unique_database fixture in test_local_fs.py
Testing: Ran hdfs/core and localfs/core private builds.

Change-Id: I0720458882ac3b1138deccf9af0ee57bf2eed7dc
Reviewed-on: http://gerrit.cloudera.org:8080/3334
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-06-08 16:30:32 -07:00
Alex Behm
025fd3bd7f IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
Adds handling and testing for a specific Parquet data corruption
scenario with plain dictionary encoded values.

The problematic scenario is when the repeat or literal count of
the RLE-encoded dictionary indexes is decoded as 0 - an invalid value.

There are several other cases of data corruption that are not yet
handled gracefully. This patch only handles one specific case.

Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Reviewed-on: http://gerrit.cloudera.org:8080/3299
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-06-07 17:29:59 -07:00
Alex Behm
95064359cc IMPALA-3491: Use unique_database fixture in test_delimited_text.py.
Testing: Ran the test locally 10 times in a loop on exhaustive.

Change-Id: Idedd5f03984e41a4b3ebf271e50863e980c66cb6
Reviewed-on: http://gerrit.cloudera.org:8080/3096
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-06-07 09:34:30 -07:00
Tim Armstrong
d23e5505c8 IMPALA-3670: fix sorter buffer mgmt bugs
Also make test_scratch_disk.py more deterministic, by using
max_block_mgr_memory, which doesn't include scanner memory.
The fixed test_scratch_disk.py exercises the other sorter bugs
that occurs when scratch cannot be written.

Testing:
Added a test that does a sort with various memory limits and consumes
the whole output of the sorter (we have many tests of sorts with limits
but limited coverage of sorts without limits).  Ran an exhaustive test
run before posting for review.

This added test reproduced one of the sorter bugs, where var-len blocks
were not always attached to the output batch. The other test was
reproduced by the test change in IMPALA-3669: test_scratch_disk fix.

Change-Id: Ia1a0ddffa0a5b157ab86a376b7b7360a923698d6
Reviewed-on: http://gerrit.cloudera.org:8080/3315
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-06-06 22:34:19 -07:00
Tim Armstrong
37ec25396f IMPALA-3344: Simplify sorter and document/enforce invariants.
Clarify relationships between classes, clean up the previous mess
where every class was friends with the other so there's an actual
distinction between public and private members. TupleIterator
is now no longer tied to TupleSorter, just Run.

Document and enforce invariants in many cases.

Factor out some functions from large functions.

Simplify and document iterator logic.

Make management of buffers when iterating over output stream more
explicitly correct: either use MarkNeedToReturn() or attach block
to the batch as appropriate. The SortedRunMerger didn't handle
resource transfer correctly, except if all the memory came from
the batch's MemPool. This patch fixes the cases when resources
are attached to the batches, but not the 'need_to_return' case.
Document that SortedRunMerger requires 'deep_copy_input' to be true
if batches can have the 'need_to_return' flag set.

Also use the atomic block exchange operation when moving between
blocks in unpinned runs to prevent pin failures at that point.
I explicitly have avoided changing the hairy block management logic
when allocating buffers for merging, that will need addressing in
a follow-up patch.

Add a SpilledRuns counter so that it's more explicit that spilling
occurred.

Testing:
Added some tests for corner cases with empty and NULL strings.
Fixed a test that previously failed with OOM but now succeeds.

Performance:
Benchmarking against old code initial revealed some regressions from
changes in inlining. Force inlining the TupleComparator::operator() and
iterator Next()/Prev() functions helped and performance seems similar or
slightly better on the targeted orderby benchmarks.

Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a
Reviewed-on: http://gerrit.cloudera.org:8080/2826
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-06-02 21:33:08 -07:00
Alex Behm
32c40f9c5d Remove redundant test in test_avro_schema_resolution.py
Change-Id: I7123cd5e19d79122af3b4fef2c092442b7a098f1
Reviewed-on: http://gerrit.cloudera.org:8080/3095
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-31 23:32:11 -07:00
Huaisi Xu
816735a032 IMPALA-3092: Set default value to NULL in AvroSchemaConverter
This change ensures that Avro tables created without column definitions
remain queryable if columns are added via ALTER TABLE. The bug was that
when synthesizing an Avro schema from the column definitions we used to
not add default values.

Change-Id: Ib86e9ba1f4329b285ae14ee299365f7291a7410e
Reviewed-on: http://gerrit.cloudera.org:8080/3219
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-31 23:32:11 -07:00
Sailesh Mukil
6f1fe4ebe7 IMPALA-3577, IMPALA-3486: Partitions on multiple filesystems breaks with S3_SKIP_INSERT_STAGING
The HdfsTableSink usualy creates a HDFS connection to the filesystem
that the base table resides in. However, if we create a partition in
a FS different than that of the base table and set
S3_SKIP_INSERT_STAGING to "true", the table sink will try to write to
a different filesystem with the wrong filesystem connector.

This patch allows the table sink itself to work with different
filesystems by getting rid of a single FS connector and getting a
connector per partition.

This also reenables the multiple_filesystems test and modifies it to
use the unique_database fixture so that parallel runs on the same
bucket do not clash and end up in failures.

This patch also introduces a SECONDARY_FILESYSTEM environment variable
which will be set by the test to allow S3, Isilon and the localFS to
be used as the secondary filesystems.

All jobs with HDFS as the default filesystem need to set the
appropriate environment for S3 and Isilon, i.e. the following:
 - export AWS_SECERT_ACCESS_KEY
 - export AWS_ACCESS_KEY_ID
 - export SECONDARY_FILESYSTEM (to whatever filesystem needs to be
   tested)

TODO: SECONDARY_FILESYSTEM and FILESYSTEM_PREFIX and NAMENODE have a
lot of similarities. Need to clean them up in a following patch.

Change-Id: Ib13b610eb9efb68c83894786cea862d7eae43aa7
Reviewed-on: http://gerrit.cloudera.org:8080/3146
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-05-31 23:32:11 -07:00
Matthew Jacobs
f413e236a8 IMPALA-3579: Strict handling of numeric overflow in text parsing
Adds a query option 'strict_mode' which treats integer and
floating pt overflows as parse errors. In the past,
overflows were ignored and the max value was returned. When
this query option is set, overflowing values are treated as if
they were completely invalid data, i.e. NULL is returned.
When abort_on_error is enabled, this means the query is
aborted.

Notes:
* DECIMAL overflow/underflow is already treated as an error.
* The handling in text-converter treats underflows the same
  as overflows, so they would result in the same behavior.
  However, floating point parsing never returns an underflow
  today.
* We may also want to handle numeric values that are truncated
  when parsing to integer types, e.g. 10.5 -> 10.

Change-Id: I7409c31ec0cb6fe0b2d9842b9f58fe1670914836
Reviewed-on: http://gerrit.cloudera.org:8080/3150
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-05-23 08:40:20 -07:00
Bharath Vissapragada
49610e2cfa IMPALA-3314/IMPALA-3513: Fix querying tables/partitions altered to Avro format
Bug: Impalads crash if we query an Avro table with stale metadata

Cause: This happens because avroSchema_ is not set in HdfsTable,
which is not propagated to the avro scanner and it doesn't have
appropriate checks to make sure the schema is non-null.

The patch fixes the following.

1. Avro scanner should gracefully handle the case where the avro schema
   is not set. Appropriate null checks and a meaning error message have
   been added.

2. This is a special case with multi-fileformat partitioned tables.
   avroSchema_ should be set in HdfsTable even if any subset of the
   partitions are backed by avro. Without this patch, we only set it
   if the base table file format is Avro.

Change-Id: I09262d3a7b85a2263c721f3beafd0cab2a1bdf4b
Reviewed-on: http://gerrit.cloudera.org:8080/3136
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-05-23 08:40:20 -07:00
Matthew Jacobs
f067929f3a IMPALA-3535: Ignore invalid per-pool default query options
In 2.5 we added the ability to set per-pool default query
options. A string of key-value pairs can be specified with a
pool configuration. However, if any options fail to parse,
then all the options are ignored. We want that behavior (and
returning an error) when parsing the process-wide default
query options on startup and when parsing the options sent
from a client (e.g. in beeswax server) because an error can
be returned immediately for the triggering action at that
time (i.e. starting the impalad or submitting a query with
the options set). This behavior is bad for the pool default
query options because (a) the configuration is set by the
administrator and there's nothing we can do until a query is
submitted and (b) one invalid option shouldn't mean that
other valid options aren't set.

Change-Id: If04733b775963091b0314c65286df126fd812358
Reviewed-on: http://gerrit.cloudera.org:8080/3056
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-05-17 10:09:05 -07:00
Casey Ching
e61b5bc119 IMPALA-3511: Fix race setting up TestKuduOperations
A couple of tests could both attempt to create/destroy the same
database if they were running in parallel. Several other related
tests were marked as requiring serial execution, these needed to be
marked for serial execution as well.

Change-Id: If0573a755cd371363c2e43c001d5c1ba499793c6
Reviewed-on: http://gerrit.cloudera.org:8080/3063
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-05-14 01:30:01 -07:00
Skye Wanderman-Milne
9174dee395 IMPALA-1578: fix text scanner to handle "\r\n" delimiters split across blocks
This patch modifies HdfsTextScanner to specifically check for split
"\r\n" delimiters when the scan range ends with '\r'. If there does
turn out to be a split delimiter, the next tuple is considered the
responsibility of the next scan range's scanner, as if the delimiter
appeared fully in the second scan range. This should not affect the
overall performance characteristics of the text scanner since it
already must do a remote read past the end of the scan range to read
the last tuple.

Change-Id: Id42b441674bb21517ad2788b99942a4b5dc55420
Reviewed-on: http://gerrit.cloudera.org:8080/2803
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 23:06:36 -07:00
Dan Hecht
a0d4249652 IMPALA-3337: fix "Cancelled" warnings when LIMIT clause is specified
The cancelled status is propagated in scanner threads to cause them to
shut down once the limit has been satisified, but depending on the code
path and when abort_on_error=false, this internal status would sometimes
incorrectly end up in the error log. Fix this by factoring out the
abort_on_error handling code so that it's handled more consistently
across scanners. Parquet, RC, and Avro all suffered from this bug.

Testing: exhastive

Change-Id: I4a91a22608e346ca21a23ea66c855eae54bbced6
Reviewed-on: http://gerrit.cloudera.org:8080/2964
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:57 -07:00
Sailesh Mukil
ed7f5ebf53 IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems
Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
functionally enables LOAD DATA and INSERT on S3 without making major
changes for the sake of improving performance over S3. This patch also
enables both INSERT and LOAD DATA between file systems.

S3 does not support the rename operation, so the staged files in S3
are copied instead of renamed, which contributes to the slow
performance on S3.

The FinalizeSuccessfulInsert() function now does not make any
underlying assumptions of the filesystem it is on and works across
all supported filesystems. This is done by adding a full URI field to
the base directory for a partition in the TInsertPartitionStatus.
Also, the HdfsOp class now does not assume a single filesystem and
gets connections to the filesystems based on the URI of the file it
is operating on.

Added a python S3 client called 'boto3' to access S3 from the python
tests. A new class called S3Client is introduced which creates
wrappers around the boto3 functions and have the same function
signatures as PyWebHdfsClient by deriving from a base abstract class
BaseFileSystem so that they can be interchangeably through a
'generic_client'. test_load.py is refactored to use this generic
client. The ImpalaTestSuite setup creates a client according to the
TARGET_FILESYSTEM environment variable and assigns it to the
'generic_client'.

P.S: Currently, the test_load.py runs 4x slower on S3 than on
HDFS. Performance needs to be improved in future patches. INSERT
performance is slower than on HDFS too. This is mainly because of an
extra copy that happens between staging and the final location of a
file. However, larger INSERTs come closer to HDFS permformance than
smaller inserts.

ACLs are not taken care of for S3 in this patch. It is something
that still needs to be discussed before implementing.

Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
Reviewed-on: http://gerrit.cloudera.org:8080/2574
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:49 -07:00
Huaisi Xu
99879aad47 IMPALA-3385: Fix crashes on accessing error_log
We used to check error_log empty with error_log.empty(), but
this may return false even when some of its member has empty
messages, thus crashes impala.

This commit:

1. Remove error_log from HdfsScanNode::ProcessSplit() since the
logs may contain irrelevant errors. This also prevents a
potential race condition when error_log is checked and enters
if clause, but later is changed by other threads. Also prevent
crashes when error_log has a cleared entry.

2. Hold locks and return a copy of the
RuntimeState::error_log_, fixing a race in the coordinator.

Change-Id: I3a7e3d22e26147ada780aae5aed1f2e25a515afc
Reviewed-on: http://gerrit.cloudera.org:8080/2829
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Huaisi Xu <hxu@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:48 -07:00
Alex Behm
bce6b2b422 IMPALA-2736: Basic column-wise slot materialization in Parquet scanner.
This change is a first step towards a more efficient Parquet scanner.
The focus is on presenting the new code flow that materializes
the table-level slots in a column-wise fashion, without going deep
into actually improving scan efficieny.

After these changes there are several obvious places that should
be optimized to realize efficiency gains.

Summary of changes
- the table-level tuples are materialized in a column-wise fashion
  with new ColumnReader::ReadValueBatch() functions
- this is done by materializing a 'scratch' batch, and transferring
  scratch tuples that survive filters/conjuncts to the output batch
- the tuples of nested collections are still materialized in
  a row-wise fashion using the ColumnReader::ReadValue() function,
  just as before

Mini benchmark
I ran the following queries on a single impalad before and after my
change using a synthetic 'huge_lineitem' table.
I modified hdfs-scan-node.cc to set the number of rows of any row
batch to 0 to focus the measurement on the scan time.

Query options:
set num_scanner_threads=1;
set disable_codegen=true;
set num_nodes=1;

select * from huge_lineitem;
Before: 22.39s
Afer:   18.50s

select * from huge_lineitem where l_linenumber < 0;
Before: 25.11s
After:  20.56s

select * from huge_lineitem where l_linenumber % 2 = 0;
Before: 26.32s
After:  21.82s

Change-Id: I72a613fa805c542e39df20588fb25c57b5f139aa
Reviewed-on: http://gerrit.cloudera.org:8080/2779
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:48 -07:00
Lars Volker
b5570da405 IMPALA-1740: Add support for skip.header.line.count.
HIVE-5795 introduced a parameter skip.header.line.count to skip header
lines from input files. This change introduces the capability to skip
an arbitrary number of header lines from csv input files on hdfs. The
size of the total file header must be smaller than
max_scan_range_length, otherwise an error will be reported. This is
necessary because scan ranges are not read in disk order, so there is
no way of identifying header lines except by counting from the start
of the first scan range.

[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='1');
Query: alter table t1 set tblproperties('skip.header.line.count'='1')
[localhost:21000] > select * from t1;
Query: select * from t1
+----+----+
| c1 | c2 |
+----+----+
| 1  | 1  |
| 2  | 2  |
| 3  | 3  |
+----+----+
Fetched 3 row(s) in 0.32s
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='0');
Query: alter table t1 set tblproperties('skip.header.line.count'='0')
[localhost:21000] > select * from t1;
Query: select * from t1
+------+------+
| c1   | c2   |
+------+------+
| NULL | NULL |
| 1    | 1    |
| 2    | 2    |
| 3    | 3    |
+------+------+
WARNINGS: Error converting column: 0 TO INT (Data is: num1)
Error converting column: 1 TO DOUBLE (Data is: num2)
file: hdfs://localhost:20500/test-warehouse/t1/test.txt
record: num1,num2

Fetched 4 row(s) in 0.41s

Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543
Reviewed-on: http://gerrit.cloudera.org:8080/2110
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:46 -07:00
Michael Brown
2826df93d2 IMPALA-3427: test_insert_wide_table: use unique_database fixture
test_insert_wide_table runs in parallel attempts to DROP/CREATE tables
with the same name into the same database. A race thus exists in which
one tests can stomp on each others' tables.

The fix is to use the unique_database fixture, in which each test can
run in parallel within its own database, and thus its own table.

Change-Id: If3b22f1b089d20c6024d6d680af82f102342394e
Reviewed-on: http://gerrit.cloudera.org:8080/2871
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Michael Brown <mikeb@cloudera.com>
2016-05-12 14:17:44 -07:00
Anuj Phadke
a915293109 IMPALA-1850: Allow fs.defaultFS to be set to a non-HDFS filesystem
This change whitelists the supported filesystems which can be set
as Default FS for Impala to run on.
This patch configures Impala to use S3 as the default filesystem, rather
than a secondary filesystem as before.

Change-Id: I2f45bef6c94ece634045acb906d12591587ccfed
Reviewed-on: http://gerrit.cloudera.org:8080/1121
Reviewed-by: anujphadke <aphadke@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:40 -07:00
Tim Armstrong
89aa6597f4 IMPALA-3354: bad sorter pivot selection on some inputs
Switch to a median of three random tuples that should be very robust to
a range of inputs. It may be slightly worse than the existing pivot
selection on some inputs where the original algorithm is close to
optimal (e.g. already sorted inputs), but should be typically
better overall.

Always recurse on the smaller partition: this prevent the stack
overflow even with bad pivot selection.

The overhead is minimal - in profiles for small sorts I'm seeing pivot
selection take at most 0.5% of CPU time.

The improved pivot selections gives modest improvements of 2-5% on the
targeted perf order by benchmarks on a single node run with TPC-H
scale factor 20.

Change-Id: Iae50112b6deca3d6268e18b6f4daae1af279b452
Reviewed-on: http://gerrit.cloudera.org:8080/2824
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:39 -07:00
Henry Robinson
6629e79f32 IMPALA-3077: Don't run spilling / nested tests without PHJ
Change-Id: Ide5e20f05b14aa19a0f570398712ac9297b525eb
Reviewed-on: http://gerrit.cloudera.org:8080/2822
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:36 -07:00
Tim Armstrong
31d4103416 IMPALA-3317: fix crash in sorter when spilling zero-length strings
The sorter converts string pointers to block offsets when spilling.
There was a subtle bug in the logic that assumed if the offset was
past the end of the current block, the data must necessarily be in
the next block. This is not true for zero-length strings, because
there is no backing storage so the pointer can point to the byte
after the end of the block.

This patch fixes the bug by using a simpler offset encoding scheme
that packs the block number into the upper 32 bits and the offset
within the block into the lower 32 bits.

It also slightly refactors the functions so that the method signatures
and types are more consistent with the rest of the impala codebase.

Also fix a bug with handling of multiple query options in tests.

Change-Id: I5f64593e94d367d6b6efb61a8b86e35516f18839
Reviewed-on: http://gerrit.cloudera.org:8080/2780
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:36 -07:00
Henry Robinson
630f1ab270 IMPALA-3367: Ensure runtime filters tests run on 3 nodes
Runtime filter tests require that each scan gets assigned to three
backends. In our test environment (with Impala daemons on the same
machine) concurrent load can cause these scans to get assigned to only
two nodes, which breaks the runtime filter tests. See IMPALA-2479 for
more details.

This patch changes the tests to run sequentially.

Change-Id: I1ca43ca0f215539d909a1dcabee5e87f2437420b
Reviewed-on: http://gerrit.cloudera.org:8080/2802
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:33 -07:00
Tim Armstrong
8e0267f2a9 IMPALA-3328: xfail TPC-H q9 if memory limit exceeded
The test is flaky due to nondeterministic memory consumption under ASAN.
Xfail until we have more concrete guarantees on mem usage.

Change-Id: Ieefcb8f8ecc179f483f6d06af80c814fe0ef728e
Reviewed-on: http://gerrit.cloudera.org:8080/2770
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:30 -07:00
Sailesh Mukil
c083c79888 IMPALA-3256: TestUdfs.test_libs_with_same_filenames failure
There was an observed race between TestUdfs.test_java_udfs and
TestUdfs.test_libs_with_same_filenames, because they both used the
same database name.

This patch just changes the name of the database used by
test_libs_with_same_filenames.

Change-Id: Icc38cbe720a3b9d864935416eb10612171132e17
Reviewed-on: http://gerrit.cloudera.org:8080/2767
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:29 -07:00
casey
687a4373e0 Make test_low_mem_limit_low_selectivity_scan faster
The took ~7 mins to run on my machine. It was spending about half its
time re-creating the same data set over and over. This may also help
with the test failures during continuous integration.

All that needed to be done was to change the setup/teardown from method
level to class level. This should have no affect on the test coverage.

Change-Id: Ic874a7e9c8305bef3e16df03a807493f069e9265
Reviewed-on: http://gerrit.cloudera.org:8080/2766
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:29 -07:00
casey
0d9028dd49 IMPALA-3179: Fix alter table properties for Kudu tables
This is one of the merge follow up tasks. It seems like there was just a
line missing to copy the metastore data into the Kudu table object. The
HDFS table class does the same thing as in this change.

Change-Id: I51c9942f2f398afb7dff2485da759a185ad7505f
Reviewed-on: http://gerrit.cloudera.org:8080/2728
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-04-12 14:03:44 -07:00
Skye Wanderman-Milne
4208fdacc1 IMPALA-3301: TestParquet::test_resolution_by_name fails on legacy join/agg nodes
Change-Id: I7ce4c1f28966a1b14c25df228338ef4db6851a5b
Reviewed-on: http://gerrit.cloudera.org:8080/2723
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-04-12 14:02:35 -07:00
Henry Robinson
449901fac6 IMPALA-3283: Disable runtime filter tests for local filesystems
The runtime filter tests assume 3 scans for alltypes* tables. For local
filesystems this isn't a correct assumption. Fixing the tests to be
resilient to different number of scans is hard, and filters aren't
dependent on the filesystem implementation, so let's just disable them.

Change-Id: Ibcd18c7e69355cde70e13e5190ed2503adb7532b
Reviewed-on: http://gerrit.cloudera.org:8080/2688
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2016-04-05 17:37:32 +00:00
Skye Wanderman-Milne
9b51b2b6e6 IMPALA-2835: introduce PARQUET_FALLBACK_SCHEMA_RESOLUTION query option
This patch introduces a new query option,
PARQUET_FALLBACK_SCHEMA_RESOLUTION which allows Parquet files' schemas
to be resolved by either name or position.  It's "fallback" because
eventually field IDs will be the primary schema resolution scheme, and
we don't want to create an option that we will have to change the name
of later. The default is still by position. I chose to do a query
option because it will make testing easier and also be easier to
diagnose resolution problems quickly in the field. If users want to
switch the default behavior to be by name (like Hive), they can use
the --default_query_options flag.

This patch also introduces a new test section, SHELL, which can be
used to execute shell commands in a .test file. This is useful for
copying files into test tables.

Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6
Reviewed-on: http://gerrit.cloudera.org:8080/2384
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2016-04-02 04:04:25 +00:00
Casey Ching
3d5a21c047 Update another Kudu test to wait for modifying operations
I checked that none of the other .test files contain "insert", "update",
or "delete". This should be the last one. Eventually we should have a
solution that provides a more robust way to "read-your-writes".

Change-Id: I6561c32b1558e0a8685392cc5739e8d2e7900e7c
Reviewed-on: http://gerrit.cloudera.org:8080/2680
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-03-31 12:30:19 +00:00
Michael Brown
b74e57a312 IMPALA-2650: UDF EE tests: use unique databases in some tests
Some of the end-to-end tests in query_test/test_udfs.py create UDFs in
the default database and leave them there. Other tests (e.g.,
test_functions_ddl) polling the default database and expecting to find
no UDFs will fail. It turns out this wouldn't happen in our Jenkins
builds (see IMPALA-2650 for more details as to why), but it manifests
itself with repeated impala-py.test runs in specific order.

The fix is to create the UDFs in databases unique to the test cases.
This leaves the default database pristine during these tests.

Testing:

Before, the following sequence of impala-py.test commands would cause
any subsequent runs of test_functions_ddl to fail:

$ # simulate a subset of serial tests that expect default DB not to have UDFs
$ impala-py.test -m "execute_serially" --workload_exploration_strategy \
    functional-query:exhaustive -k test_functions_ddl metadata/test_ddl.py
PASS
$ # simulate a subset of parallel tests that create UDFs in default DB
$ impala-py.test -n4 -m "not execute_serially" --workload_exploration_strategy \
    functional-query:exhaustive query_test/test_udfs.py
PASS
$ # rerun a subset of serial tests that passed before
$ impala-py.test -m "execute_serially" --workload_exploration_strategy \
    functional-query:exhaustive -k test_functions_ddl metadata/test_ddl.py
FAIL, because test_udfs left UDFs.

Now, I can run these over and over, and they pass.

Change-Id: Id4a8b4764fa310efaa4f6c6f06f64a4e18e44173
Reviewed-on: http://gerrit.cloudera.org:8080/2610
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
2016-03-30 04:50:15 +00:00
Casey Ching
39a28185e8 Re-enable Kudu in build using client stubs when needed
The stubs in Impala broke during the merge commit. This commit removes
the stubs in hopes of improving robustness of the build. The original
problem (Kudu clients are only available for some OSs) is now addressed
by moving the stubbing into a dummy Kudu client. The dummy client only
allows linking to succeed, if any client method is called, Impala will
crash. Before calling any such method, Kudu availability must be
checked.

Change-Id: I4bf1c964faf21722137adc4f7ba7f78654f0f712
Reviewed-on: http://gerrit.cloudera.org:8080/2585
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-03-29 23:57:54 +00:00
Sailesh Mukil
76b674850f IMPALA-2466: Add more tests for the HDFS parquet scanner.
These tests functionally test whether the following type of files
are able to be scanned properly:

1) Add a parquet file with multiple blocks such that each node has to
   scan multiple blocks.
2) Add a parquet file with multiple blocks but only one row group
   that spans the entire file. Only one scan range should do any work
   in this case.

Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368
Reviewed-on: http://gerrit.cloudera.org:8080/1500
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-03-25 13:10:15 +00:00
Jim Apple
c0397911b5 IMPALA-2973: Loosen bound on join timer test
Code coverage builds are slower than release or debug builds. This
patch gives test_hash_join_timer extra time so this test doesn't cause
code coverage builds don't fail in Jenkins.

Change-Id: I5598e073d779f744d79c5292e80a8ed8f6aa9548
Reviewed-on: http://gerrit.cloudera.org:8080/2608
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
2016-03-23 22:25:13 +00:00
Skye Wanderman-Milne
a75d7dd282 IMPALA-3081: increase mem limit (again) for TestWideRow
LZO occasionally needs > 80MB to run this query. We still need to
investigate why it needs so much memory (the scan alone takes > 70MB),
but for now bump the mem limit again.

Change-Id: Ifdb7c6d33ab2de0f06e7322aa8f8ba107da84d49
Reviewed-on: http://gerrit.cloudera.org:8080/2602
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-03-23 05:14:29 +00:00
Henry Robinson
b3937295fb Runtime filters tests
This patch adds functional tests for runtime filters. It relies on
setting RUNTIME_FILTER_WAIT_TIME_MS high enough to ensure that filters
are received.

To make the test files more readable, this patch also adds a new COMMENT
section to the test syntax, and allows blank spaces between queries so
that the separation of different test cases can be made more obvious.

Currently missing is a test for disabling probe-side filters based on
selectivity, as we lack suitable tables to trigger the disable condition.

Change-Id: I94d617c6d23ffa394a6eb7ead56f1cfb701e0d90
Reviewed-on: http://gerrit.cloudera.org:8080/2603
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-03-23 04:07:14 +00:00
Skye Wanderman-Milne
e999617719 IMPALA-3195: explicitly specify unique database in TestScanners.test_annotate_utf8_option
test_annotate_utf8_option was failing in the exhaustive build because
some other test was switching the session database from 'default' to
'functional_parquet'. This patch amends the patch to use the new
'unique_database' fixture, which both ensures the table will be
created in the expected database and handles table cleanup.

Change-Id: I981475581340edc0be68fa2813f5e63990202399
Reviewed-on: http://gerrit.cloudera.org:8080/2586
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2016-03-22 06:43:51 +00:00
Skye Wanderman-Milne
a78f3a8ca5 IMPALA-2069: add USE_UTF8_PARQUET_STRINGS query option
This option toggles whether the parquet writer will use the UTF8
annotation for string columns. This patch includes a test that writes
a table with or without this option, then verifies that the annotation
is or isn't present using a new get_parquet_metadata Python utility.

Change-Id: I030c9f5c6272e09c1ce133f66234e3cfb26b68d4
Reviewed-on: http://gerrit.cloudera.org:8080/2531
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-03-17 05:58:39 +00:00
Michael Brown
58219eac2c IMPALA-2537: EE tests: create and use unique database fixture
To speed up tests and reduce flakiness, introduce a pytest fixture
whereby a test maintainer may request a database unique to his test.
Such databases are suitable for tests that need to create tables within
Python test code. Because the database name is unique to the test, the
test can create any tables within that database it wants without fear
that the same tables will be picked up by another test. Unique databases
effectively guarantee a unique namespace for tables.

To generate the database name, we use the CRC32 checksum of the test's
so-called pytest test ID. This ID is a long string containing the test's
module path, class (if applicable), function name, and parameter set
(e.g., vector). We then concatenate the CRC32 checksum with the test
function name, so that it's easier to identify the test to which the
database belongs. The test author may also override the prefix by
parametrizing the fixture.

We then use a pytest fixture to create the database, hand the name to
the test using the fixture, and clean up the database automatically
after the test completes.

The command `impala-py.test --fixtures` executed from the tests/
directory explains the full usage.

Finally, we modify a few tests to show how test maintainers can use this
fixture.

Not supported here are databases used by .test files, creation of hive
databases, databases with special CREATE parameters such as LOCATION and
COMMENT, or asking the fixture to create multiple databases. Also not
supported would be attempted parallel runs of the same test with the
same test parameters.

Testing:

1. Manual testing of the fixture usage, both in vanilla and
   parametrized context.

2. Manual runs of the tests modified.

3. An exhaustive exploration strategy test run.

Change-Id: I74d200da8a59379388e1edfbb849828f92a1b3b7
Reviewed-on: http://gerrit.cloudera.org:8080/1821
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
2016-03-16 18:29:57 +00:00
David Alves
7381304a23 Merge branch 'feature/kudu' into cdh5-trunk
This is the final merge commit that merges the 'feature/kudu' branch
into cdh5-trunk.

Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a
2016-03-13 19:28:43 -07:00
casey
804cfbdd64 Get and use Kudu from the toolchain by default
This is for review purposes only. This patch will be merged with David's
big merge patch.

Changes:
1) Make Kudu compilation dependent on the OS since not all OSs support
   Kudu.
2) Only run Kudu related tests when Kudu is supported (see #1).
3) Look for Kudu locally, but in a different location. To use a local
   build of Kudu, set KUDU_BUILD_DIR to the path Kudu was built in and
   set KUDU_CLIENT_DIR to the path KUDU was installed in.
   Example:
     git clone https://github.com/cloudera/kudu.git
     ...build 3rd party etc...
     mkdir -p $KUDU_BUILD_DIR
     cd $KUDU_BUILD_DIR
     cmake <path to Kudu source dir>
     make
     DESTDIR=$KUDU_CLIENT_DIR make install
4) Look for Kudu in the toolchain if not using a local Kudu build.
5) Add Kudu service startup scripts. The Kudu in the toolchain is
   actually a parcel that has been renamed (the contents were not
   modified in any way), that mean the Kudu service binaries are there.
   Those binaries are now used to run the Kudu service.

Change-Id: I3db88cbd27f2ea2394f011bc8d1face37411ed58
2016-03-11 11:38:05 -08:00
David Alves
82222abaf5 Merge branch 'feature/kudu' into cdh5-trunk
This merges the 'feature/kudu' branch with cdh5-trunk as of commit:
055500cc753f87f6d1c70627321fcc825044e183

This patch is not a pure merge patch in the sense that goes beyond conflict
resolution to also address reviews to the 'feature/kudu' branch as a whole.

The review items and their resolution can be inspected at:
http://gerrit.cloudera.org:8080/#/c/1403/

Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92
2016-03-11 11:37:58 -08:00
Tim Armstrong
368d7be7e6 IMPALA-2728: reenable mem limit test now that it is stable
The test has not xfailed in a long time, so we believe that various
memory usage fixes have fixed the flakiness.

Change-Id: Idff06791e9d880cc8ddf54c0c977a556d3701bea
Reviewed-on: http://gerrit.cloudera.org:8080/2442
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-03-04 07:59:09 +00:00
Juan Yu
c9b33ddf63 IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.
Fix a bug in which Impala only reads the first stream
of a multi-stream bz2/gzip file.
Changes the bz2 decoder to read the file in a streaming
fashion rather than reading the entire file into memory
before it can be decompressed.

Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8
(cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e)
Reviewed-on: http://gerrit.cloudera.org:8080/2219
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2016-02-28 21:31:37 -08:00
Bharath Vissapragada
393c65de6d IMPALA-3070: Disable test_hive_udfs_missing_jar on local file system
Local file system builds do not allow running more than one impalad
in parallel. This test relies on that behavior and hence is disabled
for such builds.

Change-Id: I93fe6ae37018885ede4838f5a2ce0bf11148c4e6
Reviewed-on: http://gerrit.cloudera.org:8080/2315
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-26 15:37:24 -08:00
Jim Apple
7f4db1e091 IMPALA-2973: Weaken lower bound on hash join timing.
Because of IMPALA-2407, we use Linux's COARSE clockid_t on EC2.
However, this has a resolution between 1 and 10 milliseconds, and so
cannot be trusted to measure lower bounds as low as those in
test_hash_join_timer.py.

This is a temporary workaround.

Change-Id: I1332b1d9aede129ea6c508e40e20960fad9414a8
Reviewed-on: http://gerrit.cloudera.org:8080/2298
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-02-24 13:31:00 -08:00
Juan Yu
97af107729 IMPALA-2914: Fix DCHECK Check failed: HasDateOrTime()
Some TimestampValue converting functions assume caller
ensures TimestampValue instance has a valid date or time
but that's not true. Change those functions to return
result in output parameter and return boolean to indicate
the conversion is good or not.

Change-Id: I7a68a1e14d9c4ee5d83da760d4d76c20c36bc359
(cherry picked from commit 47d8977f5976b9be405f44add966820138fbda6f)
Reviewed-on: http://gerrit.cloudera.org:8080/2195
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2016-02-24 13:31:00 -08:00
Bharath Vissapragada
ef0dac661c IMPALA-2843: Persist hive udfs across catalog restarts
This commit adds a new feature to persist hive/java udfs across
catalog restarts. IMPALA-1748 already added this for non-java
udfs by storing them in parameters map of the Db object and
reading them back at catalog startup. However we follow a
different approach for hive udfs by converting them to Hive's
function format and adding them as hive functions to the metastore.
This makes it possible to share udfs between hive and Impala as the
udfs added from one service are accessible to other. This commit
takes care of format conversions between hive and impala and user
can just add function once in either of the services.

Background: Hive and impala treat udfs differently. Hive resolves the
evaluate function in the udf class at runtime depending on the data
types of the input arguments. So user can add one function by name and
can pass any arguments to it as long as there is a compatible evaluate
function in the udf class. However Impala takes the input types of the
udf as a part of function definition (that maps to only one evaluate
function) and loads the function only for those set of input argument
types. If we have multiple 'evaluate' methods, we need to add multiple
functions one for each of them.

This commit adds new variants of CREATE | DROP FUNCTIONS  to Impala which
lets the user to create and drop hive/java udfs without input argument
types or return types. Catalog takes care of loading/dropping the udf
signatures corresponding to each "evaluate" method in the udf symbol
class. The syntax is as follows,

CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts>
DROP FUNCTION [IF EXISTS] <function name>

Examples:

CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;

The older way of creating hive/java udfs with specific signature is still supported,
however they are *not* persisted across restarts. So a restart of catalog can
wipe them out. Additionally this commit also loads all the compatible java udfs
added outside of Impala and they needn't be separately loaded. One thing
to note here is that the functions added using the new CREATE FUNCTION
can only be dropped using the new DROP FUNCTION syntax (without
signature). The same rule applies for the java udfs added using the old
CREATE FUNCTION syntax (with signature).

Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d
Reviewed-on: http://gerrit.cloudera.org:8080/2250
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 23:04:03 -08:00