When the data loading finishes, it is possible for some HDFS blocks to
be under replicated. If impala gets the metadata before the replication
is done, some tests may fail. This patch adds a replication waiting step
in the data loading script.
Resubmitted with filesystem type check.
Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8
Reviewed-on: http://gerrit.cloudera.org:8080/8916
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Using fsck breaks non-HDFS builds: local, S3, and Isilon.
This reverts commit 5a7c10ec3d.
Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc
Reviewed-on: http://gerrit.cloudera.org:8080/8880
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
When the data loading finishes, it is possible for some HDFS blocks to
be under replicated. If impala gets the metadata before the replication
is done, some tests may fail. This patch adds a replication waiting step
in the data loading script.
Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322
Reviewed-on: http://gerrit.cloudera.org:8080/8846
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The two Kudu loads and Hive UDFs can all run in parallel. This
should shave about 4 minutes off of the data load. (Current
timings are 3.5, 4, and 0.6 minutes, see below.)
I've run dataload with this change many times.
Loading Kudu functional (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu.log)...
Loading workload 'functional-query' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 3 min 29 sec)
Loading Kudu TPCH (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu-tpch.log)...
Loading workload 'tpch' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 4 min 0 sec)
Loading Hive UDFs (logging to /home/ubuntu/Impala/logs/data_loading/build-and-copy-hive-udfs.log)...
Loading Hive UDFs OK (Took: 0 min 41 sec)
Change-Id: I7e93ee5a77ec9271b980b88bef7ad512ecbe0407
Reviewed-on: http://gerrit.cloudera.org:8080/8822
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
This reverts commit e4f585240a.
Among other things, that commit replaced hdfs command line calls
with "LOAD DATA LOCAL INPATH" using Hive. However, doing so
presumes that the minicluster is the only test environment.
Sometimes though, the data load script is against a remote cluster,
and those cases, the data load process is now broken.
Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2
Reviewed-on: http://gerrit.cloudera.org:8080/8653
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
This is a revert of a revert, re-enabling parallel data load. It avoid
the race condition by explicitly configuring the temporary directory in
question in load-data.py.
When the parallel data load change went in, we discovered
a race with a signature of:
java.io.FileNotFoundException: File
/tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist
The number in this path is milliseconds since the epoch, and the race
occurs when two queries submitted to HiveServer2, running with the local
runner, hit the same millisecond time stamp. The upstream bug is
https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the
symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which
is now marked as a dupe).
I've tested this by running data load 5 times on the same machines
where it failed before. I also ran data load manually and inspected
the system to make sure that the temporary directories are getting
created as expected in /tmp/impala-data-load-*.
Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10
Reviewed-on: http://gerrit.cloudera.org:8080/8405
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
We may be seeing a race with errors like "java.io.FileNotFoundException:
File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist".
This reverts commit e020c37106.
Change-Id: I46da93f4315a5a4bdaa96fa464cb51922bd6c419
Reviewed-on: http://gerrit.cloudera.org:8080/8386
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Dataload typically follows a pattern of loading data into
a text version of a table, and then using an insert
overwrite from the text table to populate the table for
other file formats. This insert is always done in Impala
for Parquet and Kudu. Otherwise it runs in Hive.
Since Impala doesn't support writing nested data, the
population of complextypes_fileformat tries to hack
the insert to run in Hive by including it in the ALTER
part of the table definition. ALTER runs immediately
after CREATE and always runs in Hive. The problem is
that ALTER also runs before the base table
(functional.complextypes_fileformat) is populated.
The insert succeeds, but it is inserting zero rows.
This code change introduces a way to force the Parquet
load to run using Hive. This lets complextypes_fileformat
specify that the insert should happen in Hive and fixes
the ordering so that the table is populated correctly.
This is also useful for loading custom Parquet files
into Parquet tables. Hive supports the DATA LOAD LOCAL
syntax, which can read a file from the local filesystem.
This means that several locations that currently use
the hdfs commandline can be modified to use this SQL.
This change speeds up dataload by a few minutes, as it
avoids the overhead of the hdfs commandline.
Any other location that could use DATA LOAD LOCAL is
also switched over to use it. This includes the
testescape* tables which now print the appropriate
DATA LOAD commands as a result of text_delims_table.py.
Any location that already uses DATA LOAD LOCAL is also
switched to indicate that it must run in Hive. Any
location that was doing an HDFS command in the LOAD
section is moved to the LOAD_DEPENDENT_HIVE section.
Testing: Ran dataload and core tests. Also verified that
functional_parquet.complextypes_fileformat has rows.
Change-Id: I7152306b2907198204a6d8d282a0bad561129b82
Reviewed-on: http://gerrit.cloudera.org:8080/8350
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
This commit loads functional-query, TPC-H data, and TPC-DS data in
parallel. In parallel, these take about 37 minutes, dominated by
functional-query. Serially, these take about 30 minutes more, namely the
13 minutes of tpcds and 16 minutes of tpcds. This works out nicely
because CPU usage during data load is very low in aggregate. (We don't
sustain more than 1 CPU of load, whereas build machines are likely to
have many CPUs.)
To do this, I added support to run-step.sh to have a notion of a
backgroundable task, and support waiting for all tasks.
I also increased the heapsize of our HiveServer2 server. When datasets
were being loaded in parallel, we ran out of memory at 256MB of heap.
The resulting log output is currently like so (but without the
timestamps):
15:58:04 Started Loading functional-query data in background; pid 8105.
15:58:04 Started Loading TPC-H data in background; pid 8106.
15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04 Started Loading TPC-DS data in background; pid 8107.
15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31 Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec)
16:14:33 Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec)
16:35:08 Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec)
I tested dataloading with the following command on an 8-core, 32GB
machine. I saw 19GB of available memory during my run:
./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format
Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Reviewed-on: http://gerrit.cloudera.org:8080/8320
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
To make statements execute, some clients require always appending
a semi-colon to the end. The workaround is quite simple.
Change-Id: Id8b9f3dde4445513f1f389785a002c6cc6b3dada
Reviewed-on: http://gerrit.cloudera.org:8080/8132
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:
- Managed by Cloudera Manager (CM)
- GPL Extras need to be installed
- KMS and KeyTrustee installed and available as a service
- SERDEPROPERTIES in the Hive DB modified to accept wide tables
- Hive warehouse dir points to /test-warehouse
The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)
It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.
Usage: remote_data_load.py [options] <cm_host address>
Options:
-h, --help show this help message and exit
--snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
--cm-user=CM_USER Cloudera Manager admin user
--cm-pass=CM_PASS Cloudera Manager admin user password
--gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
--ssh-user=SSH_USER System user on the remote machine with
passwordless SSH configured.
--no-load Do not try to load the snapshot
--exploration-strategy=EXPLORATION_STRATEGY
--test Run end-to-end tests against cluster
Testing:
This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.
However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.
Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
* Change run-step to output full log path
* Change text to say "Computing table stats" rather than "Computing
HBase stats" when running compute-table-stats.sh
Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07
Reviewed-on: http://gerrit.cloudera.org:8080/4825
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
With this commit we enable loading of TPC-H data in Kudu tables and
running the 22 TPC-H queries against Kudu. Since Kudu doesn't support
the decimal data type, we had to modify the queries by using round()
function and update the test results.
Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289
Reviewed-on: http://gerrit.cloudera.org:8080/3789
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
This change extends MemPool, FreePool and StringBuffer to support
64-bit allocations, fixes a bug in decompressor and extends various
places in the code to support 64-bit allocation sizes. With this
change, the text scanner can now decompress compressed files larger
than 1GB.
Note that the UDF interfaces FunctionContext::Allocate() and
FunctionContext::Reallocate() still use 32-bit for the input
argument to avoid breaking compatibility.
In addition, the byte size of a tuple is still assumed to be
within 32-bit. If it needs to be upgraded to 64-bit, it will be
done in a separate change.
A new test has been added to test the decompression of a 2GB
snappy block compressed text file.
Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f
Reviewed-on: http://gerrit.cloudera.org:8080/3575
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
Both `find -executable` and the Bash "&>>" operator are too new to be
supported on RHEL5. Both have reasonable workarounds, so prefer them.
Note that this may not be the exhaustive list of such "modern"
conventions, but RHEL5 isn't working end-to-end, so we can't identify
all of them in a single commit yet.
Testing:
Before, the RHEL5 build would fail quite early here. Now, data load
succeeds and most of the backend tests successfully run.
Change-Id: I7438bed908d8026327923607238808122212d2d8
Reviewed-on: http://gerrit.cloudera.org:8080/3531
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Internal Jenkins
When a LOCATION that does not have the scheme specified is used,
the default FS is used as the filesystem scheme.
The default FS is set as 'file:/tmp' for localFS runs, however the
Hadoop library seems to ignore the '/tmp' part of the defaultFS for
locations without schemes and just uses 'file:'.
So the test warehouse is in: 'file:/tmp/test-warehouse'
However, the scripts access '/test-warehouse' without the scheme which
hadoop translates to: 'file:/test-warehouse' which does not exist.
This change disables metadata loading on local filesystem if there is
a schema change detected just as it is done in S3 and Isilon too.
Change-Id: Ie404079aeb2f837ac8b03244b2019e2c8ee9f221
Reviewed-on: http://gerrit.cloudera.org:8080/3384
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Sailesh Mukil <sailesh@cloudera.com>
Added checks/error handling:
* Negative string lengths while decoding dictionary or data page.
* Buffer overruns while decoding dictionary or data page.
* Some metadata FILECHECKs were converted to statuses.
Testing:
Unit tests for:
* decoding of strings with negative lengths
* truncation of all parquet types
* dictionary creation correctly handling error returns from Decode().
End-to-end tests for handling of negative string lengths in
dictionary- and plain-encoded data in corrupt files, and for
handling of buffer overruns for string data. The corrupted
parquet files were generated by hacking Impala's parquet
writer to write invalid lengths, and by hacking it to
write plain-encoded data instead of dictionary-encoded
data by default.
Performance:
set num_nodes=1;
set num_scanner_threads=1;
select * from biglineitem where l_orderkey = -1;
I inspected MaterializeTupleTime. Before the average was 8.24s and after
was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%).
Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece
Reviewed-on: http://gerrit.cloudera.org:8080/3387
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
If a stale snapshot is detected, the full data load proceeds even
if the option to skip data load was set. A check is added to fail
immediately if this happens for isilon or s3 because the full data
load will not work on these filesystems currently.
Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52
Reviewed-on: http://gerrit.cloudera.org:8080/2811
Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com>
Tested-by: Internal Jenkins
This failure happens on filesystems other than HDFS because as a
part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the
new directories that the patch tries to create in create-load-data.
Change-Id: I8de74db93893c5273ccc9c687f608959628f5004
Reviewed-on: http://gerrit.cloudera.org:8080/2644
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
The 20 lines we dump currently are often not enough to
diagnose a failure quickly. Increasing to 50 lines.
Printing 50 lines is also consistent with our run-step
script which also prints 50 lines.
Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73
Reviewed-on: http://gerrit.cloudera.org:8080/2632
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
All logs, test results and SQL files generated during data
loading and testing are now consolidated under a single new
directory $IMPALA_HOME/logs. The goal is to simplify archiving
in Jenkins runs and debugging.
The new structure is as follows:
$IMPALA_HOME/logs/cluster
- logs of Hadoop components and Impala
$IMPALA_HOME/logs/data_loading
- logs and SQL files produced in data loading
$IMPALA_HOME/logs/fe_tests
- logs and test output of Frontend unit tests
$IMPALA_HOME/logs/be_tests
- logs and test output of Backend unit tests
$IMPALA_HOME/logs/ee_tests
- logs and test output of end-to-end tests
$IMPALA_HOME/logs/custom_cluster_tests
- logs and test output of custom cluster tests
I tested this change with a full data load which
was successful.
Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa
Reviewed-on: http://gerrit.cloudera.org:8080/2456
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
These tests functionally test whether the following type of files
are able to be scanned properly:
1) Add a parquet file with multiple blocks such that each node has to
scan multiple blocks.
2) Add a parquet file with multiple blocks but only one row group
that spans the entire file. Only one scan range should do any work
in this case.
Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368
Reviewed-on: http://gerrit.cloudera.org:8080/1500
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This merges the 'feature/kudu' branch with cdh5-trunk as of commit:
055500cc753f87f6d1c70627321fcc825044e183
This patch is not a pure merge patch in the sense that goes beyond conflict
resolution to also address reviews to the 'feature/kudu' branch as a whole.
The review items and their resolution can be inspected at:
http://gerrit.cloudera.org:8080/#/c/1403/
Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92
In commit 960808 I forgot to update the data-loading script for the
conversion of a shell script to a python script. It turns out there were
a couple of other little problems too. I checked manually that the data
was loaded after these changes.
Change-Id: Id81fc423348515ab446835868025cb839c77f52c
Reviewed-on: http://gerrit.cloudera.org:8080/1851
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Log output of data loading steps to files only print to stdout
if there is an actual failure. The output of some steps is very noisy,
and some steps even have output that looks like errors.
This is implemented with a run-step helper function in bash that handles
redirection and logging. Any bash command can be prefixed with run-step
<step description> <log file name> to redirect the output to a log file.
Sample output is:
Starting Impala cluster (logging to start-impala-cluster.log)... OK
Setting up HDFS environment (logging to setup-hdfs-env.log)... OK
Skipped loading the metadata.
Loading HBase data only (logging to load-hbase-only.log)... OK
Loading Hive UDFs (logging to build-and-copy-hive-udfs.log)... OK
Running custom post-load steps (logging to custom-post-load-steps.log)... OK
Caching test tables (logging to cache-test-tables.log)... OK
Loading external data sources (logging to load-ext-data-source.log)... OK
Splitting HBase (logging to create-hbase.log)... OK
Change-Id: I6396540858c408b084039a87efc81e1004626f39
Reviewed-on: http://gerrit.cloudera.org:8080/1760
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
Maven's INFO log level is very verbose and includes a lot of progress
information that is minimally useful.
Maven doesn't have an option to output only ERROR and WARNING log
messages. As a workaround, use grep to filter out the majority of the
output (only warnings, errors, tests, and success/failure).
Also add a header with relevant info about the maven command:
targets and working directory.
Change-Id: I828b870edc2fc80a6460e6ed594d507c46e69c82
Reviewed-on: http://gerrit.cloudera.org:8080/1752
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch allows passing additional cluster startup flags.
This is needed when building with optimizations in release
mode as the default cluster startup would only pick up a
debug build.
Change-Id: Ib98d6814558f2d82bdeac0e3cce1fb7db048c459
Reviewed-on: http://gerrit.cloudera.org:8080/1775
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
The original error reporting relied on $0 being accessible from the
current working dir, which failed if a script changed the working dir
and $0 was relative. This updates the error reporting command to cd back
to the original dir before accessing $0.
Change-Id: I2185af66e35e29b41dbe1bb08de24200bacea8a1
Reviewed-on: http://gerrit.cloudera.org:8080/1666
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Changes:
1) Consistently use "set -euo pipefail".
2) When an error happens, print the file and line.
3) Consolidated some of the kill scripts.
4) Added better error messages to the load data script.
5) Changed use of #!/bin/sh to bash.
Change-Id: I14fef66c46c1b4461859382ba3fd0dee0fbcdce1
Reviewed-on: http://gerrit.cloudera.org:8080/1620
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Allow Impala to start only with a running HMS (and no additional services like HDFS,
HBase, Hive, YARN) and use the local file system.
Skip all tests that need these services, use HDFS caching or assume that multiple impalads
are running.
To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and
WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has
permissions since this is the location where the test data will be extracted.
Test coverage (with core strategy) in comparison with HDFS and S3:
HDFS 1348 tests passed
S3 1157 tests passed
Local Filesystem 1161 tests passed
Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03
Reviewed-on: http://gerrit.cloudera.org:8080/1352
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Readability: Alex Behm <alex.behm@cloudera.com>
The test_verify_runtime_profile test failed during C5.5 builds and
GVMs because this test relies on the table lineitem_multiblock to
have 3 blocks. However, due to the rules to load the data not being
followed in the functional_schema_template.sql file, the table ended
up being stored with only one block.
This change moves the data load to the end of create-load-data.sh
file which would load the data even for snapshots.
Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be
Reviewed-on: http://gerrit.cloudera.org:8080/1153
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.
Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
A script is added that generates two parquet files with nested data.
One file has modern nested types encoding and the other one has
legacy encoding. This data will be used for testing nested types
support for "create table like file" statement.
Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef
Reviewed-on: http://gerrit.cloudera.org:8080/610
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.
Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Patch c0c9fbdf57df667f63632437f612a63baf1534dd: "Load Kudu as part of
the normal data loading workflow" passed the build when it was first
introduced as it had introduced changes to the datasets directory
which cause the metadata loading not to be skipped. However it failed
all subsequent times as there were no further changes to the metadata
directory.
This patch makes data loading for Kudu run independently of whether
metadata load is skipped or not since a new Kudu cluster is now created
on each build.
This patch also removes one last reference to 'functional_kudu.liketbl'
in AnalyzeDDLTest since we don't create/load data for that table anymore.
Change-Id: Ibe9acc7da17062ac317dff06a8c57dd87cf566d6
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7110
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: David Alves <david.alves@cloudera.com>
This patch enables running Impala tests against Isilon as the default file system. The
intention is to run tests against a realistic deployment, i.e, Isilon replacing HDFS as
the underlying filesystem.
Specifically, it does the following:
- Adds a new environment variable DEFAULT_FS, which points to HDFS by default.
- Makes the fs.defaultFs property in core-site.xml use the DEFAULT_FS environment
variable, such that all clients talk to Isilon implicitly.
- Unset FILESYSTEM_PREFIX when the TARGET_FILESYSTEM is Isilon, since path prefixes
are no longer needed.
- Only starts the Hive Metastore and the Impala service stack when running
tests against Isilon.
We don't start KMS/HBase because they're not relevant to Isilon. We also don't
start YARN, Hive and LLama because hive queries are disabled with Isilon.
The scripts that start/stop Hive, YARN and Llama should be modified to point to a
filesystem other than HDFS in the future.
Change-Id: Id66bfb160fe57f66a64a089b465b536c6c514b63
Reviewed-on: http://gerrit.cloudera.org:8080/449
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
The database will be used for testing in the future.
Change-Id: I60b54b36db9493a5bea308151b4027cd47d73047
Reviewed-on: http://gerrit.cloudera.org:8080/400
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch adds a basic implementation for a Kudu table, scan node and
supports simple DDL operations. Similar to "normal" HDFS tables, the
DDL statements executed in the Hive metastore can be propagated to
Kudu. Othewise, the Kudu table behaves similarily to a HBase table.
The syntax to create a table stored in Kudu is:
create table kudu (id int, name string, age int)
tblproperties (
'storage_handler' =
'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'kudu',
'kudu.master_addresses' = '0.0.0.0:7051',
'kudu.key_columns' = 'id,name');
The 'storage_handler' attribute is fixed and used to identify a table
backed by Kudu. The storage handler attribute is required, to make sure
that Hive will not create a directory on HDFS for this table.
The 'table_name' and 'master_addresses' properties define
the properties of the physical persistence in Kudu. The 'key_columns'
defines a list of columns that should be used as a (composite) key.
A Kudu table can be created either as a managed or un-managed (external)
table. Creating an external table behaves similar to an external HDFS
table, if the table does not exist in Kudu create it, if it exists use
it and check schema compatibility. If an external table is deleted, only
delete the Hive table. If a managed table is created, the Kudu table
must not exist and if it is deleted the Kudu table is deleted as well.
TODO: Allow creation of external table without specifying columns.
Change-Id: I794abf6abe30ace4426c53f77676ae1dcb4341ec
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6358
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
As needed, fix up file paths and other misc things to get
more test cases running against S3.
Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad
Reviewed-on: http://gerrit.cloudera.org:8080/167
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
After the hive/hdfs rebase, the indexed lzo file names changed. This patch uses a
wildcard rather than a specific file name to protect against such changes. It's safe
because the test simply expects a partition that does not have index files.
Change-Id: I6d32609b62df83fe2a8ef935d7ca6506ecff5e0d
Reviewed-on: http://gerrit.cloudera.org:8080/150
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
First change for IMPALA-1209 to address Impala limitations when
using HDFS encryption. This adds a KMS process to the testdata
cluster. This was tested manually by creating a key and an
encryption zone.
Change-Id: I499154506386f04e71c5371b128c10868b1e1318
Reviewed-on: http://gerrit.cloudera.org:8080/41
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This patch enables loading data to s3 instead of hdfs. It is preliminary in nature,
as such, there are a few caveats:
- The fe tests do not work.
- Only loading from a test-warehouse snapshot and metastore snapshot is enabled.
- Until hive works with s3, only a subset of all the tests will work.
Change-Id: Ia66a5f836b4245e3b022a49de805eec337a51324
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5851
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Previously, when we started all the services, we created an HBase table from hive to avoid
a replication bug. This had the side-effect of creating a test-warehouse directory in
hdfs. After that check was removed, we no longer create the test-warehouse, causing the
full-data-load build to fail.
Change-Id: I75479562d33e08c79ad155c615cecb5b91c0eab6
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5904
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins