If a stale snapshot is detected, the full data load proceeds even
if the option to skip data load was set. A check is added to fail
immediately if this happens for isilon or s3 because the full data
load will not work on these filesystems currently.
Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52
Reviewed-on: http://gerrit.cloudera.org:8080/2811
Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com>
Tested-by: Internal Jenkins
This failure happens on filesystems other than HDFS because as a
part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the
new directories that the patch tries to create in create-load-data.
Change-Id: I8de74db93893c5273ccc9c687f608959628f5004
Reviewed-on: http://gerrit.cloudera.org:8080/2644
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
The 20 lines we dump currently are often not enough to
diagnose a failure quickly. Increasing to 50 lines.
Printing 50 lines is also consistent with our run-step
script which also prints 50 lines.
Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73
Reviewed-on: http://gerrit.cloudera.org:8080/2632
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
All logs, test results and SQL files generated during data
loading and testing are now consolidated under a single new
directory $IMPALA_HOME/logs. The goal is to simplify archiving
in Jenkins runs and debugging.
The new structure is as follows:
$IMPALA_HOME/logs/cluster
- logs of Hadoop components and Impala
$IMPALA_HOME/logs/data_loading
- logs and SQL files produced in data loading
$IMPALA_HOME/logs/fe_tests
- logs and test output of Frontend unit tests
$IMPALA_HOME/logs/be_tests
- logs and test output of Backend unit tests
$IMPALA_HOME/logs/ee_tests
- logs and test output of end-to-end tests
$IMPALA_HOME/logs/custom_cluster_tests
- logs and test output of custom cluster tests
I tested this change with a full data load which
was successful.
Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa
Reviewed-on: http://gerrit.cloudera.org:8080/2456
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
These tests functionally test whether the following type of files
are able to be scanned properly:
1) Add a parquet file with multiple blocks such that each node has to
scan multiple blocks.
2) Add a parquet file with multiple blocks but only one row group
that spans the entire file. Only one scan range should do any work
in this case.
Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368
Reviewed-on: http://gerrit.cloudera.org:8080/1500
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This merges the 'feature/kudu' branch with cdh5-trunk as of commit:
055500cc753f87f6d1c70627321fcc825044e183
This patch is not a pure merge patch in the sense that goes beyond conflict
resolution to also address reviews to the 'feature/kudu' branch as a whole.
The review items and their resolution can be inspected at:
http://gerrit.cloudera.org:8080/#/c/1403/
Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92
In commit 960808 I forgot to update the data-loading script for the
conversion of a shell script to a python script. It turns out there were
a couple of other little problems too. I checked manually that the data
was loaded after these changes.
Change-Id: Id81fc423348515ab446835868025cb839c77f52c
Reviewed-on: http://gerrit.cloudera.org:8080/1851
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Log output of data loading steps to files only print to stdout
if there is an actual failure. The output of some steps is very noisy,
and some steps even have output that looks like errors.
This is implemented with a run-step helper function in bash that handles
redirection and logging. Any bash command can be prefixed with run-step
<step description> <log file name> to redirect the output to a log file.
Sample output is:
Starting Impala cluster (logging to start-impala-cluster.log)... OK
Setting up HDFS environment (logging to setup-hdfs-env.log)... OK
Skipped loading the metadata.
Loading HBase data only (logging to load-hbase-only.log)... OK
Loading Hive UDFs (logging to build-and-copy-hive-udfs.log)... OK
Running custom post-load steps (logging to custom-post-load-steps.log)... OK
Caching test tables (logging to cache-test-tables.log)... OK
Loading external data sources (logging to load-ext-data-source.log)... OK
Splitting HBase (logging to create-hbase.log)... OK
Change-Id: I6396540858c408b084039a87efc81e1004626f39
Reviewed-on: http://gerrit.cloudera.org:8080/1760
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
Maven's INFO log level is very verbose and includes a lot of progress
information that is minimally useful.
Maven doesn't have an option to output only ERROR and WARNING log
messages. As a workaround, use grep to filter out the majority of the
output (only warnings, errors, tests, and success/failure).
Also add a header with relevant info about the maven command:
targets and working directory.
Change-Id: I828b870edc2fc80a6460e6ed594d507c46e69c82
Reviewed-on: http://gerrit.cloudera.org:8080/1752
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch allows passing additional cluster startup flags.
This is needed when building with optimizations in release
mode as the default cluster startup would only pick up a
debug build.
Change-Id: Ib98d6814558f2d82bdeac0e3cce1fb7db048c459
Reviewed-on: http://gerrit.cloudera.org:8080/1775
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
The original error reporting relied on $0 being accessible from the
current working dir, which failed if a script changed the working dir
and $0 was relative. This updates the error reporting command to cd back
to the original dir before accessing $0.
Change-Id: I2185af66e35e29b41dbe1bb08de24200bacea8a1
Reviewed-on: http://gerrit.cloudera.org:8080/1666
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Changes:
1) Consistently use "set -euo pipefail".
2) When an error happens, print the file and line.
3) Consolidated some of the kill scripts.
4) Added better error messages to the load data script.
5) Changed use of #!/bin/sh to bash.
Change-Id: I14fef66c46c1b4461859382ba3fd0dee0fbcdce1
Reviewed-on: http://gerrit.cloudera.org:8080/1620
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Allow Impala to start only with a running HMS (and no additional services like HDFS,
HBase, Hive, YARN) and use the local file system.
Skip all tests that need these services, use HDFS caching or assume that multiple impalads
are running.
To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and
WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has
permissions since this is the location where the test data will be extracted.
Test coverage (with core strategy) in comparison with HDFS and S3:
HDFS 1348 tests passed
S3 1157 tests passed
Local Filesystem 1161 tests passed
Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03
Reviewed-on: http://gerrit.cloudera.org:8080/1352
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Readability: Alex Behm <alex.behm@cloudera.com>
The test_verify_runtime_profile test failed during C5.5 builds and
GVMs because this test relies on the table lineitem_multiblock to
have 3 blocks. However, due to the rules to load the data not being
followed in the functional_schema_template.sql file, the table ended
up being stored with only one block.
This change moves the data load to the end of create-load-data.sh
file which would load the data even for snapshots.
Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be
Reviewed-on: http://gerrit.cloudera.org:8080/1153
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.
Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
A script is added that generates two parquet files with nested data.
One file has modern nested types encoding and the other one has
legacy encoding. This data will be used for testing nested types
support for "create table like file" statement.
Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef
Reviewed-on: http://gerrit.cloudera.org:8080/610
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Python tests and infra scripts will now use "python" from the virtualenv
via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now
that python 2.6 and a dependable set of third-party libraries are
available but that is not done as part of this commit.
Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f
Reviewed-on: http://gerrit.cloudera.org:8080/603
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Patch c0c9fbdf57df667f63632437f612a63baf1534dd: "Load Kudu as part of
the normal data loading workflow" passed the build when it was first
introduced as it had introduced changes to the datasets directory
which cause the metadata loading not to be skipped. However it failed
all subsequent times as there were no further changes to the metadata
directory.
This patch makes data loading for Kudu run independently of whether
metadata load is skipped or not since a new Kudu cluster is now created
on each build.
This patch also removes one last reference to 'functional_kudu.liketbl'
in AnalyzeDDLTest since we don't create/load data for that table anymore.
Change-Id: Ibe9acc7da17062ac317dff06a8c57dd87cf566d6
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7110
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: David Alves <david.alves@cloudera.com>
This patch enables running Impala tests against Isilon as the default file system. The
intention is to run tests against a realistic deployment, i.e, Isilon replacing HDFS as
the underlying filesystem.
Specifically, it does the following:
- Adds a new environment variable DEFAULT_FS, which points to HDFS by default.
- Makes the fs.defaultFs property in core-site.xml use the DEFAULT_FS environment
variable, such that all clients talk to Isilon implicitly.
- Unset FILESYSTEM_PREFIX when the TARGET_FILESYSTEM is Isilon, since path prefixes
are no longer needed.
- Only starts the Hive Metastore and the Impala service stack when running
tests against Isilon.
We don't start KMS/HBase because they're not relevant to Isilon. We also don't
start YARN, Hive and LLama because hive queries are disabled with Isilon.
The scripts that start/stop Hive, YARN and Llama should be modified to point to a
filesystem other than HDFS in the future.
Change-Id: Id66bfb160fe57f66a64a089b465b536c6c514b63
Reviewed-on: http://gerrit.cloudera.org:8080/449
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
The database will be used for testing in the future.
Change-Id: I60b54b36db9493a5bea308151b4027cd47d73047
Reviewed-on: http://gerrit.cloudera.org:8080/400
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
This patch adds a basic implementation for a Kudu table, scan node and
supports simple DDL operations. Similar to "normal" HDFS tables, the
DDL statements executed in the Hive metastore can be propagated to
Kudu. Othewise, the Kudu table behaves similarily to a HBase table.
The syntax to create a table stored in Kudu is:
create table kudu (id int, name string, age int)
tblproperties (
'storage_handler' =
'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'kudu',
'kudu.master_addresses' = '0.0.0.0:7051',
'kudu.key_columns' = 'id,name');
The 'storage_handler' attribute is fixed and used to identify a table
backed by Kudu. The storage handler attribute is required, to make sure
that Hive will not create a directory on HDFS for this table.
The 'table_name' and 'master_addresses' properties define
the properties of the physical persistence in Kudu. The 'key_columns'
defines a list of columns that should be used as a (composite) key.
A Kudu table can be created either as a managed or un-managed (external)
table. Creating an external table behaves similar to an external HDFS
table, if the table does not exist in Kudu create it, if it exists use
it and check schema compatibility. If an external table is deleted, only
delete the Hive table. If a managed table is created, the Kudu table
must not exist and if it is deleted the Kudu table is deleted as well.
TODO: Allow creation of external table without specifying columns.
Change-Id: I794abf6abe30ace4426c53f77676ae1dcb4341ec
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6358
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
As needed, fix up file paths and other misc things to get
more test cases running against S3.
Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad
Reviewed-on: http://gerrit.cloudera.org:8080/167
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
After the hive/hdfs rebase, the indexed lzo file names changed. This patch uses a
wildcard rather than a specific file name to protect against such changes. It's safe
because the test simply expects a partition that does not have index files.
Change-Id: I6d32609b62df83fe2a8ef935d7ca6506ecff5e0d
Reviewed-on: http://gerrit.cloudera.org:8080/150
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
First change for IMPALA-1209 to address Impala limitations when
using HDFS encryption. This adds a KMS process to the testdata
cluster. This was tested manually by creating a key and an
encryption zone.
Change-Id: I499154506386f04e71c5371b128c10868b1e1318
Reviewed-on: http://gerrit.cloudera.org:8080/41
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This patch enables loading data to s3 instead of hdfs. It is preliminary in nature,
as such, there are a few caveats:
- The fe tests do not work.
- Only loading from a test-warehouse snapshot and metastore snapshot is enabled.
- Until hive works with s3, only a subset of all the tests will work.
Change-Id: Ia66a5f836b4245e3b022a49de805eec337a51324
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5851
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Previously, when we started all the services, we created an HBase table from hive to avoid
a replication bug. This had the side-effect of creating a test-warehouse directory in
hdfs. After that check was removed, we no longer create the test-warehouse, causing the
full-data-load build to fail.
Change-Id: I75479562d33e08c79ad155c615cecb5b91c0eab6
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5904
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This commit adds the ability to only load the metastore snapshot, with the assumption that
the hdfs data is already loaded. It also additionally adds the ability to specify some
buildall parameters via the environment.
Change-Id: I4a07d4cf3a63479c377d4be79c4a2140c2a52fb8
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5665
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This patch contains the following changes:
- Add a metastore_snapshot_file parameter to build.sh
- Enable skipping loading the metadata.
- create-load-data.sh is refactored into functions.
- A lot of scripts source impala-config, which creates a lot of log spew. This has now
been muted.
- Unecessary log spew from compute-table-stats has been muted.
- build_thirdparty.sh determins its parallelism from the system, it was previously hard
coded to 4
- Only force load data of the particular dataset if a schema change is detected.
Change-Id: I909336451e5c1ca57d21f040eb94c0e831546837
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5540
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Adds fixes and tests for Hive CHAR & VARCHAR compatibility.
Also fixes a bug in tuple materialization for VARCHAR and non in-lined CHAR.
Change-Id: I400b089cb8ddba2e264ef9f2e37956b2ceaaf9fb
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4054
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
This is the first iteration of a kerberized development environment.
All the daemons start and use kerberos, with the sole exception of the
hive metastore. This is sufficient to test impala authentication.
When buildall.sh is run using '-kerberize', it will stop before
loading data or attempting to run tests.
Loading data into the cluster is known to not work at this time, the
root causes being that Beeline -> HiveServer2 -> MapReduce throws
errors, and Beeline -> HiveServer2 -> HBase has problems. These are
left for later work.
However, the impala daemons will happily authenticate using kerberos
both from clients (like the impala shell) and amongst each other.
This means that if you can get data into the mini-cluster, you could
query it.
Usage:
* Supply a '-kerberize' option to buildall.sh, or
* Supply a '-kerberize' option to create-test-configuration.sh, then
'run-all.sh -format', re-source impala-config.sh, and then start
impala daemons as usual. You must reformat the cluster because
kerberizing it will change all the ownership of all files in HDFS.
Notable changes:
* Added clean start/stop script for the llama-minikdc
* Creation of Kerberized HDFS - namenode and datanodes
* Kerberized HBase (and Zookeeper)
* Kerberized Hive (minus the MetaStore)
* Kerberized Impala
* Loading of data very nearly working
Still to go:
* Kerberize the MetaStore
* Get data loading working
* Run all tests
* The unknown unknowns
* Extensive testing
Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019
Reviewed-by: Michael Yoder <myoder@cloudera.com>
Tested-by: jenkins
Changes include:
* Fix compile errors due to new column stats API and other stats related
fixes.
* Temporarily disable JDBC tests due to new serialization format in Hive .13
* Disable view compatibility tests until we can get them to work in Hive .13
* Test fixes due to Hive's type checking for partition column values
Change-Id: I05cc6a95976e0e037be79d91bc330a06d2fdc46c
Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'".
Supports all options that CREATE TABLE does. Currently only PARQUET is supported.
Run testdata/bin/create-load-data.sh after pulling this patch.
Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158
This patch checks the test-warehouse's stored githash (if it exists) to determine if the
current patch has changed the schema if a table. If a change is detected, we force load
all the data.
Change-Id: I314f9f3364d3e6b2d66de38a9e6d9f57c4e279a7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3049
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED
When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.
When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.
It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).
Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.
Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
When updating partition metadata as part of COMPUTE STATS we would previously
attempt to update all partitions at once. This could lead to HMS socket timeouts
and also could run into issues if there were > 32K partitions.
In this change we now update the partitions in batches, with a max size of 500
partitions per batch. We also compare whether the row count has changed and only
update partitions that have been modified.
Change-Id: If7bfcc30f86fc2fdd79855b981067ac29a47b5e1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1913
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1918
This updates how Impala fetches partition metadata from the Hive Metastore to fetch
partitions in batches, rather than all at once. This helps reduce the load on the
HMS and also lets Impala scale to above 32K partitions. The downside is that it
may require additional RPCs to get all the partitions.
This is done by first querying the metastore to get all the partition names that
exist, then splitting the list of names into seperate batches to get the actual
partition metadata.
Impala uses a default size of 1000 partitions per batch, but it can be configured
by setting the 'hive.metastore.batch.retrieve.table.partition.max' parameter
in the hive-site.xml config file.
Change-Id: Ide0ec30ef8a9e00f79c26551aa8e5e7814c73034
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1662
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1698
Impala reserves resources from YARN via Llama and handles resources
preemptions by cancelling affected queries. Adds the Impala Resource
Broker for interacting with Llama. Refactors scheduler and coordinator
to move fragment-to-host assignment logic into scheduler. Local test
setup uses MiniLLama.
Change-Id: Ic7b0fe43de52d30f4207b4e65cce7e6a294e54e1
We weren't attaching resources to the row batch when starting a new
row group, so it was possible for string data to be overwritten. This
patch removes CloseStreams() and merges its functionality with
AttachCompletedResources() so it's not possible to destroy streams
without transferring the resources first. It also merges and removes
ScannerContext::Close().
Also adds test cases for IMPALA-720.
Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb