run-step prints a message to tell the reader what it's doing. However,
that message wasn't flushed so that run-step could print OK or FAILED on
the same line. The result was that long-running steps wouldn't print
anything to the log until they were done, at least in Jenkins contexts.
This patch changes it so that the message is flushed, and then the
result is printed on a separate line (including the time it took to run
the step).
$ run-step "Hello world!" helloworld.out sleep 5
Hello world! (logging to /tmp/helloworld.out)...
OK (Took: 0 min 5 sec)
Change-Id: Iaced729f0ef6aa93174cd90b1516d3c34fe41a22
Reviewed-on: http://gerrit.cloudera.org:8080/5116
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
On Ubuntu 14.04 on AWS EC2 m4.4x, instances, these components
frequently take more than 30 seconds to start. I have seen the HMS
take more than 90 seconds; this patch sets a more conservative timeout
default.
Change-Id: I43eb8646cca495578c8f9730faa04812957d2917
Reviewed-on: http://gerrit.cloudera.org:8080/5068
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
This patch fixes a sed expression to make sure it only laters the code
it is meant to alter, not the comment describing the code.
Tested with tests/run-tests.py query_test/test_udfs.py
Change-Id: I51a0498d24b7fccc05b6183123501766cb36f85e
Reviewed-on: http://gerrit.cloudera.org:8080/5008
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:
- Managed by Cloudera Manager (CM)
- GPL Extras need to be installed
- KMS and KeyTrustee installed and available as a service
- SERDEPROPERTIES in the Hive DB modified to accept wide tables
- Hive warehouse dir points to /test-warehouse
The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)
It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.
Usage: remote_data_load.py [options] <cm_host address>
Options:
-h, --help show this help message and exit
--snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
--cm-user=CM_USER Cloudera Manager admin user
--cm-pass=CM_PASS Cloudera Manager admin user password
--gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
--ssh-user=SSH_USER System user on the remote machine with
passwordless SSH configured.
--no-load Do not try to load the snapshot
--exploration-strategy=EXPLORATION_STRATEGY
--test Run end-to-end tests against cluster
Testing:
This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.
However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.
Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
* Change run-step to output full log path
* Change text to say "Computing table stats" rather than "Computing
HBase stats" when running compute-table-stats.sh
Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07
Reviewed-on: http://gerrit.cloudera.org:8080/4825
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This commit modifies the stress test framework to run TPC-H and TPC-DS
workloads against Kudu. The follwing changes are included in this
commit:
1. Created template files with DDL and DML statements for loading TPC-H and
TPC-DS data in Kudu
2. Created a script (load-tpc-kudu.py) to load data in Kudu. The
script is invoked by the stress test runner to load test data in an
existing Impala/Kudu cluster (both local and CM-managed clusters are
supported).
3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL
files with TPC-H queries for Kudu were added in a previous patch.
4. Modified the stress test runner to take additional parameters
specific to Kudu (e.g. kudu master addr)
The stress test runner for Kudu was tested on EC2 clusters for both TPC-H
and TPC-DS workloads.
Missing functionality:
* No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu.
* Not all supported TPC-DS queries are included. Currently, only the
TPC-DS queries from the testdata/workloads/tpcds/queries directory
were modified to run against Kudu.
Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34
Reviewed-on: http://gerrit.cloudera.org:8080/4327
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.
Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU
Changes:
1) Remove the requirement to specify table properties such as key
columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
time of creation in Impala.
4) Disallow table properties that could conflict with an existing
table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
addresses. The flag is used as the default value for the table
property kudu_master_addresses but it can still be overriden
using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
wasn't implemented for Kudu tables and silently ignored. The Kudu
tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
Kudu) the existence of the other delegate and the use of delegates in
general has led to confusion. The Kudu delegate only exists to provide
functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
standard. When used at the column level, only one column can be
marked as a key. When used at the table level, multiple columns can
be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
The old "kudu.key_columns" table property is no longer accepted
though it is still used internally. "PRIMARY" is now a keyword.
The ident style declaration is used for "KEY" because it is also used
for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
The table property "kudu.table_name" is optional for managed tables
and is required for external tables. If for a managed table a Kudu
table name is not provided, a table name will be generated based
on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
of HMS when a table is loaded or refreshed. Table/column metadata
are cached in the catalog and are stored in HMS in order to be
able to use table and column statistics.
Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Fixes a regression in the data load process that had been introduced
by commit 75a857c. To making check-schema-diff.sh work from anywhere.
we need to specify the git-dir and work-tree arguments everywhere we
call git.
Change-Id: I32e0dce2c10c443763a038aa3b64b1c123ed62ad
Reviewed-on: http://gerrit.cloudera.org:8080/4726
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*
A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:
find . | grep "\.java\|\.py\|\.h\|\.cc" | \
xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g
along with some manual fixes.
After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/
Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
By ASF rules, we can't have JARs in releases. The releases are just
tarballs of the repo.
This patch removes from the repo the single JAR there, which was a
version of a JAR that is built during data load, with one string
changed. The JAR is used only for testing.
Instead of building that jar with the different string and saving the
result in git, daa loading will now build the jar twice, with one Java
source file slightly changed.
Change-Id: Icee7b8c32b08e064dea4a14624acff6021ef5ce1
Reviewed-on: http://gerrit.cloudera.org:8080/4499
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
We used to include a step in run-hbase.sh for calling a python
script that queried Zookeeper to see if the HBase master was up.
The original script was problematic, so we stopped using it during
our mini-cluster HBase start up procedure.
HBase start up issues continue to plague us, however. This patch
reintroduces a Zookeeper check, with the following updates:
- replace the original script with check-hbase-nodes.py
- query the correct node /hbase/master, not just /hbase/rs
- use the python Zookeeper library kazoo, rather than calling
out to the shell and parsing the return string
- since we are moving toward testing on a remote cluster, also
add the capability to pass in the address for the host that
provides the Zookeeper and HBase services
- add an additional check that the HDFS service is running,
because of an edge case where the HBase master can briefly
start without a cluster running.
In addition to the expected tests, this script was also tested
under the conditions of IMPALA-4088, whereby the HBase RegionServer
is running, but the master fails because another listening process
has already taken its TCP port (60010) during startup.
Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3
Reviewed-on: http://gerrit.cloudera.org:8080/4348
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Adds initial support for the functional-query test workload
for Kudu tables.
There are a few issues that make loading the functional
schema difficult on Kudu:
1) Kudu tables must have one or more columns that together
constitute a unique primary key.
a) Primary key columns must currently be the first columns
in the table definition (KUDU-1271).
b) Primary key columns cannot be nullable (KUDU-1570).
2) Kudu tables must be specified with distribution
parameters.
(1) limits the tables that can be loaded without ugly
workarounds. This patch only includes important tables that
are used for relevant tests, most notably the alltypes*
family. In particular, alltypesagg is important but it does
not have a set of columns that are non-nullable and form a unique
primary key. As a result, that table is created in Kudu with
a different name and an additional BIGINT column for a PK
that is a unique index and is generated at data loading time
using the ROW_NUMBER analytic function. A view is then
wrapped around the underlying table that matches the
alltypesagg schema exactly. When KUDU-1570 is resolved, this
can be simplified.
(2) requires some additional considerations and custom
syntax. As a result, the DDL to create the tables is
explicitly specified in CREATE_KUDU sections in the
functional_schema_constraints.csv, and an additional
DEPENDENT_LOAD_KUDU section was added to specify custom data
loading DML that differs from the existing DEPENDENT_LOAD.
TODO: IMPALA-4005: generate_schema_statements.py needs refactoring
Tests that are not relevant or not yet supported have been
marked with xfail and a skip where appropriate.
TODO: Support remaining functional tables/tests when possible.
Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc
Reviewed-on: http://gerrit.cloudera.org:8080/4175
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This commit fixes two issues in toSql() of DistributeParam:
1. string literals were not quoted
2. range partition split rows were not printed.
Besides, this commit fixes a small issue in run-hive-server.sh
Change-Id: I984a63a24f02670347b0e1efceb864d265d1f931
Reviewed-on: http://gerrit.cloudera.org:8080/4195
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
As of Kudu 0.9, DISTRIBUTE BY is now required when creating
a new Kudu table. Create table analysis, data loading, and
tests are updated to reflect this.
This also bumps the Kudu version to 0.10.0.
Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e
Reviewed-on: http://gerrit.cloudera.org:8080/3987
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
The Bug: Prior to this patch, a DCHECK was used to verify that the
underlying memory pool for the scratch batch was empty in a count based
scenario. For IMPALA-3964 (where a count(*) is performed on a nested
collection), if a Parquet column chunk is compressed, upon reading each
new data page it would be decompressed and eventually placed in to the
underlying scratch batch memory pool causing the aforementioned DCHECK
to fail. This was not picked up in the test suite as the TPCH nested
Parquet data is not compressed.
The Fix: Removed the erroneous DCHECK. Added logic to determine if any
memory in the scratch batch needs to be freed (due to the transfer that
occurs from the decompressed data pool), if so, it will be done.
Augmented the load_nested.py script to snappy compress each of the
tables within the 'tpch_nested_parquet' database. This is consistent with
how the flat TPCH Parquet data set is stored. Regarding test coverage,
there are already a number of tests that will perform nested collection
counts against the tables in the 'tpch_nested_parquet' database. For
uncompressed nested Parquet, the 'test_nested_types.py' test suite
leverages the 'ComplexTypesTbl' table to provide good coverage.
Change-Id: Id0955c85d18dfba4bd29a35ec95d0355da050607
Reviewed-on: http://gerrit.cloudera.org:8080/3940
Reviewed-by: Michael Ho <kwho@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
With this commit we enable loading of TPC-H data in Kudu tables and
running the 22 TPC-H queries against Kudu. Since Kudu doesn't support
the decimal data type, we had to modify the queries by using round()
function and update the test results.
Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289
Reviewed-on: http://gerrit.cloudera.org:8080/3789
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Also fix a stale comment in the avro scanner header.
The main work here is to fix the handling of empty result sets in the
test result verifier. This is a problem because we wanted to verify
that the results in the test file were a superset of the rows
returned, and this was thrown off by superflous '' rows in the expected
and actual result sets.
The basic problem is that the way test file sections
was parsed conflated an empty result section with non-empty result
section that had a single empty string. I.e.:
---- RESULTS
====
vs
---- RESULTS
====
both got resolved to [''].
Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/3413
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
This change extends MemPool, FreePool and StringBuffer to support
64-bit allocations, fixes a bug in decompressor and extends various
places in the code to support 64-bit allocation sizes. With this
change, the text scanner can now decompress compressed files larger
than 1GB.
Note that the UDF interfaces FunctionContext::Allocate() and
FunctionContext::Reallocate() still use 32-bit for the input
argument to avoid breaking compatibility.
In addition, the byte size of a tuple is still assumed to be
within 32-bit. If it needs to be upgraded to 64-bit, it will be
done in a separate change.
A new test has been added to test the decompression of a 2GB
snappy block compressed text file.
Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f
Reviewed-on: http://gerrit.cloudera.org:8080/3575
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
Both `find -executable` and the Bash "&>>" operator are too new to be
supported on RHEL5. Both have reasonable workarounds, so prefer them.
Note that this may not be the exhaustive list of such "modern"
conventions, but RHEL5 isn't working end-to-end, so we can't identify
all of them in a single commit yet.
Testing:
Before, the RHEL5 build would fail quite early here. Now, data load
succeeds and most of the backend tests successfully run.
Change-Id: I7438bed908d8026327923607238808122212d2d8
Reviewed-on: http://gerrit.cloudera.org:8080/3531
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Internal Jenkins
When a LOCATION that does not have the scheme specified is used,
the default FS is used as the filesystem scheme.
The default FS is set as 'file:/tmp' for localFS runs, however the
Hadoop library seems to ignore the '/tmp' part of the defaultFS for
locations without schemes and just uses 'file:'.
So the test warehouse is in: 'file:/tmp/test-warehouse'
However, the scripts access '/test-warehouse' without the scheme which
hadoop translates to: 'file:/test-warehouse' which does not exist.
This change disables metadata loading on local filesystem if there is
a schema change detected just as it is done in S3 and Isilon too.
Change-Id: Ie404079aeb2f837ac8b03244b2019e2c8ee9f221
Reviewed-on: http://gerrit.cloudera.org:8080/3384
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Sailesh Mukil <sailesh@cloudera.com>
Added checks/error handling:
* Negative string lengths while decoding dictionary or data page.
* Buffer overruns while decoding dictionary or data page.
* Some metadata FILECHECKs were converted to statuses.
Testing:
Unit tests for:
* decoding of strings with negative lengths
* truncation of all parquet types
* dictionary creation correctly handling error returns from Decode().
End-to-end tests for handling of negative string lengths in
dictionary- and plain-encoded data in corrupt files, and for
handling of buffer overruns for string data. The corrupted
parquet files were generated by hacking Impala's parquet
writer to write invalid lengths, and by hacking it to
write plain-encoded data instead of dictionary-encoded
data by default.
Performance:
set num_nodes=1;
set num_scanner_threads=1;
select * from biglineitem where l_orderkey = -1;
I inspected MaterializeTupleTime. Before the average was 8.24s and after
was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%).
Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece
Reviewed-on: http://gerrit.cloudera.org:8080/3387
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Maven has been downloading the postgres JDBC driver
all along. So, let's use the one in fe/target/dependency
instead of the one in thirdparty.
Change-Id: I76bce18fd308890e66615c8d08d5e58f02a8a132
Reviewed-on: http://gerrit.cloudera.org:8080/3232
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change whitelists the supported filesystems which can be set
as Default FS for Impala to run on.
This patch configures Impala to use S3 as the default filesystem, rather
than a secondary filesystem as before.
Change-Id: I2f45bef6c94ece634045acb906d12591587ccfed
Reviewed-on: http://gerrit.cloudera.org:8080/1121
Reviewed-by: anujphadke <aphadke@cloudera.com>
Tested-by: Internal Jenkins
If a stale snapshot is detected, the full data load proceeds even
if the option to skip data load was set. A check is added to fail
immediately if this happens for isilon or s3 because the full data
load will not work on these filesystems currently.
Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52
Reviewed-on: http://gerrit.cloudera.org:8080/2811
Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com>
Tested-by: Internal Jenkins
Before this fix our "Waiting for something to happen" print
output would be buffered and dumped all at once when the
event we were waiting for succeeded or we hit a timeout.
After this fix the output of "print" is displayed on
the console imemdiately, as was originally intended.
Change-Id: Icf341e81d0d459504918ae7c9e88918fe5e16c59
Reviewed-on: http://gerrit.cloudera.org:8080/2810
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
I. Start HBase per directions
1. https://hbase.apache.org/book.html#_configuration_files mentions a
'regionservers' file that points to a list of hosts on which to start
HBase RegionServers. When HBase starts in our mini-cluster there are
messages printed like this:
cat: /home/mikeb/Impala/fe/src/test/resources/regionservers: No such file or directory
The presence of this file now starts a single RegionServer and takes the
place of RegionServer 1 in the "additional region servers" startup, a
separate call.
2. The additional RegionServers are started but now we only start 2 from
index 2. See https://hbase.apache.org/book.html#quickstart_pseudo
There are still 3 total RegionServers using the same ports as before. We
are simply configuring our settings as directed in the documentation.
There were mentions in testdata/bin/run-hbase.sh of a "hbase race". One
possible such bug is https://issues.apache.org/jira/browse/HBASE-5780
which has been fixed for a while. I've removed the check to wait for
that Master, though I have not removed the Python script that does the
waiting. We could remove that later after we let this patch bake.
Also, https://issues.apache.org/jira/browse/HBASE-4467 has been marked
"not a problem", so I've removed references to that.
II. Implement HBase start retry
If starting either HBase Master or additional RegionServers fails, kill
all of HBase and try again. Do this for some number of attempts.
In order to keep errexit ("set -e") happy, we expect the possibility of
some of the startup attempts failing. We use control flow in those
cases. In the last case, errexit can fail on our behalf.
There is some code duplication here, but because Bash can't give us a
stack trace on failure, and only a line number, I chose not to use
functions to handle reuse. We don't really have functions anywhere else
at the moment, either.
Testing:
It's pretty difficult to try to trigger a real "HBase fails to start"
situation. I tested my changes by faking HBase failures, both when
starting up the Master and first RegionServer, and also starting
subsequent RegionServers.
Multiple private builds have passed.
Change-Id: Ib1d055a8a9098ce24e2f31b969501b6e090eab19
Reviewed-on: http://gerrit.cloudera.org:8080/2804
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
Previously Kudu would only be started when the test configuration was
the standard mini-cluster. That led to failures during data loading when
testing without the mini-cluster (ex: local file system). Kudu doesn't
require any other services so now it'll be started for all test
environments.
Change-Id: I92643ca6ef1acdbf4d4cd2fa5faf9ac97a3f0865
Reviewed-on: http://gerrit.cloudera.org:8080/2690
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
This failure happens on filesystems other than HDFS because as a
part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the
new directories that the patch tries to create in create-load-data.
Change-Id: I8de74db93893c5273ccc9c687f608959628f5004
Reviewed-on: http://gerrit.cloudera.org:8080/2644
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
The 20 lines we dump currently are often not enough to
diagnose a failure quickly. Increasing to 50 lines.
Printing 50 lines is also consistent with our run-step
script which also prints 50 lines.
Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73
Reviewed-on: http://gerrit.cloudera.org:8080/2632
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
All logs, test results and SQL files generated during data
loading and testing are now consolidated under a single new
directory $IMPALA_HOME/logs. The goal is to simplify archiving
in Jenkins runs and debugging.
The new structure is as follows:
$IMPALA_HOME/logs/cluster
- logs of Hadoop components and Impala
$IMPALA_HOME/logs/data_loading
- logs and SQL files produced in data loading
$IMPALA_HOME/logs/fe_tests
- logs and test output of Frontend unit tests
$IMPALA_HOME/logs/be_tests
- logs and test output of Backend unit tests
$IMPALA_HOME/logs/ee_tests
- logs and test output of end-to-end tests
$IMPALA_HOME/logs/custom_cluster_tests
- logs and test output of custom cluster tests
I tested this change with a full data load which
was successful.
Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa
Reviewed-on: http://gerrit.cloudera.org:8080/2456
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
These tests functionally test whether the following type of files
are able to be scanned properly:
1) Add a parquet file with multiple blocks such that each node has to
scan multiple blocks.
2) Add a parquet file with multiple blocks but only one row group
that spans the entire file. Only one scan range should do any work
in this case.
Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368
Reviewed-on: http://gerrit.cloudera.org:8080/1500
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This is for review purposes only. This patch will be merged with David's
big merge patch.
Changes:
1) Make Kudu compilation dependent on the OS since not all OSs support
Kudu.
2) Only run Kudu related tests when Kudu is supported (see #1).
3) Look for Kudu locally, but in a different location. To use a local
build of Kudu, set KUDU_BUILD_DIR to the path Kudu was built in and
set KUDU_CLIENT_DIR to the path KUDU was installed in.
Example:
git clone https://github.com/cloudera/kudu.git
...build 3rd party etc...
mkdir -p $KUDU_BUILD_DIR
cd $KUDU_BUILD_DIR
cmake <path to Kudu source dir>
make
DESTDIR=$KUDU_CLIENT_DIR make install
4) Look for Kudu in the toolchain if not using a local Kudu build.
5) Add Kudu service startup scripts. The Kudu in the toolchain is
actually a parcel that has been renamed (the contents were not
modified in any way), that mean the Kudu service binaries are there.
Those binaries are now used to run the Kudu service.
Change-Id: I3db88cbd27f2ea2394f011bc8d1face37411ed58
This merges the 'feature/kudu' branch with cdh5-trunk as of commit:
055500cc753f87f6d1c70627321fcc825044e183
This patch is not a pure merge patch in the sense that goes beyond conflict
resolution to also address reviews to the 'feature/kudu' branch as a whole.
The review items and their resolution can be inspected at:
http://gerrit.cloudera.org:8080/#/c/1403/
Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92
Previously, we tried to dynamically name the metastore db. With the introduction of
metatsore snapshots, this is no longer necessary and may cause naming ambiguity if the
Impala repository has a non-standard directory structure.
This patch use a constant name - impala_hive - defined as an environment variable in
impala-config.
Change-Id: Iadc59db8c538113171c9c2b8cea3ef3f6b3bd4fc
Reviewed-on: http://gerrit.cloudera.org:8080/517
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch is required for updating thirdparty.
Sentry does not ship with the Postgres JDBC driver anymore,
so we need to point it to ours in thirdparty. Sentry picks
up JARs from the HADOOP_CLASSPATH and not the CLASSPATH,
so this patch adds the JDBC driver there in run-sentry-service.sh.
Change-Id: Iee950dfcd2839b4ca0fc827a45da2a9386c4404d
Reviewed-on: http://gerrit.cloudera.org:8080/1991
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Use psql -q to suppress verbose output during metastore creation.
Also use -q instead of redirection everywhere for consistency.
Change-Id: I539da86a50d18546474b2cfdc848f992745a7875
Reviewed-on: http://gerrit.cloudera.org:8080/1884
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
In commit 960808 I forgot to update the data-loading script for the
conversion of a shell script to a python script. It turns out there were
a couple of other little problems too. I checked manually that the data
was loaded after these changes.
Change-Id: Id81fc423348515ab446835868025cb839c77f52c
Reviewed-on: http://gerrit.cloudera.org:8080/1851
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
The major changes are:
1) Collect backtrace and fatal log on crash.
2) Poll memory usage. The data is only displayed at this time.
3) Support kerberos.
4) Add random queries.
5) Generate random and TPC-H nested data on a remote cluster. The
random data generator was converted to use MR for scaling.
6) Add a cluster abstraction to run data loading for #5 on a
remote or local cluster. This also moves and consolidates some
Cloudera Manager utilities that were in the stress test.
7) Cleanup the wrappers around impyla. That stuff was getting
messy.
Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7
Reviewed-on: http://gerrit.cloudera.org:8080/1298
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
Log output of data loading steps to files only print to stdout
if there is an actual failure. The output of some steps is very noisy,
and some steps even have output that looks like errors.
This is implemented with a run-step helper function in bash that handles
redirection and logging. Any bash command can be prefixed with run-step
<step description> <log file name> to redirect the output to a log file.
Sample output is:
Starting Impala cluster (logging to start-impala-cluster.log)... OK
Setting up HDFS environment (logging to setup-hdfs-env.log)... OK
Skipped loading the metadata.
Loading HBase data only (logging to load-hbase-only.log)... OK
Loading Hive UDFs (logging to build-and-copy-hive-udfs.log)... OK
Running custom post-load steps (logging to custom-post-load-steps.log)... OK
Caching test tables (logging to cache-test-tables.log)... OK
Loading external data sources (logging to load-ext-data-source.log)... OK
Splitting HBase (logging to create-hbase.log)... OK
Change-Id: I6396540858c408b084039a87efc81e1004626f39
Reviewed-on: http://gerrit.cloudera.org:8080/1760
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
This adds a new 'latest' symlink in be/build that links to the latest
build configuration. This makes our script behave better as we don't
need to hard-code specific build types but can rather depend on sensible
defaults.
This patch addresses this issue in the cluster startup and a script that
is executed in the context of data loading. There might be more places
but so far my search did not yield any additional places where we rely
on a hardcoded path.
Change-Id: Ic814a1bef1d3088b2f8c1c34f25e2112b74315f8
Reviewed-on: http://gerrit.cloudera.org:8080/1797
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Use mvn-quiet.sh in a couple of places it was missed.
Fix mvn warnings.
Provide -q flag to git clean to prevent it reporting all of the files it
deletes.
Change-Id: I77ec2265bf35f64ab1ac76b0a253e67c5f97eccd
Reviewed-on: http://gerrit.cloudera.org:8080/1804
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Maven's INFO log level is very verbose and includes a lot of progress
information that is minimally useful.
Maven doesn't have an option to output only ERROR and WARNING log
messages. As a workaround, use grep to filter out the majority of the
output (only warnings, errors, tests, and success/failure).
Also add a header with relevant info about the maven command:
targets and working directory.
Change-Id: I828b870edc2fc80a6460e6ed594d507c46e69c82
Reviewed-on: http://gerrit.cloudera.org:8080/1752
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch allows passing additional cluster startup flags.
This is needed when building with optimizations in release
mode as the default cluster startup would only pick up a
debug build.
Change-Id: Ib98d6814558f2d82bdeac0e3cce1fb7db048c459
Reviewed-on: http://gerrit.cloudera.org:8080/1775
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
The original error reporting relied on $0 being accessible from the
current working dir, which failed if a script changed the working dir
and $0 was relative. This updates the error reporting command to cd back
to the original dir before accessing $0.
Change-Id: I2185af66e35e29b41dbe1bb08de24200bacea8a1
Reviewed-on: http://gerrit.cloudera.org:8080/1666
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins