impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 12:02:54 -05:00

Author	SHA1	Message	Date
Henry Robinson	2648bfbd90	Improve message output from run-step.sh run-step prints a message to tell the reader what it's doing. However, that message wasn't flushed so that run-step could print OK or FAILED on the same line. The result was that long-running steps wouldn't print anything to the log until they were done, at least in Jenkins contexts. This patch changes it so that the message is flushed, and then the result is printed on a separate line (including the time it took to run the step). $ run-step "Hello world!" helloworld.out sleep 5 Hello world! (logging to /tmp/helloworld.out)... OK (Took: 0 min 5 sec) Change-Id: Iaced729f0ef6aa93174cd90b1516d3c34fe41a22 Reviewed-on: http://gerrit.cloudera.org:8080/5116 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 09:35:14 +00:00
Jim Apple	4b774880c9	Increase wait times for startup of Hive and its Metastore On Ubuntu 14.04 on AWS EC2 m4.4x, instances, these components frequently take more than 30 seconds to start. I have seen the HMS take more than 90 seconds; this patch sets a more conservative timeout default. Change-Id: I43eb8646cca495578c8f9730faa04812957d2917 Reviewed-on: http://gerrit.cloudera.org:8080/5068 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 20:35:01 +00:00
Jim Apple	6775893894	IMPALA-4447: Rein in overly broad sed that dirties the tree This patch fixes a sed expression to make sure it only laters the code it is meant to alter, not the comment describing the code. Tested with tests/run-tests.py query_test/test_udfs.py Change-Id: I51a0498d24b7fccc05b6183123501766cb36f85e Reviewed-on: http://gerrit.cloudera.org:8080/5008 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 02:44:36 +00:00
Martin Grund	ce4c5f6743	IMPALA-4365: Enabling end-to-end tests on a remote cluster This patch lays the groundwork for loading data and running end-to-end tests on a remote CDH cluster. The requirements for the cluster to run the tests are: - Managed by Cloudera Manager (CM) - GPL Extras need to be installed - KMS and KeyTrustee installed and available as a service - SERDEPROPERTIES in the Hive DB modified to accept wide tables - Hive warehouse dir points to /test-warehouse The actual data loading is done via a new script, remote_data_load.py, which takes the CM host as an argument. It can be run from a client machine that is not a node of the cluster, but it needs to have the Impala repo checked out and Impala built. This insures that all of the necessary data load scripts are available, as well as setting up the environment properly (client binaries like beeline and the hbase shell are available, python libraries like cm_api are installed, necessary environment variables are defined, etc.) It should be noted that running remote_data_load.py will overwrite any local XML config files with the configurations downloaded from the remote cluster. Usage: remote_data_load.py [options] <cm_host address> Options: -h, --help show this help message and exit --snapshot-file=SNAPSHOT_FILE Path to the test-warehouse archive --cm-user=CM_USER Cloudera Manager admin user --cm-pass=CM_PASS Cloudera Manager admin user password --gateway=GATEWAY Gateway host to upload the data from. If not set, uses the CM host as gateway. --ssh-user=SSH_USER System user on the remote machine with passwordless SSH configured. --no-load Do not try to load the snapshot --exploration-strategy=EXPLORATION_STRATEGY --test Run end-to-end tests against cluster Testing: This patch is being submitted with the understanding that there are still clean up issues that need to be addressed in the remote data load script, for which JIRA's have been filed. However, since many of the existing build scripts also had to be modified, it is more important to make sure that no regressions were inadvertently introduced into the existing data load process. Loading data to a local mini-cluster was checked repeatedly while this patch was being developed, as well as running it against the Jenkins job that provides the test-warehouse snapshot used by the many other Impala CI builds that run daily. Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9 Reviewed-on: http://gerrit.cloudera.org:8080/4769 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 10:16:55 +00:00
Tim Armstrong	0dbfe169b7	IMPALA-4277: remove unneeded LegacyTCLIService Change-Id: I827590b19dc542f6256ae2e0d541eaa32a76520b Reviewed-on: http://gerrit.cloudera.org:8080/4844 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 02:34:01 +00:00
Henry Robinson	e0a3272129	Minor compute stats script fixes * Change run-step to output full log path * Change text to say "Computing table stats" rather than "Computing HBase stats" when running compute-table-stats.sh Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07 Reviewed-on: http://gerrit.cloudera.org:8080/4825 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 00:13:54 +00:00
Dimitris Tsirogiannis	8a49ceaae5	IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Reviewed-on: http://gerrit.cloudera.org:8080/4327 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 11:01:37 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00
David Knupp	05b91a973c	IMPALA-4294: Make check-schema-diff.sh executable from anywhere. Fixes a regression in the data load process that had been introduced by commit `75a857c`. To making check-schema-diff.sh work from anywhere. we need to specify the git-dir and work-tree arguments everywhere we call git. Change-Id: I32e0dce2c10c443763a038aa3b64b1c123ed62ad Reviewed-on: http://gerrit.cloudera.org:8080/4726 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-10-15 04:05:04 +00:00
Thomas Tauber-Marshall	b2c2fe7813	IMPALA-3786: Replace "cloudera" with "apache" (part 2) As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . \| grep "\.java\\|\.py\\|\.h\\|\.cc" \| \ xargs sed -i s/'com$.$cloudera$\.$impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 21:14:13 +00:00
Jim Apple	57fcbf7a28	IMPALA-4171: Remove JAR from repo. By ASF rules, we can't have JARs in releases. The releases are just tarballs of the repo. This patch removes from the repo the single JAR there, which was a version of a JAR that is built during data load, with one string changed. The JAR is used only for testing. Instead of building that jar with the different string and saving the result in git, daa loading will now build the jar twice, with one Java source file slightly changed. Change-Id: Icee7b8c32b08e064dea4a14624acff6021ef5ce1 Reviewed-on: http://gerrit.cloudera.org:8080/4499 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-22 02:00:50 +00:00
David Knupp	a42d18dcc3	IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh We used to include a step in run-hbase.sh for calling a python script that queried Zookeeper to see if the HBase master was up. The original script was problematic, so we stopped using it during our mini-cluster HBase start up procedure. HBase start up issues continue to plague us, however. This patch reintroduces a Zookeeper check, with the following updates: - replace the original script with check-hbase-nodes.py - query the correct node /hbase/master, not just /hbase/rs - use the python Zookeeper library kazoo, rather than calling out to the shell and parsing the return string - since we are moving toward testing on a remote cluster, also add the capability to pass in the address for the host that provides the Zookeeper and HBase services - add an additional check that the HDFS service is running, because of an edge case where the HBase master can briefly start without a cluster running. In addition to the expected tests, this script was also tested under the conditions of IMPALA-4088, whereby the HBase RegionServer is running, but the master fails because another listening process has already taken its TCP port (60010) during startup. Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3 Reviewed-on: http://gerrit.cloudera.org:8080/4348 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-15 00:02:22 +00:00
Matthew Jacobs	c7fa03286b	IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:11:04 +00:00
Jim Apple	bd2947329e	IMPALA-4110: Clean up issues found by Apache RAT. Change-Id: I5bfe77f9a871018e7a67553ed270e2df53006962 Reviewed-on: http://gerrit.cloudera.org:8080/4361 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:09:24 +00:00
Zoltan Ivanfi	b35689d7d9	Minor enhancements to helper scripts. - run-all-tests.sh: survive non-fatal failures when calling ulimit. - copy-udfs-udas.sh: respect $MAKE_CMD instead of blindly using make. Change-Id: Ic90bd0048786c799a8ac435de4303ed399ac1223 Reviewed-on: http://gerrit.cloudera.org:8080/4304 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-09-05 15:17:22 +00:00
Yuanhao Luo	052d3cc8dd	IMPALA-4056: Fix toSql() of DistributeParam This commit fixes two issues in toSql() of DistributeParam: 1. string literals were not quoted 2. range partition split rows were not printed. Besides, this commit fixes a small issue in run-hive-server.sh Change-Id: I984a63a24f02670347b0e1efceb864d265d1f931 Reviewed-on: http://gerrit.cloudera.org:8080/4195 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 20:11:27 +00:00
Matthew Jacobs	d113205cee	IMPALA-3650: DISTRIBUTE BY required for managed Kudu tables As of Kudu 0.9, DISTRIBUTE BY is now required when creating a new Kudu table. Create table analysis, data loading, and tests are updated to reflect this. This also bumps the Kudu version to 0.10.0. Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e Reviewed-on: http://gerrit.cloudera.org:8080/3987 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-08-19 02:14:39 +00:00
Christopher Channing	90a6b3206e	IMPALA-3964: Fix crash when a count() is performed on a nested collection. The Bug: Prior to this patch, a DCHECK was used to verify that the underlying memory pool for the scratch batch was empty in a count based scenario. For IMPALA-3964 (where a count() is performed on a nested collection), if a Parquet column chunk is compressed, upon reading each new data page it would be decompressed and eventually placed in to the underlying scratch batch memory pool causing the aforementioned DCHECK to fail. This was not picked up in the test suite as the TPCH nested Parquet data is not compressed. The Fix: Removed the erroneous DCHECK. Added logic to determine if any memory in the scratch batch needs to be freed (due to the transfer that occurs from the decompressed data pool), if so, it will be done. Augmented the load_nested.py script to snappy compress each of the tables within the 'tpch_nested_parquet' database. This is consistent with how the flat TPCH Parquet data set is stored. Regarding test coverage, there are already a number of tests that will perform nested collection counts against the tables in the 'tpch_nested_parquet' database. For uncompressed nested Parquet, the 'test_nested_types.py' test suite leverages the 'ComplexTypesTbl' table to provide good coverage. Change-Id: Id0955c85d18dfba4bd29a35ec95d0355da050607 Reviewed-on: http://gerrit.cloudera.org:8080/3940 Reviewed-by: Michael Ho <kwho@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-16 11:54:04 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Dimitris Tsirogiannis	6fbd35fa87	Enable TPC-H workload for Kudu tables With this commit we enable loading of TPC-H data in Kudu tables and running the 22 TPC-H queries against Kudu. Since Kudu doesn't support the decimal data type, we had to modify the queries by using round() function and update the test results. Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289 Reviewed-on: http://gerrit.cloudera.org:8080/3789 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:35:11 +00:00
Tim Armstrong	bc8c55afcd	IMPALA-3729: batch_size=1 coverage for avro scanner Also fix a stale comment in the avro scanner header. The main work here is to fix the handling of empty result sets in the test result verifier. This is a problem because we wanted to verify that the results in the test file were a superset of the rows returned, and this was thrown off by superflous '' rows in the expected and actual result sets. The basic problem is that the way test file sections was parsed conflated an empty result section with non-empty result section that had a single empty string. I.e.: ---- RESULTS ==== vs ---- RESULTS ==== both got resolved to ['']. Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/3413 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-07-19 23:30:02 -07:00
Michael Ho	ed5ec6772f	IMPALA-1619: Support 64-bit allocations. This change extends MemPool, FreePool and StringBuffer to support 64-bit allocations, fixes a bug in decompressor and extends various places in the code to support 64-bit allocation sizes. With this change, the text scanner can now decompress compressed files larger than 1GB. Note that the UDF interfaces FunctionContext::Allocate() and FunctionContext::Reallocate() still use 32-bit for the input argument to avoid breaking compatibility. In addition, the byte size of a tuple is still assumed to be within 32-bit. If it needs to be upgraded to 64-bit, it will be done in a separate change. A new test has been added to test the decompression of a 2GB snappy block compressed text file. Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f Reviewed-on: http://gerrit.cloudera.org:8080/3575 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-07-08 15:42:09 -07:00
Michael Brown	08e8de73b2	IMPALA-3806: remove a few modern shell idioms to improve RHEL5 support Both `find -executable` and the Bash "&>>" operator are too new to be supported on RHEL5. Both have reasonable workarounds, so prefer them. Note that this may not be the exhaustive list of such "modern" conventions, but RHEL5 isn't working end-to-end, so we can't identify all of them in a single commit yet. Testing: Before, the RHEL5 build would fail quite early here. Now, data load succeeds and most of the backend tests successfully run. Change-Id: I7438bed908d8026327923607238808122212d2d8 Reviewed-on: http://gerrit.cloudera.org:8080/3531 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Internal Jenkins	2016-07-05 13:37:26 -07:00
Sailesh Mukil	73595d8f40	IMPALA-3737: Local filesystem build failed loading custom schemas When a LOCATION that does not have the scheme specified is used, the default FS is used as the filesystem scheme. The default FS is set as 'file:/tmp' for localFS runs, however the Hadoop library seems to ignore the '/tmp' part of the defaultFS for locations without schemes and just uses 'file:'. So the test warehouse is in: 'file:/tmp/test-warehouse' However, the scripts access '/test-warehouse' without the scheme which hadoop translates to: 'file:/test-warehouse' which does not exist. This change disables metadata loading on local filesystem if there is a schema change detected just as it is done in S3 and Isilon too. Change-Id: Ie404079aeb2f837ac8b03244b2019e2c8ee9f221 Reviewed-on: http://gerrit.cloudera.org:8080/3384 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Sailesh Mukil <sailesh@cloudera.com>	2016-06-16 17:34:34 -07:00
Tim Armstrong	547be27e77	IMPALA-3745: parquet invalid data handling Added checks/error handling: * Negative string lengths while decoding dictionary or data page. * Buffer overruns while decoding dictionary or data page. * Some metadata FILECHECKs were converted to statuses. Testing: Unit tests for: * decoding of strings with negative lengths * truncation of all parquet types * dictionary creation correctly handling error returns from Decode(). End-to-end tests for handling of negative string lengths in dictionary- and plain-encoded data in corrupt files, and for handling of buffer overruns for string data. The corrupted parquet files were generated by hacking Impala's parquet writer to write invalid lengths, and by hacking it to write plain-encoded data instead of dictionary-encoded data by default. Performance: set num_nodes=1; set num_scanner_threads=1; select * from biglineitem where l_orderkey = -1; I inspected MaterializeTupleTime. Before the average was 8.24s and after was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%). Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece Reviewed-on: http://gerrit.cloudera.org:8080/3387 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-06-15 21:33:39 -07:00
Michael Ho	a485d44fd0	IMPALA-3223: Update path to Postgres JDBC driver Maven has been downloading the postgres JDBC driver all along. So, let's use the one in fe/target/dependency instead of the one in thirdparty. Change-Id: I76bce18fd308890e66615c8d08d5e58f02a8a132 Reviewed-on: http://gerrit.cloudera.org:8080/3232 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-27 08:42:17 -07:00
Anuj Phadke	a915293109	IMPALA-1850: Allow fs.defaultFS to be set to a non-HDFS filesystem This change whitelists the supported filesystems which can be set as Default FS for Impala to run on. This patch configures Impala to use S3 as the default filesystem, rather than a secondary filesystem as before. Change-Id: I2f45bef6c94ece634045acb906d12591587ccfed Reviewed-on: http://gerrit.cloudera.org:8080/1121 Reviewed-by: anujphadke <aphadke@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:40 -07:00
Harrison Sheinblatt	1058163f70	IMPALA-2276: Isilon and s3 builds must fail with stale snapshot If a stale snapshot is detected, the full data load proceeds even if the option to skip data load was set. A check is added to fail immediately if this happens for isilon or s3 because the full data load will not work on these filesystems currently. Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52 Reviewed-on: http://gerrit.cloudera.org:8080/2811 Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com> Tested-by: Internal Jenkins	2016-05-12 14:17:37 -07:00
Alex Behm	50314b5b2c	Set PYTHONUNBUFFERED in wait-for-* scripts. Before this fix our "Waiting for something to happen" print output would be buffered and dumped all at once when the event we were waiting for succeeded or we hit a timeout. After this fix the output of "print" is displayed on the console imemdiately, as was originally intended. Change-Id: Icf341e81d0d459504918ae7c9e88918fe5e16c59 Reviewed-on: http://gerrit.cloudera.org:8080/2810 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:35 -07:00
Michael Brown	6d8c075d9c	IMPALA-1996: Start HBase per directions in documentation; Implement HBase startup retry I. Start HBase per directions 1. https://hbase.apache.org/book.html#_configuration_files mentions a 'regionservers' file that points to a list of hosts on which to start HBase RegionServers. When HBase starts in our mini-cluster there are messages printed like this: cat: /home/mikeb/Impala/fe/src/test/resources/regionservers: No such file or directory The presence of this file now starts a single RegionServer and takes the place of RegionServer 1 in the "additional region servers" startup, a separate call. 2. The additional RegionServers are started but now we only start 2 from index 2. See https://hbase.apache.org/book.html#quickstart_pseudo There are still 3 total RegionServers using the same ports as before. We are simply configuring our settings as directed in the documentation. There were mentions in testdata/bin/run-hbase.sh of a "hbase race". One possible such bug is https://issues.apache.org/jira/browse/HBASE-5780 which has been fixed for a while. I've removed the check to wait for that Master, though I have not removed the Python script that does the waiting. We could remove that later after we let this patch bake. Also, https://issues.apache.org/jira/browse/HBASE-4467 has been marked "not a problem", so I've removed references to that. II. Implement HBase start retry If starting either HBase Master or additional RegionServers fails, kill all of HBase and try again. Do this for some number of attempts. In order to keep errexit ("set -e") happy, we expect the possibility of some of the startup attempts failing. We use control flow in those cases. In the last case, errexit can fail on our behalf. There is some code duplication here, but because Bash can't give us a stack trace on failure, and only a line number, I chose not to use functions to handle reuse. We don't really have functions anywhere else at the moment, either. Testing: It's pretty difficult to try to trigger a real "HBase fails to start" situation. I tested my changes by faking HBase failures, both when starting up the Master and first RegionServer, and also starting subsequent RegionServers. Multiple private builds have passed. Change-Id: Ib1d055a8a9098ce24e2f31b969501b6e090eab19 Reviewed-on: http://gerrit.cloudera.org:8080/2804 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:33 -07:00
Lars Volker	ee8c309187	Fix typo in load-test-warehouse-snapshot.sh Change-Id: I2ef9b32cbc56819f80db864a6590a9a7b2732c9c Reviewed-on: http://gerrit.cloudera.org:8080/2310 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Casey Ching	9d43aac6ce	IMPALA-3274: Always start Kudu for testing Previously Kudu would only be started when the test configuration was the standard mini-cluster. That led to failures during data loading when testing without the mini-cluster (ex: local file system). Kudu doesn't require any other services so now it'll be started for all test environments. Change-Id: I92643ca6ef1acdbf4d4cd2fa5faf9ac97a3f0865 Reviewed-on: http://gerrit.cloudera.org:8080/2690 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:02:35 -07:00
Sailesh Mukil	49a73cd598	IMPALA-3249: Failed to mkdirs on core-local-filesystem build. This failure happens on filesystems other than HDFS because as a part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the new directories that the patch tries to create in create-load-data. Change-Id: I8de74db93893c5273ccc9c687f608959628f5004 Reviewed-on: http://gerrit.cloudera.org:8080/2644 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-30 00:03:45 +00:00
Alex Behm	b2ccb17c21	Print last 50 lines of log if data loading fails. The 20 lines we dump currently are often not enough to diagnose a failure quickly. Increasing to 50 lines. Printing 50 lines is also consistent with our run-step script which also prints 50 lines. Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73 Reviewed-on: http://gerrit.cloudera.org:8080/2632 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 20:22:18 +00:00
Alex Behm	7e76e92bef	Consolidate test and cluster logs under a single directory. All logs, test results and SQL files generated during data loading and testing are now consolidated under a single new directory $IMPALA_HOME/logs. The goal is to simplify archiving in Jenkins runs and debugging. The new structure is as follows: $IMPALA_HOME/logs/cluster - logs of Hadoop components and Impala $IMPALA_HOME/logs/data_loading - logs and SQL files produced in data loading $IMPALA_HOME/logs/fe_tests - logs and test output of Frontend unit tests $IMPALA_HOME/logs/be_tests - logs and test output of Backend unit tests $IMPALA_HOME/logs/ee_tests - logs and test output of end-to-end tests $IMPALA_HOME/logs/custom_cluster_tests - logs and test output of custom cluster tests I tested this change with a full data load which was successful. Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa Reviewed-on: http://gerrit.cloudera.org:8080/2456 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 19:23:22 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00
Casey Ching	432a76e4dd	Temporarily disable Kudu support Change-Id: I9aeb808a9898972788cb1d5d071619d8c64b514c Reviewed-on: http://gerrit.cloudera.org:8080/2551 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-16 00:15:34 +00:00
casey	804cfbdd64	Get and use Kudu from the toolchain by default This is for review purposes only. This patch will be merged with David's big merge patch. Changes: 1) Make Kudu compilation dependent on the OS since not all OSs support Kudu. 2) Only run Kudu related tests when Kudu is supported (see #1). 3) Look for Kudu locally, but in a different location. To use a local build of Kudu, set KUDU_BUILD_DIR to the path Kudu was built in and set KUDU_CLIENT_DIR to the path KUDU was installed in. Example: git clone https://github.com/cloudera/kudu.git ...build 3rd party etc... mkdir -p $KUDU_BUILD_DIR cd $KUDU_BUILD_DIR cmake <path to Kudu source dir> make DESTDIR=$KUDU_CLIENT_DIR make install 4) Look for Kudu in the toolchain if not using a local Kudu build. 5) Add Kudu service startup scripts. The Kudu in the toolchain is actually a parcel that has been renamed (the contents were not modified in any way), that mean the Kudu service binaries are there. Those binaries are now used to run the Kudu service. Change-Id: I3db88cbd27f2ea2394f011bc8d1face37411ed58	2016-03-11 11:38:05 -08:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
ishaan	db4bd623ed	IMPALA-2131: The metastore database name should be a global constant. Previously, we tried to dynamically name the metastore db. With the introduction of metatsore snapshots, this is no longer necessary and may cause naming ambiguity if the Impala repository has a non-standard directory structure. This patch use a constant name - impala_hive - defined as an environment variable in impala-config. Change-Id: Iadc59db8c538113171c9c2b8cea3ef3f6b3bd4fc Reviewed-on: http://gerrit.cloudera.org:8080/517 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-02-03 00:58:50 +00:00
Alex Behm	194377f7d8	Add the Postgres JDBC JAR to the HADOOP_CLASSPATH for Sentry. This patch is required for updating thirdparty. Sentry does not ship with the Postgres JDBC driver anymore, so we need to point it to ours in thirdparty. Sentry picks up JARs from the HADOOP_CLASSPATH and not the CLASSPATH, so this patch adds the JDBC driver there in run-sentry-service.sh. Change-Id: Iee950dfcd2839b4ca0fc827a45da2a9386c4404d Reviewed-on: http://gerrit.cloudera.org:8080/1991 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-03 00:24:59 +00:00
Tim Armstrong	0c6628af18	Use psql -q consistently Use psql -q to suppress verbose output during metastore creation. Also use -q instead of redirection everywhere for consistency. Change-Id: I539da86a50d18546474b2cfdc848f992745a7875 Reviewed-on: http://gerrit.cloudera.org:8080/1884 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-01-26 21:15:04 +00:00
Casey Ching	72d1889c08	IMPALA-2873: Fix nested TPC-H data loading In commit 960808 I forgot to update the data-loading script for the conversion of a shell script to a python script. It turns out there were a couple of other little problems too. I checked manually that the data was loaded after these changes. Change-Id: Id81fc423348515ab446835868025cb839c77f52c Reviewed-on: http://gerrit.cloudera.org:8080/1851 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-01-21 05:42:17 +00:00
Casey Ching	f288867833	Stress test: Various changes The major changes are: 1) Collect backtrace and fatal log on crash. 2) Poll memory usage. The data is only displayed at this time. 3) Support kerberos. 4) Add random queries. 5) Generate random and TPC-H nested data on a remote cluster. The random data generator was converted to use MR for scaling. 6) Add a cluster abstraction to run data loading for #5 on a remote or local cluster. This also moves and consolidates some Cloudera Manager utilities that were in the stress test. 7) Cleanup the wrappers around impyla. That stuff was getting messy. Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7 Reviewed-on: http://gerrit.cloudera.org:8080/1298 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 23:00:25 +00:00
Tim Armstrong	43de306d17	Log data loading and cluster setup to file Log output of data loading steps to files only print to stdout if there is an actual failure. The output of some steps is very noisy, and some steps even have output that looks like errors. This is implemented with a run-step helper function in bash that handles redirection and logging. Any bash command can be prefixed with run-step <step description> <log file name> to redirect the output to a log file. Sample output is: Starting Impala cluster (logging to start-impala-cluster.log)... OK Setting up HDFS environment (logging to setup-hdfs-env.log)... OK Skipped loading the metadata. Loading HBase data only (logging to load-hbase-only.log)... OK Loading Hive UDFs (logging to build-and-copy-hive-udfs.log)... OK Running custom post-load steps (logging to custom-post-load-steps.log)... OK Caching test tables (logging to cache-test-tables.log)... OK Loading external data sources (logging to load-ext-data-source.log)... OK Splitting HBase (logging to create-hbase.log)... OK Change-Id: I6396540858c408b084039a87efc81e1004626f39 Reviewed-on: http://gerrit.cloudera.org:8080/1760 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 04:38:19 +00:00
Martin Grund	4648a39823	Introduce 'latest' build symlink This adds a new 'latest' symlink in be/build that links to the latest build configuration. This makes our script behave better as we don't need to hard-code specific build types but can rather depend on sensible defaults. This patch addresses this issue in the cluster startup and a script that is executed in the context of data loading. There might be more places but so far my search did not yield any additional places where we rely on a hardcoded path. Change-Id: Ic814a1bef1d3088b2f8c1c34f25e2112b74315f8 Reviewed-on: http://gerrit.cloudera.org:8080/1797 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-01-17 09:04:53 +00:00
Tim Armstrong	1832bb285e	Misc changes to reduce build noise Use mvn-quiet.sh in a couple of places it was missed. Fix mvn warnings. Provide -q flag to git clean to prevent it reporting all of the files it deletes. Change-Id: I77ec2265bf35f64ab1ac76b0a253e67c5f97eccd Reviewed-on: http://gerrit.cloudera.org:8080/1804 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-01-17 08:50:07 +00:00
Tim Armstrong	f13dfcbddc	Suppress maven info logging Maven's INFO log level is very verbose and includes a lot of progress information that is minimally useful. Maven doesn't have an option to output only ERROR and WARNING log messages. As a workaround, use grep to filter out the majority of the output (only warnings, errors, tests, and success/failure). Also add a header with relevant info about the maven command: targets and working directory. Change-Id: I828b870edc2fc80a6460e6ed594d507c46e69c82 Reviewed-on: http://gerrit.cloudera.org:8080/1752 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-01-15 19:38:46 +00:00
Martin Grund	d51f20fa1f	Passing cluster startup flags This patch allows passing additional cluster startup flags. This is needed when building with optimizations in release mode as the default cluster startup would only pick up a debug build. Change-Id: Ib98d6814558f2d82bdeac0e3cce1fb7db048c459 Reviewed-on: http://gerrit.cloudera.org:8080/1775 Tested-by: Internal Jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2016-01-14 16:48:43 +00:00
Casey Ching	cfb1ab5c2c	IMPALA-2781: Fix shell error reporting after chdir The original error reporting relied on $0 being accessible from the current working dir, which failed if a script changed the working dir and $0 was relative. This updates the error reporting command to cd back to the original dir before accessing $0. Change-Id: I2185af66e35e29b41dbe1bb08de24200bacea8a1 Reviewed-on: http://gerrit.cloudera.org:8080/1666 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-14 07:10:54 +00:00

1 2 3 4 5 ...

306 Commits