impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 18:02:33 -05:00

Author	SHA1	Message	Date
Taras Bobrovytsky	bd6d2df730	IMPALA-5527: Add nested testdata flattener The TableFlattener takes a nested dataset and creates an equivalent unnested dataset. The unnested dataset is saved as Parquet. When an array or map is encountered in the original table, the flattener creates a new table and adds an id column to it which references the row in the parent table. Joining on the id column should produce the original dataset. The flattened dataset should be loaded into Postgres in order to run the query generator (in nested types mode) on it. There is a script that automates generaration, flattening and loading random data into Postgres and Impala: testdata/bin/generate-load-nested.sh -f Testing: - ran ./testdata/bin/generate-load-nested.sh -f and random nested data was generated and flattened as expected. Change-Id: I7e7a8e53ada9274759a3e2128b97bec292c129c6 Reviewed-on: http://gerrit.cloudera.org:8080/5787 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-17 03:18:06 +00:00
Lars Volker	467ccd1950	IMPALA-5223: Add waiting for HBase Zookeeper nodes to retry loop Occasionally we'd see HBase fail to startup properly on CentOS 7 clusters. The symptom was that HBase would not open the required nodes in zookeeper, signaling its readiness. As a workaround, this change includes waiting for the Zookeeper nodes into the retry logic. Change-Id: Id8dbdff4ad02cac1322e7d580e0a6971daf6ea28 Reviewed-on: http://gerrit.cloudera.org:8080/7159 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: anujphadke <aphadke@cloudera.com> Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2017-06-13 05:57:49 +00:00
Jakub Kukul	0992a6afda	IMPALA-2525: Treat parquet ENUMs as STRINGs when creating impala tables. Change-Id: Ia7a2e20c3ab83eb3fac422c3b33c117856fec475 Reviewed-on: http://gerrit.cloudera.org:8080/6550 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-07 02:51:54 +00:00
Jim Apple	07a7138817	Add a script to test performance on a developer machine This is a migration from an old and broken script from another repository. Example use: bin/single_node_perf_run.py --ninja --workloads targeted-perf \ --load --scale 4 --iterations 20 --num_impalads 3 \ --start_minicluster --query_names PERF_AGG-Q3 \ $(git rev-parse HEAD~1) $(git rev-parse HEAD) The script can load data, run benchmarks, and compare the statistics of those runs for significant differences in performance. It glues together buildall.sh, bin/load-data.py, bin/run-workload.py, and tests/benchmark/report_benchmark_results.py. Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9 Reviewed-on: http://gerrit.cloudera.org:8080/6818 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-31 08:10:48 +00:00
Michael Ho	f15589573b	IMPALA-5376: Loads all TPC-DS tables This change loads the missing tables in TPC-DS. In addition, it also fixes up the loading of the partitioned table store_sales so all partitions will be loaded. The existing TPC-DS queries are also updated to use the parameters for qualification runs as noted in the TPC-DS specification. Some hard-coded partition filters were also removed. They were there due to the lack of dynamic partitioning in the past. Some missing TPC-DS queries are also added to this change, including query28 which discovered the infamous IMPALA-5251. Having all tables in TPC-DS available paves the way for us to include all supported TPCDS queries in our functional testing. Due to the change in the data, planner tests and the E2E tests have different results than before. The results of E2E tests were compared against the run done with Netezza and Vertica. The divergence were all due to the truncation behavior of decimal types in DECIMAL_V1. Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9 Reviewed-on: http://gerrit.cloudera.org:8080/6877 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-27 05:19:53 +00:00
Lars Volker	12f3ecceab	IMPALA-5287: Test skip.header.line.count on gzip This change fixed IMPALA-4873 by adding the capability to supply a dict 'test_file_vars' to run_test_case(). Keys in this dict will be replaced with their values inside test queries before they are executed. Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70 Reviewed-on: http://gerrit.cloudera.org:8080/6817 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 01:36:46 +00:00
Jim Apple	374f1121da	IMPALA-3224: De-Cloudera non-docs JIRA URLs John Russell is planning to fix the URLS in docs in a separate commit. Fixed using: (git ls-files \| xargs replace \ 'https://issues.cloudera.org/browse/IMPALA' 'IMPALA' --) && \ git checkout HEAD docs Change-Id: I28ea06e89341de234f9005fdc72a2e43f0ab8182 Reviewed-on: http://gerrit.cloudera.org:8080/6487 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-07 04:44:57 +00:00
Michael Brown	8b459dffec	IMPALA-5162,IMPALA-5163: stress test support on secure clusters This patch adds support for running the stress test (concurrent_select.py) and loading nested data (load_nested.py) into a Kerberized, SSL-enabled Impala cluster. It assumes the calling user already has a valid Kerberos ticket. One way to do that is: 1. Get access to a keytab and krb5.config 2. Set KRB5_CONFIG and KRB5CCNAME appropriately 3. Run kinit(1) 4. Run load_nested.py and/or concurrent_select.py within this environment. Because our Python clients already support Kerberos and SSL, we simply need to make sure to use the correct options when calling the entry points and initializing the clients: Impala: Impyla Hive: Impyla HDFS: hdfs.ext.kerberos.KerberosClient With this patch, I was able to manually do a short concurrent_select.py run against a secure cluster without connection or auth errors, and I was able to do the same with load_nested.py for a cluster that already had TPC-H loaded. Follow-ons for future cleanup work: IMPALA-5263: support CA bundles when running stress test against SSL'd Impala IMPALA-5264: fix InsecurePlatformWarning under stress test with SSL Change-Id: I0daad57bb8ceeb5071b75125f11c1997ed7e0179 Reviewed-on: http://gerrit.cloudera.org:8080/6763 Reviewed-by: Matthew Mulder <mmulder@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 04:56:01 +00:00
David Knupp	894bb77855	IMPALA-4839: Remove implicit 'localhost' for KUDU_MASTER_HOSTS The Kudu query tests were failing on a remote cluster because the Kudu master was always set to '127.0.0.1', with no way to override it. This patch corrects the issue with a number of changes: - Add a pytest command line option to specify an arbitrary Kudu master - Consolidate the place where the default Kudu master is derived. It had been stored both in the env and in tests/common/__init__.py, with different files looking to different places. For now, just look to the env, and remove the value from __init__.py. - The kudu_client test fixture in conftest.py was using the connect() method from impala.dbapi (part of the Impyla library), without specifying the host param. In the absence of that, the default value is 'localhost', so add the host param to the connect() call. - Define the various defaults for pytest config as constants at the top of conftest.py. Change-Id: I9df71480a165f4ce21ae3edab6ce7227fbf76f77 Reviewed-on: http://gerrit.cloudera.org:8080/5877 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-14 21:51:39 +00:00
David Knupp	226a2e6332	IMPALA-4684: Handle Zookeeper ConnentionLoss exceptions This is the second patch to address IMPALA-4684. The first patch exposed a transient Zookeeper connection error on RHEL7. This patch introduces a retry (up to 3 times), and somewhat better logging. Tested by running tests against an RHEL7 instance and confirming that all HBase nodes start up. Change-Id: I44b4eec342addcfe489f94c332bbe14225c9968c Reviewed-on: http://gerrit.cloudera.org:8080/5554 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-22 01:18:56 +00:00
David Knupp	73146a0a46	IMPALA-4684: Wrap checking for HBase nodes ina try/finally block If an exception (other than NoNodeError) was raised while checking for HBase nodes, we weren't cleanly stopping the ZooKeeper client, which in turn created a second exception when the the connection was closed. The second exception masked the original error condition. Tested by forcibly raising unexpected errors while checking for HBase nodes. Change-Id: I46a74d018f9169385a9f10a85718044c31a24dbc Reviewed-on: http://gerrit.cloudera.org:8080/5547 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-12-21 00:22:16 +00:00
Tim Armstrong	44ae9fcead	IMPALA-4649: add a mechanism to pass flags into make Testing: Tested that buildall.sh works as expected. Built locally with IMPALA_MAKE_FLAGS unset to confirm I didn't break anything. Built locally with IMPALA_MAKE_FLAGS=--load-average=$IMPALA_BUILD_THREADS and looked at "ps auxf" output to confirm it's passed through. Change-Id: I17b13cbaf395f962762d5cff3d650ffb077934a4 Reviewed-on: http://gerrit.cloudera.org:8080/5480 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2016-12-15 21:37:17 +00:00
Dan Burkert	f83652c1da	Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and `BUCKETS` keywords that were going to be newly released in Impala 2.6, but are now unused. Additionally, a few remaining uses of the `DISTRIBUTE BY` syntax has been switched to `PARTITION BY`. Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922 Reviewed-on: http://gerrit.cloudera.org:8080/5382 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-07 07:31:16 +00:00
Dimitris Tsirogiannis	cba93f1ac3	IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2 Reviewed-on: http://gerrit.cloudera.org:8080/5317 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 10:41:53 +00:00
Henry Robinson	2648bfbd90	Improve message output from run-step.sh run-step prints a message to tell the reader what it's doing. However, that message wasn't flushed so that run-step could print OK or FAILED on the same line. The result was that long-running steps wouldn't print anything to the log until they were done, at least in Jenkins contexts. This patch changes it so that the message is flushed, and then the result is printed on a separate line (including the time it took to run the step). $ run-step "Hello world!" helloworld.out sleep 5 Hello world! (logging to /tmp/helloworld.out)... OK (Took: 0 min 5 sec) Change-Id: Iaced729f0ef6aa93174cd90b1516d3c34fe41a22 Reviewed-on: http://gerrit.cloudera.org:8080/5116 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 09:35:14 +00:00
Jim Apple	4b774880c9	Increase wait times for startup of Hive and its Metastore On Ubuntu 14.04 on AWS EC2 m4.4x, instances, these components frequently take more than 30 seconds to start. I have seen the HMS take more than 90 seconds; this patch sets a more conservative timeout default. Change-Id: I43eb8646cca495578c8f9730faa04812957d2917 Reviewed-on: http://gerrit.cloudera.org:8080/5068 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 20:35:01 +00:00
Jim Apple	6775893894	IMPALA-4447: Rein in overly broad sed that dirties the tree This patch fixes a sed expression to make sure it only laters the code it is meant to alter, not the comment describing the code. Tested with tests/run-tests.py query_test/test_udfs.py Change-Id: I51a0498d24b7fccc05b6183123501766cb36f85e Reviewed-on: http://gerrit.cloudera.org:8080/5008 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 02:44:36 +00:00
Martin Grund	ce4c5f6743	IMPALA-4365: Enabling end-to-end tests on a remote cluster This patch lays the groundwork for loading data and running end-to-end tests on a remote CDH cluster. The requirements for the cluster to run the tests are: - Managed by Cloudera Manager (CM) - GPL Extras need to be installed - KMS and KeyTrustee installed and available as a service - SERDEPROPERTIES in the Hive DB modified to accept wide tables - Hive warehouse dir points to /test-warehouse The actual data loading is done via a new script, remote_data_load.py, which takes the CM host as an argument. It can be run from a client machine that is not a node of the cluster, but it needs to have the Impala repo checked out and Impala built. This insures that all of the necessary data load scripts are available, as well as setting up the environment properly (client binaries like beeline and the hbase shell are available, python libraries like cm_api are installed, necessary environment variables are defined, etc.) It should be noted that running remote_data_load.py will overwrite any local XML config files with the configurations downloaded from the remote cluster. Usage: remote_data_load.py [options] <cm_host address> Options: -h, --help show this help message and exit --snapshot-file=SNAPSHOT_FILE Path to the test-warehouse archive --cm-user=CM_USER Cloudera Manager admin user --cm-pass=CM_PASS Cloudera Manager admin user password --gateway=GATEWAY Gateway host to upload the data from. If not set, uses the CM host as gateway. --ssh-user=SSH_USER System user on the remote machine with passwordless SSH configured. --no-load Do not try to load the snapshot --exploration-strategy=EXPLORATION_STRATEGY --test Run end-to-end tests against cluster Testing: This patch is being submitted with the understanding that there are still clean up issues that need to be addressed in the remote data load script, for which JIRA's have been filed. However, since many of the existing build scripts also had to be modified, it is more important to make sure that no regressions were inadvertently introduced into the existing data load process. Loading data to a local mini-cluster was checked repeatedly while this patch was being developed, as well as running it against the Jenkins job that provides the test-warehouse snapshot used by the many other Impala CI builds that run daily. Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9 Reviewed-on: http://gerrit.cloudera.org:8080/4769 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 10:16:55 +00:00
Tim Armstrong	0dbfe169b7	IMPALA-4277: remove unneeded LegacyTCLIService Change-Id: I827590b19dc542f6256ae2e0d541eaa32a76520b Reviewed-on: http://gerrit.cloudera.org:8080/4844 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 02:34:01 +00:00
Henry Robinson	e0a3272129	Minor compute stats script fixes * Change run-step to output full log path * Change text to say "Computing table stats" rather than "Computing HBase stats" when running compute-table-stats.sh Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07 Reviewed-on: http://gerrit.cloudera.org:8080/4825 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 00:13:54 +00:00
Dimitris Tsirogiannis	8a49ceaae5	IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Reviewed-on: http://gerrit.cloudera.org:8080/4327 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 11:01:37 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00
David Knupp	05b91a973c	IMPALA-4294: Make check-schema-diff.sh executable from anywhere. Fixes a regression in the data load process that had been introduced by commit `75a857c`. To making check-schema-diff.sh work from anywhere. we need to specify the git-dir and work-tree arguments everywhere we call git. Change-Id: I32e0dce2c10c443763a038aa3b64b1c123ed62ad Reviewed-on: http://gerrit.cloudera.org:8080/4726 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-10-15 04:05:04 +00:00
Thomas Tauber-Marshall	b2c2fe7813	IMPALA-3786: Replace "cloudera" with "apache" (part 2) As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . \| grep "\.java\\|\.py\\|\.h\\|\.cc" \| \ xargs sed -i s/'com$.$cloudera$\.$impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 21:14:13 +00:00
Jim Apple	57fcbf7a28	IMPALA-4171: Remove JAR from repo. By ASF rules, we can't have JARs in releases. The releases are just tarballs of the repo. This patch removes from the repo the single JAR there, which was a version of a JAR that is built during data load, with one string changed. The JAR is used only for testing. Instead of building that jar with the different string and saving the result in git, daa loading will now build the jar twice, with one Java source file slightly changed. Change-Id: Icee7b8c32b08e064dea4a14624acff6021ef5ce1 Reviewed-on: http://gerrit.cloudera.org:8080/4499 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-22 02:00:50 +00:00
David Knupp	a42d18dcc3	IMPALA-2013: Reintroduce steps for checking HBase health in run-hbase.sh We used to include a step in run-hbase.sh for calling a python script that queried Zookeeper to see if the HBase master was up. The original script was problematic, so we stopped using it during our mini-cluster HBase start up procedure. HBase start up issues continue to plague us, however. This patch reintroduces a Zookeeper check, with the following updates: - replace the original script with check-hbase-nodes.py - query the correct node /hbase/master, not just /hbase/rs - use the python Zookeeper library kazoo, rather than calling out to the shell and parsing the return string - since we are moving toward testing on a remote cluster, also add the capability to pass in the address for the host that provides the Zookeeper and HBase services - add an additional check that the HDFS service is running, because of an edge case where the HBase master can briefly start without a cluster running. In addition to the expected tests, this script was also tested under the conditions of IMPALA-4088, whereby the HBase RegionServer is running, but the master fails because another listening process has already taken its TCP port (60010) during startup. Change-Id: I9b81f3cfb6ea0ba7b18ce5fcd5d268f515c8b0c3 Reviewed-on: http://gerrit.cloudera.org:8080/4348 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-15 00:02:22 +00:00
Matthew Jacobs	c7fa03286b	IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:11:04 +00:00
Jim Apple	bd2947329e	IMPALA-4110: Clean up issues found by Apache RAT. Change-Id: I5bfe77f9a871018e7a67553ed270e2df53006962 Reviewed-on: http://gerrit.cloudera.org:8080/4361 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:09:24 +00:00
Zoltan Ivanfi	b35689d7d9	Minor enhancements to helper scripts. - run-all-tests.sh: survive non-fatal failures when calling ulimit. - copy-udfs-udas.sh: respect $MAKE_CMD instead of blindly using make. Change-Id: Ic90bd0048786c799a8ac435de4303ed399ac1223 Reviewed-on: http://gerrit.cloudera.org:8080/4304 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-09-05 15:17:22 +00:00
Yuanhao Luo	052d3cc8dd	IMPALA-4056: Fix toSql() of DistributeParam This commit fixes two issues in toSql() of DistributeParam: 1. string literals were not quoted 2. range partition split rows were not printed. Besides, this commit fixes a small issue in run-hive-server.sh Change-Id: I984a63a24f02670347b0e1efceb864d265d1f931 Reviewed-on: http://gerrit.cloudera.org:8080/4195 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 20:11:27 +00:00
Matthew Jacobs	d113205cee	IMPALA-3650: DISTRIBUTE BY required for managed Kudu tables As of Kudu 0.9, DISTRIBUTE BY is now required when creating a new Kudu table. Create table analysis, data loading, and tests are updated to reflect this. This also bumps the Kudu version to 0.10.0. Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e Reviewed-on: http://gerrit.cloudera.org:8080/3987 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-08-19 02:14:39 +00:00
Christopher Channing	90a6b3206e	IMPALA-3964: Fix crash when a count() is performed on a nested collection. The Bug: Prior to this patch, a DCHECK was used to verify that the underlying memory pool for the scratch batch was empty in a count based scenario. For IMPALA-3964 (where a count() is performed on a nested collection), if a Parquet column chunk is compressed, upon reading each new data page it would be decompressed and eventually placed in to the underlying scratch batch memory pool causing the aforementioned DCHECK to fail. This was not picked up in the test suite as the TPCH nested Parquet data is not compressed. The Fix: Removed the erroneous DCHECK. Added logic to determine if any memory in the scratch batch needs to be freed (due to the transfer that occurs from the decompressed data pool), if so, it will be done. Augmented the load_nested.py script to snappy compress each of the tables within the 'tpch_nested_parquet' database. This is consistent with how the flat TPCH Parquet data set is stored. Regarding test coverage, there are already a number of tests that will perform nested collection counts against the tables in the 'tpch_nested_parquet' database. For uncompressed nested Parquet, the 'test_nested_types.py' test suite leverages the 'ComplexTypesTbl' table to provide good coverage. Change-Id: Id0955c85d18dfba4bd29a35ec95d0355da050607 Reviewed-on: http://gerrit.cloudera.org:8080/3940 Reviewed-by: Michael Ho <kwho@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-16 11:54:04 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Dimitris Tsirogiannis	6fbd35fa87	Enable TPC-H workload for Kudu tables With this commit we enable loading of TPC-H data in Kudu tables and running the 22 TPC-H queries against Kudu. Since Kudu doesn't support the decimal data type, we had to modify the queries by using round() function and update the test results. Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289 Reviewed-on: http://gerrit.cloudera.org:8080/3789 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:35:11 +00:00
Tim Armstrong	bc8c55afcd	IMPALA-3729: batch_size=1 coverage for avro scanner Also fix a stale comment in the avro scanner header. The main work here is to fix the handling of empty result sets in the test result verifier. This is a problem because we wanted to verify that the results in the test file were a superset of the rows returned, and this was thrown off by superflous '' rows in the expected and actual result sets. The basic problem is that the way test file sections was parsed conflated an empty result section with non-empty result section that had a single empty string. I.e.: ---- RESULTS ==== vs ---- RESULTS ==== both got resolved to ['']. Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/3413 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-07-19 23:30:02 -07:00
Michael Ho	ed5ec6772f	IMPALA-1619: Support 64-bit allocations. This change extends MemPool, FreePool and StringBuffer to support 64-bit allocations, fixes a bug in decompressor and extends various places in the code to support 64-bit allocation sizes. With this change, the text scanner can now decompress compressed files larger than 1GB. Note that the UDF interfaces FunctionContext::Allocate() and FunctionContext::Reallocate() still use 32-bit for the input argument to avoid breaking compatibility. In addition, the byte size of a tuple is still assumed to be within 32-bit. If it needs to be upgraded to 64-bit, it will be done in a separate change. A new test has been added to test the decompression of a 2GB snappy block compressed text file. Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f Reviewed-on: http://gerrit.cloudera.org:8080/3575 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-07-08 15:42:09 -07:00
Michael Brown	08e8de73b2	IMPALA-3806: remove a few modern shell idioms to improve RHEL5 support Both `find -executable` and the Bash "&>>" operator are too new to be supported on RHEL5. Both have reasonable workarounds, so prefer them. Note that this may not be the exhaustive list of such "modern" conventions, but RHEL5 isn't working end-to-end, so we can't identify all of them in a single commit yet. Testing: Before, the RHEL5 build would fail quite early here. Now, data load succeeds and most of the backend tests successfully run. Change-Id: I7438bed908d8026327923607238808122212d2d8 Reviewed-on: http://gerrit.cloudera.org:8080/3531 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Internal Jenkins	2016-07-05 13:37:26 -07:00
Sailesh Mukil	73595d8f40	IMPALA-3737: Local filesystem build failed loading custom schemas When a LOCATION that does not have the scheme specified is used, the default FS is used as the filesystem scheme. The default FS is set as 'file:/tmp' for localFS runs, however the Hadoop library seems to ignore the '/tmp' part of the defaultFS for locations without schemes and just uses 'file:'. So the test warehouse is in: 'file:/tmp/test-warehouse' However, the scripts access '/test-warehouse' without the scheme which hadoop translates to: 'file:/test-warehouse' which does not exist. This change disables metadata loading on local filesystem if there is a schema change detected just as it is done in S3 and Isilon too. Change-Id: Ie404079aeb2f837ac8b03244b2019e2c8ee9f221 Reviewed-on: http://gerrit.cloudera.org:8080/3384 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Sailesh Mukil <sailesh@cloudera.com>	2016-06-16 17:34:34 -07:00
Tim Armstrong	547be27e77	IMPALA-3745: parquet invalid data handling Added checks/error handling: * Negative string lengths while decoding dictionary or data page. * Buffer overruns while decoding dictionary or data page. * Some metadata FILECHECKs were converted to statuses. Testing: Unit tests for: * decoding of strings with negative lengths * truncation of all parquet types * dictionary creation correctly handling error returns from Decode(). End-to-end tests for handling of negative string lengths in dictionary- and plain-encoded data in corrupt files, and for handling of buffer overruns for string data. The corrupted parquet files were generated by hacking Impala's parquet writer to write invalid lengths, and by hacking it to write plain-encoded data instead of dictionary-encoded data by default. Performance: set num_nodes=1; set num_scanner_threads=1; select * from biglineitem where l_orderkey = -1; I inspected MaterializeTupleTime. Before the average was 8.24s and after was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%). Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece Reviewed-on: http://gerrit.cloudera.org:8080/3387 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-06-15 21:33:39 -07:00
Michael Ho	a485d44fd0	IMPALA-3223: Update path to Postgres JDBC driver Maven has been downloading the postgres JDBC driver all along. So, let's use the one in fe/target/dependency instead of the one in thirdparty. Change-Id: I76bce18fd308890e66615c8d08d5e58f02a8a132 Reviewed-on: http://gerrit.cloudera.org:8080/3232 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-27 08:42:17 -07:00
Anuj Phadke	a915293109	IMPALA-1850: Allow fs.defaultFS to be set to a non-HDFS filesystem This change whitelists the supported filesystems which can be set as Default FS for Impala to run on. This patch configures Impala to use S3 as the default filesystem, rather than a secondary filesystem as before. Change-Id: I2f45bef6c94ece634045acb906d12591587ccfed Reviewed-on: http://gerrit.cloudera.org:8080/1121 Reviewed-by: anujphadke <aphadke@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:40 -07:00
Harrison Sheinblatt	1058163f70	IMPALA-2276: Isilon and s3 builds must fail with stale snapshot If a stale snapshot is detected, the full data load proceeds even if the option to skip data load was set. A check is added to fail immediately if this happens for isilon or s3 because the full data load will not work on these filesystems currently. Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52 Reviewed-on: http://gerrit.cloudera.org:8080/2811 Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com> Tested-by: Internal Jenkins	2016-05-12 14:17:37 -07:00
Alex Behm	50314b5b2c	Set PYTHONUNBUFFERED in wait-for-* scripts. Before this fix our "Waiting for something to happen" print output would be buffered and dumped all at once when the event we were waiting for succeeded or we hit a timeout. After this fix the output of "print" is displayed on the console imemdiately, as was originally intended. Change-Id: Icf341e81d0d459504918ae7c9e88918fe5e16c59 Reviewed-on: http://gerrit.cloudera.org:8080/2810 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:35 -07:00
Michael Brown	6d8c075d9c	IMPALA-1996: Start HBase per directions in documentation; Implement HBase startup retry I. Start HBase per directions 1. https://hbase.apache.org/book.html#_configuration_files mentions a 'regionservers' file that points to a list of hosts on which to start HBase RegionServers. When HBase starts in our mini-cluster there are messages printed like this: cat: /home/mikeb/Impala/fe/src/test/resources/regionservers: No such file or directory The presence of this file now starts a single RegionServer and takes the place of RegionServer 1 in the "additional region servers" startup, a separate call. 2. The additional RegionServers are started but now we only start 2 from index 2. See https://hbase.apache.org/book.html#quickstart_pseudo There are still 3 total RegionServers using the same ports as before. We are simply configuring our settings as directed in the documentation. There were mentions in testdata/bin/run-hbase.sh of a "hbase race". One possible such bug is https://issues.apache.org/jira/browse/HBASE-5780 which has been fixed for a while. I've removed the check to wait for that Master, though I have not removed the Python script that does the waiting. We could remove that later after we let this patch bake. Also, https://issues.apache.org/jira/browse/HBASE-4467 has been marked "not a problem", so I've removed references to that. II. Implement HBase start retry If starting either HBase Master or additional RegionServers fails, kill all of HBase and try again. Do this for some number of attempts. In order to keep errexit ("set -e") happy, we expect the possibility of some of the startup attempts failing. We use control flow in those cases. In the last case, errexit can fail on our behalf. There is some code duplication here, but because Bash can't give us a stack trace on failure, and only a line number, I chose not to use functions to handle reuse. We don't really have functions anywhere else at the moment, either. Testing: It's pretty difficult to try to trigger a real "HBase fails to start" situation. I tested my changes by faking HBase failures, both when starting up the Master and first RegionServer, and also starting subsequent RegionServers. Multiple private builds have passed. Change-Id: Ib1d055a8a9098ce24e2f31b969501b6e090eab19 Reviewed-on: http://gerrit.cloudera.org:8080/2804 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:33 -07:00
Lars Volker	ee8c309187	Fix typo in load-test-warehouse-snapshot.sh Change-Id: I2ef9b32cbc56819f80db864a6590a9a7b2732c9c Reviewed-on: http://gerrit.cloudera.org:8080/2310 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Casey Ching	9d43aac6ce	IMPALA-3274: Always start Kudu for testing Previously Kudu would only be started when the test configuration was the standard mini-cluster. That led to failures during data loading when testing without the mini-cluster (ex: local file system). Kudu doesn't require any other services so now it'll be started for all test environments. Change-Id: I92643ca6ef1acdbf4d4cd2fa5faf9ac97a3f0865 Reviewed-on: http://gerrit.cloudera.org:8080/2690 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:02:35 -07:00
Sailesh Mukil	49a73cd598	IMPALA-3249: Failed to mkdirs on core-local-filesystem build. This failure happens on filesystems other than HDFS because as a part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the new directories that the patch tries to create in create-load-data. Change-Id: I8de74db93893c5273ccc9c687f608959628f5004 Reviewed-on: http://gerrit.cloudera.org:8080/2644 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-30 00:03:45 +00:00
Alex Behm	b2ccb17c21	Print last 50 lines of log if data loading fails. The 20 lines we dump currently are often not enough to diagnose a failure quickly. Increasing to 50 lines. Printing 50 lines is also consistent with our run-step script which also prints 50 lines. Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73 Reviewed-on: http://gerrit.cloudera.org:8080/2632 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 20:22:18 +00:00
Alex Behm	7e76e92bef	Consolidate test and cluster logs under a single directory. All logs, test results and SQL files generated during data loading and testing are now consolidated under a single new directory $IMPALA_HOME/logs. The goal is to simplify archiving in Jenkins runs and debugging. The new structure is as follows: $IMPALA_HOME/logs/cluster - logs of Hadoop components and Impala $IMPALA_HOME/logs/data_loading - logs and SQL files produced in data loading $IMPALA_HOME/logs/fe_tests - logs and test output of Frontend unit tests $IMPALA_HOME/logs/be_tests - logs and test output of Backend unit tests $IMPALA_HOME/logs/ee_tests - logs and test output of end-to-end tests $IMPALA_HOME/logs/custom_cluster_tests - logs and test output of custom cluster tests I tested this change with a full data load which was successful. Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa Reviewed-on: http://gerrit.cloudera.org:8080/2456 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 19:23:22 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00

1 2 3 4 5 ...

320 Commits