impala

mirror of https://github.com/apache/impala.git synced 2026-01-03 15:00:52 -05:00

Author	SHA1	Message	Date
Tianyi Wang	c4d950b9e9	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Resubmitted with filesystem type check. Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8 Reviewed-on: http://gerrit.cloudera.org:8080/8916 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 03:24:36 +00:00
David Knupp	2fb11fb732	Revert "IMPALA-3887: Wait for HDFS replication in data loading" Using fsck breaks non-HDFS builds: local, S3, and Isilon. This reverts commit `5a7c10ec3d`. Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc Reviewed-on: http://gerrit.cloudera.org:8080/8880 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-19 23:26:29 +00:00
Tianyi Wang	5a7c10ec3d	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322 Reviewed-on: http://gerrit.cloudera.org:8080/8846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:53:56 +00:00
Philip Zeyliger	11dbb3952a	IMPALA-6070: Parallelize another bit of data load. The two Kudu loads and Hive UDFs can all run in parallel. This should shave about 4 minutes off of the data load. (Current timings are 3.5, 4, and 0.6 minutes, see below.) I've run dataload with this change many times. Loading Kudu functional (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu.log)... Loading workload 'functional-query' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 3 min 29 sec) Loading Kudu TPCH (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu-tpch.log)... Loading workload 'tpch' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 4 min 0 sec) Loading Hive UDFs (logging to /home/ubuntu/Impala/logs/data_loading/build-and-copy-hive-udfs.log)... Loading Hive UDFs OK (Took: 0 min 41 sec) Change-Id: I7e93ee5a77ec9271b980b88bef7ad512ecbe0407 Reviewed-on: http://gerrit.cloudera.org:8080/8822 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-14 02:28:40 +00:00
David Knupp	d1c9510001	Revert "IMPALA-6068: Fix dataload for complextypes_fileformat" This reverts commit `e4f585240a`. Among other things, that commit replaced hdfs command line calls with "LOAD DATA LOCAL INPATH" using Hive. However, doing so presumes that the minicluster is the only test environment. Sometimes though, the data load script is against a remote cluster, and those cases, the data load process is now broken. Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2 Reviewed-on: http://gerrit.cloudera.org:8080/8653 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 02:57:04 +00:00
Philip Zeyliger	76111ce168	IMPALA-6108, IMPALA-6070: Parallel data load (re-instated). This is a revert of a revert, re-enabling parallel data load. It avoid the race condition by explicitly configuring the temporary directory in question in load-data.py. When the parallel data load change went in, we discovered a race with a signature of: java.io.FileNotFoundException: File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist The number in this path is milliseconds since the epoch, and the race occurs when two queries submitted to HiveServer2, running with the local runner, hit the same millisecond time stamp. The upstream bug is https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which is now marked as a dupe). I've tested this by running data load 5 times on the same machines where it failed before. I also ran data load manually and inspected the system to make sure that the temporary directories are getting created as expected in /tmp/impala-data-load-*. Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10 Reviewed-on: http://gerrit.cloudera.org:8080/8405 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-02 00:40:19 +00:00
Philip Zeyliger	e301ca6418	IMPALA-6108: Revert "IMPALA-6070: Parallel data load." We may be seeing a race with errors like "java.io.FileNotFoundException: File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist". This reverts commit `e020c37106`. Change-Id: I46da93f4315a5a4bdaa96fa464cb51922bd6c419 Reviewed-on: http://gerrit.cloudera.org:8080/8386 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-26 02:07:50 +00:00
Joe McDonnell	e4f585240a	IMPALA-6068: Fix dataload for complextypes_fileformat Dataload typically follows a pattern of loading data into a text version of a table, and then using an insert overwrite from the text table to populate the table for other file formats. This insert is always done in Impala for Parquet and Kudu. Otherwise it runs in Hive. Since Impala doesn't support writing nested data, the population of complextypes_fileformat tries to hack the insert to run in Hive by including it in the ALTER part of the table definition. ALTER runs immediately after CREATE and always runs in Hive. The problem is that ALTER also runs before the base table (functional.complextypes_fileformat) is populated. The insert succeeds, but it is inserting zero rows. This code change introduces a way to force the Parquet load to run using Hive. This lets complextypes_fileformat specify that the insert should happen in Hive and fixes the ordering so that the table is populated correctly. This is also useful for loading custom Parquet files into Parquet tables. Hive supports the DATA LOAD LOCAL syntax, which can read a file from the local filesystem. This means that several locations that currently use the hdfs commandline can be modified to use this SQL. This change speeds up dataload by a few minutes, as it avoids the overhead of the hdfs commandline. Any other location that could use DATA LOAD LOCAL is also switched over to use it. This includes the testescape* tables which now print the appropriate DATA LOAD commands as a result of text_delims_table.py. Any location that already uses DATA LOAD LOCAL is also switched to indicate that it must run in Hive. Any location that was doing an HDFS command in the LOAD section is moved to the LOAD_DEPENDENT_HIVE section. Testing: Ran dataload and core tests. Also verified that functional_parquet.complextypes_fileformat has rows. Change-Id: I7152306b2907198204a6d8d282a0bad561129b82 Reviewed-on: http://gerrit.cloudera.org:8080/8350 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 03:43:26 +00:00
Philip Zeyliger	e020c37106	IMPALA-6070: Parallel data load. This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In parallel, these take about 37 minutes, dominated by functional-query. Serially, these take about 30 minutes more, namely the 13 minutes of tpcds and 16 minutes of tpcds. This works out nicely because CPU usage during data load is very low in aggregate. (We don't sustain more than 1 CPU of load, whereas build machines are likely to have many CPUs.) To do this, I added support to run-step.sh to have a notion of a backgroundable task, and support waiting for all tasks. I also increased the heapsize of our HiveServer2 server. When datasets were being loaded in parallel, we ran out of memory at 256MB of heap. The resulting log output is currently like so (but without the timestamps): 15:58:04 Started Loading functional-query data in background; pid 8105. 15:58:04 Started Loading TPC-H data in background; pid 8106. 15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)... 15:58:04 Started Loading TPC-DS data in background; pid 8107. 15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)... 15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)... 16:11:31 Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec) 16:14:33 Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec) 16:35:08 Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec) I tested dataloading with the following command on an 8-core, 32GB machine. I saw 19GB of available memory during my run: ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Reviewed-on: http://gerrit.cloudera.org:8080/8320 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 00:00:25 +00:00
Zachary Amsden	439f245d34	IMPALA-5975: Work around broken beeline clients To make statements execute, some clients require always appending a semi-colon to the end. The workaround is quite simple. Change-Id: Id8b9f3dde4445513f1f389785a002c6cc6b3dada Reviewed-on: http://gerrit.cloudera.org:8080/8132 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-27 03:27:45 +00:00
Jakub Kukul	0992a6afda	IMPALA-2525: Treat parquet ENUMs as STRINGs when creating impala tables. Change-Id: Ia7a2e20c3ab83eb3fac422c3b33c117856fec475 Reviewed-on: http://gerrit.cloudera.org:8080/6550 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-07 02:51:54 +00:00
Jim Apple	374f1121da	IMPALA-3224: De-Cloudera non-docs JIRA URLs John Russell is planning to fix the URLS in docs in a separate commit. Fixed using: (git ls-files \| xargs replace \ 'https://issues.cloudera.org/browse/IMPALA' 'IMPALA' --) && \ git checkout HEAD docs Change-Id: I28ea06e89341de234f9005fdc72a2e43f0ab8182 Reviewed-on: http://gerrit.cloudera.org:8080/6487 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-07 04:44:57 +00:00
Martin Grund	ce4c5f6743	IMPALA-4365: Enabling end-to-end tests on a remote cluster This patch lays the groundwork for loading data and running end-to-end tests on a remote CDH cluster. The requirements for the cluster to run the tests are: - Managed by Cloudera Manager (CM) - GPL Extras need to be installed - KMS and KeyTrustee installed and available as a service - SERDEPROPERTIES in the Hive DB modified to accept wide tables - Hive warehouse dir points to /test-warehouse The actual data loading is done via a new script, remote_data_load.py, which takes the CM host as an argument. It can be run from a client machine that is not a node of the cluster, but it needs to have the Impala repo checked out and Impala built. This insures that all of the necessary data load scripts are available, as well as setting up the environment properly (client binaries like beeline and the hbase shell are available, python libraries like cm_api are installed, necessary environment variables are defined, etc.) It should be noted that running remote_data_load.py will overwrite any local XML config files with the configurations downloaded from the remote cluster. Usage: remote_data_load.py [options] <cm_host address> Options: -h, --help show this help message and exit --snapshot-file=SNAPSHOT_FILE Path to the test-warehouse archive --cm-user=CM_USER Cloudera Manager admin user --cm-pass=CM_PASS Cloudera Manager admin user password --gateway=GATEWAY Gateway host to upload the data from. If not set, uses the CM host as gateway. --ssh-user=SSH_USER System user on the remote machine with passwordless SSH configured. --no-load Do not try to load the snapshot --exploration-strategy=EXPLORATION_STRATEGY --test Run end-to-end tests against cluster Testing: This patch is being submitted with the understanding that there are still clean up issues that need to be addressed in the remote data load script, for which JIRA's have been filed. However, since many of the existing build scripts also had to be modified, it is more important to make sure that no regressions were inadvertently introduced into the existing data load process. Loading data to a local mini-cluster was checked repeatedly while this patch was being developed, as well as running it against the Jenkins job that provides the test-warehouse snapshot used by the many other Impala CI builds that run daily. Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9 Reviewed-on: http://gerrit.cloudera.org:8080/4769 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 10:16:55 +00:00
Henry Robinson	e0a3272129	Minor compute stats script fixes * Change run-step to output full log path * Change text to say "Computing table stats" rather than "Computing HBase stats" when running compute-table-stats.sh Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07 Reviewed-on: http://gerrit.cloudera.org:8080/4825 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 00:13:54 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Dimitris Tsirogiannis	6fbd35fa87	Enable TPC-H workload for Kudu tables With this commit we enable loading of TPC-H data in Kudu tables and running the 22 TPC-H queries against Kudu. Since Kudu doesn't support the decimal data type, we had to modify the queries by using round() function and update the test results. Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289 Reviewed-on: http://gerrit.cloudera.org:8080/3789 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:35:11 +00:00
Michael Ho	ed5ec6772f	IMPALA-1619: Support 64-bit allocations. This change extends MemPool, FreePool and StringBuffer to support 64-bit allocations, fixes a bug in decompressor and extends various places in the code to support 64-bit allocation sizes. With this change, the text scanner can now decompress compressed files larger than 1GB. Note that the UDF interfaces FunctionContext::Allocate() and FunctionContext::Reallocate() still use 32-bit for the input argument to avoid breaking compatibility. In addition, the byte size of a tuple is still assumed to be within 32-bit. If it needs to be upgraded to 64-bit, it will be done in a separate change. A new test has been added to test the decompression of a 2GB snappy block compressed text file. Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f Reviewed-on: http://gerrit.cloudera.org:8080/3575 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-07-08 15:42:09 -07:00
Michael Brown	08e8de73b2	IMPALA-3806: remove a few modern shell idioms to improve RHEL5 support Both `find -executable` and the Bash "&>>" operator are too new to be supported on RHEL5. Both have reasonable workarounds, so prefer them. Note that this may not be the exhaustive list of such "modern" conventions, but RHEL5 isn't working end-to-end, so we can't identify all of them in a single commit yet. Testing: Before, the RHEL5 build would fail quite early here. Now, data load succeeds and most of the backend tests successfully run. Change-Id: I7438bed908d8026327923607238808122212d2d8 Reviewed-on: http://gerrit.cloudera.org:8080/3531 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Internal Jenkins	2016-07-05 13:37:26 -07:00
Sailesh Mukil	73595d8f40	IMPALA-3737: Local filesystem build failed loading custom schemas When a LOCATION that does not have the scheme specified is used, the default FS is used as the filesystem scheme. The default FS is set as 'file:/tmp' for localFS runs, however the Hadoop library seems to ignore the '/tmp' part of the defaultFS for locations without schemes and just uses 'file:'. So the test warehouse is in: 'file:/tmp/test-warehouse' However, the scripts access '/test-warehouse' without the scheme which hadoop translates to: 'file:/test-warehouse' which does not exist. This change disables metadata loading on local filesystem if there is a schema change detected just as it is done in S3 and Isilon too. Change-Id: Ie404079aeb2f837ac8b03244b2019e2c8ee9f221 Reviewed-on: http://gerrit.cloudera.org:8080/3384 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Sailesh Mukil <sailesh@cloudera.com>	2016-06-16 17:34:34 -07:00
Tim Armstrong	547be27e77	IMPALA-3745: parquet invalid data handling Added checks/error handling: * Negative string lengths while decoding dictionary or data page. * Buffer overruns while decoding dictionary or data page. * Some metadata FILECHECKs were converted to statuses. Testing: Unit tests for: * decoding of strings with negative lengths * truncation of all parquet types * dictionary creation correctly handling error returns from Decode(). End-to-end tests for handling of negative string lengths in dictionary- and plain-encoded data in corrupt files, and for handling of buffer overruns for string data. The corrupted parquet files were generated by hacking Impala's parquet writer to write invalid lengths, and by hacking it to write plain-encoded data instead of dictionary-encoded data by default. Performance: set num_nodes=1; set num_scanner_threads=1; select * from biglineitem where l_orderkey = -1; I inspected MaterializeTupleTime. Before the average was 8.24s and after was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%). Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece Reviewed-on: http://gerrit.cloudera.org:8080/3387 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-06-15 21:33:39 -07:00
Harrison Sheinblatt	1058163f70	IMPALA-2276: Isilon and s3 builds must fail with stale snapshot If a stale snapshot is detected, the full data load proceeds even if the option to skip data load was set. A check is added to fail immediately if this happens for isilon or s3 because the full data load will not work on these filesystems currently. Change-Id: I98faaa4a66e5715bd86289a56d199599b9011f52 Reviewed-on: http://gerrit.cloudera.org:8080/2811 Reviewed-by: Harrison Sheinblatt <hs7@hotmail.com> Tested-by: Internal Jenkins	2016-05-12 14:17:37 -07:00
Sailesh Mukil	49a73cd598	IMPALA-3249: Failed to mkdirs on core-local-filesystem build. This failure happens on filesystems other than HDFS because as a part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the new directories that the patch tries to create in create-load-data. Change-Id: I8de74db93893c5273ccc9c687f608959628f5004 Reviewed-on: http://gerrit.cloudera.org:8080/2644 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-30 00:03:45 +00:00
Alex Behm	b2ccb17c21	Print last 50 lines of log if data loading fails. The 20 lines we dump currently are often not enough to diagnose a failure quickly. Increasing to 50 lines. Printing 50 lines is also consistent with our run-step script which also prints 50 lines. Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73 Reviewed-on: http://gerrit.cloudera.org:8080/2632 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 20:22:18 +00:00
Alex Behm	7e76e92bef	Consolidate test and cluster logs under a single directory. All logs, test results and SQL files generated during data loading and testing are now consolidated under a single new directory $IMPALA_HOME/logs. The goal is to simplify archiving in Jenkins runs and debugging. The new structure is as follows: $IMPALA_HOME/logs/cluster - logs of Hadoop components and Impala $IMPALA_HOME/logs/data_loading - logs and SQL files produced in data loading $IMPALA_HOME/logs/fe_tests - logs and test output of Frontend unit tests $IMPALA_HOME/logs/be_tests - logs and test output of Backend unit tests $IMPALA_HOME/logs/ee_tests - logs and test output of end-to-end tests $IMPALA_HOME/logs/custom_cluster_tests - logs and test output of custom cluster tests I tested this change with a full data load which was successful. Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa Reviewed-on: http://gerrit.cloudera.org:8080/2456 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 19:23:22 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00
Casey Ching	432a76e4dd	Temporarily disable Kudu support Change-Id: I9aeb808a9898972788cb1d5d071619d8c64b514c Reviewed-on: http://gerrit.cloudera.org:8080/2551 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-16 00:15:34 +00:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Casey Ching	72d1889c08	IMPALA-2873: Fix nested TPC-H data loading In commit 960808 I forgot to update the data-loading script for the conversion of a shell script to a python script. It turns out there were a couple of other little problems too. I checked manually that the data was loaded after these changes. Change-Id: Id81fc423348515ab446835868025cb839c77f52c Reviewed-on: http://gerrit.cloudera.org:8080/1851 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-01-21 05:42:17 +00:00
Tim Armstrong	43de306d17	Log data loading and cluster setup to file Log output of data loading steps to files only print to stdout if there is an actual failure. The output of some steps is very noisy, and some steps even have output that looks like errors. This is implemented with a run-step helper function in bash that handles redirection and logging. Any bash command can be prefixed with run-step <step description> <log file name> to redirect the output to a log file. Sample output is: Starting Impala cluster (logging to start-impala-cluster.log)... OK Setting up HDFS environment (logging to setup-hdfs-env.log)... OK Skipped loading the metadata. Loading HBase data only (logging to load-hbase-only.log)... OK Loading Hive UDFs (logging to build-and-copy-hive-udfs.log)... OK Running custom post-load steps (logging to custom-post-load-steps.log)... OK Caching test tables (logging to cache-test-tables.log)... OK Loading external data sources (logging to load-ext-data-source.log)... OK Splitting HBase (logging to create-hbase.log)... OK Change-Id: I6396540858c408b084039a87efc81e1004626f39 Reviewed-on: http://gerrit.cloudera.org:8080/1760 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 04:38:19 +00:00
Tim Armstrong	f13dfcbddc	Suppress maven info logging Maven's INFO log level is very verbose and includes a lot of progress information that is minimally useful. Maven doesn't have an option to output only ERROR and WARNING log messages. As a workaround, use grep to filter out the majority of the output (only warnings, errors, tests, and success/failure). Also add a header with relevant info about the maven command: targets and working directory. Change-Id: I828b870edc2fc80a6460e6ed594d507c46e69c82 Reviewed-on: http://gerrit.cloudera.org:8080/1752 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-01-15 19:38:46 +00:00
Martin Grund	d51f20fa1f	Passing cluster startup flags This patch allows passing additional cluster startup flags. This is needed when building with optimizations in release mode as the default cluster startup would only pick up a debug build. Change-Id: Ib98d6814558f2d82bdeac0e3cce1fb7db048c459 Reviewed-on: http://gerrit.cloudera.org:8080/1775 Tested-by: Internal Jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2016-01-14 16:48:43 +00:00
Casey Ching	cfb1ab5c2c	IMPALA-2781: Fix shell error reporting after chdir The original error reporting relied on $0 being accessible from the current working dir, which failed if a script changed the working dir and $0 was relative. This updates the error reporting command to cd back to the original dir before accessing $0. Change-Id: I2185af66e35e29b41dbe1bb08de24200bacea8a1 Reviewed-on: http://gerrit.cloudera.org:8080/1666 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-14 07:10:54 +00:00
Casey Ching	e2bfb6ae2f	Misc improvements to shell scripts about error reporting Changes: 1) Consistently use "set -euo pipefail". 2) When an error happens, print the file and line. 3) Consolidated some of the kill scripts. 4) Added better error messages to the load data script. 5) Changed use of #!/bin/sh to bash. Change-Id: I14fef66c46c1b4461859382ba3fd0dee0fbcdce1 Reviewed-on: http://gerrit.cloudera.org:8080/1620 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-12-17 18:25:27 +00:00
Vlad Berindei	b6c20b2a40	Allow Impala to run against local filesystem. Allow Impala to start only with a running HMS (and no additional services like HDFS, HBase, Hive, YARN) and use the local file system. Skip all tests that need these services, use HDFS caching or assume that multiple impalads are running. To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has permissions since this is the location where the test data will be extracted. Test coverage (with core strategy) in comparison with HDFS and S3: HDFS 1348 tests passed S3 1157 tests passed Local Filesystem 1161 tests passed Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03 Reviewed-on: http://gerrit.cloudera.org:8080/1352 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins Readability: Alex Behm <alex.behm@cloudera.com>	2015-12-05 06:48:32 +00:00
Sailesh Mukil	277a92a14a	IMPALA-2479: Failure in TestParquet.test_verify_runtime_profile The test_verify_runtime_profile test failed during C5.5 builds and GVMs because this test relies on the table lineitem_multiblock to have 3 blocks. However, due to the rules to load the data not being followed in the functional_schema_template.sql file, the table ended up being stored with only one block. This change moves the data load to the end of create-load-data.sh file which would load the data even for snapshots. Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be Reviewed-on: http://gerrit.cloudera.org:8080/1153 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2015-10-08 15:16:35 -07:00
Taras Bobrovytsky	b8b7930377	Add nested types support to Create Table Like File Add support for creating a table based on a parquet file which contains arrays, structs and/or maps. Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae Reviewed-on: http://gerrit.cloudera.org:8080/582 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-22 01:46:26 +00:00
Taras Bobrovytsky	3c9ceb1a2b	Add Parquet nested schemas to testdata A script is added that generates two parquet files with nested data. One file has modern nested types encoding and the other one has legacy encoding. This data will be used for testing nested types support for "create table like file" statement. Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef Reviewed-on: http://gerrit.cloudera.org:8080/610 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 10:25:39 +00:00
Casey Ching	d202d6a967	Use "impala-python" (virtualenv) instead of system python Python tests and infra scripts will now use "python" from the virtualenv via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now that python 2.6 and a dependable set of third-party libraries are available but that is not done as part of this commit. Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f Reviewed-on: http://gerrit.cloudera.org:8080/603 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-06 02:09:09 +00:00
David Alves	b911c7e97e	Always create the kudu tables even if we skipped metadata load Patch c0c9fbdf57df667f63632437f612a63baf1534dd: "Load Kudu as part of the normal data loading workflow" passed the build when it was first introduced as it had introduced changes to the datasets directory which cause the metadata loading not to be skipped. However it failed all subsequent times as there were no further changes to the metadata directory. This patch makes data loading for Kudu run independently of whether metadata load is skipped or not since a new Kudu cluster is now created on each build. This patch also removes one last reference to 'functional_kudu.liketbl' in AnalyzeDDLTest since we don't create/load data for that table anymore. Change-Id: Ibe9acc7da17062ac317dff06a8c57dd87cf566d6 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/7110 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: David Alves <david.alves@cloudera.com>	2015-07-12 23:02:08 -07:00
ishaan	377214c469	Use Isilon as the default file system when running Isilon tests. This patch enables running Impala tests against Isilon as the default file system. The intention is to run tests against a realistic deployment, i.e, Isilon replacing HDFS as the underlying filesystem. Specifically, it does the following: - Adds a new environment variable DEFAULT_FS, which points to HDFS by default. - Makes the fs.defaultFs property in core-site.xml use the DEFAULT_FS environment variable, such that all clients talk to Isilon implicitly. - Unset FILESYSTEM_PREFIX when the TARGET_FILESYSTEM is Isilon, since path prefixes are no longer needed. - Only starts the Hive Metastore and the Impala service stack when running tests against Isilon. We don't start KMS/HBase because they're not relevant to Isilon. We also don't start YARN, Hive and LLama because hive queries are disabled with Isilon. The scripts that start/stop Hive, YARN and Llama should be modified to point to a filesystem other than HDFS in the future. Change-Id: Id66bfb160fe57f66a64a089b465b536c6c514b63 Reviewed-on: http://gerrit.cloudera.org:8080/449 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-06-11 01:23:11 +00:00
Casey Ching	060f08ef69	Add tpch_nested_parquet database The database will be used for testing in the future. Change-Id: I60b54b36db9493a5bea308151b4027cd47d73047 Reviewed-on: http://gerrit.cloudera.org:8080/400 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-06-04 21:18:36 +00:00
Martin Grund	55f6457a28	Kudu Table and Kudu Scan Node This patch adds a basic implementation for a Kudu table, scan node and supports simple DDL operations. Similar to "normal" HDFS tables, the DDL statements executed in the Hive metastore can be propagated to Kudu. Othewise, the Kudu table behaves similarily to a HBase table. The syntax to create a table stored in Kudu is: create table kudu (id int, name string, age int) tblproperties ( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'kudu', 'kudu.master_addresses' = '0.0.0.0:7051', 'kudu.key_columns' = 'id,name'); The 'storage_handler' attribute is fixed and used to identify a table backed by Kudu. The storage handler attribute is required, to make sure that Hive will not create a directory on HDFS for this table. The 'table_name' and 'master_addresses' properties define the properties of the physical persistence in Kudu. The 'key_columns' defines a list of columns that should be used as a (composite) key. A Kudu table can be created either as a managed or un-managed (external) table. Creating an external table behaves similar to an external HDFS table, if the table does not exist in Kudu create it, if it exists use it and check schema compatibility. If an external table is deleted, only delete the Hive table. If a managed table is created, the Kudu table must not exist and if it is deleted the Kudu table is deleted as well. TODO: Allow creation of external table without specifying columns. Change-Id: I794abf6abe30ace4426c53f77676ae1dcb4341ec Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6358 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-06-01 16:51:53 -07:00
Martin Grund	e95349676b	Re-enabling data loading from snapshot. Change-Id: Ia8929a492cb97ed5a21bc4bd3d793e91a2076be3 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6752 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: jenkins	2015-06-01 16:01:10 -07:00
Martin Grund	0edbe5004a	FIXME: Ignore schema changes in build for now Change-Id: Ib57272c8ec49323bf0506af92b9c6e8f4e3ba5a4 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6362 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-06-01 15:41:52 -07:00
Alex Behm	1bd3eca22f	Quietly resolve dependencies in Jenkins runs to avoid log spew. Change-Id: If38a683785f3c6c9d92f762a2dfd86f009ce9d84 Reviewed-on: http://gerrit.cloudera.org:8080/392 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-05-19 09:12:43 +00:00
Dan Hecht	2916132283	S3: enable more tests for S3 As needed, fix up file paths and other misc things to get more test cases running against S3. Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad Reviewed-on: http://gerrit.cloudera.org:8080/167 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:39 -07:00
ishaan	4a9adfd685	Fix the full data load build by not hardcoding the lzo index file's name. After the hive/hdfs rebase, the indexed lzo file names changed. This patch uses a wildcard rather than a specific file name to protect against such changes. It's safe because the test simply expects a partition that does not have index files. Change-Id: I6d32609b62df83fe2a8ef935d7ca6506ecff5e0d Reviewed-on: http://gerrit.cloudera.org:8080/150 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 09:52:34 +00:00
Matthew Jacobs	835d6dbef4	IMPALA-1209: Add KMS service to testdata cluster (pt1) First change for IMPALA-1209 to address Impala limitations when using HDFS encryption. This adds a KMS process to the testdata cluster. This was tested manually by creating a key and an encryption zone. Change-Id: I499154506386f04e71c5371b128c10868b1e1318 Reviewed-on: http://gerrit.cloudera.org:8080/41 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-02-13 20:46:14 +00:00
ishaan	2386fb84a8	Enable the data loading infrastructure to switch the underlying file system. This patch enables loading data to s3 instead of hdfs. It is preliminary in nature, as such, there are a few caveats: - The fe tests do not work. - Only loading from a test-warehouse snapshot and metastore snapshot is enabled. - Until hive works with s3, only a subset of all the tests will work. Change-Id: Ia66a5f836b4245e3b022a49de805eec337a51324 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5851 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2015-02-03 01:02:42 -08:00
ishaan	5ac46af786	Fix the full data load path by explicitly creating the test-warehouse directory in hdfs. Previously, when we started all the services, we created an HBase table from hive to avoid a replication bug. This had the side-effect of creating a test-warehouse directory in hdfs. After that check was removed, we no longer create the test-warehouse, causing the full-data-load build to fail. Change-Id: I75479562d33e08c79ad155c615cecb5b91c0eab6 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5904 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2015-02-03 00:51:49 -08:00

1 2 3

119 Commits