impala

mirror of https://github.com/apache/impala.git synced 2026-01-06 06:01:03 -05:00

Author	SHA1	Message	Date
Philip Zeyliger	2e6a63e31e	IMPALA-6070: Further improvements to test-with-docker. This commit tackles a few additions and improvements to test-with-docker. In general, I'm adding workloads (e.g., exhaustive, rat-check), tuning memory setting and parallelism, and trying to speed things up. Bug fixes: * Embarassingly, I was still skipping thrift-server-test in the backend tests. This was a mistake in handling feedback from my last review. * I made the timeline a little bit taller to clip less. Adding workloads: * I added the RAT licensing check. * I added exhaustive runs. This led me to model the suites a little bit more in Python, with a class representing a suite with a bunch of data about the suite. It's not perfect and still coupled with the entrypoint.sh shell script, but it feels workable. As part of adding exhaustive tests, I had to re-work the timeout handling, since now different suites meaningfully have different timeouts. Speed ups: * To speed up test runs, I added a mechanism to split py.test suites into multiple shards with a py.test argument. This involved a little bit of work in conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh. Furthermore, I moved a bit more logic about managing the list of suites into Python. * Doing the full build with "-notests" and only building the backend tests in the relevant target that needs them. This speeds up "docker commit" significantly by removing about 20GB from the container. I had to indicates that expr-codegen-test depends on expr-codegen-test-ir, which was missing. * I sped up copying the Kudu data: previously I did both a move and a copy; now I'm doing a move followed by a move. One of the moves is cross-filesystem so is slow, but this does half the amount of copying. Memory usage: * I tweaked the memlimit_gb settings to have a higher default. I've been fighting empirically to have the tests run well on c4.8xlarge and m4.10xlarge. The more memory a minicluster and test suite run uses, the fewer parallel suites we can run. By observing the peak processes at the tail of a run (with a new "memory_usage" function that uses a ps/sort/awk trick) and by observing peak container total_rss, I found that we had several JVMs that didn't have Xmx settings set. I added Xms/Xmx settings in a few places: * The non-first Impalad does very little JVM work, so having an Xmx keeps it small, even in the parallel tests. * Datanodes do work, but they essentially were never garbage collecting, because JVM defaults let them use up to 1/4th the machine memory. (I observed this based on RSS at the end of the run; nothing fancier.) Adding Xms/Xmx settings helped. * Similarly, I piped the settings through to HBase. A few daemons still run without resource limitations, but they don't seem to be a problem. Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c Reviewed-on: http://gerrit.cloudera.org:8080/10123 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 20:47:29 +00:00
Joe McDonnell	da363a99a4	IMPALA-6899: Optimize the HDFS commands used in dataload HDFS commandline calls can be expensive due to JVM startup and other costs. Since most HDFS commandline calls can take multiple paths, one way to reduce execution time is to consolidate multiple HDFS commands into a single HDFS call. Since HDFS put commands will follow symbolic links and can copy recursively, this can allow for further consolidation by creating the full directory structure and copying it in a single HDFS call. This does several of these optimizations throughout the dataload codepath. It saves a few seconds here and there: Loading Hive Builtins: 1:10 -> 0:30 Loading custom schemas: 0:35 -> 0:20 Loading Hive UDFs: 0:45 -> 0:25 Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282 Reviewed-on: http://gerrit.cloudera.org:8080/10120 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-23 21:29:59 +00:00
Joe McDonnell	5bc5279b07	IMPALA-6898: Avoid duplicate Kudu load during full dataload testdata/bin/create-load-data.sh does bin/load-data.py for functional/exhaustive, tpch/core, and tpcds/core in a first phase, then it loads functional and tpch for Kudu in a second phase. For a full dataload, this second phase is not necessary. functional/exhaustive and tpch/core already include Kudu. This avoids the second phase when doing a full dataload. The second phase is still necessary when loading from a snapshot, and this does not change that behavior. This saves a couple minutes off of full dataload. Change-Id: Ic023d230f99126ed37795106c38faae5f0cb608e Reviewed-on: http://gerrit.cloudera.org:8080/10128 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-21 01:08:50 +00:00
Fredy Wijaya	51cf5b27fc	IMPALA-6850: Print actual error message on Sentry error The patch puts the output of Sentry to $IMPALA_CLUSTER_LOGS_DIR/sentry/sentry.out to follow the same convention as other service output logs. Testing: - Injected some failure in run-sentry-service.sh script to see if the error message was captured Change-Id: I76627bb5b986a548ec6e4f12b555bd6fc8c4dab8 Reviewed-on: http://gerrit.cloudera.org:8080/10064 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 01:41:38 +00:00
Joe McDonnell	d481cd4842	IMPALA-6372: Go parallel for Hive dataload This changes generate-schema-statements.py to produce separate SQL files for different file formats for Hive. This changes load-data.py to go parallel on these separate Hive SQL files. For correctness, the text version of all tables must be loaded before any of the other file formats. load-data.py runs DDLs to create the tables in Impala and goes parallel. Currently, there are some minor dependencies so that text tables must be created prior to creating the other table formats. This changes the definitions of some tables in testdata/datasets/functional/functional_schema_template.sql to remove these dependencies. Now, the DDLs for the text tables can run in parallel to the other file formats. To unify the parallelism for Impala and Hive, load-data.py now uses a single fixed-size pool of processes to run all SQL files rather than spawning a thread per SQL file. This also modifies the locations that do invalidate to use refresh where possible and eliminate global invalidates. For debuggability, different SQL executions output to different log files rather than to standard out. If an error occurs, this will point out the relevant log file. This saves about 10-15 minutes on dataload (including for GVO). Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Reviewed-on: http://gerrit.cloudera.org:8080/8894 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 00:16:26 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Philip Zeyliger	2896b8d127	IMPALA-6070: Expose using Docker to run tests faster. Allows running the tests that make up the "core" suite in about 2 hours. By comparison, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend tends to run in about 3.5 hours. This commit: * Adds "echo" statements in a few places, to facilitate timing. * Adds --skip-parallel/--skip-serial flags to run-tests.py, and exposes them in run-all-tests.sh. * Marks TestRuntimeFilters as a serial test. This test runs queries that need > 1GB of memory, and, combined with other tests running in parallel, can kill the parallel test suite. * Adds "test-with-docker.py", which runs a full build, data load, and executes tests inside of Docker containers, generating a timeline at the end. In short, one container is used to do the build and data load, and then this container is re-used to run various tests in parallel. All logs are left on the host system. Besides the obvious win of getting test results more quickly, this commit serves as an example of how to get various bits of Impala development working inside of Docker containers. For example, Kudu relies on atomic rename of directories, which isn't available in most Docker filesystems, and entrypoint.sh works around it. In addition, the timeline generated by the build suggests where further optimizations can be made. Most obviously, dataload eats up a precious ~30-50 minutes, on a largely idle machine. This work is significantly CPU and memory hungry. It was developed on a 32-core, 120GB RAM Google Compute Engine machine. I've worked out parallelism configurations such that it runs nicely on 60GB of RAM (c4.8xlarge) and over 100GB (eg., m4.10xlarge, which has 160GB). There is some simple logic to guess at some knobs, and there are knobs. By and large, EC2 and GCE price machines linearly, so, if CPU usage can be kept up, it's not wasteful to run on bigger machines. Change-Id: I82052ef31979564968effef13a3c6af0d5c62767 Reviewed-on: http://gerrit.cloudera.org:8080/9085 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-06 06:40:07 +00:00
Philip Zeyliger	8e5f923158	Loosen hive-exec.jar glob pattern in copy-udfs-udas.sh. This commit slightly loosens the coupling between IMPALA_HIVE_VERSION and "hive.version" in the Maven sense. Cherry-picks: not for 2.x Change-Id: Ifbe6f5208b4ad0ffc9cbfe4e93d712ce698beb23 Reviewed-on: http://gerrit.cloudera.org:8080/9925 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-05 06:54:53 +00:00
Vuk Ercegovac	2894884deb	IMPALA-6670: refresh lib-cache entries from plan When an impalad is in executor-only mode, it receives no catalog updates. As a result, lib-cache entries are never refreshed. A consequence is that udf queries can return incorrect results or may not run due to resolution issues. Both cases are caused by the executor using a stale copy of the lib file. For incorrect results, an old version of the method may be used. Resolution issues can come up if a method is added to a lib file. The solution in this change is to capture the coordinator's view of the lib file's last modified time when planning. This last modified time is then shipped with the plan to executors. Executors must then use both the lib file path and the last modified time as a key for the lib-cache. If the coordinator's last modified time is more recent than the executor's lib-cache entry, then the entry is refreshed. Brief discussion of alternatives: - lib-cache always checks last modified time + easy/local change to lib-cache - adds an fs lookup always. rejected for this reason - keep the last modified time in the catalog - bound on staleness is too loose. consider the case where fn's f1, f2, f3 are created with last modified times of t1, t2, t3. treat the fn's last modified time as a low-watermark; if the cache entry has a more recent time, use it. Such a scheme would allow the version at t2 to persist. An old fn may keep the state from converging to the latest. This could end up with strange cases where different versions of the lib are used across executors for a single query. In contrast, the change in this path relies on the statestore to push versions forward at all coordinators, so will push all versions at all caches forward as well. Testing: - added an e2e custom cluster test Change-Id: Icf740ea8c6a47e671427d30b4d139cb8507b7ff6 Reviewed-on: http://gerrit.cloudera.org:8080/9697 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-24 04:38:53 +00:00
Philip Zeyliger	783de170c9	IMPALA-4277: Support multiple versions of Hadoop ecosystem Adds support for building against two sets of Hadoop ecosystem components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE, which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for Hadoop 3, Hive 2, and so on). We intend (in a trivial follow-on change soon) to make 3 the new default and to explicitly deprecate 2, but this change only does not switch the default yet. We support both to facilitate a smoother transition, but support will be removed soon in the Impala 3.x line. The switch is done at build time, following the pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). Switching back and forth requires running 'cmake' again. Doing this at build-time avoids complicating the Java code with classloader configuration. There are relatively few incompatible APIs. This implementation encapsulates that by extracting some Java code into fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the pattern established by IMPALA-5184, but, to avoid a proliferation of directories, I've moved the Hive files into the same tree.) pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I consolidated the Hive changes into the same directory structure. For Maven, I introduced Maven "profiles" to handle the two cases where the dependencies (and exclusions) differ. These are driven by the $IMPALA_MINICLUSTER_PROFILE environment variable. For Sentry, exception class names changed. We work around this by adding "isSentry...(Exception)" methods with two different implementations. Sentry is also doing some odd shading, whereby some exceptions are "sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism to create a SentryAuthProvider is slightly different. The easiest way to see the differences is to run: diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java The Sentry work is based on a change by Zach Amsden. In addition, we recently added an explicit "refresh" permission. In Sentry 2, this required creating an ImpalaPrivilegeModel to capture that. It's a slight customization of Hive's equivalent class. For Parquet, the difference is even more mechanical. The package names gone from "parquet" to "org.apache.parquet". The affected code was extracted into ParquetHelper, but only one copy exists. The second copy is generated at build-time using sed. In the rare cases where we need to behave differently at runtime, MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates what version we were built aginst. One of the cases is the results expected by various frontend tests. I avoided the issue by translating one error string into another, which handled the diversion in one place, rather than complicating the several locations which look for "No FileSystem for scheme..." errors. The HBase APIs we use for splitting regions at test time changed. This patch includes a re-write of that code for the new APIs. This piece was contributed by Zach Amsden. To work with newer versions of dependencies, I updated the version of httpcomponents.core we use to 4.4.9. We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase binaries to s3://native-toolchain, and amended the shell scripts to launch the right things. There are minor mechanical differences. Some of this was based on earlier work by Joe McDonnell and Zach Amsden. Hive's logging is changed in Hive 2, necessitating creating a log4j2.properties template and using it appropriately. Furthermore, Hadoop3's new shell script re-writes do a certain amount of classpath de-duplication, causing some issues with locating the relevant logging configurations. Accomodations exist in the code to deal with that. parquet-filtering.test was updated to turn off stats filtering. Older Hive didn't write Parquet statistics, but newer Hive does. By turning off stats filtering, we test what the test had intended to test. For views-compatibility.test, it seems that Hive 2 has fixed certain bugs that we were testing for in Hive. I've added a HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that. For AuthorizationTest, different hive versions show slightly different things for extended output. To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing to see here. rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%) CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This was done manually, but without changing anything except what Java required in terms of accessibility and boilerplate. rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%) copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%) Testing: Ran core & exhaustive tests with both profiles. Cherry-picks: not for 2.x. Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7 Reviewed-on: http://gerrit.cloudera.org:8080/9716 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-23 20:56:00 +00:00
Tianyi Wang	d03b66ca35	IMPALA-6394: Restart HDFS only when no replication progress is made In wait-hdfs-replication, the frequent and eager restart might slow the HDFS replication down. HDFS should be restarted only if no progress is made in a certain amount of time, and we should wait longer before failing the data loading. Testing: It's tested with a fake HDFS fsck script. Change-Id: Ib059480254643dc032731b4b3c55204a93b61e77 Reviewed-on: http://gerrit.cloudera.org:8080/9698 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-22 00:41:16 +00:00
Philip Zeyliger	45aee121eb	Removing (broken) retries from split-hbase.sh. The retries in split-hbase.sh don't work in the common case, because $MINIKDC_PRINC_HIVE is not set in non-kerberized (common) environments. The regular data load scripts (create-load-data.sh) have code to manage that, but split-hbase.sh blindly forges ahead, leading to errors like: /home/impdev/Impala/testdata/bin/split-hbase.sh: line 49: MINIKDC_PRINC_HIVE: unbound variable Error in /home/impdev/Impala/testdata/bin/create-load-data.sh at line 48: LOAD_DATA_ARGS="" Since this hasn't been working, I opted to remove it entirely, as a failure on the line where HBase splitting actually failed would be significantly more useful than the error here. A search of mailing lists suggested that I was at least the second person to have run into this. (In my case, I did break HBase splitting, but it took me a second to identify the error, since the log was spammed with unrelated information relating to the cluster restart.) Testing: core tests. Change-Id: I715891c9e744f21002330c3ae3ebc14095d94ffd Reviewed-on: http://gerrit.cloudera.org:8080/9588 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-15 01:32:14 +00:00
Tianyi Wang	c7a58b8a73	IMPALA-6394: Restart HDFS when blocks are under replicated HDFS sometimes fails to fully replicate all the blocks in 30 seconds and no progress is made. This patch tries to restart HDFS several times before aborting the data loading. Change-Id: Iefd4c2fc6c287f054e385de52bdc42b0bdbd7915 Reviewed-on: http://gerrit.cloudera.org:8080/9469 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-09 22:54:47 +00:00
Joe McDonnell	fd66890bf1	IMPALA-6579: Always force reload Kudu tables for dataload When loading from an up-to-date snapshot, dataload will load all of the metadata and load data into HDFS. Then, it will skip load-data.py for functional/exhaustive, tpch/core, and tpcds/core. It will invoke a special round of load-data.py calls to populate Kudu tables, and it always runs these with a force reload. However, when loading from an old snapshot, dataload will still load all of the metadata and load the data into HDFS, but then it will still invoke load-data.py for functional/exhaustive, tpch/core, and tpcds/core. These invocations mostly do DDLs with very few load statements. However, these invocations are a problem for Kudu. The metadata of Impala tables referencing Kudu entities have been imported along with all the other metadata, but the Kudu entities have not been created, as they are separate from HDFS. This means that Kudu tables are not really valid in this circumstance. Since Kudu has been added to the list of data formats for tpch/core (see IMPALA-6475), load-data.py with tpch/core will attempt to insert into these invalid Kudu tables. To avoid this, always force reload any Kudu tables. generate-schema-statements.py will always generate a drop table statement before any create of a Kudu table. This guarantees that the create will also create the corresponding Kudu entity. Change-Id: I2d07f3513c543e2590f2f62b96b37472316868ee Reviewed-on: http://gerrit.cloudera.org:8080/9445 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-25 03:04:58 +00:00
Joe McDonnell	d9b6fd0730	IMPALA-6386: Invalidate metadata at table level for dataload Dataload currently executes bin/load-data.py for TPC-H, TPC-DS, and functional-query concurrently. One of the final steps for bin/load-data.py is to run a global "invalidate metadata". Global "invalidate metadata" commands are known to cause problem on concurrent systems. See IMPALA-5087. For dataload, if TPC-H executes "invalidate metadata" while TPC-DS is still creating tables and adding partitions, the TPC-DS executor might erroneously believe that a table does not exist. This changes dataload to invalidate metadata at an individual table level rather than globally. This prevents the concurrency issue. This also changes the names of some of the intermediate SQL files generated by generate-schema-statements.py and consumed by load-data.py to make them less confusing. Change-Id: Ibc3a6d8a674a0bf6b02069bfe8a5e12034335b1f Reviewed-on: http://gerrit.cloudera.org:8080/9009 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 22:52:58 +00:00
Tianyi Wang	c4d950b9e9	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Resubmitted with filesystem type check. Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8 Reviewed-on: http://gerrit.cloudera.org:8080/8916 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 03:24:36 +00:00
David Knupp	2fb11fb732	Revert "IMPALA-3887: Wait for HDFS replication in data loading" Using fsck breaks non-HDFS builds: local, S3, and Isilon. This reverts commit `5a7c10ec3d`. Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc Reviewed-on: http://gerrit.cloudera.org:8080/8880 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-19 23:26:29 +00:00
Tianyi Wang	5a7c10ec3d	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322 Reviewed-on: http://gerrit.cloudera.org:8080/8846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:53:56 +00:00
Philip Zeyliger	11dbb3952a	IMPALA-6070: Parallelize another bit of data load. The two Kudu loads and Hive UDFs can all run in parallel. This should shave about 4 minutes off of the data load. (Current timings are 3.5, 4, and 0.6 minutes, see below.) I've run dataload with this change many times. Loading Kudu functional (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu.log)... Loading workload 'functional-query' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 3 min 29 sec) Loading Kudu TPCH (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu-tpch.log)... Loading workload 'tpch' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 4 min 0 sec) Loading Hive UDFs (logging to /home/ubuntu/Impala/logs/data_loading/build-and-copy-hive-udfs.log)... Loading Hive UDFs OK (Took: 0 min 41 sec) Change-Id: I7e93ee5a77ec9271b980b88bef7ad512ecbe0407 Reviewed-on: http://gerrit.cloudera.org:8080/8822 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-14 02:28:40 +00:00
Zachary Amsden	66704f915e	IMPALA-6068: Scale back fixing functional-types I re-created the original patch for IMPALA-6068, but only performed what I believe to be the limited legal transformation of data load: DEPENDENT_LOAD -> DEPENDENT_LOAD_HIVE. Any place that directly uploads via hadoop or hdfs commands was left alone as changing it can't be proven to be correct. Change-Id: I6c242cca209a7138b10ad517076707709b5cd204 Testing: Doing a full data load. I mistakenly changed a variable name causing the first two dry-runs to fail. Reviewed-on: http://gerrit.cloudera.org:8080/8690 Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Zach Amsden <zamsden@cloudera.com>	2017-12-04 23:46:44 +00:00
Philip Zeyliger	45ccf4ae3b	Removing testdata/bin/run-hive.sh. I can't find any uses of it. Change-Id: Ibdb65f8390efec8559cea59da0a48584a383cb24 Reviewed-on: http://gerrit.cloudera.org:8080/8503 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 03:02:53 +00:00
David Knupp	d1c9510001	Revert "IMPALA-6068: Fix dataload for complextypes_fileformat" This reverts commit `e4f585240a`. Among other things, that commit replaced hdfs command line calls with "LOAD DATA LOCAL INPATH" using Hive. However, doing so presumes that the minicluster is the only test environment. Sometimes though, the data load script is against a remote cluster, and those cases, the data load process is now broken. Change-Id: I6dc419934d2953eb950b14d090d7895ec57aa9f2 Reviewed-on: http://gerrit.cloudera.org:8080/8653 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-28 02:57:04 +00:00
Philip Zeyliger	a0be00ad6d	Expose $IMPALA_MAVEN_OPTIONS for configuring Maven. With this commit, $IMPALA_MAVEN_OPTIONS is used by bin/mvn-quiet.sh to configure Maven slightly. The default is no extra options. This is handy for giving Maven a settings file with the "-s" flag, to control, for example, repositories and their mirrors. In fact, I considered exposing IMPALA_MAVEN_SETTINGS_FILE explicitly, but decided that the generic option would be as good. It's useful to customize how Maven works, especially to provide a settings file with repository mirrors. Change-Id: I2c62185476fd2388c7cda8884276b79a77370127 Reviewed-on: http://gerrit.cloudera.org:8080/8496 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-14 01:29:56 +00:00
Philip Zeyliger	76111ce168	IMPALA-6108, IMPALA-6070: Parallel data load (re-instated). This is a revert of a revert, re-enabling parallel data load. It avoid the race condition by explicitly configuring the temporary directory in question in load-data.py. When the parallel data load change went in, we discovered a race with a signature of: java.io.FileNotFoundException: File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist The number in this path is milliseconds since the epoch, and the race occurs when two queries submitted to HiveServer2, running with the local runner, hit the same millisecond time stamp. The upstream bug is https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which is now marked as a dupe). I've tested this by running data load 5 times on the same machines where it failed before. I also ran data load manually and inspected the system to make sure that the temporary directories are getting created as expected in /tmp/impala-data-load-*. Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10 Reviewed-on: http://gerrit.cloudera.org:8080/8405 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-02 00:40:19 +00:00
Philip Zeyliger	e301ca6418	IMPALA-6108: Revert "IMPALA-6070: Parallel data load." We may be seeing a race with errors like "java.io.FileNotFoundException: File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist". This reverts commit `e020c37106`. Change-Id: I46da93f4315a5a4bdaa96fa464cb51922bd6c419 Reviewed-on: http://gerrit.cloudera.org:8080/8386 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-26 02:07:50 +00:00
Joe McDonnell	e4f585240a	IMPALA-6068: Fix dataload for complextypes_fileformat Dataload typically follows a pattern of loading data into a text version of a table, and then using an insert overwrite from the text table to populate the table for other file formats. This insert is always done in Impala for Parquet and Kudu. Otherwise it runs in Hive. Since Impala doesn't support writing nested data, the population of complextypes_fileformat tries to hack the insert to run in Hive by including it in the ALTER part of the table definition. ALTER runs immediately after CREATE and always runs in Hive. The problem is that ALTER also runs before the base table (functional.complextypes_fileformat) is populated. The insert succeeds, but it is inserting zero rows. This code change introduces a way to force the Parquet load to run using Hive. This lets complextypes_fileformat specify that the insert should happen in Hive and fixes the ordering so that the table is populated correctly. This is also useful for loading custom Parquet files into Parquet tables. Hive supports the DATA LOAD LOCAL syntax, which can read a file from the local filesystem. This means that several locations that currently use the hdfs commandline can be modified to use this SQL. This change speeds up dataload by a few minutes, as it avoids the overhead of the hdfs commandline. Any other location that could use DATA LOAD LOCAL is also switched over to use it. This includes the testescape* tables which now print the appropriate DATA LOAD commands as a result of text_delims_table.py. Any location that already uses DATA LOAD LOCAL is also switched to indicate that it must run in Hive. Any location that was doing an HDFS command in the LOAD section is moved to the LOAD_DEPENDENT_HIVE section. Testing: Ran dataload and core tests. Also verified that functional_parquet.complextypes_fileformat has rows. Change-Id: I7152306b2907198204a6d8d282a0bad561129b82 Reviewed-on: http://gerrit.cloudera.org:8080/8350 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 03:43:26 +00:00
Philip Zeyliger	e020c37106	IMPALA-6070: Parallel data load. This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In parallel, these take about 37 minutes, dominated by functional-query. Serially, these take about 30 minutes more, namely the 13 minutes of tpcds and 16 minutes of tpcds. This works out nicely because CPU usage during data load is very low in aggregate. (We don't sustain more than 1 CPU of load, whereas build machines are likely to have many CPUs.) To do this, I added support to run-step.sh to have a notion of a backgroundable task, and support waiting for all tasks. I also increased the heapsize of our HiveServer2 server. When datasets were being loaded in parallel, we ran out of memory at 256MB of heap. The resulting log output is currently like so (but without the timestamps): 15:58:04 Started Loading functional-query data in background; pid 8105. 15:58:04 Started Loading TPC-H data in background; pid 8106. 15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)... 15:58:04 Started Loading TPC-DS data in background; pid 8107. 15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)... 15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)... 16:11:31 Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec) 16:14:33 Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec) 16:35:08 Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec) I tested dataloading with the following command on an 8-core, 32GB machine. I saw 19GB of available memory during my run: ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Reviewed-on: http://gerrit.cloudera.org:8080/8320 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-25 00:00:25 +00:00
Philip Zeyliger	77e010ae4e	IMPALA-6070: Parallel compute_table_stats.py Uses a thread pool to issue many compute stats commands in parallel to Impala, rather than doing it serially. Where it was obvious, I combined multiple stats commands into fewer, to reduce the number of "show databses" and serialized "show tables" commands. This speeds up the compute stats step in data loading significantly. My measurements for testdata/bin/compute-table-stats.sh running before and after this change, with the Impala daemons restarted (cold) or not restarted (warm) on an 8-core, 32GB RAM machine were: old, cold: 7m44s new, cold: 1m42s old, warm: 1m23s new, warm: 48s The data load in the full test build behaves in a cold fashion. It's typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to run this compute stats step for 9 or 10 minutes. With this change, this will come down to about 2 minutes. Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d Reviewed-on: http://gerrit.cloudera.org:8080/8354 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-24 23:54:15 +00:00
Zachary Amsden	439f245d34	IMPALA-5975: Work around broken beeline clients To make statements execute, some clients require always appending a semi-colon to the end. The workaround is quite simple. Change-Id: Id8b9f3dde4445513f1f389785a002c6cc6b3dada Reviewed-on: http://gerrit.cloudera.org:8080/8132 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-27 03:27:45 +00:00
Matthew Jacobs	922ee70317	IMPALA-5336: Fix partition pruning when column is cast Partition pruning has two mechanisms: 1) Simple predicates (e.g. binary predicates of the form <SlotRef> <op> <LiteralExpr>) can be used to derive lists of matching partition ids directly from the partition key values. This is handled directly in the FE and is very efficient for supported simple predicates. 2) General expr evaluation of predicates using the BE (via FeSupport). This works for all predicates, so is the mechanism used for predicates not supported by (1). The issue was that (1) was being used when a binary predicate contained an implicit cast on the SlotRef. While this is OK when being evaluated by the BE, the simple mechanism in (1) would not be able to match the partition key values with the predicate literal because the partition key values cannot be cast in the FE. The fix is to force binary predicates involving a cast to be evaluated in the BE. Testing: A planner test was added to demonstrate the expected partition pruning occurs. Some modifications were made to the functional schema table stringpartitionkey, so it will be necessary to reload those tables: load-data.py -w functional-query --table_names=stringpartitionkey Change-Id: I94f597a6589f5e34d2b74abcd29be77c4161cd99 Reviewed-on: http://gerrit.cloudera.org:8080/7521 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-31 21:49:17 +00:00
Taras Bobrovytsky	bd6d2df730	IMPALA-5527: Add nested testdata flattener The TableFlattener takes a nested dataset and creates an equivalent unnested dataset. The unnested dataset is saved as Parquet. When an array or map is encountered in the original table, the flattener creates a new table and adds an id column to it which references the row in the parent table. Joining on the id column should produce the original dataset. The flattened dataset should be loaded into Postgres in order to run the query generator (in nested types mode) on it. There is a script that automates generaration, flattening and loading random data into Postgres and Impala: testdata/bin/generate-load-nested.sh -f Testing: - ran ./testdata/bin/generate-load-nested.sh -f and random nested data was generated and flattened as expected. Change-Id: I7e7a8e53ada9274759a3e2128b97bec292c129c6 Reviewed-on: http://gerrit.cloudera.org:8080/5787 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-17 03:18:06 +00:00
Lars Volker	467ccd1950	IMPALA-5223: Add waiting for HBase Zookeeper nodes to retry loop Occasionally we'd see HBase fail to startup properly on CentOS 7 clusters. The symptom was that HBase would not open the required nodes in zookeeper, signaling its readiness. As a workaround, this change includes waiting for the Zookeeper nodes into the retry logic. Change-Id: Id8dbdff4ad02cac1322e7d580e0a6971daf6ea28 Reviewed-on: http://gerrit.cloudera.org:8080/7159 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: anujphadke <aphadke@cloudera.com> Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2017-06-13 05:57:49 +00:00
Jakub Kukul	0992a6afda	IMPALA-2525: Treat parquet ENUMs as STRINGs when creating impala tables. Change-Id: Ia7a2e20c3ab83eb3fac422c3b33c117856fec475 Reviewed-on: http://gerrit.cloudera.org:8080/6550 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-07 02:51:54 +00:00
Jim Apple	07a7138817	Add a script to test performance on a developer machine This is a migration from an old and broken script from another repository. Example use: bin/single_node_perf_run.py --ninja --workloads targeted-perf \ --load --scale 4 --iterations 20 --num_impalads 3 \ --start_minicluster --query_names PERF_AGG-Q3 \ $(git rev-parse HEAD~1) $(git rev-parse HEAD) The script can load data, run benchmarks, and compare the statistics of those runs for significant differences in performance. It glues together buildall.sh, bin/load-data.py, bin/run-workload.py, and tests/benchmark/report_benchmark_results.py. Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9 Reviewed-on: http://gerrit.cloudera.org:8080/6818 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-31 08:10:48 +00:00
Michael Ho	f15589573b	IMPALA-5376: Loads all TPC-DS tables This change loads the missing tables in TPC-DS. In addition, it also fixes up the loading of the partitioned table store_sales so all partitions will be loaded. The existing TPC-DS queries are also updated to use the parameters for qualification runs as noted in the TPC-DS specification. Some hard-coded partition filters were also removed. They were there due to the lack of dynamic partitioning in the past. Some missing TPC-DS queries are also added to this change, including query28 which discovered the infamous IMPALA-5251. Having all tables in TPC-DS available paves the way for us to include all supported TPCDS queries in our functional testing. Due to the change in the data, planner tests and the E2E tests have different results than before. The results of E2E tests were compared against the run done with Netezza and Vertica. The divergence were all due to the truncation behavior of decimal types in DECIMAL_V1. Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9 Reviewed-on: http://gerrit.cloudera.org:8080/6877 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-27 05:19:53 +00:00
Lars Volker	12f3ecceab	IMPALA-5287: Test skip.header.line.count on gzip This change fixed IMPALA-4873 by adding the capability to supply a dict 'test_file_vars' to run_test_case(). Keys in this dict will be replaced with their values inside test queries before they are executed. Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70 Reviewed-on: http://gerrit.cloudera.org:8080/6817 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 01:36:46 +00:00
Jim Apple	374f1121da	IMPALA-3224: De-Cloudera non-docs JIRA URLs John Russell is planning to fix the URLS in docs in a separate commit. Fixed using: (git ls-files \| xargs replace \ 'https://issues.cloudera.org/browse/IMPALA' 'IMPALA' --) && \ git checkout HEAD docs Change-Id: I28ea06e89341de234f9005fdc72a2e43f0ab8182 Reviewed-on: http://gerrit.cloudera.org:8080/6487 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-07 04:44:57 +00:00
Michael Brown	8b459dffec	IMPALA-5162,IMPALA-5163: stress test support on secure clusters This patch adds support for running the stress test (concurrent_select.py) and loading nested data (load_nested.py) into a Kerberized, SSL-enabled Impala cluster. It assumes the calling user already has a valid Kerberos ticket. One way to do that is: 1. Get access to a keytab and krb5.config 2. Set KRB5_CONFIG and KRB5CCNAME appropriately 3. Run kinit(1) 4. Run load_nested.py and/or concurrent_select.py within this environment. Because our Python clients already support Kerberos and SSL, we simply need to make sure to use the correct options when calling the entry points and initializing the clients: Impala: Impyla Hive: Impyla HDFS: hdfs.ext.kerberos.KerberosClient With this patch, I was able to manually do a short concurrent_select.py run against a secure cluster without connection or auth errors, and I was able to do the same with load_nested.py for a cluster that already had TPC-H loaded. Follow-ons for future cleanup work: IMPALA-5263: support CA bundles when running stress test against SSL'd Impala IMPALA-5264: fix InsecurePlatformWarning under stress test with SSL Change-Id: I0daad57bb8ceeb5071b75125f11c1997ed7e0179 Reviewed-on: http://gerrit.cloudera.org:8080/6763 Reviewed-by: Matthew Mulder <mmulder@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 04:56:01 +00:00
David Knupp	894bb77855	IMPALA-4839: Remove implicit 'localhost' for KUDU_MASTER_HOSTS The Kudu query tests were failing on a remote cluster because the Kudu master was always set to '127.0.0.1', with no way to override it. This patch corrects the issue with a number of changes: - Add a pytest command line option to specify an arbitrary Kudu master - Consolidate the place where the default Kudu master is derived. It had been stored both in the env and in tests/common/__init__.py, with different files looking to different places. For now, just look to the env, and remove the value from __init__.py. - The kudu_client test fixture in conftest.py was using the connect() method from impala.dbapi (part of the Impyla library), without specifying the host param. In the absence of that, the default value is 'localhost', so add the host param to the connect() call. - Define the various defaults for pytest config as constants at the top of conftest.py. Change-Id: I9df71480a165f4ce21ae3edab6ce7227fbf76f77 Reviewed-on: http://gerrit.cloudera.org:8080/5877 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-14 21:51:39 +00:00
David Knupp	226a2e6332	IMPALA-4684: Handle Zookeeper ConnentionLoss exceptions This is the second patch to address IMPALA-4684. The first patch exposed a transient Zookeeper connection error on RHEL7. This patch introduces a retry (up to 3 times), and somewhat better logging. Tested by running tests against an RHEL7 instance and confirming that all HBase nodes start up. Change-Id: I44b4eec342addcfe489f94c332bbe14225c9968c Reviewed-on: http://gerrit.cloudera.org:8080/5554 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-12-22 01:18:56 +00:00
David Knupp	73146a0a46	IMPALA-4684: Wrap checking for HBase nodes ina try/finally block If an exception (other than NoNodeError) was raised while checking for HBase nodes, we weren't cleanly stopping the ZooKeeper client, which in turn created a second exception when the the connection was closed. The second exception masked the original error condition. Tested by forcibly raising unexpected errors while checking for HBase nodes. Change-Id: I46a74d018f9169385a9f10a85718044c31a24dbc Reviewed-on: http://gerrit.cloudera.org:8080/5547 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-12-21 00:22:16 +00:00
Tim Armstrong	44ae9fcead	IMPALA-4649: add a mechanism to pass flags into make Testing: Tested that buildall.sh works as expected. Built locally with IMPALA_MAKE_FLAGS unset to confirm I didn't break anything. Built locally with IMPALA_MAKE_FLAGS=--load-average=$IMPALA_BUILD_THREADS and looked at "ps auxf" output to confirm it's passed through. Change-Id: I17b13cbaf395f962762d5cff3d650ffb077934a4 Reviewed-on: http://gerrit.cloudera.org:8080/5480 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2016-12-15 21:37:17 +00:00
Dan Burkert	f83652c1da	Replace INTO N BUCKETS with PARTITIONS N in CREATE TABLE This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and `BUCKETS` keywords that were going to be newly released in Impala 2.6, but are now unused. Additionally, a few remaining uses of the `DISTRIBUTE BY` syntax has been switched to `PARTITION BY`. Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922 Reviewed-on: http://gerrit.cloudera.org:8080/5382 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-12-07 07:31:16 +00:00
Dimitris Tsirogiannis	cba93f1ac3	IMPALA-4561: Replace DISTRIBUTE BY with PARTITION BY in CREATE TABLE Change-Id: I0e07c41eabb4c8cb95754cf04293cbd9e03d6ab2 Reviewed-on: http://gerrit.cloudera.org:8080/5317 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-12-06 10:41:53 +00:00
Henry Robinson	2648bfbd90	Improve message output from run-step.sh run-step prints a message to tell the reader what it's doing. However, that message wasn't flushed so that run-step could print OK or FAILED on the same line. The result was that long-running steps wouldn't print anything to the log until they were done, at least in Jenkins contexts. This patch changes it so that the message is flushed, and then the result is printed on a separate line (including the time it took to run the step). $ run-step "Hello world!" helloworld.out sleep 5 Hello world! (logging to /tmp/helloworld.out)... OK (Took: 0 min 5 sec) Change-Id: Iaced729f0ef6aa93174cd90b1516d3c34fe41a22 Reviewed-on: http://gerrit.cloudera.org:8080/5116 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 09:35:14 +00:00
Jim Apple	4b774880c9	Increase wait times for startup of Hive and its Metastore On Ubuntu 14.04 on AWS EC2 m4.4x, instances, these components frequently take more than 30 seconds to start. I have seen the HMS take more than 90 seconds; this patch sets a more conservative timeout default. Change-Id: I43eb8646cca495578c8f9730faa04812957d2917 Reviewed-on: http://gerrit.cloudera.org:8080/5068 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 20:35:01 +00:00
Jim Apple	6775893894	IMPALA-4447: Rein in overly broad sed that dirties the tree This patch fixes a sed expression to make sure it only laters the code it is meant to alter, not the comment describing the code. Tested with tests/run-tests.py query_test/test_udfs.py Change-Id: I51a0498d24b7fccc05b6183123501766cb36f85e Reviewed-on: http://gerrit.cloudera.org:8080/5008 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 02:44:36 +00:00
Martin Grund	ce4c5f6743	IMPALA-4365: Enabling end-to-end tests on a remote cluster This patch lays the groundwork for loading data and running end-to-end tests on a remote CDH cluster. The requirements for the cluster to run the tests are: - Managed by Cloudera Manager (CM) - GPL Extras need to be installed - KMS and KeyTrustee installed and available as a service - SERDEPROPERTIES in the Hive DB modified to accept wide tables - Hive warehouse dir points to /test-warehouse The actual data loading is done via a new script, remote_data_load.py, which takes the CM host as an argument. It can be run from a client machine that is not a node of the cluster, but it needs to have the Impala repo checked out and Impala built. This insures that all of the necessary data load scripts are available, as well as setting up the environment properly (client binaries like beeline and the hbase shell are available, python libraries like cm_api are installed, necessary environment variables are defined, etc.) It should be noted that running remote_data_load.py will overwrite any local XML config files with the configurations downloaded from the remote cluster. Usage: remote_data_load.py [options] <cm_host address> Options: -h, --help show this help message and exit --snapshot-file=SNAPSHOT_FILE Path to the test-warehouse archive --cm-user=CM_USER Cloudera Manager admin user --cm-pass=CM_PASS Cloudera Manager admin user password --gateway=GATEWAY Gateway host to upload the data from. If not set, uses the CM host as gateway. --ssh-user=SSH_USER System user on the remote machine with passwordless SSH configured. --no-load Do not try to load the snapshot --exploration-strategy=EXPLORATION_STRATEGY --test Run end-to-end tests against cluster Testing: This patch is being submitted with the understanding that there are still clean up issues that need to be addressed in the remote data load script, for which JIRA's have been filed. However, since many of the existing build scripts also had to be modified, it is more important to make sure that no regressions were inadvertently introduced into the existing data load process. Loading data to a local mini-cluster was checked repeatedly while this patch was being developed, as well as running it against the Jenkins job that provides the test-warehouse snapshot used by the many other Impala CI builds that run daily. Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9 Reviewed-on: http://gerrit.cloudera.org:8080/4769 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 10:16:55 +00:00
Tim Armstrong	0dbfe169b7	IMPALA-4277: remove unneeded LegacyTCLIService Change-Id: I827590b19dc542f6256ae2e0d541eaa32a76520b Reviewed-on: http://gerrit.cloudera.org:8080/4844 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 02:34:01 +00:00
Henry Robinson	e0a3272129	Minor compute stats script fixes * Change run-step to output full log path * Change text to say "Computing table stats" rather than "Computing HBase stats" when running compute-table-stats.sh Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07 Reviewed-on: http://gerrit.cloudera.org:8080/4825 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 00:13:54 +00:00

1 2 3 4 5 ...

350 Commits