impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 06:00:36 -05:00

Author	SHA1	Message	Date
Joe McDonnell	09a297a270	IMPALA-10503: Use larger yarn containers for dataload This increases the 'yarn.app.mapreduce.am.resource.mb' parameter to 2GB in yarn-site.xml. This reduces the frequency of dataload hitting the container size limit on the docker-based tests and seems likely to address other problems related to the container size. Testing: - Ran docker-based tests - Ran GVO successfully - Ran a debug core job on a different machine configuration Change-Id: I06567ffc44fa378be7c8cf4008f138b47b68d931 Reviewed-on: http://gerrit.cloudera.org:8080/17201 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-05-22 19:58:15 +00:00
Zoltan Borok-Nagy	7f1ce039be	IMPALA-11154: Idle Kudu daemons consume too much CPU Due to KUDU-1973 kudu-tservers produce high CPU consumption (see also KUDU-3134) when there is a high number of table replicas. This means that in the Impala dev environment the CPU consumption can be around 15-20% per kudu-tserver (there are 3 kudu-tservers) when all the Kudu tables are loaded. Setting the value to 3 seconds lowers CPU usage to ~5% per kudu-terver. Testing: * ran exhaustive tests Change-Id: Ieb4de56540f5a7dc860bf6e27d9a5c0e4f4b3d26 Reviewed-on: http://gerrit.cloudera.org:8080/18290 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-03-08 18:35:04 +00:00
Joe McDonnell	7b490eed5b	IMPALA-10951 (preparation): Update Kudu to a more recent version As part of moving to a newer protobuf, this updates the Kudu version to get the fix for KUDU-3334. With this newer Kudu version, Clang builds hit an error while linking: lib/libLLVMCodeGen.a(TargetPassConfig.cpp.o):TargetPassConfig.cpp: function llvm::TargetPassConfig::createRegAllocPass(bool): error: relocation refers to global symbol "std::call_once<void (&)()>(std::once_flag&, void (&)())::{lambda()#2}::_FUN()", which is defined in a discarded section section group signature: "_ZZSt9call_onceIRFvvEJEEvRSt9once_flagOT_DpOT0_ENKUlvE0_clEv" prevailing definition is from ../../build/debug/security/libsecurity.a(openssl_util.cc.o) (This is from a newer binutils that will be pursued separately.) As a hack to get around this error, this adds the calloncehack shared library. The shared library publicly defines the symbol that was coming from kudu_client. By linking it ahead of kudu_client, the linker uses that rather than the one from kudu_client. This fixes the Clang builds. The new Kudu also requires a minor change to the flags for tserver startup. Testing: - Ran debug tests and verified calloncehack is not used - Ran ASAN tests Change-Id: Ieccbe284f11445e1de792352ebc7c9e1fa2ca0c3 Reviewed-on: http://gerrit.cloudera.org:8080/18129 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-01-07 01:44:58 +00:00
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
wzhou-code	fcaea30b15	IMPALA-10557: Support Kudu's multi-row transaction Kudu added multi-row transaction so Impala could run query that inserts multiple rows into Kudu's table in the context of a single transaction. Kudu provides new Java/C++ client APIs to open/commit/rollback transaction, create session with transaction, serialize/deserialize metadata of transaction object. Kudu transaction object has built-in heartbeater. This patch added Impala support to use Kudu's multiple-row transaction. - Added a new query option to enable Kudu's transaction. - When the query option is set, a new Kudu transaction should be started for "insert", "CTAS" and "UPDATE/UPSERT/DELETE" statements by Impala's frontend of coordinator. - The Kudu transaction objects are kept in KuduTransactionManager until the transactions are going to be aborted or committed. - Frontend serialize the transaction metadata into a transaction token and pass to executors. - Executors deserialize the transaction token and ingest via that transaction handle. For Kudu session in the context of a transaction, return the first error if there are any pending errors for the Kudu session so that the Kudu transaction will be aborted. Since Kudu does not support transaction for "UPDATE/UPSERT/DELETE" statements now, Kudu returns error which causes transaction to be aborted. - Coordinator commits the transaction if everything goes well. Otherwise, aborts the transaction. Also changed code to store KuduClient as shared pointer since KuduClient has to be passed as a shared pointer when KuduTransaction::Deserialize() is called. Testing: - Added new e-to-e tests for Kudu transaction. - Passed core test. Change-Id: I876ada48991afdff5d61b5d6a0417571aba7cb34 Reviewed-on: http://gerrit.cloudera.org:8080/17553 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-06-24 20:34:45 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
fifteencai	a0a25a61c3	IMPALA-10193: Limit the memory usage for the whole test cluster This patch introduces a new approach of limiting the memory usage for both mini-cluster and CDH cluster. Without this limit, clusters are prone to getting killed when running in docker containers with a lower mem limit than host's memory size. i.e. The mini-cluster may running in a container with 32GB limitted by CGROUPS, while the host machine has 128GB. Under this circumstance, if the container is started with '-privileged' command argument, both mini and CDH clusters compute their mem_limit according to 128GB rather than 32GB. They will be killed when attempting to apply for extra resource. Currently, the mem-limit estimating algorithms for Impalad and Node Manager are different: for Impalad: mem_limit = 0.7 * sys_mem / cluster_size (default is 3) for Node Manager: 1. Leave aside 24GB, then fit the left into threasholds below. 2. The bare limit is 4GB and maximum limit 48GB In headge of over-consumption, we - Added a new environment variable IMPALA_CLUSTER_MAX_MEM_GB - Modified the algorithm in 'bin/start-impala-cluster.py', making it taking IMPALA_CLUSTER_MAX_MEM_GB rather than sys_mem into account. - Modified the logic in 'testdata/cluster/node_templates/common/etc/hadoop/conf/yarn-site.xml.py' Similarly, making IMPALA_CLUSTER_MAX_MEM_GB substitutes for sys_mem . Testing: this patch worked in a 32GB docker container running on a 128GB host machine. All 1188 unit tests get passed. Change-Id: I8537fd748e279d5a0e689872aeb4dbfd0c84dc93 Reviewed-on: http://gerrit.cloudera.org:8080/16522 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-01 08:38:18 +00:00
Joe McDonnell	f15a311065	IMPALA-9709: Remove Impala-lzo from the development environment This removes Impala-lzo from the Impala development environment. Impala-lzo is not built as part of the Impala build. The LZO plugin is no longer loaded. LZO tables are not loaded during dataload, and LZO is no longer tested. This removes some obsolete scan APIs that were only used by Impala-lzo. With this commit, Impala-lzo would require code changes to build against Impala. The plugin infrastructure is not removed, and this leaves some LZO support code in place. If someone were to decide to revive Impala-lzo, they would still be able to load it as a plugin and get the same functionality as before. This plugin support may be removed later. Testing: - Dryrun of GVO - Modified TestPartitionMetadataUncompressedTextOnly's test_unsupported_text_compression() to add LZO case Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e Reviewed-on: http://gerrit.cloudera.org:8080/15814 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-06-15 23:42:12 +00:00
Joe McDonnell	f241fd08ac	IMPALA-9731: Remove USE_CDP_HIVE=false and Hive 2 support Impala 4 moved to using CDP versions for components, which involves adopting Hive 3. This removes the old code supporting CDH components and Hive 2. Specifically, it does the following: 1. Remove USE_CDP_HIVE and default to the values from USE_CDP_HIVE=true. USE_CDP_HIVE now has no effect on the Impala environment. This also means that bin/jenkins/build-all-flag-combinations.sh no longer include USE_CDP_HIVE=false as a configuration. 2. Remove USE_CDH_KUDU and default to getting Impala from the native toolchain. 3. Ban IMPALA_HIVE_MAJOR_VERSION<3 and remove related code, including the IMPALA_HIVE_MAJOR_VERSION=2 maven profile in fe/pom.xml. There is a fair amount of code that still references the Hive major version. Upstream Hive is now working on Hive 4, so there is a high likelihood that we'll need some code to deal with that transition. This leaves some code (such as maven profiles) and test logic in place. Change-Id: Id85e849beaf4e19dda4092874185462abd2ec608 Reviewed-on: http://gerrit.cloudera.org:8080/15869 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-07 22:14:39 +00:00
Joe McDonnell	4386c1b44a	IMPALA-9677: Fix frontend tests using a non-existent S3 bucket With HADOOP-16711, Hadoop added extra validation during the initialization of S3AFileSystem that verified that the caller had permissions on the S3 bucket specified. Some frontend tests use non-existent S3 buckets in URIs to check analysis behavior. These started to fail with the new validation. This changes the core-site.xml configuration to disable the new validation by setting fs.s3a.bucket.probe=1. This is equivalent to the old behavior, and it can now run the frontend tests without AWS credentials. Testing: - Hand tested the failing tests (AnalyzeDDLTest, ExplainTest, PlannerTest) - Ran core job on USE_CDP_HIVE=true and USE_CDP_HIVE=false Change-Id: Id61ffbf686f8b7827e7fbf13167cfc1dfc06a325 Reviewed-on: http://gerrit.cloudera.org:8080/15799 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Anurag Mantripragada <anurag@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-04-24 16:44:33 +00:00
Grant Henke	208d9d6896	IMPALA-9577: [test] Use `system_unsync` time for Kudu test clusters Recently Kudu made enhancements to time source configuration and adjusted the time source for local clusters/tests to `system_unsync`. This patch mirrors that behavior in Impala test clusters given there is no need to require NTP-synchronized clock for a test where all the participating Kudu masters and tablet servers are run at the same node using the same local wallclock. See the Kudu commit here for details: `eb2b70d4b9` While making this change, I removed all ntp related packages and special handling as they should not be needed in a development environment any more. I also added curl and gawk which were missing in my Docker ubuntu environment and broke my testing. Testing: I tested with the steps below using Docker for Mac: docker rm impala-dev docker volume rm impala docker run --privileged --interactive --tty --name impala-dev -v impala:/home -p 25000:25000 -p 25010:25010 -p 25020:25020 ubuntu:16.04 /bin/bash apt-get update apt-get install sudo adduser --disabled-password --gecos '' impdev echo 'impdev ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers su - impdev cd ~ sudo apt-get --yes install git git clone https://git-wip-us.apache.org/repos/asf/impala.git ~/Impala cd ~/Impala export IMPALA_HOME=`pwd` git remote add fork https://github.com/granthenke/impala.git git fetch fork git checkout kudu-system-time $IMPALA_HOME/bin/bootstrap_development.sh source $IMPALA_HOME/bin/impala-config.sh (pushd fe && mvn -fae test -Dtest=AnalyzeDDLTest) (pushd fe && mvn -fae test -Dtest=AnalyzeKuduDDLTest) $IMPALA_HOME/bin/start-impala-cluster.py ./tests/run-tests.py query_test/test_kudu.py Change-Id: Id99e5cb58ab988c3ad4f98484be8db193d5eaf99 Reviewed-on: http://gerrit.cloudera.org:8080/15568 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Alexey Serbin <aserbin@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-31 19:38:10 +00:00
Tim Armstrong	6f150d383c	IMPALA-9361: manually configured kerberized minicluster The kerberized minicluster is enabled by setting IMPALA_KERBERIZE=true in impala-config-*.sh. After setting it you must run ./bin/create-test-configuration.sh then restart minicluster. This adds a script to partially automate setup of a local KDC, in lieu of the unmaintained minikdc support (which has been ripped out). Testing: I was able to run some queries against pre-created HDFS tables with kerberos enabled. Change-Id: Ib34101d132e9c9d59da14537edf7d096f25e9bee Reviewed-on: http://gerrit.cloudera.org:8080/15159 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-08 05:16:12 +00:00
Joe McDonnell	0163a10332	IMPALA-9068: Use different directories for external vs managed warehouse Hive 3 changed the typical storage model for tables to split them between two directories: - hive.metastore.warehouse.dir stores managed tables (which is now defined to be only transactional tables) - hive.metastore.warehouse.external.dir stores external tables (everything that is not a transactional table) In more recent commits of Hive, there is now validation that the external tables cannot be stored in the managed directory. In order to adopt these newer versions of Hive, we need to use separate directories for external vs managed warehouses. Most of our test tables are not transactional, so they would reside in the external directory. To keep the test changes small, this uses /test-warehouse for the external directory and /test-warehouse/managed for the managed directory. Having the managed directory be a subdirectory of /test-warehouse means that the data snapshot code should not need to change. The Hive 2 configuration doesn't change as it does not have this concept. Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate location for data as compared to USE_CDP_HIVE=false. That should reduce conflicts between the two configurations. Testing: - Ran exhaustive tests with USE_CDP_HIVE=false - Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version) - Verified that dataload succeeds and tests are able to run with a newer Hive version. Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec Reviewed-on: http://gerrit.cloudera.org:8080/15026 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-24 17:29:15 +00:00
Joe McDonnell	fa7d91fd30	IMPALA-9241: Remove pid files on successful shutdown of minicluster The minicluster init scripts currently keep track of pids for HDFS, YARN, etc by writing the pid to files for each service. It uses the pid in the file to see what is running and needs to shutdown or start. Currently, it does not remove the pid file after the minicluster shuts down. This means that it would be reading a zombie pid from the pid file to see if the service is already running. If the pid is reused by something else, it can fail to start up a necessary service. This removes the pid files when the minicluster components shut down successfully. Change-Id: I5b14d74df8061b6595b9897df9c9667e3f569e34 Reviewed-on: http://gerrit.cloudera.org:8080/14950 Reviewed-by: Andrew Sherman <asherman@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-28 02:16:56 +00:00
Hao Hao	d26aae5f2d	IMPALA-8504: Support CREATE TABLE statement with Kudu/HMS integration This commit adds support for the syntax of CREATE TABLE (and CTAS) statements for managed Kudu tables with Kudu/HMS integration. A follow up patch will address the actual handling of CREATE TABLE statement with Kudu/HMS integration. For a managed table the syntax remains the same. However, the detailed changes includes: 1) Kudu table will always be created with the new Kudu storage handler 'org.apache.kudu.hive.KuduStorageHandler' even when Kudu/HMS integration is disabled. The legacy storage handler will be eventually deprecated. 2) When Kudu/HMS integration is enabled, the Kudu table underneath the managed HMS table will follow the naming convention 'db_name.table_name' instead of 'impala::db_name.table_name'. 3) Add 'kudu.table_id' table property to be used with Kudu/HMS integration. This commit also extracts Kudu-related DDL parsing and analyzing tests, so that they can be run with or without Kudu/HMS integration enabled. Change-Id: I465673d749221bd5f3772814b1c22c2673a53f5c Reviewed-on: http://gerrit.cloudera.org:8080/13318 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Thomas Marshall <tmarshall@cloudera.com>	2019-05-31 23:33:51 +00:00
Joe McDonnell	6b09612e76	IMPALA-8344: Add support for running the minicluster with S3Guard Some tests can fail on S3 due to some operations that are eventually consistent. S3Guard stores extra metadata in a DynamoDB to solve several consistency issues. This adds support for running the minicluster on S3 with S3Guard. S3Guard is configured by the following environment variables: S3GUARD_ENABLED: defaults to false, set to true to enable S3Guard S3GUARD_DYNAMODB_TABLE: name of the DynamoDB table to use. This must be exclusively owned by this minicluster. The dataload scripts initialize this table and will purge entries if the table already exists. The table should be in the same region as the S3_BUCKET for the minicluster. S3GUARD_DYNAMODB_REGION - AWS region for S3GUARD_DYNAMODB_TABLE These environment variables only impact S3 configurations. The support comes from three pieces: 1. Configuration changes in core-site.xml to add the appropriate parameters. 2. Updating dataload to initialize/purge the s3guard dynamodb table and import data appropriately. 3. Update tests to manipulate files through the HDFS command line rather than through s3 utilities. This takes the filesystem utility code for ABFS (which actually calls HDFS command line), makes it generic, and uses it for S3Guard. Testing: - Ran multiple rounds of s3 tests - Aborted tests in the middle and restarted the s3 tests (to test the s3guard reinitialization code) Change-Id: I3c748529a494bb6e70fec96dc031523ff79bf61d Reviewed-on: http://gerrit.cloudera.org:8080/13020 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Sahil Takiar <stakiar@cloudera.com>	2019-05-23 18:25:46 +00:00
Todd Lipcon	17daa6efb9	IMPALA-8369 (part 2): Hive 3: switch to Tez-on-YARN execution This switches away from Tez local mode to tez-on-YARN. After spending a couple of days trying to debug issues with Tez local mode, it seemed like it was just going to be too much of a lift. This patch switches on the starting of a Yarn RM and NM when USE_CDP_HIVE is enabled. It also switches to a new yarn-site.xml with a minimized set of configurations, generated by the new python templating. In order for everything to work properly I also had to update the Hadoop dependency to come from CDP instead of CDH when using CDP Hive. Otherwise, the classpath of the launched Tez containers had conflicting versions of various Hadoop classes which caused tasks to fail. I verified that this fixes concurrent query execution by running queries in parallel in two beeline sessions. With local mode, these queries would periodically fail due to various races (HIVE-21682). I'm also able to get farther along in data loading. Change-Id: If96064f271582b2790a3cfb3d135f3834d46c41d Reviewed-on: http://gerrit.cloudera.org:8080/13224 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Todd Lipcon <todd@apache.org>	2019-05-10 13:42:55 +00:00
Tim Armstrong	236b9194d3	IMPALA-7988: support loading data with dockerized Impalas This patch does the work to load data and run some end-to-end query tests on a dockerised cluster. Changes were required in start-impala-cluster.py/ImpalaCluster and in some configuration files. ImpalaCluster is used for various things, including discovering service ports and testing for cluster readiness. This patch adds basic support and uses it from start-impala-cluster.py to check for cluster readiness. Some logic is moved from start-impala-cluster.py to ImpalaCluster. Limitations: * We're fairly inconsistent about whether services listen only on a single interface (e.g. loopback, traditionally) or whether it listens on all interfaces. This doesn't fix all of those issues. E.g. HDFS datanodes listen on all interfaces to work around some issues. * Many tests don't pass yet, particularly those using ImpalaCluster(), which isn't initialised with the appropriate docker arguments. Testing: Did a full data load locally using a dockerised Impala cluster: START_CLUSTER_ARGS="--docker_network=impala-cluster" \ TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" \ ./buildall.sh -format -testdata -ninja -notests -skiptests -noclean Ran a selection of end-to-end tests touching HDFS, Kudu and HBase tables after I loaded data locally. Ran exhaustive tests with non-dockerised impala cluster. Change-Id: I98fb9c4f5a3a3bb15c7809eab28ec8e5f63ff517 Reviewed-on: http://gerrit.cloudera.org:8080/12189 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-18 21:33:16 +00:00
Tim Armstrong	ff628d2b13	IMPALA-7986,IMPALA-7987: run daemons in docker containers This refactors start-impala-cluster.py to allow multiple implementations of the minicluster operations like start and stop. There are now two classes implementing the same set of operations - MiniClusterOperations and DockerMiniClusterOperations. The docker versions start and stop the containers added in IMPALA-7948. With some configuration (see instructions below), the containers can connect back to services (HDFS, HMS, Kudu, Sentry, etc) running on the host. Config generation was modified so that services optionally communicate via the docker bridge network rather than loopback (the host's loopback interface is not accessible to the containers). Notes: * I improved the container build to regenerate containers when cluster configs are regenerated (previously the containers could have stale configs). * Switch from CMD to ENTRYPOINT to allow passing in arguments to "docker run" without clobbering default args. * Python 2.6 is not supported for this code path. This only affects CentOS 6, which has limited support for docker anyway. * I deferred implementing wait_for_cluster(), since the existing code requires surgery to abstract out assumptions about locating processes and web UI ports - see IMPALA-7988. How to use: ========== Create a docker network to use for internal cluster communication, e.g.: docker network create -d bridge --gateway=172.17.0.1 \ --subnet=172.17.0.1/16 impala-cluster Add the gateway address of the docker network you created to impala-config-local.sh, e.g.: export INTERNAL_LISTEN_HOST=172.17.0.1 export DEFAULT_FS=hdfs://${INTERNAL_LISTEN_HOST}:20500 Regenerate configs and docker images: . bin/impala-config.sh ./bin/create-test-configuration.sh ninja -j $IMPALA_BUILD_THREADS docker_images Restart the minicluster and Impala services to pick up the config: ./testdata/bin/run-all.sh start-impala-cluster.py --docker_network impala-cluster You can connect with impala-shell and run some queries. You will likely run into issues, particularly if running against an existing data load, since "localhost" or "127.0.0.1" get baked into HMS table definitions. Testing: Ran exhaustive tests (not using Docker) to make sure I didn't break anything. Change-Id: I5975cced33fa93df43101dd47d19b8af12e93d11 Reviewed-on: http://gerrit.cloudera.org:8080/12095 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-18 04:56:49 +00:00
Sean Mackrory	7a022cf36a	IMPALA-7681. Add Azure Blob File System (ADLS Gen2) support. HADOOP-15407 adds a new FileSystem implementation called "ABFS" for the ADLS Gen2 service. It's in the hadoop-azure module as a replacement for WASB. Filesystem semantics should be the same, so skipped tests and other behavior changes have simply mirrored what is done for ADLS Gen1 by default. Tests skipped on ADLS Gen1 due to eventual consistency of the Python client can be run against ADLS Gen2. Change-Id: I5120b071760e7655e78902dce8483f8f54de445d Reviewed-on: http://gerrit.cloudera.org:8080/11630 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 06:43:00 +00:00
Thomas Tauber-Marshall	85f3bb0178	IMPALA-7499: build against CDH Kudu This patch transitions from pulling in Kudu (libkudu_client.so and the minicluster tarballs) from the toolchain to instead pull Kudu in with the other CDH components. For OSes where the CDH binaries are not provided but the toolchain binaries are (only Ubuntu 14), we set USE_CDH_KUDU to false to continue to download the toolchain binaries. We also continue to use the toolchain binaries to build the client stub for OSes where KUDU_IS_SUPPORTED is false. This patch also fixes an issue in bootstrap_toolchain.py where we were using the wrong g++ to compile the Kudu stub. Testing: - Verified building and running Impala works as expected for supported combinations of KUDU_IS_SUPPORTED/USE_CDH_KUDU Change-Id: If6e1048438b6d09a1b38c58371d6212bb6dcc06c Reviewed-on: http://gerrit.cloudera.org:8080/11363 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-11 01:01:06 +00:00
Fredy Wijaya	a203733fac	IMPALA-7295: Remove IMPALA_MINICLUSTER_PROFILE=2 This patch removes the use of IMPALA_MINICLUSTER_PROFILE. The code that uses IMPALA_MINICLUSTER_PROFILE=2 is removed and it defaults to code from IMPALA_MINICLUSTER_PROFILE=3. In order to reduce having too many code changes in this patch, there is no code change for the shims. The shims for IMPALA_MINICLUSTER_PROFILE=3 automatically become the default implementation. Testing: - Ran core and exhaustive tests Change-Id: Iba4a81165b3d2012dc04d4115454372c41e39f08 Reviewed-on: http://gerrit.cloudera.org:8080/10940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-14 01:03:18 +00:00
Tianyi Wang	00519a68d2	IMPALA-7169: Prevent HDFS from checkpointing trash until 3000 AD HDFS trash checkpointing renames files in the trash folder and breaks impala tests. Impala set the trash checkpointing interval to 1440 to try to postpone it for 24 hours. Unfortunately that told HDFS to do it when the UNIX time is a multiple of 1440 * 60 and it broke trash-related tests run around midnight in GMT. This patch sets the interval to 541728000 so that HDFS won't do the checkpointing until Jan 1st 3000, and HDFS will checkpoint every 1030 years after that. Change-Id: I9452f7e44c7679f86a947cd20115c078757223d8 Reviewed-on: http://gerrit.cloudera.org:8080/10742 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-20 01:32:36 +00:00
Philip Zeyliger	5b824408af	IMPALA-7035: Configure jceks.key.serialFilter for KMS. Configures a Java property for KMS to account for JDK 8u171's security fixes. I was seeing impala-py.test tests/metadata/test_hdfs_encryption.py fail with the following error: AssertionError: Error creating encryption zone: RemoteException: Can't recover key for testkey1 from keystore file:/home/impdev/Impala/testdata/cluster/cdh6/node-1/data/kms.keystore The issue is described in HDFS-13494, and I imagine it'll be fixed in due time. In the meanwhile, setting this property seems to do the trick. Change-Id: I2d21c9cce3b91e8fd8b2b4f1cda75e3958c977d5 Reviewed-on: http://gerrit.cloudera.org:8080/10418 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 22:15:32 +00:00
Tianyi Wang	13a1acd7e4	IMPALA-7003: Deflake erasure coding data loading Erasure coding data loading is flaky in two ways: 1. HBase sometimes doesn't work because of HBase-19369 2. Nested data loading sometimes fails because the HDFS namenode cannot find enough good datanodes. For problem 1, this patch enables erasure coding only on /test-warehouse directory. For problem 2, this patch sets dfs.namenode.redundancy.considerLoad to false, preventing namenode from excluding heavily-loaded datanodes. Change-Id: I219106cd3ec7ffab7a834700f2a722b165e5f66c Reviewed-on: http://gerrit.cloudera.org:8080/10362 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-15 23:59:58 +00:00
Taras Bobrovytsky	c05696dd6a	IMPALA-6949: Add the option to start the minicluster with EC enabled In this patch we add the "ERASURE_CODING" enviornment variable. If we enable it, a cluster with 5 data nodes will be created during data loading and HDFS will be started with erasure coding enabled. Testing: I ran the core build, and verified that erasure coding gets enabled in HDFS. Many of our EE tests failed however. Cherry-picks: not for 2.x Change-Id: I397aed491354be21b0a8441ca671232dca25146c Reviewed-on: http://gerrit.cloudera.org:8080/10275 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-05 01:20:59 +00:00
Philip Zeyliger	2e6a63e31e	IMPALA-6070: Further improvements to test-with-docker. This commit tackles a few additions and improvements to test-with-docker. In general, I'm adding workloads (e.g., exhaustive, rat-check), tuning memory setting and parallelism, and trying to speed things up. Bug fixes: * Embarassingly, I was still skipping thrift-server-test in the backend tests. This was a mistake in handling feedback from my last review. * I made the timeline a little bit taller to clip less. Adding workloads: * I added the RAT licensing check. * I added exhaustive runs. This led me to model the suites a little bit more in Python, with a class representing a suite with a bunch of data about the suite. It's not perfect and still coupled with the entrypoint.sh shell script, but it feels workable. As part of adding exhaustive tests, I had to re-work the timeout handling, since now different suites meaningfully have different timeouts. Speed ups: * To speed up test runs, I added a mechanism to split py.test suites into multiple shards with a py.test argument. This involved a little bit of work in conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh. Furthermore, I moved a bit more logic about managing the list of suites into Python. * Doing the full build with "-notests" and only building the backend tests in the relevant target that needs them. This speeds up "docker commit" significantly by removing about 20GB from the container. I had to indicates that expr-codegen-test depends on expr-codegen-test-ir, which was missing. * I sped up copying the Kudu data: previously I did both a move and a copy; now I'm doing a move followed by a move. One of the moves is cross-filesystem so is slow, but this does half the amount of copying. Memory usage: * I tweaked the memlimit_gb settings to have a higher default. I've been fighting empirically to have the tests run well on c4.8xlarge and m4.10xlarge. The more memory a minicluster and test suite run uses, the fewer parallel suites we can run. By observing the peak processes at the tail of a run (with a new "memory_usage" function that uses a ps/sort/awk trick) and by observing peak container total_rss, I found that we had several JVMs that didn't have Xmx settings set. I added Xms/Xmx settings in a few places: * The non-first Impalad does very little JVM work, so having an Xmx keeps it small, even in the parallel tests. * Datanodes do work, but they essentially were never garbage collecting, because JVM defaults let them use up to 1/4th the machine memory. (I observed this based on RSS at the end of the run; nothing fancier.) Adding Xms/Xmx settings helped. * Similarly, I piped the settings through to HBase. A few daemons still run without resource limitations, but they don't seem to be a problem. Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c Reviewed-on: http://gerrit.cloudera.org:8080/10123 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-26 20:47:29 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Philip Zeyliger	783de170c9	IMPALA-4277: Support multiple versions of Hadoop ecosystem Adds support for building against two sets of Hadoop ecosystem components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE, which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for Hadoop 3, Hive 2, and so on). We intend (in a trivial follow-on change soon) to make 3 the new default and to explicitly deprecate 2, but this change only does not switch the default yet. We support both to facilitate a smoother transition, but support will be removed soon in the Impala 3.x line. The switch is done at build time, following the pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). Switching back and forth requires running 'cmake' again. Doing this at build-time avoids complicating the Java code with classloader configuration. There are relatively few incompatible APIs. This implementation encapsulates that by extracting some Java code into fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the pattern established by IMPALA-5184, but, to avoid a proliferation of directories, I've moved the Hive files into the same tree.) pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I consolidated the Hive changes into the same directory structure. For Maven, I introduced Maven "profiles" to handle the two cases where the dependencies (and exclusions) differ. These are driven by the $IMPALA_MINICLUSTER_PROFILE environment variable. For Sentry, exception class names changed. We work around this by adding "isSentry...(Exception)" methods with two different implementations. Sentry is also doing some odd shading, whereby some exceptions are "sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism to create a SentryAuthProvider is slightly different. The easiest way to see the differences is to run: diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java The Sentry work is based on a change by Zach Amsden. In addition, we recently added an explicit "refresh" permission. In Sentry 2, this required creating an ImpalaPrivilegeModel to capture that. It's a slight customization of Hive's equivalent class. For Parquet, the difference is even more mechanical. The package names gone from "parquet" to "org.apache.parquet". The affected code was extracted into ParquetHelper, but only one copy exists. The second copy is generated at build-time using sed. In the rare cases where we need to behave differently at runtime, MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates what version we were built aginst. One of the cases is the results expected by various frontend tests. I avoided the issue by translating one error string into another, which handled the diversion in one place, rather than complicating the several locations which look for "No FileSystem for scheme..." errors. The HBase APIs we use for splitting regions at test time changed. This patch includes a re-write of that code for the new APIs. This piece was contributed by Zach Amsden. To work with newer versions of dependencies, I updated the version of httpcomponents.core we use to 4.4.9. We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase binaries to s3://native-toolchain, and amended the shell scripts to launch the right things. There are minor mechanical differences. Some of this was based on earlier work by Joe McDonnell and Zach Amsden. Hive's logging is changed in Hive 2, necessitating creating a log4j2.properties template and using it appropriately. Furthermore, Hadoop3's new shell script re-writes do a certain amount of classpath de-duplication, causing some issues with locating the relevant logging configurations. Accomodations exist in the code to deal with that. parquet-filtering.test was updated to turn off stats filtering. Older Hive didn't write Parquet statistics, but newer Hive does. By turning off stats filtering, we test what the test had intended to test. For views-compatibility.test, it seems that Hive 2 has fixed certain bugs that we were testing for in Hive. I've added a HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that. For AuthorizationTest, different hive versions show slightly different things for extended output. To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing to see here. rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%) CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This was done manually, but without changing anything except what Java required in terms of accessibility and boilerplate. rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%) copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%) Testing: Ran core & exhaustive tests with both profiles. Cherry-picks: not for 2.x. Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7 Reviewed-on: http://gerrit.cloudera.org:8080/9716 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-23 20:56:00 +00:00
Bharath Vissapragada	20daa4d516	IMPALA-6384: RequestPoolService should honor custom group mapping config Due to the way in which we instantiate fair scheduler allocation loader, we donot read the config overrides from the HDFS config files. This is an unexpected behavior from users' POV since we typically support overrides like custom user -> group mapping via HDFS config (for ex: LDAPGroupsMapping) that eventually affects the query -> pool assignment. Fix: This patch loads the hadoop default configuration so that the underlying QueuePlacementPolicy is based on user specified overrides. Testing (manual): Changed the core-site.xml to use LDAPGroupsMapping instead of the default ShellBasedUnixGroupsMapping and confirmed that the correct group mapping plugin is loaded, by adding additional logging. Also, modified TestRequestPoolService to assert that the core-site xml overrides are loaded. Change-Id: Ibb93870c0cc37e2432a643a274931f1d3d13fb96 Reviewed-on: http://gerrit.cloudera.org:8080/9000 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-11 22:52:29 +00:00
Laszlo Gaal	e81b7c6b68	IMPALA-6067: Enable S3 access via IAM roles for EC2 VMs For some time Impala in a production environment has been able to access data stored in Amazon S3 buckets using credentials specified in a number of ways: - storing Amazon access keys in environment variables or in core-site.xml. - using proprietary management tools to store Amazon access keys securely - using Amazon IAM roles bound to VMs running in EC2. The development minicluster environment used the first approach, which risked leaking these keys. This change enables Impala builds to use IAM roles to access S3 buckets when running on an Amazon EC2 virtual machine. The changes mainly ensure that environment variables carrying the traditional AWS credentials do not conflict with credentials supplied by the IAM role attached to the VM instance. IAM role based credentials are accessible through the EC2 instance-property mechanism; for further details see Amazon's docs at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html#instance-metadata-security-credentials The change also removes the remaining references to the s3n: provider. In the FE tests all URIs referring to s3n: are replaced with their s3a: equivalents, except for a single negative test in AnalyzeStmtsTest.java, which is removed. In addition to the code changes, the s3n: and s3a: credential properties are also removed from core-site.xml.tmpl. The s3a: provider can pick up AWS S3 credentials from environment variables or IAM properties bound to the VM instance, which is a more flexible approach. As environment variables have precedence over IAM roles, care must be taken when managing the canonical environment variables carrying AWS credentials. There are two requirements to be reconciled: 1. The FE tests have code that examines s3a: URIs; this code needs existing, but not necessarily valid AWS credentials. 2. When the Impala test suite is executed on an EC2 VM, AWS credentials can be supplied via IAM roles. These credentials can be used only if the AWS_* environment variables are unset (do not exist). The tradeoff is managed following these rules: 1. When AWS_* environment variables are set before invoking the Impala configuration scripts, their value is preserved and the config scripts ensure that the variables are exported. 2. If the AWS_* variables are missing or empty, they will be unset to ensure that credentials supplied by Amazon's IAM roles can be accessed, 3. except if the scripts are running outside of EC2 (so there can be no IAM roles) and TARGET_FILESYSTEM is not set "s3". This combination is most often the case on a developer's local workstation. In this case the AWS_* credential variables are forcibly set to dummy values to allow the FE tests to succeed. The removal of S3 credential parameters from core-site.xml[.tmpl] also allows users to set up their own credentials there, the config scripts will not change those settings. Environment variables carrying AWS security credentials will be set up according to the following table: Instance: Running outside EC2 \|\| Running in EC2 \| --------------------+--------+--------++--------+--------+ TARGET_FILESYSTEM \| S3 \| not S3 \|\| S3 \| not S3 \| --------------------+--------+--------++--------+--------+ \| \| \|\| \| \| empty \| unset \| dummy \|\| unset \| unset \| AWS_* \| \| \|\| \| \| env --------------+--------+--------++--------+--------+ var \| \| \|\| \| \| not empty \| export \| export \|\| export \| export \| \| \| \|\| \| \| --------------------+--------+--------++--------+--------+ Legend: unset: the variable is unset export: the variable is exported with its current value dummy: the variable is set to a preset dummy value and exported Running on an EC2 VM is indicated by setting RUNNING_IN_EC2 to "true" and exporting it before impala_config.sh is invoked. The change also moves the logic performing the S3 access checks into a separate script file: bin/check-s3-access.sh. This file now contains all the S3-specific logic and network access to check if the requested S3 bucket can be accessed. Testing: Performed local builds for HDFS as well as automated builds against HDFS and S3, using both IAM roles and explicit AWS_* credentials for authentication. Verified that FE tests that parse s3a: URLs are still successful in all these combinations (when they are run). Change-Id: I14cd9d4453a91baad3c379aa7e4944993fca95ae Reviewed-on: http://gerrit.cloudera.org:8080/8294 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-09 01:43:01 +00:00
Sailesh Mukil	50bd015f2d	IMPALA-5333: Add support for Impala to work with ADLS This patch leverages the AdlFileSystem in Hadoop to allow Impala to talk to the Azure Data Lake Store. This patch has functional changes as well as adds test infrastructure for testing Impala over ADLS. We do not support ACLs on ADLS since the Hadoop ADLS connector does not integrate ADLS ACLs with Hadoop users/groups. For testing, we use the azure-data-lake-store-python client from Microsoft. This client seems to have some consistency issues. For example, a drop table through Impala will delete the files in ADLS, however, listing that directory through the python client immediately after the drop, will still show the files. This behavior is unexpected since ADLS claims to be strongly consistent. Some tests have been skipped due to this limitation with the tag SkipIfADLS.slow_client. Tracked by IMPALA-5335. The azure-data-lake-store-python client also only works on CentOS 6.6 and over, so the python dependencies for Azure will not be downloaded when the TARGET_FILESYSTEM is not "adls". While running ADLS tests, the expectation will be that it runs on a machine that is at least running CentOS 6.6. Note: This is only a test limitation, not a functional one. Clusters with older OSes like CentOS 6.4 will still work with ADLS. Added another dependency to bootstrap_build.sh for the ADLS Python client. Testing: Ran core tests with and without TARGET_FILESYSTEM as 'adls' to make sure that all tests pass and that nothing breaks. Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542 Reviewed-on: http://gerrit.cloudera.org:8080/6910 Tested-by: Impala Public Jenkins Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>	2017-05-25 19:35:24 +00:00
Matthew Jacobs	d04f96b990	IMPALA-5301: Set Kudu minicluster memory limit By default, Kudu assumes it has 80% of system memory which is far too high for the minicluster. This sets a mem limit of 2gb and lowers the limit of the block cache. These values were tested on a gerrit-verify-dryrun job as well as an exhaustive run. This patch also simplifies TestKuduMemLimits which was unnecessarily creating a large table during test execution. Change-Id: I7fd7e1cd9dc781aaa672a2c68c845cb57ec885d5 Reviewed-on: http://gerrit.cloudera.org:8080/6844 Reviewed-by: Todd Lipcon <todd@apache.org> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-17 23:44:37 +00:00
Laszlo Gaal	9e7fb830fd	IMPALA-4088: Assign fix values to the minicluster server ports The minicluster setup logic assigned fixed port numbers to several but not all listening sockets of the data nodes. This change assigns similar port ranges to all the listening ports that were so far allowed to pick their own port numbers, interfering with other components, e.g. HBase. Change-Id: Iecf312873b7026c52b0ac0e71adbecab181925a0 Reviewed-on: http://gerrit.cloudera.org:8080/6531 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-07 22:57:16 +00:00
Lars Volker	85d7f5eb2b	IMPALA-4733: Change HBase ports to non-ephemeral We've seen repeated test failures because HBase tries to bind to ports in the ephemeral port range, which sometimes would already be occupied by outgoing connections of other proccesses. This change changes the ports to the new default HBase ports (HBASE-10123): HBase Master Port: 60000 -> 16000 HBase Master Web UI Port: 60010 -> 16010 HBase ReqionServer Port: 60020 -> 16020 HBase ReqionServer Web UI Port: 60030 -> 16030 HBase Status Multicast Port: 60100 -> 16100 This made it necessary to change the default KMS port, too (HADOOP-12811): KMS HTTP port: 16000 -> 9600 Change-Id: I6f8af325e34b6e352afd75ce5ddd2446ce73d857 Reviewed-on: http://gerrit.cloudera.org:8080/6524 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-04 00:28:29 +00:00
Jim Apple	5a2c50a163	IMPALA-4230: ASF policy issues from 2.7.0 rc3. In our IPMC vote to release 2.7.0 rc3, Justing Mclean pointed out a number of issues of compliance with ASF policy. He asked: 1. "Please place build instruction and supported platforms in the README. The wiki may change over time and that may make it difficult to build older versions." 2. Remove binary file llvm-ir/test-loop.bc 3. Add be/src/gutil/valgrind.h, shell/ext-py/sqlparse-0.1.14/sqlparse/pipeline.py and cmake_modules/FindJNI.cmake, normalize.css (embedded in bootstrap.css) to LICENSE.txt 4. Fix be/src/thirdparty/squeasel/squeasel* in LICENSE.txt 5. Remove outdated copyright lines from HBase (see https://issues.apache.org/jira/browse/HBASE-3870) 6. Remove duplicate jquery notice from LICENSE.txt Change-Id: I30ff77d7ac28ce67511c200764fba19ae69922e0 Reviewed-on: http://gerrit.cloudera.org:8080/4582 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-10-19 23:59:02 +00:00
Henry Robinson	19de09ab7d	IMPALA-4160: Remove Llama support. Alas, poor Llama! I knew him, Impala: a system of infinite jest, of most excellent fancy: we hath borne him on our back a thousand times; and now, how abhorred in my imagination it is! Done: * Removed QueryResourceMgr, ResourceBroker, CGroupsMgr * Removed untested 'offline' mode and NM failure detection from ImpalaServer * Removed all Llama-related Thrift files * Removed RM-related arguments to MemTracker constructors * Deprecated all RM-related flags, printing a warning if enable_rm is set * Removed expansion logic from MemTracker * Removed VCore logic from QuerySchedule * Removed all reservation-related logic from Scheduler * Removed RM metric descriptions * Various misc. small class changes Not done: * Remove RM flags (--enable_rm etc.) * Remove RM query options * Changes to RequestPoolService (see IMPALA-4159) * Remove estimates of VCores / memory from plan Change-Id: Icfb14209e31f6608bb7b8a33789e00411a6447ef Reviewed-on: http://gerrit.cloudera.org:8080/4445 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-09-20 23:50:43 +00:00
Yuanhao Luo	6e4064d942	IMPALA-4074: Configuration items duplicate in template of YARN Remove duplicate configuration items "yarn.nodemanager.local-dirs" and "yarn.nodemanager.log-dirs" from template configuration of YARN. Change-Id: I81d6d019d4982cb35932b1d45c376b215ec5bcc6 Reviewed-on: http://gerrit.cloudera.org:8080/4311 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-09-16 00:38:16 +00:00
Matthew Jacobs	32199105f7	Bump Kudu version to 1.0-RC1 and add support for new OSes Change-Id: Ibbe554d6782212f91db07757f429c5571a7a44da Reviewed-on: http://gerrit.cloudera.org:8080/4420 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-16 00:14:15 +00:00
Jim Apple	bd2947329e	IMPALA-4110: Clean up issues found by Apache RAT. Change-Id: I5bfe77f9a871018e7a67553ed270e2df53006962 Reviewed-on: http://gerrit.cloudera.org:8080/4361 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:09:24 +00:00
Bharath Vissapragada	d19751669a	IMPALA-3680: Cleanup the scan range state after failed hdfs cache reads Currently we don't reset the file read offset if ZCR fails. Due to this, when we switch to the normal read path, we hit the eosr of the scan-range even before reading the expected data length. If both the ReadFromCache() and ReadRange() calls fail without reading any data, we end up creating a whole list of scan-ranges, each with size 1KB (DEFAULT_READ_PAST_SIZE) assuming we are reading past the scan range. This gives a huge performance hit. This patch just calls ScanRange::Close() after the failed cache reads to clean up the file system state so that the re-reads start from beginning of the scan range. This was hit as a part of debugging IMPALA-3679, where the queries on 1gb cached data were running ~20x slower compared to non-cached runs. Change-Id: I0a9ea19dd8571b01d2cd5b87da1c259219f6297a Reviewed-on: http://gerrit.cloudera.org:8080/3313 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Bharath Vissapragada <bharathv@cloudera.com>	2016-07-05 13:37:26 -07:00
Michael Ho	6e71e903ff	IMPALA-3223: Supports download of CDH components from S3. This change updates the toolchain bootstrapping script to download the CDH components (hadoop, hbase, hive, llama, llama-minikdc and sentry) from the toolchain S3 bucket to the toolchain directory if the environment variable $DOWNLOAD_CDH_COMPONENTS is true. By default, it is false which means the CDH components in the thirdparty directory will be used instead. To build the ASF tree(https://git-wip-us.apache.org/repos/asf?p=incubator-impala.git), set $DOWNLOAD_CDH_COMPONENTS to true. Currently, the CDH components in S3 are snapshots from the thirdparty directory at 688d0efcd38731e8e27a8236dbdca21c8fd571a1. Once the integration jenkins job (impala-cdh5-trunk-core-integration) is modified to upload the latest stable builds to the S3 buckets, we can remove the thirdparty directory and always use the CDH components in the toolchain directory. Note that bootstrap_toolchain.py will not overwrite existing directories in the toolchain directory. To force a refresh of cpmponents in the toolchain directory, a user should delete the cached copy in the toolchain directory and execute bootstrap_toolchain.py again. This behavior allows users to develop locally without network connection once the toolchain has been bootstrapped. Change-Id: I16fa79db0005554cc0a116e74775647ba99f8dda Reviewed-on: http://gerrit.cloudera.org:8080/3333 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-06-21 00:37:53 -07:00
Taras Bobrovytsky	bee5375502	Add kill cluster marker to KMS If PID files of each process in the mini cluster get deleted for some reason, it should still possible to kill them because each process is marked with "-DIBelongToTheMiniCluster". It turns out that the KMS process was not being marked. This patch fixes this. Change-Id: I0398dec94be3ae91548d11a79c1d5eec0ad3dadb Reviewed-on: http://gerrit.cloudera.org:8080/3354 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>	2016-06-13 15:32:17 -07:00
Sailesh Mukil	d2c3c8711b	IMPALA-2021: S3: Flaky tests: impala-s3 job sometimes encounters I/O error 255 Through emprical analysis, it was determined that setting the maximum number of connections to S3 as 1500 was optimal for functionality and performance. The hadoop set default of 15 connections could lead us to have deadlocks as our parquet scanner requires that we have multiple concurrent open connections proportional to the number of columns that we are scanning. Setting it to this high a value does not seem to have any negative implications. This has also been found to fix the Error(255): Unknown errors. Change-Id: Ide6f1326d5155b2e5f4da3a3f23df3f3d40c5a8d Reviewed-on: http://gerrit.cloudera.org:8080/3114 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:19 -07:00
Bharath Vissapragada	3092c96619	IMPALA-2660: Respect auth_to_local configs from hdfs configs This patch implements a new feature to read the auth_to_local configs from hdfs configuration files, using the parameter hadoop.security.auth_to_local. This is done by modifying the User#getShortName() method to use its hdfs equivalent. This patch includes an end to end authorization test using sentry where we add specific auth_to_local setting for a certain user and test if the sentry authorization passes for this user after applying these rules. Given we don't have tests that run on a kerberized min-cluster, this patch adds a hack to load this configuration during even on non-kerberized 'test runs'. However this feature is disabled by default to preserve the existing behavior. To enable it, 1. Use kerberos as authentication mechanism (by setting --principal) and 2. Add "--load_auth_to_local_rules=true" to the cluster startup args Change-Id: I76485b83c14ba26f6fce66e5f83e8014667829e0 Reviewed-on: http://gerrit.cloudera.org:8080/2800 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:18:01 -07:00
Tim Armstrong	8e64273fee	Fix Kudu hole punch check to work if /tmp is on different fs /tmp isn't necessarily on the same filesystem as the Kudu data directory. Fix the check so that it checks the actual Kudu directory. Change-Id: Ic6aa27569a0650db7dcf5759952cd50c8e47f8c9 Reviewed-on: http://gerrit.cloudera.org:8080/2967 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:56 -07:00
casey	a27946e696	Improve mini-cluster usability (testdata/cluster/admin) Changes: 1) Previously when a service would fail, the user would have to find the the log file and open it. Now the end of the log is dumped to stdout. 2) Add start, stop, and restart commands to the "admin" script. For example now you can run testdata/cluster/admin restart kudu 3) Wait up to 120 seconds for services to shutdown. The timeout is the same as for the Impala processes. If the services fail to stop an error will be raised. Change-Id: I537ea5656df2081d4f1f27a9f3fcef4547fdc2fe Reviewed-on: http://gerrit.cloudera.org:8080/2751 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:37 -07:00
casey	6f4a5e6bb0	Kudu: Use -block_manager=file if "hole punching" isn't supported By default Kudu requires the underlying file system to support hole punching. If support isn't there Kudu will fail to start. People using such a file system can instead start Kudu with -block_manager=file. Before starting Kudu in the local mini-cluster, the "fallocate" command will be used to automatically determine if the special flag is needed. Note, users who need this must run bin/create-test-configuration.sh after pulling in this commit. This also fixes a bug in the delete_kudu_data() in the cluster admin script. A directory name was incorrect. Change-Id: I1ca7fedb367444c41e462b72b0b76091ee94e27c Reviewed-on: http://gerrit.cloudera.org:8080/2750 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:36 -07:00
casey	cef87e39dc	Updates for new Kudu toolchain layout and upgrade Kudu The directory structure of the newer Kudu toolchain artifacts has changed. Now the root directory is split into /release and /debug. A few little updates are needed to the build and service scripts. Since the toolchain no longer provides stubs for platforms that Kudu doesn't support the stubs need to be generated. This will be done as part of the toolchain bootstrapping. Also this upgrades Kudu to 0.8 RC1. Developers will need to run bin/create-test-configuration.sh after pulling in this change. Otherwise the Kudu service will fail to start. Change-Id: I625903bd92afece0ad819a96fc275d5812b5eb2a Reviewed-on: http://gerrit.cloudera.org:8080/2720 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:35 -07:00
Casey Ching	9bb1b8a366	Kudu: Disable fsnyc in the mini-cluster The Kudu team recommended disabling this for testing purposes. This should help with timeouts in cloud machines (ec2/gce). Disabling fsyncs could lead to data loss if the system crashed before the OS had a chance to write the data to disk. Our test setups don't need that level of reliability. Change-Id: I72fd85ce5c4bc71f071b854ea6a9ebe60fc1305f Reviewed-on: http://gerrit.cloudera.org:8080/2734 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:43 -07:00

1 2

71 Commits