mirror of
https://github.com/apache/impala.git
synced 2026-01-06 06:01:03 -05:00
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:
- Managed by Cloudera Manager (CM)
- GPL Extras need to be installed
- KMS and KeyTrustee installed and available as a service
- SERDEPROPERTIES in the Hive DB modified to accept wide tables
- Hive warehouse dir points to /test-warehouse
The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)
It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.
Usage: remote_data_load.py [options] <cm_host address>
Options:
-h, --help show this help message and exit
--snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
--cm-user=CM_USER Cloudera Manager admin user
--cm-pass=CM_PASS Cloudera Manager admin user password
--gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
--ssh-user=SSH_USER System user on the remote machine with
passwordless SSH configured.
--no-load Do not try to load the snapshot
--exploration-strategy=EXPLORATION_STRATEGY
--test Run end-to-end tests against cluster
Testing:
This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.
However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.
Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
52 lines
2.1 KiB
Bash
Executable File
52 lines
2.1 KiB
Bash
Executable File
#!/bin/bash
|
|
#
|
|
# Licensed to the Apache Software Foundation (ASF) under one
|
|
# or more contributor license agreements. See the NOTICE file
|
|
# distributed with this work for additional information
|
|
# regarding copyright ownership. The ASF licenses this file
|
|
# to you under the Apache License, Version 2.0 (the
|
|
# "License"); you may not use this file except in compliance
|
|
# with the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing,
|
|
# software distributed under the License is distributed on an
|
|
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
# KIND, either express or implied. See the License for the
|
|
# specific language governing permissions and limitations
|
|
# under the License.
|
|
|
|
# Runs compute table stats over a curated set of Impala test tables.
|
|
#
|
|
set -euo pipefail
|
|
trap 'echo Error in $0 at line $LINENO: $(cd "'$PWD'" && awk "NR == $LINENO" $0)' ERR
|
|
|
|
. ${IMPALA_HOME}/bin/impala-config.sh > /dev/null 2>&1
|
|
|
|
# TODO: We need a better way of managing how these get set. See:
|
|
# https://issues.cloudera.org/browse/IMPALA-4346
|
|
IMPALAD=${IMPALAD:-localhost:21000}
|
|
|
|
COMPUTE_STATS_SCRIPT="${IMPALA_HOME}/tests/util/compute_table_stats.py --impalad=${IMPALAD}"
|
|
|
|
# Run compute stats over as many of the tables used in the Planner tests as possible.
|
|
${COMPUTE_STATS_SCRIPT} --db_names=functional\
|
|
--table_names="alltypes,alltypesagg,alltypesaggmultifilesnopart,alltypesaggnonulls,
|
|
alltypessmall,alltypestiny,jointbl,dimtbl"
|
|
|
|
# We cannot load HBase on s3 and isilon yet.
|
|
if [ "${TARGET_FILESYSTEM}" = "hdfs" ]; then
|
|
${COMPUTE_STATS_SCRIPT} --db_name=functional_hbase\
|
|
--table_names="alltypessmall,stringids"
|
|
fi
|
|
${COMPUTE_STATS_SCRIPT} --db_names=tpch,tpch_parquet \
|
|
--table_names=customer,lineitem,nation,orders,part,partsupp,region,supplier
|
|
${COMPUTE_STATS_SCRIPT} --db_names=tpch_nested_parquet
|
|
${COMPUTE_STATS_SCRIPT} --db_names=tpcds
|
|
|
|
if "$KUDU_IS_SUPPORTED"; then
|
|
${COMPUTE_STATS_SCRIPT} --db_names=functional_kudu
|
|
${COMPUTE_STATS_SCRIPT} --db_names=tpch_kudu
|
|
fi
|