Files
impala/tests/comparison/leopard
Joe McDonnell 1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
..
2017-09-21 22:57:26 +00:00

Summary
-------

This package runs the query generator continuously. Compares Impala and Postgres results
for a randomly generated query and produces several reports per day. Reports are
displayed on a web page which allows the user to conveniently examine the discovered
issues. The user can also start a custom run against a private Impala branch using the
web interface.

Requirements
------------

Docker -- A docker image with Impala and PostgresQL installed and at
    least one reference database loaded into PostgresQL. data_generator.py is a useful
    tool to migrate data from Impala into PostgresQL.

To get started, run ./controller.py and ./front_end.py. You should be able to view the
web page at http://localhost:5000. Results and logs are saved to /tmp/query_gen


Basic Configuration
-------------------

The following are useful environment variables for running the
controller and Docker images within it.

DOCKER_USER - user *within* the Impala Docker container who owns the
Impala source tree and test data.

DOCKER_PASSWORD - password for the user *within* the Impala Docker
container.

TARGET_HOST - host system on which Docker Engine is running. This is the
host that the controller will use to issue Docker commands like "docker
run".

TARGET_HOST_USERNAME - username for controller process to use to SSH
into TARGET_HOST. Via Fabric, one can either type a password or use SSH
keys.

DOCKER_IMAGE_NAME - image to pull via "docker pull"


External Volume Configuration
-----------------------------

To run Leopard against Impala with Kudu, we need to work around
KUDU-1419. KUDU-1419 is likely to occur if your Docker Storage Engine is
AUFS, or maybe others.  The easiest way to overcome this is to mount an
external Docker volume that contain the necessary test data.  To try to
handle this automatically, you can export any or all of the environment
variables, depending on your host and container setups:

DOCKER_IMPALA_USER_UID, DOCKER_IMPALA_USER_GID - numeric UID and GID for
the owner of the Impala test data (testdata/cluster from an Impala
source checkout) within your Docker container. Numeric IDs are needed,
because there is no guarantee the symbolic owner and group on the
container match the IDs on the target host.

HOST_TESTDATA_EXTERNAL_VOLUME_PATH - path on TARGET_HOST where the
external volume will reside. This is the destination for rsync to warm
the volume and the left-hand side of "docker run -v".

DOCKER_TESTDATA_VOLUME_PATH - path on your Docker container to the
testdata/cluster Impala directory. This is source for rsync to warm the
volume and the right-hand side of "docker run -v".

HOST_TO_DOCKER_SSH_KEY - name of private key on TARGET_HOST for use with
rsync so as to "warm" the external volume automatically.

You are encouraged to configure your container in such a way that rsync
with passwordless SSH is possible so as to create the external volume
using the environment variables above.

To do that, this is a handy guide on how to use rsync with SSH keys:

https://www.guyrutenberg.com/2014/01/14/restricting-ssh-access-to-rsync/