This patch adds support for running the stress test
(concurrent_select.py) and loading nested data (load_nested.py) into a
Kerberized, SSL-enabled Impala cluster. It assumes the calling user
already has a valid Kerberos ticket. One way to do that is:
1. Get access to a keytab and krb5.config
2. Set KRB5_CONFIG and KRB5CCNAME appropriately
3. Run kinit(1)
4. Run load_nested.py and/or concurrent_select.py within this
environment.
Because our Python clients already support Kerberos and SSL, we simply
need to make sure to use the correct options when calling the entry
points and initializing the clients:
Impala: Impyla
Hive: Impyla
HDFS: hdfs.ext.kerberos.KerberosClient
With this patch, I was able to manually do a short concurrent_select.py
run against a secure cluster without connection or auth errors, and I
was able to do the same with load_nested.py for a cluster that already
had TPC-H loaded.
Follow-ons for future cleanup work:
IMPALA-5263: support CA bundles when running stress test against SSL'd
Impala
IMPALA-5264: fix InsecurePlatformWarning under stress test with SSL
Change-Id: I0daad57bb8ceeb5071b75125f11c1997ed7e0179
Reviewed-on: http://gerrit.cloudera.org:8080/6763
Reviewed-by: Matthew Mulder <mmulder@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This patch lays the groundwork for loading data and running end-to-end
tests on a remote CDH cluster. The requirements for the cluster to run
the tests are:
- Managed by Cloudera Manager (CM)
- GPL Extras need to be installed
- KMS and KeyTrustee installed and available as a service
- SERDEPROPERTIES in the Hive DB modified to accept wide tables
- Hive warehouse dir points to /test-warehouse
The actual data loading is done via a new script, remote_data_load.py,
which takes the CM host as an argument. It can be run from a client
machine that is not a node of the cluster, but it needs to have the
Impala repo checked out and Impala built. This insures that all of the
necessary data load scripts are available, as well as setting up the
environment properly (client binaries like beeline and the hbase shell
are available, python libraries like cm_api are installed, necessary
environment variables are defined, etc.)
It should be noted that running remote_data_load.py will overwrite
any local XML config files with the configurations downloaded from
the remote cluster.
Usage: remote_data_load.py [options] <cm_host address>
Options:
-h, --help show this help message and exit
--snapshot-file=SNAPSHOT_FILE
Path to the test-warehouse archive
--cm-user=CM_USER Cloudera Manager admin user
--cm-pass=CM_PASS Cloudera Manager admin user password
--gateway=GATEWAY Gateway host to upload the data from. If not
set, uses the CM host as gateway.
--ssh-user=SSH_USER System user on the remote machine with
passwordless SSH configured.
--no-load Do not try to load the snapshot
--exploration-strategy=EXPLORATION_STRATEGY
--test Run end-to-end tests against cluster
Testing:
This patch is being submitted with the understanding that there are
still clean up issues that need to be addressed in the remote data
load script, for which JIRA's have been filed.
However, since many of the existing build scripts also had to be
modified, it is more important to make sure that no regressions were
inadvertently introduced into the existing data load process. Loading
data to a local mini-cluster was checked repeatedly while this patch
was being developed, as well as running it against the Jenkins job
that provides the test-warehouse snapshot used by the many other
Impala CI builds that run daily.
Change-Id: I1f443a1728a1d28168090c6f54e82dec2cb073e9
Reviewed-on: http://gerrit.cloudera.org:8080/4769
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
The Bug: Prior to this patch, a DCHECK was used to verify that the
underlying memory pool for the scratch batch was empty in a count based
scenario. For IMPALA-3964 (where a count(*) is performed on a nested
collection), if a Parquet column chunk is compressed, upon reading each
new data page it would be decompressed and eventually placed in to the
underlying scratch batch memory pool causing the aforementioned DCHECK
to fail. This was not picked up in the test suite as the TPCH nested
Parquet data is not compressed.
The Fix: Removed the erroneous DCHECK. Added logic to determine if any
memory in the scratch batch needs to be freed (due to the transfer that
occurs from the decompressed data pool), if so, it will be done.
Augmented the load_nested.py script to snappy compress each of the
tables within the 'tpch_nested_parquet' database. This is consistent with
how the flat TPCH Parquet data set is stored. Regarding test coverage,
there are already a number of tests that will perform nested collection
counts against the tables in the 'tpch_nested_parquet' database. For
uncompressed nested Parquet, the 'test_nested_types.py' test suite
leverages the 'ComplexTypesTbl' table to provide good coverage.
Change-Id: Id0955c85d18dfba4bd29a35ec95d0355da050607
Reviewed-on: http://gerrit.cloudera.org:8080/3940
Reviewed-by: Michael Ho <kwho@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
In commit 960808 I forgot to update the data-loading script for the
conversion of a shell script to a python script. It turns out there were
a couple of other little problems too. I checked manually that the data
was loaded after these changes.
Change-Id: Id81fc423348515ab446835868025cb839c77f52c
Reviewed-on: http://gerrit.cloudera.org:8080/1851
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
The major changes are:
1) Collect backtrace and fatal log on crash.
2) Poll memory usage. The data is only displayed at this time.
3) Support kerberos.
4) Add random queries.
5) Generate random and TPC-H nested data on a remote cluster. The
random data generator was converted to use MR for scaling.
6) Add a cluster abstraction to run data loading for #5 on a
remote or local cluster. This also moves and consolidates some
Cloudera Manager utilities that were in the stress test.
7) Cleanup the wrappers around impyla. That stuff was getting
messy.
Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7
Reviewed-on: http://gerrit.cloudera.org:8080/1298
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins