To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3
This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
doesn't have a main function, it removes the hash-bang and makes
sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
(or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
replaced by the cm-client pypi package and interfaces have changed.
Rather than migrating the code (which hasn't been used in years), this
deletes the old code and stops installing cm-api into the virtualenv.
The code can be restored and revamped if there is any interest in
interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
bit-rotted. Some pieces can be run manually, but it can't be fully
verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
version that supports Python 3. The newest version of kazoo requires
upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
needing other upgrades.
The two remaining uses of impala-python are:
- bin/cmake_aux/create_virtualenv.sh
- bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.
The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)
Testing:
- Ran core job
- Ran build + dataload on Centos 7, Redhat 8
- Manual testing of individual scripts (except some bitrotted areas like the
random query generator)
Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Removes deprecated ImpalaHttpClient constructor that supported port and
path as it has been deprecated since at least 2020 and appears unused.
Removes cert_file and key_file as they were also never used, and if
required must now be passed in via ssl_context.
Updates TSSLSocket fixes for Thrift 0.16 and Python 3.12. _validate_cert
was removed by Thrift 0.16, but everything worked because Thrift used
ssl.match_hostname instead. With Python 3.12 ssl.match_hostname no
longer exists so we rely on OpenSSL to handle verification with
ssl.PROTOCOL_TLS_CLIENT.
Only uses ssl.PROTOCOL_TLS_CLIENT when match_hostname is unavailable to
avoid changing existing behavior. THRIFT-792 identifies that TSocket
suppresses connection errors, where we would otherwise see SSL hostname
verification errors like
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed: IP address mismatch, certificate is not
valid for '::1'. (_ssl.c:1131)
Python 2.7.9 and 3.2 are minimum required versions; both have been EOL
for several years.
Testing:
- ran custom_cluster/{test_client_ssl.py,test_ipv6.py} on Ubuntu 24 with
Python 3.12, OpenSSL 3.0.13.
- ran custom_cluster/test_client_ssl.py on RHEL 7.9 with Python 2.7.5
and Python 3.6.8, OpenSSL 1.0.2k-fips.
- adds test that hostname checking is configured.
Change-Id: I046a9010ac4cb1f7d705935054b306cddaf8bdc7
Reviewed-on: http://gerrit.cloudera.org:8080/23519
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
TestImpalaShell.test_cancellation start failing when run with Python 3.9
with following error message
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'>
This patch is a quick fix the by changing the stderr write from using
print() to os.write(). Note that the thread-safetyness isssue within
_signal_handler in impala_shell.py during query cancellation still
remains.
Testing:
Run and pass test_cancellation in RHEL9 with Python 3.9.
Change-Id: I5403c7b8126b1a35ea841496fdfb6eb93e83376e
Reviewed-on: http://gerrit.cloudera.org:8080/23416
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true
reveals some tests that require adjustment. This patch made such
adjustment, which mostly revolves around encoding differences and string
vs bytes type in Python3. This patch also switch the default to run
pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The
following are the details:
Change hash() function in conftest.py to crc32() to produce
deterministic hash. Hash randomization is enabled by default since
Python 3.3 (see
https://docs.python.org/3/reference/datamodel.html#object.__hash__).
This cause test sharding (like --shard_tests=1/2) produce inconsistent
set of tests per shard. Always restart minicluster during custom cluster
tests if --shard_tests argument is set, because test order may change
and affect test correctness, depending on whether running on fresh
minicluster or not.
Moved one test case from delimited-latin-text.test to
test_delimited_text.py for easier binary comparison.
Add bytes_to_str() as a utility function to decode bytes in Python3.
This is often needed when inspecting the return value of
subprocess.check_output() as a string.
Implement DataTypeMetaclass.__lt__ to substitute
DataTypeMetaclass.__cmp__ that is ignored in Python3 (see
https://peps.python.org/pep-0207/).
Fix WEB_CERT_ERR difference in test_ipv6.py.
Fix trivial integer parsing in test_restart_services.py.
Fix various encoding issues in test_saml2_sso.py,
test_shell_commandline.py, and test_shell_interactive.py.
Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1.
Switch to binary comparison in test_iceberg.py where needed.
Specify text mode when calling tempfile.NamedTemporaryFile().
Simplify create_impala_shell_executable_dimension to skip testing dev
and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason
is that several UTF-8 related tests in test_shell_commandline.py break
in Python3 pytest + Python2 impala-shell combo. This skipping already
happen automatically in build OS without system Python2 available like
RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty).
Removed unused vector argument and fixed some trivial flake8 issues.
Several test logic require modification due to intermittent issue in
Python3 pytest. These include:
Add _run_query_with_client() in test_ranger.py to allow reusing a single
Impala client for running several queries. Ensure clients are closed
when the test is done. Mark several tests in test_ranger.py with
SkipIfFS.hive because they run queries through beeline + HiveServer2,
but Ozone and S3 build environment does not start HiveServer2 by
default.
Increase the sleep period from 0.1 to 0.5 seconds per iteration in
test_statestore.py and mark TestStatestore to execute serially. This is
because TServer appears to shut down more slowly when run concurrently
with other tests. Handle the deprecation of Thread.setDaemon() as well.
Always force_restart=True each test method in TestLoggingCore,
TestShellInteractiveReconnect, and TestQueryRetries to prevent them from
reusing minicluster from previous test method. Some of these tests
destruct minicluster (kill impalad) and will produce minidump if metrics
verifier for next tests fail to detect healthy minicluster state.
Testing:
Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true.
Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4
Reviewed-on: http://gerrit.cloudera.org:8080/23319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
make_shell_tarball.sh previously used the system Python to copy the
required packages to each extra python versions. This caused version
mismatches as the compiled modules are not portable between Python
versions. This change modifies the logic to use the cache of the
provided extra version instead of the system Python.
Tests:
- manually validated that each provided extra Python contains their
respective compiled dependencies
Change-Id: Iaee9b3a98b73fd1faf1b7c8ba4b388722add6fb4
Reviewed-on: http://gerrit.cloudera.org:8080/23160
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Main changes:
- added flag external_interface to override hostname for
beeswax/hs2/hs2-http port to allow testing ipv6 on these
interfaces without forcing ipv6 on internal communication
- compile Squeasel with USE_IPV6 to allow ipv6 on webui (webui
interface can be configured with existing flag webserver_interface)
- fixed the handling of [<ipv6addr>].<port> style addresses in
impala-shell (e.g. [::1]:21050) and test framework
- improved handling of custom clusters in test framework to
allow webui/ImpalaTestSuite's clients to work with non
standard settings (also fixes these clients with SSL)
Using ipv4 vs ipv6 vs dual stack can be configured by setting
the interface to bind to with flag webserver_interface and
external_interface. The Thrift server behind hs2/hs2-http/beeswax
only accepts a single host name and uses the first address
returned by getaddrinfo() that it can successfully bind to. This
means that unless an ipv6 address is used (like ::1) the behavior
will depend on the order of addresses returned by getaddrinfo():
63b7a263fc/lib/cpp/src/thrift/transport/TServerSocket.cpp (L481)
For dual stack the only way currently is to bind to "::",
as the Thrift server can only listen a single socket.
Testing:
- added custom cluster tests for ipv6 only/dual interface
with and without SSL
- manually tested in dual stack environment with client on a
different host
- among clients impala-shell and impyla are tested, but not
JDBC/ODBC
- no tests yet on truly ipv6 only environment, as internal
communication (e.g. krpc) is not ready for ipv6
To test manually the dev cluster can be started with ipv6 support:
dual mode:
bin/start-impala-cluster.py --impalad_args="--external_interface=:: --webserver_interface=::" --catalogd_args="--webserver_interface=::" --state_store_args="--webserver_interface=::"
ipv6 only:
bin/start-impala-cluster.py --impalad_args="--external_interface=::1 --webserver_interface=::1" --catalogd_args="--webserver_interface=::1" --state_store_args="--webserver_interface=::1"
Change-Id: I51ac66c568cc9bb06f4a3915db07a53c100109b6
Reviewed-on: http://gerrit.cloudera.org:8080/22527
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the support to fetch access tokens
from the OAuth Server using the OAuth client_id and
client_secret if the access token is not provided.
It covers the flow: client_credentials.
The client_secret can either be passed as a file or
be prompted to enter.
Added a test param for impala shell oauth_mock_response_cmd
to mock oauth server response only to be used for testing.
Also suppressed existing option hs2_x_forward from the
impala --help output.
Testing(okta oauth server):
- Added custom_cluster tests in test_shell_jwt_auth.py:
test_oauth_auth_with_clientid_and_secret_success
test_oauth_auth_with_clientid_and_secret_failure
- Tested manually by providing --user <user> and
--oauth_client_secret_cmd="cat password_file.txt"
- Tested manually by providing --user <user> and no
--oauth_client_secret_cmd, thereby prompting the user
to enter the client_secret.
Example command: impala-shell.sh -a
--auth_creds_ok_in_clear --protocol="hs2-http"
--oauth_client_id="client_id"
--oauth_client_secret_cmd="cat client_secret.txt"
--oauth_server="dev.us.auth01.com"
--oauth_endpoint="/oauth/token"
Change-Id: I84e26d54f6a53696660728efb239ffd43de4c55d
Reviewed-on: http://gerrit.cloudera.org:8080/22424
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The tarball packaging for impala-shell ships support for
multiple Python versions (including both Python 2 and Python 3).
In the impala-shell script, it determines the python to use
and uses the corresponding installation. Historically, impala-shell
has preferred the "python" executable (which can be Python 2) to
the "python3" executable. Since Python 2 is deprecated, this flips
the preference to prefer "python3" to "python".
This continues to respect IMPALA_PYTHON_EXECUTABLE as before, but
it adds an IMPALA_SHELL_PYTHON_FALLBACK variable to determine
whether to fall back to the regular logic. This defaults to
true, allowing fallback, to maintain existing behavior. The
shell end-to-end tests set this to false to lock in the
Python version.
Testing:
- Ran shell tests
Change-Id: If0e32e8eee672e4dc66e725722f5150cd1e4c9a6
Reviewed-on: http://gerrit.cloudera.org:8080/22953
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
When running the shell in a terminal with live_progress=true, live
progress overwrites its output by using the ANSI up character to
rewrite lines with updated on the query progress. On Python 3,
we found that the updates to clear the live progress were overwriting
the actual output in the terminal. e.g.
+----------+
| count(*) |
+----------+
Fetched 1 row(s) in 5.20s
To avoid this, the live progress lines need to be fully flushed to stderr
before starting to output the result to stdout. This adds a flush call
in OverwritingStdErrOutputStream::clear() to force this.
Testing:
- Hand tested queries with live progress
- Added test that redirects stdout and stderr to the same file and
verifies that no ANSI up character comes after the query output
Change-Id: Id2e21224253f76b2a04767a57b3ade49ce2c914f
Reviewed-on: http://gerrit.cloudera.org:8080/22941
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Python 3 changed the behavior of imports with PEP328. Existing
imports become absolute unless they use the new relative import
syntax. This adapts the impala-shell code to use absolute
imports, fixing issues where it is imported from our test code.
There are several parts to this:
1. It moves impala shell code into shell/impala_shell.
This matches the directory structure of the PyPi package.
2. It changes the imports in the shell code to be
absolute paths (i.e. impala_shell.foo rather than foo).
This fixes issues with Python 3 absolute imports.
It also eliminates the need for ugly hacks in the PyPi
package's __init__.py.
3. This changes Thrift generation to put it directly in
$IMPALA_HOME/shell rather than $IMPALA_HOME/shell/gen-py.
This means that the generated Thrift code is rooted in
the same directory as the shell code.
4. This changes the PYTHONPATH to include $IMPALA_HOME/shell
and not $IMPALA_HOME/shell/gen-py. This means that the
test code is using the same import paths as the pypi
package.
With all of these changes, the source code is very close
to the directory structure of the PyPi package. As long as
CMake has generated the thrift files and the Python version
file, only a few differences remain. This removes those
differences by moving the setup.py / MANIFEST.in and other
files from the packaging directory to the top-level
shell/ directory. This means that one can pip install
directly from the source code. i.e. pip install $IMPALA_HOME/shell
This also moves the shell tarball generation script to the
packaging directory and changes bin/impala-shell.sh to use
Python 3.
This sorts the imports using isort for the affected Python files.
Testing:
- Ran a regular core job with Python 2
- Ran a core job with Python 3 and verified that the absolute
import issues are gone.
Change-Id: Ica75a24fa6bcb78999b9b6f4f4356951b81c3124
Reviewed-on: http://gerrit.cloudera.org:8080/22330
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
In beeswax all statements with the exception of USE print
'Fetched X row(s) in Ys', while in HS2 some statements (REFRESH,
INVALIDATE) metadata does not print it. While these statements always
return 0 rows, the amount of time spent with the statement can be
useful.
This patch modifies add impala-shell to let it print elapsed time for
that query, even if query is not expected to return result metadata.
Added --beeswax_compat_num_rows option in impala-shell. It default to
False. If this option is set (True), 'Fetched 0 row(s) in' will be
printed for all Impala protocol, just like beeswax. One exception for
this is USE query, which will remain silent.
Testing:
- Added test_beeswax_compat_num_rows in test_shell_interactive.py.
- Pass test_shell_interactive.py.
Change-Id: Id76ede98c514f73ff1dfa123a0d951e80e7508b4
Reviewed-on: http://gerrit.cloudera.org:8080/22813
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This puts all of the thrift-generated python code into the
impala_thrift_gen package. This is similar to what Impyla
does for its thrift-generated python code, except that it
uses the impala_thrift_gen package rather than impala._thrift_gen.
This is a preparatory patch for fixing the absolute import
issues.
This patches all of the thrift files to add the python namespace.
This has code to apply the patching to the thirdparty thrift
files (hive_metastore.thrift, fb303.thrift) to do the same.
Putting all the generated python into a package makes it easier
to understand where the imports are getting code. When the
subsequent change rearranges the shell code, the thrift generated
code can stay in a separate directory.
This uses isort to sort the imports for the affected Python files
with the provided .isort.cfg file. This also adds an impala-isort
shell script to make it easy to run.
Testing:
- Ran a core job
Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0
Reviewed-on: http://gerrit.cloudera.org:8080/20169
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
This patch implement building exec summary table for
ImpylaHS2Connection. It adds fetch_exec_summary argument in
ImpalaConnection.execute(). If this argument is True, an exec summary
table will be added into the returned result object.
fetch_exec_summary is also implemented for BeeswaxConnection. Thus,
BeeswaxConnection will not fetch exec summary by default all the time.
Tests that validate exec summary table is updated to set
fetch_exec_summary=True and migrated to test against hs2 protocol.
Change TestExecutorGroup._set_query_options() to do query option setting
through hs2_client iconfig instead of SET query. Some flake8 issues are
addressed as well.
Move build_exec_summary_table to separate exec_summary.py file. Tweak it
a bit to return early if given TExecSummary is empty.
Fixed bug in ImpalaBeeswaxClient.fetch_results() where fetch will not
happen at all if discard_result argument is True.
Testing:
- Run and pass affected tests locally.
Change-Id: I7d88f78e58eeda29ce21e7828884c7a129d7efe6
Reviewed-on: http://gerrit.cloudera.org:8080/22626
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change impala-shell could not be installed on Python 3.11
duo to compilation failure in python-sasl. Checked installation
on Python 3.11/3.12/3.13.
Also bumps impyla version to 0.21a2.
Change-Id: I4efdd105e489e1d0a996d156fb7efbb6fad8da7d
Reviewed-on: http://gerrit.cloudera.org:8080/22593
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
This patch added OAuth support with following functionality:
* Load and parse OAuth JWKS from configured JSON file or url.
* Read the OAuth Access token from the HTTP Header which is
the same format as JWT Authorization Bearer token.
* Verify the OAuth's signature with public key in JWKS.
* Get the username out of the payload of OAuth Access token.
* If kerberos or ldap is enabled, then both jwt and oauth are
supported together. Else only one of jwt or oauth is supported.
This has been a pre existing flow for jwt. So OAuth will follow
the same policy.
* Impala Shell side changes: OAuth options -a and --oauth_cmd
Testing:
- Added 3 custom cluster be test in test_shell_jwt_auth.py:
- test_oauth_auth_valid: authenticate with valid token.
- test_oauth_auth_expired: authentication failure with
expired token.
- test_oauth_auth_invalid_jwk: authentication failure with
valid signature but expired.
- Added 1 custom cluster fe test in JwtWebserverTest.java
- testWebserverOAuthAuth: Basic tests for OAuth
- Added 1 custom cluster fe test in LdapHS2Test.java
- testHiveserver2JwtAndOAuthAuth: tests all combinations of
jwt and oauth token verification with separate jwks keys.
- Manually tested with a valid, invalid and expired oauth
access token.
- Passed core run.
Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6
Reviewed-on: http://gerrit.cloudera.org:8080/21728
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, the shell tarball maintains its own packaging code
and directory layout. This is very complicated and currently has
several Python packages directly checked into our repository.
To simplify it, this changes the shell tarball to be based on
pip installing the pypi package. Specifically, the new directory
structure for an unpack shell tarball is:
impala-shell-4.5.0-SNAPSHOT/
impala-shell
install_py${PYTHON_VERSION}/
install_py${ANOTHER_PYTHON_VERSION}/
For example, install_py2.7 is the Python 2.7 pip install of impala-shell.
install_py3.8 is a Python 3.8 pip install of impala-shell. This means
that the impala-shell script simply picks the install for the
specified version of python and uses that pip install directory.
To make this more consistent across different Linux distributions, this
upgrades pip in the virtualenv to the latest.
With this, ext-py and pkg_resources.py can be removed.
This requires rearranging the shell build code. Specifically, this splits
out the code that generates impala_build_version.py so that it can run
before generating the pypi package. The shell tarball now has a dependency
on the pypi package and must run after it.
This builds on Michael Smith's work from IMPALA-11399.
Testing:
- Ran shell tests locally
- Built on Centos 7, Redhat 8 & 9, Ubuntu 20 & 22, SLES 15
Change-Id: Ifbb66ab2c5bc7180221f98d9bf5e38d62f4ac036
Reviewed-on: http://gerrit.cloudera.org:8080/20171
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Calls to both of these RPC endpoints are previously logged at
VLOG_RPC (or VLOG(2)). This patch change the log level to VLOG_QUERY (or
VLOG(1)). This is helpful because both RPC are usually called after
query execution complete, but the query handle is not released yet. They
are also rarely called by client, so they will not be too noisy. Missing
query driver log in GetAllQueryHandles is moved to its caller, where the
log message is clarified.
ImpalaShell._execute_stmt() also modified to call get_runtime_profile()
only if show_profile option is true.
Testing:
- Using impala-shell, run a TPC-DS query followed by 'profile' and
summary command. Verify that logs are printed, both with beeswax and
HS2 protocol.
- Pass core tests.
Change-Id: I90ef7d0fadd81c58ec1072e53430f51fea146cf1
Reviewed-on: http://gerrit.cloudera.org:8080/22085
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch unify duplicated exec summary code used by python beeswax
clients: one used by the shell in impala_shell.py and one used by tests
in impala_beeswax.py. The code that has progress furthest is the one in
shell/impala_client.py, which is the one that can print correct exec
summary table for MT_DOP>0 queries. It is made into a dedicated
build_exec_summary_table function in impala_client.py, and then
impala_beeswax.py import it from impala_client.py.
This patch also fix several flake8 issues around the modified files.
Testing:
- Manually run TPC-DS Q74 in impala-shell and then type "summary"
command. Confirm that plan tree is displayed properly.
- Run single_node_perf_run.py over branches that produce different
TPC-DS Q74 plan tree. Confirm that the plan tree are displayed
correctly in performance_result.txt
Change-Id: Ica57c90dd571d9ac74d76d9830da26c7fe20c74f
Reviewed-on: http://gerrit.cloudera.org:8080/22060
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
This change will print timestamp of an exception or warning
occurred during execution of a query via impala-shell.
The timestamp will use timezone of the machine running impala-shell.
example:
Query submitted at: 2024-08-22 16:17:57 (Coordinator: http://host:25000)
Query state can be monitored at:
http://localhost:25000/query_plan?query_id=e04dcc55e560d1ee:11173fe800000000
^C Cancelling Query
Opened TCP connection to localhost:21050
2024-08-22 16:17:58 [Exception] type=<class 'socket.error'> in FetchResults.
[Errno 4] Interrupted system call
2024-08-22 16:17:58 [Warning] Cancelling Query
2024-08-22 16:17:58 [Warning] close session RPC failed: <class
'shell_exceptions.QueryCancelledByShellException'>
Opened TCP connection to localhost:21050
[localhost:21050] default>
Change-Id: I4abbd02aa9f61210b0333495bf191e72c22a5944
Reviewed-on: http://gerrit.cloudera.org:8080/21426
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
MERGE statement is a DML command that allows users to perform
conditional insert, update, or delete operations on a target table based
on the results of a join with a source table. This change adds MERGE
statement parsing and an Iceberg-specific semantic analysis, planning,
and execution. The parsing grammar follows the SQL standard, it accepts
the same syntax as Hive, Spark, and Trino by supporting arbitrary number
of WHEN clauses, with conditions or without and accepting inline views
as source.
Example:
'MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED AND t.id < 100 THEN UPDATE SET column1 = s.column1
WHEN MATCHED AND t.id > 100 THEN DELETE
WHEN MATCHED THEN UPDATE SET column1 = "value"
WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.column1);'
The Iceberg-specific analysis, planning, and execution are based on a
concept that was previously used for UPDATE: The analyzer creates a
SELECT statement with all target and source columns (including
Iceberg's virtual columns) and a 'row_present' column that defines
whether the source, the target, or both rows are present in the result
set after joining the two table references by the ON clause. The join
condition should be an equi-join, as it is a FULL OUTER JOIN, and Impala
currently supports only equi-joins in this case. The joining order is
forced by a query hint, this guarantees that the target table is always
on the left side.
A new, IcebergMergeNode is added at planning phase, this node does the
row-level filtering for each MATCHED/ NOT MATCHED cases. The
'row_present' column decides which case group will be evaluated; if
both sides are available, the matched cases, if only the source side
matches then the not matched cases and their filter expressions
will be evaluated over the row. If one of the cases match, then the
execution evaluates the result expressions into the output row batch,
and an auxiliary tuple will store the merge action. The merge action is
a flag for the newly added IcebergMergeSink; this sink will route each
incoming row from IcebergMergeNode to their respective destination. Each
row could go to the delete sink, insert sink, or to both sinks.
Target-side duplicate records are filtered during IcebergMergeNode's
execution, if one target table-side duplicate is detected, the whole
statement's execution is stopped and the error is reported back to the
user.
Added tests:
- Parser tests
- Analyzer tests
- Unit test for WHEN NOT MATCHED INSERT column collation
- Planner tests for partitioned/sorted cases
- Authorization tests
- E2E tests
Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268
Reviewed-on: http://gerrit.cloudera.org:8080/21423
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When using hs2-http protocol, http messages from Impala clients may pass
through one or more proxies before reaching the Impala coordinator.
This can make it harder to track the origin of the http messages. The
'X-Forwarded-For' header is added to or edited by HTTP proxies when
forwarding a request, so it may contain multiple source addresses. Add
the value of this header to the runtime profile so that it can be
observed.
Impala will truncate the 'X-Forwarded-For' header value at 8096
characters. Apart from this, Impala does not do any verification or
sanitization of this value, so its value should only be trusted if the
deployment environment protects against spoofing.
A good reference for understanding the use of 'X-Forwarded-For' is
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For
This patch does not address the cases where http proxies insert
multiple 'X-Forwarded-For' headers. This issue is tracked in
IMPALA-13335.
TESTING: add an option '--hs2_x_forward' to impala-shell which will
set the 'X-Forwarded-For' header. Add tests which verify that the value
is set in the profile, and that a long value is truncated correctly.
Change-Id: I2e010cfb09674c5d043ef915347c3836696e03cf
Reviewed-on: http://gerrit.cloudera.org:8080/21700
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, Impala does an execute call, then the client polls
waiting for the operation to finish (or error out). The client
sleeps between polls, and this sleep time can be a substantial
percentage of a short query's execution time.
To reduce this client side sleep, this implements long polling to
provide an option to wait for query completion on the server side.
This is controlled by the long_polling_time_ms query option. If
set to greater than zero, status RPCs will wait for query
completion for up to that amount of time. This defaults to off (0ms).
Both Beeswax and HS2 add a wait for query completion in their
get status calls (get_state for Beeswax, GetOperationStatus for HS2).
This doesn't wait in the execute RPC calls (e.g. query for Beeswax,
ExecuteStatement for HS2), because neither includes the query status
in the response. The client will always need to do a separate status
RPC.
This modifies impala-shell and the beeswax client to avoid doing a
sleep if the get_state/GetOperationStatus calls take longer than
they would have slept. In other words, if they would have slept 50ms,
then they skip that sleep if the RPC to the server took longer than
50ms. This allows the client to maintain its sleep behavior with
older Impalas that don't use long polling while adapting properly
to systems that do have long polling. This has the added benefit
that it also adjusts for high latency to the server as well. This
does not change any of the sleep times.
Testing:
- This adds a test case in test_hs2.py to verify the long
polling behavior
Change-Id: I72ca595c5dd8a33b936f078f7f7faa5b3f0f337d
Reviewed-on: http://gerrit.cloudera.org:8080/19205
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the query id to the error messages in both
- the result of the `get_log()` RPC, and
- the error message in an RPC response
before they are returned to the client, so that the users can easily
figure out the errored queries on the client side.
To achieve this, the query id of the thread debug info is set in the
RPC handler method, and is retrieved from the thread debug info each
time the error reporting function or `get_log()` gets called.
Due to the change of the error message format, some checks in the
impala-shell.py are adapted to keep them valid.
Testing:
- Added helper function `error_msg_expected()` to check whether an
error message is expected. It is stricter than only using the `in`
operator.
- Added helper function `error_msg_equal()` to check if two error
messages are equal regardless of the query ids.
- Various test cases are adapted to match the new error message format.
- `ImpalaBeeswaxException`, which is used in tests only, is simplified
so that it has the same error message format as the exceptions for
HS2.
- Added an assertion to the case of killing and restarting a worker
in the custom cluster test to ensure that the query id is in
the error message in the client log retrieved with `get_log()`.
Change-Id: I67e659681e36162cad1d9684189106f8eedbf092
Reviewed-on: http://gerrit.cloudera.org:8080/21587
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
It can be useful to get a stacktrace for a running impala-shell
for debugging. This uses Python 3's faulthandler to handle the
SIGUSR1, so it prints a stacktrace for all threads when it
receives SIGUSR1.
This does not implement an equivalent functionality for Python 2.
Python 2 doesn't have the faulthandler library, and hand tests
showed that sending SIGUSR1 to Python 2 impala-shell can interrupt
network calls and abort a running query.
Testing:
- Added a test that verifies the stacktrace is printed and a
running query succeeds.
Change-Id: If7dae2686b65a1a4f02488abadca3b3c90e48bf1
Reviewed-on: http://gerrit.cloudera.org:8080/21611
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Updates impala-shell to preserve all cookies by default, defined as
setting 'http_cookie_names=*'. Prior behavior of restricting cookies to
a user-specified list is preserved when 'http_cookie_names' is given any
value besides '*'. Setting 'http_cookie_names=' prevents any cookies
from being preserved.
Adds verbose output that prints all cookies that are preserved by the
HTTP client.
Existing cookie tests with LDAP still work. Adds a test where Impala
returns an extra cookie, and test verifies that verbose mode prints all
expected cookies.
Change-Id: Ic81f790288460b086ab218e6701e8115a996dfa7
Reviewed-on: http://gerrit.cloudera.org:8080/19827
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
The issue occurred in Python 3 when 0 rows were deleted from Iceberg.
It could also happen in other DMLs with older Impala servers where
TDmlResult.rows_deleted was not set. See the Jira for details of
the error.
Testing:
Extended shell tests for Kudu DML reporting to also cover Iceberg.
Change-Id: I5812b8006b9cacf34a7a0dbbc89a486d8b454438
Reviewed-on: http://gerrit.cloudera.org:8080/21284
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If the Impala server has an older version that does not contain
IMPALA-12048 then TExecProgress.total_fragment_instances will be
None, leading to error when checking total_fragment_instances > 0.
Note that this issue only comes with Python 3, in Python 2 None > 0
returns False.
Testing:
- Manually checked with a modified Impala that doesn't set
total_fragment_instances. Only the scanner progress bar is shown
in this case.
Change-Id: Ic6562ff6c908bfebd09b7612bc5bcbd92623a8e6
Reviewed-on: http://gerrit.cloudera.org:8080/21256
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zihao Ye <eyizoha@163.com>
This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
* UPDATE value of partitioning column
* UPDATE table that went through partition evolution
* Table has SORT BY properties
The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.
This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.
To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.
The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:
UPDATE tbl SET k = 5 WHERE i % 100 = 11;
The generated source statement will be ==>
SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
FROM tbl WHERE i % 100 = 11;
Then we create two table sinks that refer to expressions from the above
source statement:
Insert sink (i, s, 5)
Delete sink (INPUT__FILE__NAME, FILE__POSITION)
The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.
For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.
Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.
Why does this patch have the limitations?
- Because we are shuffling and sorting rows based on the delete
records and their partitions. This means that the new data files
might not get written in an efficient way, e.g. there will be
too many of them, or we will need to keep too many open file
handles during writing.
Also, if the table has SORT BY properties, we cannot respect
it as the input rows are ordered in a way to favor the position
deletes.
Part 3 will introduce a buffering writer for position delete
files. This means we will shuffle and sort records based on
the data records' partitions and SORT BY properties while
delete records get buffered and written out at the end (sorted
by file_path and position). In some edge cases the delete records
might not get written efficiently, but it is a smaller problem
then inefficient data files.
Testing:
* negative tests
* planner tests
* update all supported data types
* partitioned tables
* Impala/Hive interop tests
* authz tests
* concurrent tests
Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Reviewed-on: http://gerrit.cloudera.org:8080/20677
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
in python 3 environment when kerberos_host_fqdn option is used
In Pyhton 2, the sasl layer does not accept unicode strings,
so we have to explicitly encode the kerberos_host_fqdn string
to ascii. However, this is not the case in python 3, where
we have to omit the encode, because if we don't do this,
impala-shell wants to use the following service principal
during Kerberos auth:
my_service_name/b'my.kerberos.host.fqdn'@MY.REALM
instead of the correct one, which is:
my_service_name/my.kerberos.host.fqdn@MY.REALM
(This is because the output of the encode function
is a byte array in python 3.)
Tested with new unit tests and with a snapshot build
manually in CDP PVC DS.
Change-Id: I8b157d76824ad67faf531a529256a8afe2ab9d49
Reviewed-on: http://gerrit.cloudera.org:8080/20691
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
This patch modifies the dynamic query progress reporting in impala-shell
by adding an extra query progress bar below the scan progress bar.
The query progress is calculated using the number of completed fragment
instances divided by the total number of fragment instances. Compared to
the scan progress, which is calculated based on completed scan ranges
divided by the total scan ranges, the query progress provides a more
accurate reflection of the actual completion progress of the query.
Particularly for computationally intensive queries involving complex
aggregations or sorting, such as tpcds query78, there is often
additional computation time required after the scanning is complete. In
such cases, displaying only 100% scan progress would be inaccurate.
Change-Id: I11a704885505442b7499a026fcee3b86696cd064
Reviewed-on: http://gerrit.cloudera.org:8080/20672
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Clarifies the behavior building impala-shell tarball when one of the
system pythons is also included in IMPALA_EXTRA_PACKAGE_PYTHONS. System
python will always replace the same version from
IMPALA_EXTRA_PACKAGE_PYTHONS, as system pythons are appended to the end.
Updates make_shell_tarball to delete the old ext-py install when it
would be replaced rather than relying on 'pip --upgrade', and iterates
by python executable first to make that possible.
Change-Id: I629bdab38d98c8c4232d4cae7b0429a5118d9ff7
Reviewed-on: http://gerrit.cloudera.org:8080/20687
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds IMPALA_EXTRA_PACKAGE_PYTHONS to build impala-shell tarball
dependencies for additional Python targets. That can be used to build a
tarball that supports multiple Python 3 minor versions at once.
Updates the impala-shell script to provide a clear error message when
attempting to use the tarball with a Python version that it hasn't been
built for.
Change-Id: I13720a9e3c50f348bef41f5e91f810204e416f13
Reviewed-on: http://gerrit.cloudera.org:8080/20617
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
When impala-shell receives binary data with the HS2 protocol, it uses a
stringifier to decode it. In Python 3, 'str' on binary data wraps it in
"b'...'"; to get equivalent output to 'str' in Python 2, we need to
decode as UTF-8 and handle errors.
Adds a test case for how impala-shell formats binary data.
Change-Id: I9222cd1ac081a38ab2b37d58628faac0812695ec
Reviewed-on: http://gerrit.cloudera.org:8080/20624
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In some build environments, the impala-shell Python 3
virtualenv install fails due to interactions with
shell/pkg_resources.py. This doesn't reproduce in the standard
development environment, but it is consistent. It seems to
be related to invoking a command in ${IMPALA_HOME}/shell
and the pkg_resources.py being in that directory.
To avoid any interactions, this moves shell/pkg_resources.py
to shell/legacy/pkg_resources.py. This keeps it off of the
path for the failing command, and it also keeps it off of
our PYTHONPATH (which includes ${IMPALA_HOME}/shell).
Testing:
- Ran a build in the affected build environment
- Ran a core job
Change-Id: Id8f2d8a8472c7bb405bf88673ed9779e23cde1d6
Reviewed-on: http://gerrit.cloudera.org:8080/20468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala Shell gets cookies from an HTTMessage object formed from a
response to an HTTP message. The format of cookies in the message
differs across the python versions. In Python 2 the HTTPMessage is a
mimetools.Message object, and the Set-Cookie values all appear in a
single header, separated by newlines. In Python 3 the HTTPMessage is an
email.message.Message, and the Set-Cookie values appear as duplicate
headers.
Add platform dependent code to get_all_matching_cookies() that loads
cookies from all the Set-Cookie headers.
TESTING:
Changed test_get_all_matching_cookies() to build the HTTPMessage
using a new utility method that creates Set-Cookie headers in
the appropriate format for the platform.
Validated that the KNOX_BACKEND-IMPALA cookies is correctly set in
Impala Shell on a Red Hat 9 system using Python 3 (which is how
the problem was first observed).
Change-Id: I057b5c2b9d78e36f32865537d091c4ac0e80d37f
Reviewed-on: http://gerrit.cloudera.org:8080/20216
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The impala-shell tarball ships its external dependencies
by building eggs and including them in the ext-py* directories.
On Redhat 9 and Ubuntu 22, the impala-shell tarball encountered
a regression where the sasl package could not access its
Client class:
Error connecting: AttributeError, module 'sasl' has no attribute 'Client'
This only occurs when using eggs (which are zip files). The virtualenv
installs worked fine. Unpacking the eggs and using the content directly
also avoids the problem.
This reworks the shell tarball to instead build wheels and install
them with 'pip install'. This means that the external dependencies
are not packaged in eggs, and this avoids the issue with sasl. This
is a minimal change to avoid the issue until the shell tarball build
can be reworked more extensively.
Testing:
- Ran shell tests on Redhat 9
Change-Id: I49403979c559b7f8bbe038865c06db6024468d72
Reviewed-on: http://gerrit.cloudera.org:8080/20095
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds support for Redhat 9 / Ubuntu 22. It updates
to a newer toolchain that has those builds, and it adds
supporting code in bootstrap_system.sh.
Redhat 9 and Ubuntu 22 use python = python3, which requires
various changes to build scripts and tests. Ubuntu 22 uses
Python 3.10, which deprecates certain ssl.PROTOCOL_TLS, so
this adapts test_client_ssl.py to that change until it
can be fully addressed in IMPALA-12219.
Various OpenSSL methods have been deprecated. As a workaround
until these can be addressed properly, this specifies
-Wno-deprecated-declarations. This can be removed once the
code is adapted to the non-deprecated APIs in IMPALA-12226.
Impala crashes with tcmalloc errors unless we update to a newer
gperftools, so this moves to gperftools 2.10. gperftools changed
the default for tcmalloc.aggressive_memory_decommit to off, so
this adapts our code to set it for backend tests. The gperftools
upgrade does not show any performance regression:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(42) | parquet / none / none | 3.08 | -0.64% | 2.20 | -0.37% |
+----------+-----------------------+---------+------------+------------+----------------+
With newer Python versions, the impala-virtualenv command
fails to create a Python 3 virtualenv. This switches to
using Python 3's builtin venv command for Python >=3.6.
Kudu needed a newer version and LLVM required a couple patches.
Testing:
- Ran a core job on Ubuntu 22 and Redhat 9. The tests run
to completion without crashing. There are test failures
that will be addressed in follow-up JIRAs.
- Ran dockerised tests on Ubuntu 22.
- Ran dockerised tests on Ubuntu 20 and Rocky 8.5.
Change-Id: If1fcdb2f8c635ecd6dc7a8a1db81f5f389c78b86
Reviewed-on: http://gerrit.cloudera.org:8080/20073
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Interactive shell tests can hang waiting for input if the
shell process hits errors or exits. For example, the problems
in the sasl package seen in IMPALA-12220 cause test_shell_interactive.py
to hang.
This improves the error detection/handling to avoid hangs for
most common shell errors. Specifically, it adds a check for
the impala-shell process exiting, and it adds a check for
a failure to connect to Impala. Both would previous result
in hangs.
Testing:
- Verified test_shell_interactive.py doesn't hang with hand
tests
- Remove a vital import from impala-shell so it exits instantly
- Simulate a connection problem by overwriting the port
with a non-functional port
- Test on Redhat 9 with the IMPALA-12220 issue
Change-Id: I7556fb687e06b41caa538d8c3231ec9f2ad98162
Reviewed-on: http://gerrit.cloudera.org:8080/20087
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Commit cd9f3f578 aims to suppres logging for the 'thrift' library
within impala-shell. However, it does not work in all case. This change
moves the fix into the 'main' function, which suppresses the unwanted
messagge.
Tested by connecting through impala-shell with Python2.7 and Python3.6
with SSL enabled.
Change-Id: I4de95b1b67abe9a0b4637910b0894addddda23d5
Reviewed-on: http://gerrit.cloudera.org:8080/20074
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This pulls in a new toolchain to get a Thrift with
the patch for THRIFT-5705. This fixes an issue where
idle clients using TLS are needlessly disconnected due
to a bug in the read retry count logic inside Thrift.
Tests:
- This modifies test_thrift_socket.py to make it do
more idle polls and check that ImpalaShell is not
disconnected. It fails without the THRIFT-5705 patch
and passes now.
Change-Id: Ifc7704cba032a91b9fd0d5d54d1e0a7e17fb10bb
Reviewed-on: http://gerrit.cloudera.org:8080/19962
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
The previous fetch_size of 10240 turned out to be suboptimal for HS2
server side, likely because it leads to overallocation in result
'std::vector's. Changed to the closest power of 2 size (8192).
With this change RowMaterializationTimer decreased from 3.4s to 2.7s
for "SELECT * FROM tpch_parquet.lineitem".
Change-Id: I34973cb705db53c496b9944c74995b45cf720d46
Reviewed-on: http://gerrit.cloudera.org:8080/19965
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The end time of the exact same rpc call was different between stdout
and the rpc details file because the end time was calculated each
time the details were written out instead of calculating the end time
once and reusing that value.
The duration of each rpc call was being calculated incorrectly.
Change-Id: Ifd9dec189d0f6fb8713fb1c7b2b6c663e492ef05
Reviewed-on: http://gerrit.cloudera.org:8080/19932
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As __future__.unicode_literals is imported in impala-shell
concatenating an str with a literal leads to decoding the
string with 'ascii' codec which fails if there are non-ascii
characters. Converting the literal to str solves the issue.
Testing:
- added regression test + ran related EE tests
Change-Id: I99b72dd262fc7c382e8baee1dce7592880c84de2
Reviewed-on: http://gerrit.cloudera.org:8080/19893
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Thrift's fastbinary module provides native code that
accelerations the BinaryProtocol. It can make a large
performance difference when using the Hiveserver2
protocol with impala-shell. If the fastbinary is not
working, it silently falls back to interpreted code.
This can happen because the fastbinary couldn't load
a particular library, etc.
This adds a warning on impala-shell startup when
it detects that Thrift's fastbinary is not working.
When bin/impala-shell.sh is modified to use python3,
impala-shell outputs this error (shortened for legibility):
WARNING: Failed to load Thrift's fastbinary module. Thrift's
BinaryProtocol will not be accelerated, which can reduce performance.
Error was '{path to Python2 thrift fastbinary.so}: undefined symbol: _Py_ZeroStruct'
Testing:
- Added a simple test that verifies the impala-shell
does not output the warning
- Outputs warning when Python 2 thrift used for Python 3 shell
Change-Id: Id5d0e5db5cfdf1db4521b00f912b4697a7f646e8
Reviewed-on: http://gerrit.cloudera.org:8080/19806
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This support was modeled after the LDAP authentication.
If JWT authentication is used, the Impala shell enforces the use of the
hs2-http protocol since the JWT is sent via the "Authentication"
HTTP header.
The following flags have been added to the Impala shell:
* -j, --jwt: indicates that JWT authentication will be used
* --jwt_cmd: shell command to run to retrieve the JWT to use for
authentication
Testing
New Python tests have been added:
* The shell tests ensure that the various command line arguments are
handled properly. Situations such as a single authentication method,
JWTs cannot be sent in clear text without the proper arguments, etc
are asserted.
* The Python custom cluster tests leverage a test JWKS and test JWTs.
Then, a custom Impala cluster is started with the test JWKS. The
Impala shell attempts to authenticate using a valid JWT, an expired
(invalid) JWT, and a valid JWT signed by a different, untrusted JWKS.
These tests also exercise the Impala JWT authentication mechanism and
assert the prometheus JWT auth success and failure metrics are
reported accurately.
Change-Id: I52247f9262c548946269fe5358b549a3e8c86d4c
Reviewed-on: http://gerrit.cloudera.org:8080/19837
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Pip sporadically hits an error when installing impala-shell into
a virtualenv. An example symptom is this (though the issue is
not specific to thrift):
WARNING: Skipping page https://pypi.org/simple/thrift/ because the
GET request got Content-Type: Unknown. The only supported
Content-Types are application/vnd.pypi.simple.v1+json,
application/vnd.pypi.simple.v1+html, and text/html
ERROR: Could not find a version that satisfies the requirement
thrift==0.16.0 (from impala-shell) (from versions: none)
ERROR: No matching distribution found for thrift==0.16.0
It appears that this error can occur when two pip processes
are installing into virtualenvs simultaneously and share a
cache directory. This happens for our impala-shell build,
because we are doing pip install for Python 2 and Python 3
simultaneously. The impala-python/impala-python3 virtualenvs
do not use a cache directory and are not impacted.
This changes the shell's pip install to give the Python 2 and
Python 3 separate cache directories. The cache directories are
placed in ~/.cache like the regular pip cache. These do not
consume much space (a couple MB).
Testing:
- Ran all-build-options-ub2004 ten times without seeing the failure
Change-Id: I3f834b9f8c8cbc09830745ad132677a2fe17e07b
Reviewed-on: http://gerrit.cloudera.org:8080/19813
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Fix various quality-of-life issues with the 'summary' command:
- update regex to correctly match query ID for handling "Query id ...
not found" errors
- fail the command rather than exiting the shell when 'summary' is
called with an incorrect argument (such as 'summary 1')
- provide a useful message rather than print an exception when 'summary
original' is invoked with no failed queries
Testing:
- added new tests for the 'summary' command
Change-Id: I7523d45b27e5e63e1f962fb1f6ebb4f0adc85213
Reviewed-on: http://gerrit.cloudera.org:8080/19797
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>