impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 09:00:39 -05:00

Author	SHA1	Message	Date
Balazs Hevele	474af29dc9	IMPALA-572 impala-shell: add option to write profiles to a file Added an argument to write runtime profiles to a given file, after running "profile;" or a query with -p flag set. Usage: impala-shell.sh --profile_output=path/to/file It is also available as a shell option: SET PROFILE_OUTPUT=path/to/file; If no file is provided, the profile will be written to standard output. Change-Id: Id8ce4ddcf013392b3c4d66941f07fb90f9c90c3c Reviewed-on: http://gerrit.cloudera.org:8080/23883 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2026-01-27 19:30:29 +00:00
Michael Smith	eb8c8bd4ce	IMPALA-14674: Implement connect_timeout_ms for HS2-HTTP Implements connect_timeout_ms support for HS2-HTTP. This only applies the timeout while establishing the connection, so later requests do not timeout. This is preferrable to configuring http_socket_timeout_s, which can result in timing out operations that just take awhile to execute. Removed ImpalaHttpClient.setTimeout as it's unused and not part of the specification for TTransportBase. Testing: - updates tests now that HS2-HTTP supports connect_timeout_ms - test_impala_shell_timeout for HS2-HTTP without http_socket_timeout_s - marks test_http_socket_timeout to run serially because it relies on short timeouts; inconsistent behavior can leave a dangling session Change-Id: I9012066fb0d16497f309532021d7b323404b9fb2 Reviewed-on: http://gerrit.cloudera.org:8080/23499 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2026-01-15 00:50:17 +00:00
Michael Smith	ac1c11dd82	IMPALA-14460: Keep http connections open in impala-shell Leave HS2-HTTP connections open and retry on 401 or EPIPE failures to re-use connections, greatly reducing the number of client connections needed with the HS2-HTTP protocol. Adds a 'use_new_http_connection' impala-shell option to restore the old behavior of using a new connection for each rpc. Existing test_shell_interactive_reconnect tests that ImpalaShell - the library implementing the impala-shell CLI - will automatically establish a new connection with all protocols. Prior to this patch, after restarting impalad you'd see 2026-01-06 11:13:08 [Warning] close session RPC failed: <class 'impala_shell.shell_exceptions.RPCException'> ERROR: Invalid session id: be40a2618203ff7b:beacd4b5d28f7692 Connection lost, reconnecting... Warning: --connect_timeout_ms is currently ignored with HTTP transport. Opened TCP connection to localhost:28001 If you instead introduce a load balancer like haproxy and restart the lb, there's no apparent break because impala-shell would always establish a new connection. With this patch, when impalad is restarted we still see the lost session 2026-01-06 11:20:43 [Exception] type=<class 'BrokenPipeError'> in PingImpalaHS2Service. Num remaining tries: 3 [Errno 32] Broken pipe Connection closed, reconnecting... 2026-01-06 11:20:43 [Warning] close session RPC failed: <class 'impala_shell.shell_exceptions.RPCException'> ERROR: Invalid session id: 6e494c76a9a58278:dbb7016cb5999385 Connection lost, reconnecting... Warning: --connect_timeout_ms is currently ignored with HTTP transport. Opened TCP connection to localhost:28000 If the lb is restarted, we now see that the connection is reopened 2026-01-06 11:24:02 [Exception] type=<class 'BrokenPipeError'> in PingImpalaHS2Service. Num remaining tries: 3 [Errno 32] Broken pipe Connection closed, reconnecting... Query: ... Triggering a retry due to 401 Unauthorized requires Kerberos, since Basic and Bearer auth always send the Authorization header; it shows 2026-01-06 17:02:27 [Exception] type= <class 'http.client.RemoteDisconnected'> in ExecuteStatement. Remote end closed connection without response 2026-01-06 17:02:27 [Exception] type= <class 'http.client.RemoteDisconnected'> when listing query options. Num remaining tries: 3 Remote end closed connection without response 2026-01-06 17:02:27 [Exception] type=<class 'ConnectionRefusedError'> in ExecuteStatement. [Errno 111] Connection refused 2026-01-06 17:02:27 [Exception] type=<class 'ConnectionRefusedError'> when listing query options. Num remaining tries: 2 [Errno 111] Connection refused Connection closed, reconnecting... Cookies expired, restarting authentication... Preserving cookies: impala.auth Connected to localhost:28005 Updates tests that count RPCs via number of connections as re-use means they're no longer linked. Tests now rely on connection count, which verifies we're re-using connections. Adds testReconnect to use a proxy where we can interrupt the existing connection, which will sometimes trigger "Connection closed, reconnecting..." I didn't find a way to trigger it consistently in this test environment. Adds tests using Kerberos authentication to trigger cookie retry and "Cookie expired, restarting authentication..." Generated-by: Github Copilot (GPT-4.1) Change-Id: Iafb3fc39817e93c691cd993902c6d939a7235a03 Reviewed-on: http://gerrit.cloudera.org:8080/23831 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2026-01-12 18:04:08 +00:00
Riza Suminto	3ed2a82a95	IMPALA-14606: Stop building impala-shell for Python 2 This patch stop setting up and building impala-shell for Python 2. A more thorough clean up will be done in the future. Testing: Pass build and test/shell/ in RHEL8. Change-Id: Ic7d59b283f4e2f011880ff6221d550b52714a538 Reviewed-on: http://gerrit.cloudera.org:8080/23750 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-12-10 04:40:46 +00:00
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
Michael Smith	512a73771f	IMPALA-14452: Fix impala-shell SSL with Python 3.12 Removes deprecated ImpalaHttpClient constructor that supported port and path as it has been deprecated since at least 2020 and appears unused. Removes cert_file and key_file as they were also never used, and if required must now be passed in via ssl_context. Updates TSSLSocket fixes for Thrift 0.16 and Python 3.12. _validate_cert was removed by Thrift 0.16, but everything worked because Thrift used ssl.match_hostname instead. With Python 3.12 ssl.match_hostname no longer exists so we rely on OpenSSL to handle verification with ssl.PROTOCOL_TLS_CLIENT. Only uses ssl.PROTOCOL_TLS_CLIENT when match_hostname is unavailable to avoid changing existing behavior. THRIFT-792 identifies that TSocket suppresses connection errors, where we would otherwise see SSL hostname verification errors like ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: IP address mismatch, certificate is not valid for '::1'. (_ssl.c:1131) Python 2.7.9 and 3.2 are minimum required versions; both have been EOL for several years. Testing: - ran custom_cluster/{test_client_ssl.py,test_ipv6.py} on Ubuntu 24 with Python 3.12, OpenSSL 3.0.13. - ran custom_cluster/test_client_ssl.py on RHEL 7.9 with Python 2.7.5 and Python 3.6.8, OpenSSL 1.0.2k-fips. - adds test that hostname checking is configured. Change-Id: I046a9010ac4cb1f7d705935054b306cddaf8bdc7 Reviewed-on: http://gerrit.cloudera.org:8080/23519 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2025-10-20 09:55:22 +00:00
Riza Suminto	0292e296c0	IMPALA-14426: Deflake TestImpalaShell.test_cancellation TestImpalaShell.test_cancellation start failing when run with Python 3.9 with following error message RuntimeError: reentrant call inside <_io.BufferedWriter name='<stderr>'> This patch is a quick fix the by changing the stderr write from using print() to os.write(). Note that the thread-safetyness isssue within _signal_handler in impala_shell.py during query cancellation still remains. Testing: Run and pass test_cancellation in RHEL9 with Python 3.9. Change-Id: I5403c7b8126b1a35ea841496fdfb6eb93e83376e Reviewed-on: http://gerrit.cloudera.org:8080/23416 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-12 20:49:22 +00:00
Riza Suminto	28cff4022d	IMPALA-14333: Run impala-py.test using Python3 Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true reveals some tests that require adjustment. This patch made such adjustment, which mostly revolves around encoding differences and string vs bytes type in Python3. This patch also switch the default to run pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The following are the details: Change hash() function in conftest.py to crc32() to produce deterministic hash. Hash randomization is enabled by default since Python 3.3 (see https://docs.python.org/3/reference/datamodel.html#object.__hash__). This cause test sharding (like --shard_tests=1/2) produce inconsistent set of tests per shard. Always restart minicluster during custom cluster tests if --shard_tests argument is set, because test order may change and affect test correctness, depending on whether running on fresh minicluster or not. Moved one test case from delimited-latin-text.test to test_delimited_text.py for easier binary comparison. Add bytes_to_str() as a utility function to decode bytes in Python3. This is often needed when inspecting the return value of subprocess.check_output() as a string. Implement DataTypeMetaclass.__lt__ to substitute DataTypeMetaclass.__cmp__ that is ignored in Python3 (see https://peps.python.org/pep-0207/). Fix WEB_CERT_ERR difference in test_ipv6.py. Fix trivial integer parsing in test_restart_services.py. Fix various encoding issues in test_saml2_sso.py, test_shell_commandline.py, and test_shell_interactive.py. Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1. Switch to binary comparison in test_iceberg.py where needed. Specify text mode when calling tempfile.NamedTemporaryFile(). Simplify create_impala_shell_executable_dimension to skip testing dev and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason is that several UTF-8 related tests in test_shell_commandline.py break in Python3 pytest + Python2 impala-shell combo. This skipping already happen automatically in build OS without system Python2 available like RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty). Removed unused vector argument and fixed some trivial flake8 issues. Several test logic require modification due to intermittent issue in Python3 pytest. These include: Add _run_query_with_client() in test_ranger.py to allow reusing a single Impala client for running several queries. Ensure clients are closed when the test is done. Mark several tests in test_ranger.py with SkipIfFS.hive because they run queries through beeline + HiveServer2, but Ozone and S3 build environment does not start HiveServer2 by default. Increase the sleep period from 0.1 to 0.5 seconds per iteration in test_statestore.py and mark TestStatestore to execute serially. This is because TServer appears to shut down more slowly when run concurrently with other tests. Handle the deprecation of Thread.setDaemon() as well. Always force_restart=True each test method in TestLoggingCore, TestShellInteractiveReconnect, and TestQueryRetries to prevent them from reusing minicluster from previous test method. Some of these tests destruct minicluster (kill impalad) and will produce minidump if metrics verifier for next tests fail to detect healthy minicluster state. Testing: Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true. Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4 Reviewed-on: http://gerrit.cloudera.org:8080/23319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-03 10:01:29 +00:00
Csaba Ringhofer	6d7c076086	IMPALA-14338: Update six to 1.17.0 to fix impala-shell on Python 3.12+ Change-Id: Iaa04b20f767f2ca74ee680151d029e304363994e Reviewed-on: http://gerrit.cloudera.org:8080/23327 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-08-22 06:43:08 +00:00
Peter Rozsa	3b0d6277d5	IMPALA-14215: Fix tarball creation for extra Python versions make_shell_tarball.sh previously used the system Python to copy the required packages to each extra python versions. This caused version mismatches as the compiled modules are not portable between Python versions. This change modifies the logic to use the cache of the provided extra version instead of the system Python. Tests: - manually validated that each provided extra Python contains their respective compiled dependencies Change-Id: Iaee9b3a98b73fd1faf1b7c8ba4b388722add6fb4 Reviewed-on: http://gerrit.cloudera.org:8080/23160 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-07-12 13:55:51 +00:00
Csaba Ringhofer	5cca1aa9e5	IMPALA-13820: add ipv6 support for webui/hs2/hs2-http/beeswax Main changes: - added flag external_interface to override hostname for beeswax/hs2/hs2-http port to allow testing ipv6 on these interfaces without forcing ipv6 on internal communication - compile Squeasel with USE_IPV6 to allow ipv6 on webui (webui interface can be configured with existing flag webserver_interface) - fixed the handling of [<ipv6addr>].<port> style addresses in impala-shell (e.g. [::1]:21050) and test framework - improved handling of custom clusters in test framework to allow webui/ImpalaTestSuite's clients to work with non standard settings (also fixes these clients with SSL) Using ipv4 vs ipv6 vs dual stack can be configured by setting the interface to bind to with flag webserver_interface and external_interface. The Thrift server behind hs2/hs2-http/beeswax only accepts a single host name and uses the first address returned by getaddrinfo() that it can successfully bind to. This means that unless an ipv6 address is used (like ::1) the behavior will depend on the order of addresses returned by getaddrinfo(): `63b7a263fc/lib/cpp/src/thrift/transport/TServerSocket.cpp (L481)` For dual stack the only way currently is to bind to "::", as the Thrift server can only listen a single socket. Testing: - added custom cluster tests for ipv6 only/dual interface with and without SSL - manually tested in dual stack environment with client on a different host - among clients impala-shell and impyla are tested, but not JDBC/ODBC - no tests yet on truly ipv6 only environment, as internal communication (e.g. krpc) is not ready for ipv6 To test manually the dev cluster can be started with ipv6 support: dual mode: bin/start-impala-cluster.py --impalad_args="--external_interface=:: --webserver_interface=::" --catalogd_args="--webserver_interface=::" --state_store_args="--webserver_interface=::" ipv6 only: bin/start-impala-cluster.py --impalad_args="--external_interface=::1 --webserver_interface=::1" --catalogd_args="--webserver_interface=::1" --state_store_args="--webserver_interface=::1" Change-Id: I51ac66c568cc9bb06f4a3915db07a53c100109b6 Reviewed-on: http://gerrit.cloudera.org:8080/22527 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-06-21 14:00:31 +00:00
gaurav1086	3781132ef6	IMPALA-13675: OAuth AuthN Support for Impala Shell This patch adds the support to fetch access tokens from the OAuth Server using the OAuth client_id and client_secret if the access token is not provided. It covers the flow: client_credentials. The client_secret can either be passed as a file or be prompted to enter. Added a test param for impala shell oauth_mock_response_cmd to mock oauth server response only to be used for testing. Also suppressed existing option hs2_x_forward from the impala --help output. Testing(okta oauth server): - Added custom_cluster tests in test_shell_jwt_auth.py: test_oauth_auth_with_clientid_and_secret_success test_oauth_auth_with_clientid_and_secret_failure - Tested manually by providing --user <user> and --oauth_client_secret_cmd="cat password_file.txt" - Tested manually by providing --user <user> and no --oauth_client_secret_cmd, thereby prompting the user to enter the client_secret. Example command: impala-shell.sh -a --auth_creds_ok_in_clear --protocol="hs2-http" --oauth_client_id="client_id" --oauth_client_secret_cmd="cat client_secret.txt" --oauth_server="dev.us.auth01.com" --oauth_endpoint="/oauth/token" Change-Id: I84e26d54f6a53696660728efb239ffd43de4c55d Reviewed-on: http://gerrit.cloudera.org:8080/22424 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-06-05 21:15:47 +00:00
Joe McDonnell	cbb35ebccd	IMPALA-13326: Prefer python3 for tarball packaged impala-shell The tarball packaging for impala-shell ships support for multiple Python versions (including both Python 2 and Python 3). In the impala-shell script, it determines the python to use and uses the corresponding installation. Historically, impala-shell has preferred the "python" executable (which can be Python 2) to the "python3" executable. Since Python 2 is deprecated, this flips the preference to prefer "python3" to "python". This continues to respect IMPALA_PYTHON_EXECUTABLE as before, but it adds an IMPALA_SHELL_PYTHON_FALLBACK variable to determine whether to fall back to the regular logic. This defaults to true, allowing fallback, to maintain existing behavior. The shell end-to-end tests set this to false to lock in the Python version. Testing: - Ran shell tests Change-Id: If0e32e8eee672e4dc66e725722f5150cd1e4c9a6 Reviewed-on: http://gerrit.cloudera.org:8080/22953 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2025-05-28 23:22:12 +00:00
Joe McDonnell	f4e7551094	IMPALA-14087: Fix shell live_progress output display issue on Python 3 When running the shell in a terminal with live_progress=true, live progress overwrites its output by using the ANSI up character to rewrite lines with updated on the query progress. On Python 3, we found that the updates to clear the live progress were overwriting the actual output in the terminal. e.g. +----------+ \| count(*) \| +----------+ Fetched 1 row(s) in 5.20s To avoid this, the live progress lines need to be fully flushed to stderr before starting to output the result to stdout. This adds a flush call in OverwritingStdErrOutputStream::clear() to force this. Testing: - Hand tested queries with live progress - Added test that redirects stdout and stderr to the same file and verifies that no ANSI up character comes after the query output Change-Id: Id2e21224253f76b2a04767a57b3ade49ce2c914f Reviewed-on: http://gerrit.cloudera.org:8080/22941 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-05-24 04:29:14 +00:00
Joe McDonnell	ea0969a772	IMPALA-11980 (part 2): Fix absolute import issues for impala_shell Python 3 changed the behavior of imports with PEP328. Existing imports become absolute unless they use the new relative import syntax. This adapts the impala-shell code to use absolute imports, fixing issues where it is imported from our test code. There are several parts to this: 1. It moves impala shell code into shell/impala_shell. This matches the directory structure of the PyPi package. 2. It changes the imports in the shell code to be absolute paths (i.e. impala_shell.foo rather than foo). This fixes issues with Python 3 absolute imports. It also eliminates the need for ugly hacks in the PyPi package's __init__.py. 3. This changes Thrift generation to put it directly in $IMPALA_HOME/shell rather than $IMPALA_HOME/shell/gen-py. This means that the generated Thrift code is rooted in the same directory as the shell code. 4. This changes the PYTHONPATH to include $IMPALA_HOME/shell and not $IMPALA_HOME/shell/gen-py. This means that the test code is using the same import paths as the pypi package. With all of these changes, the source code is very close to the directory structure of the PyPi package. As long as CMake has generated the thrift files and the Python version file, only a few differences remain. This removes those differences by moving the setup.py / MANIFEST.in and other files from the packaging directory to the top-level shell/ directory. This means that one can pip install directly from the source code. i.e. pip install $IMPALA_HOME/shell This also moves the shell tarball generation script to the packaging directory and changes bin/impala-shell.sh to use Python 3. This sorts the imports using isort for the affected Python files. Testing: - Ran a regular core job with Python 2 - Ran a core job with Python 3 and verified that the absolute import issues are gone. Change-Id: Ica75a24fa6bcb78999b9b6f4f4356951b81c3124 Reviewed-on: http://gerrit.cloudera.org:8080/22330 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-05-21 15:14:11 +00:00
Riza Suminto	96ae16b60b	IMPALA-13584: Add option to shows num row report in impala-shell In beeswax all statements with the exception of USE print 'Fetched X row(s) in Ys', while in HS2 some statements (REFRESH, INVALIDATE) metadata does not print it. While these statements always return 0 rows, the amount of time spent with the statement can be useful. This patch modifies add impala-shell to let it print elapsed time for that query, even if query is not expected to return result metadata. Added --beeswax_compat_num_rows option in impala-shell. It default to False. If this option is set (True), 'Fetched 0 row(s) in' will be printed for all Impala protocol, just like beeswax. One exception for this is USE query, which will remain silent. Testing: - Added test_beeswax_compat_num_rows in test_shell_interactive.py. - Pass test_shell_interactive.py. Change-Id: Id76ede98c514f73ff1dfa123a0d951e80e7508b4 Reviewed-on: http://gerrit.cloudera.org:8080/22813 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-28 19:13:39 +00:00
Csaba Ringhofer	98f15044a1	IMPALA-13746: Fix long ldap password handling in impala-shell+hs2-http Before this patch impala-shell inserted a \n char after every 76 bytes. The fix is to switch to a different function for encoding. The exact semantics of base64 functions is described in https://docs.python.org/3/library/base64.html Based on impyla fix https://github.com/cloudera/impyla/pull/562 by https://github.com/paulmayer (released in Impyla 0.21a3) Change-Id: I4d73d682cf2d1843d9801ef71b99d551b79deb19 Reviewed-on: http://gerrit.cloudera.org:8080/22780 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2025-04-17 06:17:51 +00:00
Joe McDonnell	c5a0ec8bdf	IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package This puts all of the thrift-generated python code into the impala_thrift_gen package. This is similar to what Impyla does for its thrift-generated python code, except that it uses the impala_thrift_gen package rather than impala._thrift_gen. This is a preparatory patch for fixing the absolute import issues. This patches all of the thrift files to add the python namespace. This has code to apply the patching to the thirdparty thrift files (hive_metastore.thrift, fb303.thrift) to do the same. Putting all the generated python into a package makes it easier to understand where the imports are getting code. When the subsequent change rearranges the shell code, the thrift generated code can stay in a separate directory. This uses isort to sort the imports for the affected Python files with the provided .isort.cfg file. This also adds an impala-isort shell script to make it easy to run. Testing: - Ran a core job Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0 Reviewed-on: http://gerrit.cloudera.org:8080/20169 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-04-15 17:03:02 +00:00
Riza Suminto	e73e2d40da	IMPALA-13864: Implement ImpylaHS2ResultSet.exec_summary This patch implement building exec summary table for ImpylaHS2Connection. It adds fetch_exec_summary argument in ImpalaConnection.execute(). If this argument is True, an exec summary table will be added into the returned result object. fetch_exec_summary is also implemented for BeeswaxConnection. Thus, BeeswaxConnection will not fetch exec summary by default all the time. Tests that validate exec summary table is updated to set fetch_exec_summary=True and migrated to test against hs2 protocol. Change TestExecutorGroup._set_query_options() to do query option setting through hs2_client iconfig instead of SET query. Some flake8 issues are addressed as well. Move build_exec_summary_table to separate exec_summary.py file. Tweak it a bit to return early if given TExecSummary is empty. Fixed bug in ImpalaBeeswaxClient.fetch_results() where fetch will not happen at all if discard_result argument is True. Testing: - Run and pass affected tests locally. Change-Id: I7d88f78e58eeda29ce21e7828884c7a129d7efe6 Reviewed-on: http://gerrit.cloudera.org:8080/22626 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-03-24 22:34:20 +00:00
Csaba Ringhofer	9437f9fd16	IMPALA-12656: Bump sasl to 0.4a1 to allow Python3.11+ in impala-shell Before this change impala-shell could not be installed on Python 3.11 duo to compilation failure in python-sasl. Checked installation on Python 3.11/3.12/3.13. Also bumps impyla version to 0.21a2. Change-Id: I4efdd105e489e1d0a996d156fb7efbb6fad8da7d Reviewed-on: http://gerrit.cloudera.org:8080/22593 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-03-12 19:33:09 +00:00
gaurav1086	c3cbd79b56	IMPALA-13288: OAuth AuthN Support for Impala This patch added OAuth support with following functionality: * Load and parse OAuth JWKS from configured JSON file or url. * Read the OAuth Access token from the HTTP Header which is the same format as JWT Authorization Bearer token. * Verify the OAuth's signature with public key in JWKS. * Get the username out of the payload of OAuth Access token. * If kerberos or ldap is enabled, then both jwt and oauth are supported together. Else only one of jwt or oauth is supported. This has been a pre existing flow for jwt. So OAuth will follow the same policy. * Impala Shell side changes: OAuth options -a and --oauth_cmd Testing: - Added 3 custom cluster be test in test_shell_jwt_auth.py: - test_oauth_auth_valid: authenticate with valid token. - test_oauth_auth_expired: authentication failure with expired token. - test_oauth_auth_invalid_jwk: authentication failure with valid signature but expired. - Added 1 custom cluster fe test in JwtWebserverTest.java - testWebserverOAuthAuth: Basic tests for OAuth - Added 1 custom cluster fe test in LdapHS2Test.java - testHiveserver2JwtAndOAuthAuth: tests all combinations of jwt and oauth token verification with separate jwks keys. - Manually tested with a valid, invalid and expired oauth access token. - Passed core run. Change-Id: I65dc8db917476b0f0d29b659b9fa51ebaf45b7a6 Reviewed-on: http://gerrit.cloudera.org:8080/21728 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-15 03:32:57 +00:00
Joe McDonnell	aefd1b0920	IMPALA-13551: Produce the shell tarball by pip installing impala-shell Currently, the shell tarball maintains its own packaging code and directory layout. This is very complicated and currently has several Python packages directly checked into our repository. To simplify it, this changes the shell tarball to be based on pip installing the pypi package. Specifically, the new directory structure for an unpack shell tarball is: impala-shell-4.5.0-SNAPSHOT/ impala-shell install_py${PYTHON_VERSION}/ install_py${ANOTHER_PYTHON_VERSION}/ For example, install_py2.7 is the Python 2.7 pip install of impala-shell. install_py3.8 is a Python 3.8 pip install of impala-shell. This means that the impala-shell script simply picks the install for the specified version of python and uses that pip install directory. To make this more consistent across different Linux distributions, this upgrades pip in the virtualenv to the latest. With this, ext-py and pkg_resources.py can be removed. This requires rearranging the shell build code. Specifically, this splits out the code that generates impala_build_version.py so that it can run before generating the pypi package. The shell tarball now has a dependency on the pypi package and must run after it. This builds on Michael Smith's work from IMPALA-11399. Testing: - Ran shell tests locally - Built on Centos 7, Redhat 8 & 9, Ubuntu 20 & 22, SLES 15 Change-Id: Ifbb66ab2c5bc7180221f98d9bf5e38d62f4ac036 Reviewed-on: http://gerrit.cloudera.org:8080/20171 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-17 22:52:01 +00:00
Riza Suminto	aac375eb20	IMPALA-13556: Log GetRuntimeProfile and GetExecSummary at VLOG_QUERY Calls to both of these RPC endpoints are previously logged at VLOG_RPC (or VLOG(2)). This patch change the log level to VLOG_QUERY (or VLOG(1)). This is helpful because both RPC are usually called after query execution complete, but the query handle is not released yet. They are also rarely called by client, so they will not be too noisy. Missing query driver log in GetAllQueryHandles is moved to its caller, where the log message is clarified. ImpalaShell._execute_stmt() also modified to call get_runtime_profile() only if show_profile option is true. Testing: - Using impala-shell, run a TPC-DS query followed by 'profile' and summary command. Verify that logs are printed, both with beeswax and HS2 protocol. - Pass core tests. Change-Id: I90ef7d0fadd81c58ec1072e53430f51fea146cf1 Reviewed-on: http://gerrit.cloudera.org:8080/22085 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-11-22 05:23:29 +00:00
Riza Suminto	1f35747ea3	IMPALA-5792: Eliminate duplicate beeswax python code This patch unify duplicated exec summary code used by python beeswax clients: one used by the shell in impala_shell.py and one used by tests in impala_beeswax.py. The code that has progress furthest is the one in shell/impala_client.py, which is the one that can print correct exec summary table for MT_DOP>0 queries. It is made into a dedicated build_exec_summary_table function in impala_client.py, and then impala_beeswax.py import it from impala_client.py. This patch also fix several flake8 issues around the modified files. Testing: - Manually run TPC-DS Q74 in impala-shell and then type "summary" command. Confirm that plan tree is displayed properly. - Run single_node_perf_run.py over branches that produce different TPC-DS Q74 plan tree. Confirm that the plan tree are displayed correctly in performance_result.txt Change-Id: Ica57c90dd571d9ac74d76d9830da26c7fe20c74f Reviewed-on: http://gerrit.cloudera.org:8080/22060 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>	2024-11-14 11:19:19 +00:00
Saurabh Katiyal	2535e79491	IMPALA-12216: Print timestamp for impala-shell errors This change will print timestamp of an exception or warning occurred during execution of a query via impala-shell. The timestamp will use timezone of the machine running impala-shell. example: Query submitted at: 2024-08-22 16:17:57 (Coordinator: http://host:25000) Query state can be monitored at: http://localhost:25000/query_plan?query_id=e04dcc55e560d1ee:11173fe800000000 ^C Cancelling Query Opened TCP connection to localhost:21050 2024-08-22 16:17:58 [Exception] type=<class 'socket.error'> in FetchResults. [Errno 4] Interrupted system call 2024-08-22 16:17:58 [Warning] Cancelling Query 2024-08-22 16:17:58 [Warning] close session RPC failed: <class 'shell_exceptions.QueryCancelledByShellException'> Opened TCP connection to localhost:21050 [localhost:21050] default> Change-Id: I4abbd02aa9f61210b0333495bf191e72c22a5944 Reviewed-on: http://gerrit.cloudera.org:8080/21426 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-12 01:43:59 +00:00
Peter Rozsa	a0aaf338ae	IMPALA-12732: Add support for MERGE statements for Iceberg tables MERGE statement is a DML command that allows users to perform conditional insert, update, or delete operations on a target table based on the results of a join with a source table. This change adds MERGE statement parsing and an Iceberg-specific semantic analysis, planning, and execution. The parsing grammar follows the SQL standard, it accepts the same syntax as Hive, Spark, and Trino by supporting arbitrary number of WHEN clauses, with conditions or without and accepting inline views as source. Example: 'MERGE INTO target t USING source s ON t.id = s.id WHEN MATCHED AND t.id < 100 THEN UPDATE SET column1 = s.column1 WHEN MATCHED AND t.id > 100 THEN DELETE WHEN MATCHED THEN UPDATE SET column1 = "value" WHEN NOT MATCHED THEN INSERT VALUES (s.id, s.column1);' The Iceberg-specific analysis, planning, and execution are based on a concept that was previously used for UPDATE: The analyzer creates a SELECT statement with all target and source columns (including Iceberg's virtual columns) and a 'row_present' column that defines whether the source, the target, or both rows are present in the result set after joining the two table references by the ON clause. The join condition should be an equi-join, as it is a FULL OUTER JOIN, and Impala currently supports only equi-joins in this case. The joining order is forced by a query hint, this guarantees that the target table is always on the left side. A new, IcebergMergeNode is added at planning phase, this node does the row-level filtering for each MATCHED/ NOT MATCHED cases. The 'row_present' column decides which case group will be evaluated; if both sides are available, the matched cases, if only the source side matches then the not matched cases and their filter expressions will be evaluated over the row. If one of the cases match, then the execution evaluates the result expressions into the output row batch, and an auxiliary tuple will store the merge action. The merge action is a flag for the newly added IcebergMergeSink; this sink will route each incoming row from IcebergMergeNode to their respective destination. Each row could go to the delete sink, insert sink, or to both sinks. Target-side duplicate records are filtered during IcebergMergeNode's execution, if one target table-side duplicate is detected, the whole statement's execution is stopped and the error is reported back to the user. Added tests: - Parser tests - Analyzer tests - Unit test for WHEN NOT MATCHED INSERT column collation - Planner tests for partitioned/sorted cases - Authorization tests - E2E tests Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268 Reviewed-on: http://gerrit.cloudera.org:8080/21423 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-05 01:01:05 +00:00
Andrew Sherman	3bdf1d2648	IMPALA-13310 Add the value of the http 'X-Forwarded-For' header to the runtime profile When using hs2-http protocol, http messages from Impala clients may pass through one or more proxies before reaching the Impala coordinator. This can make it harder to track the origin of the http messages. The 'X-Forwarded-For' header is added to or edited by HTTP proxies when forwarding a request, so it may contain multiple source addresses. Add the value of this header to the runtime profile so that it can be observed. Impala will truncate the 'X-Forwarded-For' header value at 8096 characters. Apart from this, Impala does not do any verification or sanitization of this value, so its value should only be trusted if the deployment environment protects against spoofing. A good reference for understanding the use of 'X-Forwarded-For' is https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For This patch does not address the cases where http proxies insert multiple 'X-Forwarded-For' headers. This issue is tracked in IMPALA-13335. TESTING: add an option '--hs2_x_forward' to impala-shell which will set the 'X-Forwarded-For' header. Add tests which verify that the value is set in the profile, and that a long value is truncated correctly. Change-Id: I2e010cfb09674c5d043ef915347c3836696e03cf Reviewed-on: http://gerrit.cloudera.org:8080/21700 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-08-28 05:56:27 +00:00
Joe McDonnell	d8a8412c2b	IMPALA-13294: Add support for long polling to avoid client side wait Currently, Impala does an execute call, then the client polls waiting for the operation to finish (or error out). The client sleeps between polls, and this sleep time can be a substantial percentage of a short query's execution time. To reduce this client side sleep, this implements long polling to provide an option to wait for query completion on the server side. This is controlled by the long_polling_time_ms query option. If set to greater than zero, status RPCs will wait for query completion for up to that amount of time. This defaults to off (0ms). Both Beeswax and HS2 add a wait for query completion in their get status calls (get_state for Beeswax, GetOperationStatus for HS2). This doesn't wait in the execute RPC calls (e.g. query for Beeswax, ExecuteStatement for HS2), because neither includes the query status in the response. The client will always need to do a separate status RPC. This modifies impala-shell and the beeswax client to avoid doing a sleep if the get_state/GetOperationStatus calls take longer than they would have slept. In other words, if they would have slept 50ms, then they skip that sleep if the RPC to the server took longer than 50ms. This allows the client to maintain its sleep behavior with older Impalas that don't use long polling while adapting properly to systems that do have long polling. This has the added benefit that it also adjusts for high latency to the server as well. This does not change any of the sleep times. Testing: - This adds a test case in test_hs2.py to verify the long polling behavior Change-Id: I72ca595c5dd8a33b936f078f7f7faa5b3f0f337d Reviewed-on: http://gerrit.cloudera.org:8080/19205 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-08-13 03:49:45 +00:00
Xuebin Su	ad868b9947	IMPALA-13115: Add query id to error messages This patch adds the query id to the error messages in both - the result of the `get_log()` RPC, and - the error message in an RPC response before they are returned to the client, so that the users can easily figure out the errored queries on the client side. To achieve this, the query id of the thread debug info is set in the RPC handler method, and is retrieved from the thread debug info each time the error reporting function or `get_log()` gets called. Due to the change of the error message format, some checks in the impala-shell.py are adapted to keep them valid. Testing: - Added helper function `error_msg_expected()` to check whether an error message is expected. It is stricter than only using the `in` operator. - Added helper function `error_msg_equal()` to check if two error messages are equal regardless of the query ids. - Various test cases are adapted to match the new error message format. - `ImpalaBeeswaxException`, which is used in tests only, is simplified so that it has the same error message format as the exceptions for HS2. - Added an assertion to the case of killing and restarting a worker in the custom cluster test to ensure that the query id is in the error message in the client log retrieved with `get_log()`. Change-Id: I67e659681e36162cad1d9684189106f8eedbf092 Reviewed-on: http://gerrit.cloudera.org:8080/21587 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-08-08 14:11:04 +00:00
Joe McDonnell	2b98e5fb95	IMPALA-13230: Dump stacktrace for impala-shell when it receives SIGUSR1 It can be useful to get a stacktrace for a running impala-shell for debugging. This uses Python 3's faulthandler to handle the SIGUSR1, so it prints a stacktrace for all threads when it receives SIGUSR1. This does not implement an equivalent functionality for Python 2. Python 2 doesn't have the faulthandler library, and hand tests showed that sending SIGUSR1 to Python 2 impala-shell can interrupt network calls and abort a running query. Testing: - Added a test that verifies the stacktrace is printed and a running query succeeds. Change-Id: If7dae2686b65a1a4f02488abadca3b3c90e48bf1 Reviewed-on: http://gerrit.cloudera.org:8080/21611 Reviewed-by: Yida Wu <wydbaggio000@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2024-08-02 18:07:20 +00:00
Michael Smith	100693d5ad	IMPALA-12093: impala-shell to preserve all cookies Updates impala-shell to preserve all cookies by default, defined as setting 'http_cookie_names='. Prior behavior of restricting cookies to a user-specified list is preserved when 'http_cookie_names' is given any value besides ''. Setting 'http_cookie_names=' prevents any cookies from being preserved. Adds verbose output that prints all cookies that are preserved by the HTTP client. Existing cookie tests with LDAP still work. Adds a test where Impala returns an extra cookie, and test verifies that verbose mode prints all expected cookies. Change-Id: Ic81f790288460b086ab218e6701e8115a996dfa7 Reviewed-on: http://gerrit.cloudera.org:8080/19827 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2024-06-27 22:36:57 +00:00
Csaba Ringhofer	541fc5ee9e	IMPALA-12990: Fix impala-shell handling of unset rows_deleted The issue occurred in Python 3 when 0 rows were deleted from Iceberg. It could also happen in other DMLs with older Impala servers where TDmlResult.rows_deleted was not set. See the Jira for details of the error. Testing: Extended shell tests for Kudu DML reporting to also cover Iceberg. Change-Id: I5812b8006b9cacf34a7a0dbbc89a486d8b454438 Reviewed-on: http://gerrit.cloudera.org:8080/21284 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-17 18:52:25 +00:00
Csaba Ringhofer	5c003cdcda	IMPALA-12978: Fix impala-shell`s live progress with older Impalas If the Impala server has an older version that does not contain IMPALA-12048 then TExecProgress.total_fragment_instances will be None, leading to error when checking total_fragment_instances > 0. Note that this issue only comes with Python 3, in Python 2 None > 0 returns False. Testing: - Manually checked with a modified Impala that doesn't set total_fragment_instances. Only the scanner progress bar is shown in this case. Change-Id: Ic6562ff6c908bfebd09b7612bc5bcbd92623a8e6 Reviewed-on: http://gerrit.cloudera.org:8080/21256 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zihao Ye <eyizoha@163.com>	2024-04-09 02:23:05 +00:00
Zoltan Borok-Nagy	e326b3cc0d	IMPALA-12313: (part 2) Limited UPDATE support for Iceberg tables This patch adds limited UPDATE support for Iceberg tables. The limitations mean users cannot update Iceberg tables if any of the following is true: * UPDATE value of partitioning column * UPDATE table that went through partition evolution * Table has SORT BY properties The above limitations will be resolved by part 3. The usual limitations like writing non-Parquet files, using copy-on-write, modifying V1 tables are out of scope of IMPALA-12313. This patch implements UPDATEs with the merge-on-read technique. This means the UPDATE statement writes both data files and delete files. Data files contain the updated records, delete files contain the position delete records of the old data records that have been touched. To achieve the above this patch introduces a new sink: MultiDataSink. We can configure multiple TableSinks for a single MultiDataSink object. During execution, the row batches sent to the MultiDataSink will be forwarded to all the TableSinks that have been registered. The UPDATE statement for an Iceberg table creates a source select statement with all table columns and virtual columns INPUT__FILE__NAME and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema (i int, s string, k int), and we update the table with: UPDATE tbl SET k = 5 WHERE i % 100 = 11; The generated source statement will be ==> SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION FROM tbl WHERE i % 100 = 11; Then we create two table sinks that refer to expressions from the above source statement: Insert sink (i, s, 5) Delete sink (INPUT__FILE__NAME, FILE__POSITION) The tuples in the rowbatch of MultiDataSink contain slots for all the above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION). MultiDataSink forwards each row batch to each registered TableSink. They will pick their relevant expressions from the tuple and write data/delete files. The tuples are sorted by INPUTE__FILE__NAME and FILE__POSITION because we need to write the delete records in this order. For partitioned tables we need to shuffle and sort the input tuples. In this case we also add virtual columns "PARTITION__SPEC__ID" and "ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle and sort the rows based on them. Data files and delete files are now separated in the DmlExecState, so at the end of the operation we'll have two sets of files. We use these two sets to create a new Iceberg snapshot. Why does this patch have the limitations? - Because we are shuffling and sorting rows based on the delete records and their partitions. This means that the new data files might not get written in an efficient way, e.g. there will be too many of them, or we will need to keep too many open file handles during writing. Also, if the table has SORT BY properties, we cannot respect it as the input rows are ordered in a way to favor the position deletes. Part 3 will introduce a buffering writer for position delete files. This means we will shuffle and sort records based on the data records' partitions and SORT BY properties while delete records get buffered and written out at the end (sorted by file_path and position). In some edge cases the delete records might not get written efficiently, but it is a smaller problem then inefficient data files. Testing: * negative tests * planner tests * update all supported data types * partitioned tables * Impala/Hive interop tests * authz tests * concurrent tests Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08 Reviewed-on: http://gerrit.cloudera.org:8080/20677 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-12-09 03:04:05 +00:00
Gergely Farkas	04bdb4d32c	IMPALA-12552: Fix Kerberos authentication issue that occurs in python 3 environment when kerberos_host_fqdn option is used In Pyhton 2, the sasl layer does not accept unicode strings, so we have to explicitly encode the kerberos_host_fqdn string to ascii. However, this is not the case in python 3, where we have to omit the encode, because if we don't do this, impala-shell wants to use the following service principal during Kerberos auth: my_service_name/b'my.kerberos.host.fqdn'@MY.REALM instead of the correct one, which is: my_service_name/my.kerberos.host.fqdn@MY.REALM (This is because the output of the encode function is a byte array in python 3.) Tested with new unit tests and with a snapshot build manually in CDP PVC DS. Change-Id: I8b157d76824ad67faf531a529256a8afe2ab9d49 Reviewed-on: http://gerrit.cloudera.org:8080/20691 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-11-17 20:08:42 +00:00
Eyizoha	52ad12bc0c	IMPALA-12544: Add additional query progress reporting for the shell This patch modifies the dynamic query progress reporting in impala-shell by adding an extra query progress bar below the scan progress bar. The query progress is calculated using the number of completed fragment instances divided by the total number of fragment instances. Compared to the scan progress, which is calculated based on completed scan ranges divided by the total scan ranges, the query progress provides a more accurate reflection of the actual completion progress of the query. Particularly for computationally intensive queries involving complex aggregations or sorting, such as tpcds query78, there is often additional computation time required after the scanning is complete. In such cases, displaying only 100% scan progress would be inaccurate. Change-Id: I11a704885505442b7499a026fcee3b86696cd064 Reviewed-on: http://gerrit.cloudera.org:8080/20672 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2023-11-10 16:18:51 +00:00
Michael Smith	3e99dfcd16	IMPALA-12515: Clarify behavior with redundant system python Clarifies the behavior building impala-shell tarball when one of the system pythons is also included in IMPALA_EXTRA_PACKAGE_PYTHONS. System python will always replace the same version from IMPALA_EXTRA_PACKAGE_PYTHONS, as system pythons are appended to the end. Updates make_shell_tarball to delete the old ext-py install when it would be replaced rather than relying on 'pip --upgrade', and iterates by python executable first to make that possible. Change-Id: I629bdab38d98c8c4232d4cae7b0429a5118d9ff7 Reviewed-on: http://gerrit.cloudera.org:8080/20687 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-11-09 03:43:35 +00:00
Michael Smith	12325eb7ec	IMPALA-12515: Build modules for extra pythons Adds IMPALA_EXTRA_PACKAGE_PYTHONS to build impala-shell tarball dependencies for additional Python targets. That can be used to build a tarball that supports multiple Python 3 minor versions at once. Updates the impala-shell script to provide a clear error message when attempting to use the tarball with a Python version that it hasn't been built for. Change-Id: I13720a9e3c50f348bef41f5e91f810204e416f13 Reviewed-on: http://gerrit.cloudera.org:8080/20617 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-11-03 16:50:25 +00:00
Michael Smith	09f15eea78	IMPALA-12517: Decode binary data with Python 3 When impala-shell receives binary data with the HS2 protocol, it uses a stringifier to decode it. In Python 3, 'str' on binary data wraps it in "b'...'"; to get equivalent output to 'str' in Python 2, we need to decode as UTF-8 and handle errors. Adds a test case for how impala-shell formats binary data. Change-Id: I9222cd1ac081a38ab2b37d58628faac0812695ec Reviewed-on: http://gerrit.cloudera.org:8080/20624 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-10-31 22:34:15 +00:00
Joe McDonnell	1a1a84ee23	IMPALA-12434: Isolate pkg_resources.py to its own directory In some build environments, the impala-shell Python 3 virtualenv install fails due to interactions with shell/pkg_resources.py. This doesn't reproduce in the standard development environment, but it is consistent. It seems to be related to invoking a command in ${IMPALA_HOME}/shell and the pkg_resources.py being in that directory. To avoid any interactions, this moves shell/pkg_resources.py to shell/legacy/pkg_resources.py. This keeps it off of the path for the failing command, and it also keeps it off of our PYTHONPATH (which includes ${IMPALA_HOME}/shell). Testing: - Ran a build in the affected build environment - Ran a core job Change-Id: Id8f2d8a8472c7bb405bf88673ed9779e23cde1d6 Reviewed-on: http://gerrit.cloudera.org:8080/20468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-09-19 04:30:09 +00:00
Andrew Sherman	749d664c60	IMPALA-12294 Fix Cookie handling for Impala Shell with python 3 Impala Shell gets cookies from an HTTMessage object formed from a response to an HTTP message. The format of cookies in the message differs across the python versions. In Python 2 the HTTPMessage is a mimetools.Message object, and the Set-Cookie values all appear in a single header, separated by newlines. In Python 3 the HTTPMessage is an email.message.Message, and the Set-Cookie values appear as duplicate headers. Add platform dependent code to get_all_matching_cookies() that loads cookies from all the Set-Cookie headers. TESTING: Changed test_get_all_matching_cookies() to build the HTTPMessage using a new utility method that creates Set-Cookie headers in the appropriate format for the platform. Validated that the KNOX_BACKEND-IMPALA cookies is correctly set in Impala Shell on a Red Hat 9 system using Python 3 (which is how the problem was first observed). Change-Id: I057b5c2b9d78e36f32865537d091c4ac0e80d37f Reviewed-on: http://gerrit.cloudera.org:8080/20216 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-07-18 23:52:29 +00:00
Joe McDonnell	07d5a93de6	IMPALA-12220: pip install ext-py dependencies in the shell tarball The impala-shell tarball ships its external dependencies by building eggs and including them in the ext-py* directories. On Redhat 9 and Ubuntu 22, the impala-shell tarball encountered a regression where the sasl package could not access its Client class: Error connecting: AttributeError, module 'sasl' has no attribute 'Client' This only occurs when using eggs (which are zip files). The virtualenv installs worked fine. Unpacking the eggs and using the content directly also avoids the problem. This reworks the shell tarball to instead build wheels and install them with 'pip install'. This means that the external dependencies are not packaged in eggs, and this avoids the issue with sasl. This is a minimal change to avoid the issue until the shell tarball build can be reworked more extensively. Testing: - Ran shell tests on Redhat 9 Change-Id: I49403979c559b7f8bbe038865c06db6024468d72 Reviewed-on: http://gerrit.cloudera.org:8080/20095 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-21 05:21:01 +00:00
Joe McDonnell	234d641d7b	IMPALA-11961/IMPALA-12207: Add Redhat 9 / Ubuntu 22 support This adds support for Redhat 9 / Ubuntu 22. It updates to a newer toolchain that has those builds, and it adds supporting code in bootstrap_system.sh. Redhat 9 and Ubuntu 22 use python = python3, which requires various changes to build scripts and tests. Ubuntu 22 uses Python 3.10, which deprecates certain ssl.PROTOCOL_TLS, so this adapts test_client_ssl.py to that change until it can be fully addressed in IMPALA-12219. Various OpenSSL methods have been deprecated. As a workaround until these can be addressed properly, this specifies -Wno-deprecated-declarations. This can be removed once the code is adapted to the non-deprecated APIs in IMPALA-12226. Impala crashes with tcmalloc errors unless we update to a newer gperftools, so this moves to gperftools 2.10. gperftools changed the default for tcmalloc.aggressive_memory_decommit to off, so this adapts our code to set it for backend tests. The gperftools upgrade does not show any performance regression: +----------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +----------+-----------------------+---------+------------+------------+----------------+ \| TPCH(42) \| parquet / none / none \| 3.08 \| -0.64% \| 2.20 \| -0.37% \| +----------+-----------------------+---------+------------+------------+----------------+ With newer Python versions, the impala-virtualenv command fails to create a Python 3 virtualenv. This switches to using Python 3's builtin venv command for Python >=3.6. Kudu needed a newer version and LLVM required a couple patches. Testing: - Ran a core job on Ubuntu 22 and Redhat 9. The tests run to completion without crashing. There are test failures that will be addressed in follow-up JIRAs. - Ran dockerised tests on Ubuntu 22. - Ran dockerised tests on Ubuntu 20 and Rocky 8.5. Change-Id: If1fcdb2f8c635ecd6dc7a8a1db81f5f389c78b86 Reviewed-on: http://gerrit.cloudera.org:8080/20073 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-06-21 05:21:01 +00:00
Joe McDonnell	bad064dbea	IMPALA-12224: Improve error handling for shell interactive tests Interactive shell tests can hang waiting for input if the shell process hits errors or exits. For example, the problems in the sasl package seen in IMPALA-12220 cause test_shell_interactive.py to hang. This improves the error detection/handling to avoid hangs for most common shell errors. Specifically, it adds a check for the impala-shell process exiting, and it adds a check for a failure to connect to Impala. Both would previous result in hangs. Testing: - Verified test_shell_interactive.py doesn't hang with hand tests - Remove a vital import from impala-shell so it exits instantly - Simulate a connection problem by overwriting the port with a non-functional port - Test on Redhat 9 with the IMPALA-12220 issue Change-Id: I7556fb687e06b41caa538d8c3231ec9f2ad98162 Reviewed-on: http://gerrit.cloudera.org:8080/20087 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-06-21 05:21:01 +00:00
Vincent Tran	9727b46f3b	IMPALA-11435: Fixup - Suppress logging for 'thrift' in impala-shell Commit `cd9f3f578` aims to suppres logging for the 'thrift' library within impala-shell. However, it does not work in all case. This change moves the fix into the 'main' function, which suppresses the unwanted messagge. Tested by connecting through impala-shell with Python2.7 and Python3.6 with SSL enabled. Change-Id: I4de95b1b67abe9a0b4637910b0894addddda23d5 Reviewed-on: http://gerrit.cloudera.org:8080/20074 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-15 12:06:26 +00:00
Joe McDonnell	e9fb8e717c	IMPALA-12114: Pull in fix for THRIFT-5705 and add test This pulls in a new toolchain to get a Thrift with the patch for THRIFT-5705. This fixes an issue where idle clients using TLS are needlessly disconnected due to a bug in the read retry count logic inside Thrift. Tests: - This modifies test_thrift_socket.py to make it do more idle polls and check that ImpalaShell is not disconnected. It fails without the THRIFT-5705 patch and passes now. Change-Id: Ifc7704cba032a91b9fd0d5d54d1e0a7e17fb10bb Reviewed-on: http://gerrit.cloudera.org:8080/19962 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Reviewed-by: Andrew Sherman <asherman@cloudera.com>	2023-06-02 15:57:37 +00:00
Csaba Ringhofer	adfa6c83ec	IMPALA-12142: Decrease default fetch_size to 8192 in impala-shell The previous fetch_size of 10240 turned out to be suboptimal for HS2 server side, likely because it leads to overallocation in result 'std::vector's. Changed to the closest power of 2 size (8192). With this change RowMaterializationTimer decreased from 3.4s to 2.7s for "SELECT * FROM tpch_parquet.lineitem". Change-Id: I34973cb705db53c496b9944c74995b45cf720d46 Reviewed-on: http://gerrit.cloudera.org:8080/19965 Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-01 22:01:49 +00:00
jasonmfehr	ee300a1af0	IMPALA-12163: Fixes two issues when outputting RPC details. The end time of the exact same rpc call was different between stdout and the rpc details file because the end time was calculated each time the details were written out instead of calculating the end time once and reusing that value. The duration of each rpc call was being calculated incorrectly. Change-Id: Ifd9dec189d0f6fb8713fb1c7b2b6c663e492ef05 Reviewed-on: http://gerrit.cloudera.org:8080/19932 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-25 20:51:04 +00:00
Csaba Ringhofer	14035065fa	IMPALA-12145: Fix profiles with non-ascii character in impala-shell (python2) As __future__.unicode_literals is imported in impala-shell concatenating an str with a literal leads to decoding the string with 'ascii' codec which fails if there are non-ascii characters. Converting the literal to str solves the issue. Testing: - added regression test + ran related EE tests Change-Id: I99b72dd262fc7c382e8baee1dce7592880c84de2 Reviewed-on: http://gerrit.cloudera.org:8080/19893 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-25 00:33:34 +00:00
Joe McDonnell	451543a2e5	IMPALA-11785: Warn if Thrift fastbinary is not working for impala-shell Thrift's fastbinary module provides native code that accelerations the BinaryProtocol. It can make a large performance difference when using the Hiveserver2 protocol with impala-shell. If the fastbinary is not working, it silently falls back to interpreted code. This can happen because the fastbinary couldn't load a particular library, etc. This adds a warning on impala-shell startup when it detects that Thrift's fastbinary is not working. When bin/impala-shell.sh is modified to use python3, impala-shell outputs this error (shortened for legibility): WARNING: Failed to load Thrift's fastbinary module. Thrift's BinaryProtocol will not be accelerated, which can reduce performance. Error was '{path to Python2 thrift fastbinary.so}: undefined symbol: _Py_ZeroStruct' Testing: - Added a simple test that verifies the impala-shell does not output the warning - Outputs warning when Python 2 thrift used for Python 3 shell Change-Id: Id5d0e5db5cfdf1db4521b00f912b4697a7f646e8 Reviewed-on: http://gerrit.cloudera.org:8080/19806 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-23 06:41:02 +00:00

1 2 3 4 5 ...

414 Commits