To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3
This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
doesn't have a main function, it removes the hash-bang and makes
sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
(or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
replaced by the cm-client pypi package and interfaces have changed.
Rather than migrating the code (which hasn't been used in years), this
deletes the old code and stops installing cm-api into the virtualenv.
The code can be restored and revamped if there is any interest in
interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
bit-rotted. Some pieces can be run manually, but it can't be fully
verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
version that supports Python 3. The newest version of kazoo requires
upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
needing other upgrades.
The two remaining uses of impala-python are:
- bin/cmake_aux/create_virtualenv.sh
- bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.
The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)
Testing:
- Ran core job
- Ran build + dataload on Centos 7, Redhat 8
- Manual testing of individual scripts (except some bitrotted areas like the
random query generator)
Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true
reveals some tests that require adjustment. This patch made such
adjustment, which mostly revolves around encoding differences and string
vs bytes type in Python3. This patch also switch the default to run
pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The
following are the details:
Change hash() function in conftest.py to crc32() to produce
deterministic hash. Hash randomization is enabled by default since
Python 3.3 (see
https://docs.python.org/3/reference/datamodel.html#object.__hash__).
This cause test sharding (like --shard_tests=1/2) produce inconsistent
set of tests per shard. Always restart minicluster during custom cluster
tests if --shard_tests argument is set, because test order may change
and affect test correctness, depending on whether running on fresh
minicluster or not.
Moved one test case from delimited-latin-text.test to
test_delimited_text.py for easier binary comparison.
Add bytes_to_str() as a utility function to decode bytes in Python3.
This is often needed when inspecting the return value of
subprocess.check_output() as a string.
Implement DataTypeMetaclass.__lt__ to substitute
DataTypeMetaclass.__cmp__ that is ignored in Python3 (see
https://peps.python.org/pep-0207/).
Fix WEB_CERT_ERR difference in test_ipv6.py.
Fix trivial integer parsing in test_restart_services.py.
Fix various encoding issues in test_saml2_sso.py,
test_shell_commandline.py, and test_shell_interactive.py.
Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1.
Switch to binary comparison in test_iceberg.py where needed.
Specify text mode when calling tempfile.NamedTemporaryFile().
Simplify create_impala_shell_executable_dimension to skip testing dev
and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason
is that several UTF-8 related tests in test_shell_commandline.py break
in Python3 pytest + Python2 impala-shell combo. This skipping already
happen automatically in build OS without system Python2 available like
RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty).
Removed unused vector argument and fixed some trivial flake8 issues.
Several test logic require modification due to intermittent issue in
Python3 pytest. These include:
Add _run_query_with_client() in test_ranger.py to allow reusing a single
Impala client for running several queries. Ensure clients are closed
when the test is done. Mark several tests in test_ranger.py with
SkipIfFS.hive because they run queries through beeline + HiveServer2,
but Ozone and S3 build environment does not start HiveServer2 by
default.
Increase the sleep period from 0.1 to 0.5 seconds per iteration in
test_statestore.py and mark TestStatestore to execute serially. This is
because TServer appears to shut down more slowly when run concurrently
with other tests. Handle the deprecation of Thread.setDaemon() as well.
Always force_restart=True each test method in TestLoggingCore,
TestShellInteractiveReconnect, and TestQueryRetries to prevent them from
reusing minicluster from previous test method. Some of these tests
destruct minicluster (kill impalad) and will produce minidump if metrics
verifier for next tests fail to detect healthy minicluster state.
Testing:
Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true.
Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4
Reviewed-on: http://gerrit.cloudera.org:8080/23319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Python 3 changed the behavior of imports with PEP328. Existing
imports become absolute unless they use the new relative import
syntax. This adapts the impala-shell code to use absolute
imports, fixing issues where it is imported from our test code.
There are several parts to this:
1. It moves impala shell code into shell/impala_shell.
This matches the directory structure of the PyPi package.
2. It changes the imports in the shell code to be
absolute paths (i.e. impala_shell.foo rather than foo).
This fixes issues with Python 3 absolute imports.
It also eliminates the need for ugly hacks in the PyPi
package's __init__.py.
3. This changes Thrift generation to put it directly in
$IMPALA_HOME/shell rather than $IMPALA_HOME/shell/gen-py.
This means that the generated Thrift code is rooted in
the same directory as the shell code.
4. This changes the PYTHONPATH to include $IMPALA_HOME/shell
and not $IMPALA_HOME/shell/gen-py. This means that the
test code is using the same import paths as the pypi
package.
With all of these changes, the source code is very close
to the directory structure of the PyPi package. As long as
CMake has generated the thrift files and the Python version
file, only a few differences remain. This removes those
differences by moving the setup.py / MANIFEST.in and other
files from the packaging directory to the top-level
shell/ directory. This means that one can pip install
directly from the source code. i.e. pip install $IMPALA_HOME/shell
This also moves the shell tarball generation script to the
packaging directory and changes bin/impala-shell.sh to use
Python 3.
This sorts the imports using isort for the affected Python files.
Testing:
- Ran a regular core job with Python 2
- Ran a core job with Python 3 and verified that the absolute
import issues are gone.
Change-Id: Ica75a24fa6bcb78999b9b6f4f4356951b81c3124
Reviewed-on: http://gerrit.cloudera.org:8080/22330
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.
All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.
The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.
get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.
Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Previously, error_msg_expected() only accepted error messages starting
with the following error prompt:
```
Query <query_id> failed:\n
```
However, for some tests using the Beeswax protocol, the error prompt may
appear in the middle of the error message instead of at its beginning.
Therefore, this patch adapts error_msg_expected() to accept error
messages not starting with the error prompt.
The error_msg_expected() function is renamed to error_msg_startswith()
to better describe its behavior.
Change-Id: Iac3e68bcc36776f7fd6cc9c838dd8da9c3ecf58b
Reviewed-on: http://gerrit.cloudera.org:8080/22468
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
This change will print timestamp of an exception or warning
occurred during execution of a query via impala-shell.
The timestamp will use timezone of the machine running impala-shell.
example:
Query submitted at: 2024-08-22 16:17:57 (Coordinator: http://host:25000)
Query state can be monitored at:
http://localhost:25000/query_plan?query_id=e04dcc55e560d1ee:11173fe800000000
^C Cancelling Query
Opened TCP connection to localhost:21050
2024-08-22 16:17:58 [Exception] type=<class 'socket.error'> in FetchResults.
[Errno 4] Interrupted system call
2024-08-22 16:17:58 [Warning] Cancelling Query
2024-08-22 16:17:58 [Warning] close session RPC failed: <class
'shell_exceptions.QueryCancelledByShellException'>
Opened TCP connection to localhost:21050
[localhost:21050] default>
Change-Id: I4abbd02aa9f61210b0333495bf191e72c22a5944
Reviewed-on: http://gerrit.cloudera.org:8080/21426
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When debugging stale metadata, it'd be helpful to know what catalog
version of the tables are used and what's the time when catalogd loads
those versions. This patch exposes these info in the query profile for
each referenced table. E.g.
Original Table Versions: tpch.customer, 2249, 1726052668932, Wed Sep 11 19:04:28 CST 2024
tpch.nation, 2255, 1726052790140, Wed Sep 11 19:06:30 CST 2024
tpch.orders, 2257, 1726052803258, Wed Sep 11 19:06:43 CST 2024
tpch.lineitem, 2254, 1726052785384, Wed Sep 11 19:06:25 CST 2024
tpch.supplier, 2256, 1726052794235, Wed Sep 11 19:06:34 CST 2024
Each line consists of the table name, catalog version, loaded timestamp
and the timestamp string.
Implementation:
The loaded timestamp is updated whenever a CatalogObject updates its
catalog version in catalogd. It's passed to impalads with the
TCatalogObject broadcasted by statestore, or in DDL/DML responses.
Currently, the loaded timestamp is added for table, view, function, data
source, and hdfs cache pool in catalogd. However, only those of table
and view are applied used in impalad. For the loaded timestamp of other
types, users can check them in the /catalog WebUI of catalogd.
Tests:
- Adds e2e test
Change-Id: I94b2fd59ed5aca664d6db4448c61ad21a88a4f98
Reviewed-on: http://gerrit.cloudera.org:8080/21782
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When using hs2-http protocol, http messages from Impala clients may pass
through one or more proxies before reaching the Impala coordinator.
This can make it harder to track the origin of the http messages. The
'X-Forwarded-For' header is added to or edited by HTTP proxies when
forwarding a request, so it may contain multiple source addresses. Add
the value of this header to the runtime profile so that it can be
observed.
Impala will truncate the 'X-Forwarded-For' header value at 8096
characters. Apart from this, Impala does not do any verification or
sanitization of this value, so its value should only be trusted if the
deployment environment protects against spoofing.
A good reference for understanding the use of 'X-Forwarded-For' is
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For
This patch does not address the cases where http proxies insert
multiple 'X-Forwarded-For' headers. This issue is tracked in
IMPALA-13335.
TESTING: add an option '--hs2_x_forward' to impala-shell which will
set the 'X-Forwarded-For' header. Add tests which verify that the value
is set in the profile, and that a long value is truncated correctly.
Change-Id: I2e010cfb09674c5d043ef915347c3836696e03cf
Reviewed-on: http://gerrit.cloudera.org:8080/21700
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the query id to the error messages in both
- the result of the `get_log()` RPC, and
- the error message in an RPC response
before they are returned to the client, so that the users can easily
figure out the errored queries on the client side.
To achieve this, the query id of the thread debug info is set in the
RPC handler method, and is retrieved from the thread debug info each
time the error reporting function or `get_log()` gets called.
Due to the change of the error message format, some checks in the
impala-shell.py are adapted to keep them valid.
Testing:
- Added helper function `error_msg_expected()` to check whether an
error message is expected. It is stricter than only using the `in`
operator.
- Added helper function `error_msg_equal()` to check if two error
messages are equal regardless of the query ids.
- Various test cases are adapted to match the new error message format.
- `ImpalaBeeswaxException`, which is used in tests only, is simplified
so that it has the same error message format as the exceptions for
HS2.
- Added an assertion to the case of killing and restarting a worker
in the custom cluster test to ensure that the query id is in
the error message in the client log retrieved with `get_log()`.
Change-Id: I67e659681e36162cad1d9684189106f8eedbf092
Reviewed-on: http://gerrit.cloudera.org:8080/21587
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The ImpalaShell class expects to start impala-shell and interact with it
by sending instructions over stdin and reading the results. This
assumption was incorrect when used for impala-shell batch sessions,
where the process exits on its own. If there's a delay in
ImpalaShell.__init__ - between starting the process and polling to see
that it's running - for a batch process, ImpalaShell will fail the
assertion that process_status is None. This can be easily reproduced by
adding a small (0.1s) sleep after starting the new process.
Most batch runs of impala-shell happen through `run_impala_shell_cmd`.
Updated that function to only wait for a successful connection when
stdin input is supplied. Otherwise the command is assumed to be a batch
function and any failures will be detected during `get_result`. Removed
explicit use of `wait_until_connected` as redundant.
Fixed cases in test_config_file that previously ignored WARNING before
the connection string because they did not specify
`wait_until_connected`.
Tested by running shell/test_shell_commandline.py with a 0.1s delay
before ImpalaShell polls.
Change-Id: I24e029b6192a17773760cb44fd7a4f87b71c0aae
Reviewed-on: http://gerrit.cloudera.org:8080/21598
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
The issue occurred in Python 3 when 0 rows were deleted from Iceberg.
It could also happen in other DMLs with older Impala servers where
TDmlResult.rows_deleted was not set. See the Jira for details of
the error.
Testing:
Extended shell tests for Kudu DML reporting to also cover Iceberg.
Change-Id: I5812b8006b9cacf34a7a0dbbc89a486d8b454438
Reviewed-on: http://gerrit.cloudera.org:8080/21284
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch modifies the dynamic query progress reporting in impala-shell
by adding an extra query progress bar below the scan progress bar.
The query progress is calculated using the number of completed fragment
instances divided by the total number of fragment instances. Compared to
the scan progress, which is calculated based on completed scan ranges
divided by the total scan ranges, the query progress provides a more
accurate reflection of the actual completion progress of the query.
Particularly for computationally intensive queries involving complex
aggregations or sorting, such as tpcds query78, there is often
additional computation time required after the scanning is complete. In
such cases, displaying only 100% scan progress would be inaccurate.
Change-Id: I11a704885505442b7499a026fcee3b86696cd064
Reviewed-on: http://gerrit.cloudera.org:8080/20672
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
When impala-shell receives binary data with the HS2 protocol, it uses a
stringifier to decode it. In Python 3, 'str' on binary data wraps it in
"b'...'"; to get equivalent output to 'str' in Python 2, we need to
decode as UTF-8 and handle errors.
Adds a test case for how impala-shell formats binary data.
Change-Id: I9222cd1ac081a38ab2b37d58628faac0812695ec
Reviewed-on: http://gerrit.cloudera.org:8080/20624
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The end time of the exact same rpc call was different between stdout
and the rpc details file because the end time was calculated each
time the details were written out instead of calculating the end time
once and reusing that value.
The duration of each rpc call was being calculated incorrectly.
Change-Id: Ifd9dec189d0f6fb8713fb1c7b2b6c663e492ef05
Reviewed-on: http://gerrit.cloudera.org:8080/19932
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
As __future__.unicode_literals is imported in impala-shell
concatenating an str with a literal leads to decoding the
string with 'ascii' codec which fails if there are non-ascii
characters. Converting the literal to str solves the issue.
Testing:
- added regression test + ran related EE tests
Change-Id: I99b72dd262fc7c382e8baee1dce7592880c84de2
Reviewed-on: http://gerrit.cloudera.org:8080/19893
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Thrift's fastbinary module provides native code that
accelerations the BinaryProtocol. It can make a large
performance difference when using the Hiveserver2
protocol with impala-shell. If the fastbinary is not
working, it silently falls back to interpreted code.
This can happen because the fastbinary couldn't load
a particular library, etc.
This adds a warning on impala-shell startup when
it detects that Thrift's fastbinary is not working.
When bin/impala-shell.sh is modified to use python3,
impala-shell outputs this error (shortened for legibility):
WARNING: Failed to load Thrift's fastbinary module. Thrift's
BinaryProtocol will not be accelerated, which can reduce performance.
Error was '{path to Python2 thrift fastbinary.so}: undefined symbol: _Py_ZeroStruct'
Testing:
- Added a simple test that verifies the impala-shell
does not output the warning
- Outputs warning when Python 2 thrift used for Python 3 shell
Change-Id: Id5d0e5db5cfdf1db4521b00f912b4697a7f646e8
Reviewed-on: http://gerrit.cloudera.org:8080/19806
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Fix various quality-of-life issues with the 'summary' command:
- update regex to correctly match query ID for handling "Query id ...
not found" errors
- fail the command rather than exiting the shell when 'summary' is
called with an incorrect argument (such as 'summary 1')
- provide a useful message rather than print an exception when 'summary
original' is invoked with no failed queries
Testing:
- added new tests for the 'summary' command
Change-Id: I7523d45b27e5e63e1f962fb1f6ebb4f0adc85213
Reviewed-on: http://gerrit.cloudera.org:8080/19797
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Python 3 changed some object model methods:
- __nonzero__ was removed in favor of __bool__
- func_dict / func_name were removed in favor of __dict__ / __name__
- The next() function was deprecated in favor of __next__
(Code locations should use next(iter) rather than iter.next())
- metaclasses are specified a different way
- Locations that specify __eq__ should also specify __hash__
Python 3 also moved some packages around (urllib2, Queue, httplib,
etc), and this adapts the code to use the new locations (usually
handled on Python 2 via future). This also fixes the code to
avoid referencing exception variables outside the exception block
and variables outside of a comprehension. Several of these seem
like false positives, but it is better to avoid the warning.
This fixes these pylint warnings:
bad-python3-import
eq-without-hash
metaclass-assignment
next-method-called
nonzero-method
exception-escape
comprehension-escape
Testing:
- Ran core tests
- Ran release exhaustive tests
Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee
Reviewed-on: http://gerrit.cloudera.org:8080/19592
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.
Python 2:
range(0,5) == [0,1,2,3,4]
True
Python 3:
range(0,5) == [0,1,2,3,4]
False
The fix is to wrap locations with list(). i.e.
Python 3:
list(range(0,5)) == [0,1,2,3,4]
True
Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).
Most of the changes were done via these futurize fixes:
- libfuturize.fixes.fix_xrange_with_import
- lib2to3.fixes.fix_map
- lib2to3.fixes.fix_filter
This eliminates the pylint warnings:
- xrange-builtin
- range-builtin-not-iterating
- map-builtin-not-iterating
- zip-builtin-not-iterating
- filter-builtin-not-iterating
- reduce-builtin
- deprecated-itertools-function
Testing:
- Ran core job
Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Python 3 does not support this old except syntax:
except Exception, e:
Instead, it needs to be:
except Exception as e:
This uses impala-futurize to fix all locations of
the old syntax.
Testing:
- The check-python-syntax.sh no longer shows errors
for except syntax.
Change-Id: I1737281a61fa159c8d91b7d4eea593177c0bd6c9
Reviewed-on: http://gerrit.cloudera.org:8080/19551
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This change adds a new info string to the frontend runtime profile
which contains the referenced tables by the query in a
comma-separated format.
Tests:
- Added tests to check if the referenced tables are enumerated
correctly
- Added test to check if referenced table is filled properly with
different DLM statements
Change-Id: Ib474a5c6522032679701103aa225a18edca62f5a
Reviewed-on: http://gerrit.cloudera.org:8080/19401
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When the Impala shell is using the hs2 protocol, it makes multiple RPCs
to the Impala daemon. These calls pass Thrift objects back and forth.
This change adds the '--show_rpc' which outputs the details of the RPCs
to stdout and the '--rpc_file' flag which outputs the RPC details to the
specified file path.
RPC details include:
- operation name
- request attempt count
- Impala session/query ids (if applicable)
- call duration
- call status (success/failure)
- request Thrift objects
- response Thrift objects
Certain information is not included in the RPC details:
- Thrift object attributes named 'secret' or 'password'
are redacted.
- Thrift objects with a type of TRowSet or TGetRuntimeProfileResp
are not include as the information contained within them is
already available in the standard output from the Impala shell.
Testing:
- Added new tests in the end-to-end test suite.
Change-Id: I36f8dbc96726aa2a573133acbe8a558299381f8b
Reviewed-on: http://gerrit.cloudera.org:8080/19388
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently max tries for connecting to coordinator is hard coded to 4
in hs2-http mode. It's required to make the max tries when connecting
to coordinator a configurable option, especially in the environment
where coordinator is started slowly.
This patch added support for configurable max tries in hs2-http mode
using the new impala-shell config option '--connect_max_tries'.
The default value of '--connect_max_tries' is set to 4.
Testing:
- Ran e2e shell tests.
- Ran impala-shell with connect_max_tries as 100 before starting
impala coordinator daemon, verified that impala-shell connects to
coordinator after coordinator daemon was started.
Change-Id: I5f7caeb91a69e71a38689785fb1636094295fdb1
Reviewed-on: http://gerrit.cloudera.org:8080/19105
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds a shell option called "hs2_fp_format"
which manipulates the print format of floating-point values in HS2.
It lets the user to specify a Python-based format specification
expression (https://docs.python.org/2.7/library/string.html#formatspec)
which will get parsed and applied to floating-point
column values. The default value is None, in this case the
formatting is the same as the state before this change.
This option does not support the Beeswax protocol, because Beeswax
converts all of the column values to strings in its response.
Tests: command line tests for various formatting options and
for invalid formatting option
Change-Id: I424339266be66437941be8bafaa83fa0f2dfbd4e
Reviewed-on: http://gerrit.cloudera.org:8080/18990
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When running in the Docker-based tests, TestImpalaShell's
test_http_socket_timeout fails with a mismatch in the
error message. The test expected "Operation now in progress",
but in Docker-based tests it throws "Cannot assign requested
address". Since this is testing that a socket timeout of zero
gets an error, it seems reasonable to tolerate this extra
variant.
This modifies the test to allow this error message.
Testing:
- TestImpalaShell.test_http_socket_timeout passes
in the docker-based tests and in a normal core job
Change-Id: If463f1100db673bb916b094c1402f1876342c80e
Reviewed-on: http://gerrit.cloudera.org:8080/18899
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Similar to IMPALA-11332, The current VerticalOutputFormatter is
stripping trailing whitespaces from the last line of output. This
rstrip() was intended to remove an extra newline,
but it is matching other white space. This is a
problem for a SQL query like:
select 'Trailing whitespace ';
This changes the rstrip() to rstrip('\n') to
avoid removing the other white space.
Testing:
- Current shell tests pass
- Added a shell test that verifies trailing whitespace
is not being stripped.
Change-Id: Id66162d28498e7bef2933651616cf3df2fb0f354
Reviewed-on: http://gerrit.cloudera.org:8080/18722
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Build Python 3 eggs for the shell tarball so it works with both Python 2
and Python 3. The impala-shell script selects eggs based on the
available Python version.
Inlines thrift for impala-shell so we can easily build Python 2 and
Python 3 versions, consistent with other libraries. The impala-shell
version should always be at least as new as IMPALA_THRIFT_PY_VERSION.
Thrift 0.13.0+ wraps all exceptions during TSocket read/write operations
in TTransportException. Specifically socket.error that we got as raw
exceptions are now wrapped. Unwraps them before raising to preserve
prior behavior.
A specific Python version can be selected with IMPALA_PYTHON_EXECUTABLE;
otherwise it will use 'python', and if unavailable try 'python3'.
Adds tests for impala-shell tarball with Python 3.
Change-Id: I94f86de9e2a6303151c2f0e6454b5f629cbc9444
Reviewed-on: http://gerrit.cloudera.org:8080/18653
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When redirecting stdout and stderr to a file, the
existing code can sometimes output the "Fetched X row(s)"
line before finishing the row output. e.g.
impala-shell -B -q "select 1" >> outfile.txt 2>> outfile.txt
The rows output goes to stdout while the control messages
like "Fetched X row(s)" go to stderr. Since stdout can buffer
output, that can delay the output. This adds a flush for
stdout before writing the "Fetched X row(s)" message.
Testing:
- Added a shell test that redirects stdout and stderr to
a file and verifies the contents. This consistently
fails without the flush.
- Other shell tests pass
Change-Id: I83f89c110fd90d2d54331c7121e407d9de99146c
Reviewed-on: http://gerrit.cloudera.org:8080/18625
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Sets up a virtualenv with system python to install the impala-shell PyPI
package into. Using system python provides better coverage for Python
versions likely to be used by customers. Runs impala-shell tests using
the PyPI package to provide better coverage for the artifact customers
will use.
Includes a PyPI install in notests_independent_targets because these
seem to be used for Python testing despite -notests.
Change-Id: I384ea6a7dab51945828cca629860400a23fa0c05
Reviewed-on: http://gerrit.cloudera.org:8080/18586
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
In vertical mode, impala-shell will print each row in the format:
firstly print a line contains line number, then print this row's columns
line by line, each column line started with it's name and a colon.
To enable it: use shell option '-E' or '--vertical', or 'set VERTICAL=
true' in interactive mode. to disable it in interactive mode: 'set
VERTICAL=false'. NOTICE: it will be disabled if '-B' option or 'set
WRITE_DELIMITED=true' is specified.
Tests:
add methods in test_shell_interactive.py and test_shell_commandline.py.
Change-Id: I5cee48d5a239d6b7c0f51331275524a25130fadf
Reviewed-on: http://gerrit.cloudera.org:8080/18549
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The current CSV output is stripping trailing
whitespaces from the last line of CSV output. This
rstrip() was intended to remove an extra newline,
but it is matching other white space. This is a
problem for a SQL query like:
select 'Trailing whitespace ';
This changes the rstrip() to rstrip('\n') to
avoid removing the other white space.
Testing:
- Current shell tests pass
- Added a shell test that verifies trailing whitespace
is not being stripped.
Change-Id: I69d032ca2f581587b0938d0878fdf402fee0d57e
Reviewed-on: http://gerrit.cloudera.org:8080/18580
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When using the --output_file commandline option for
impala-shell, the shell fails with UnicodeDecodeError
if the output contains Unicode characters.
For example, if running this command:
impala-shell -B -q "select '引'" --output_file=output.txt
This fails with:
UnicodeDecodeError : 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
This happens due to an encode('utf-8') call happening
in OutputStream::write() on a string that is already UTF-8 encoded.
This changes the code to skip the encode('utf-8') call for Python 2.
Python 3 is using a string and still needs the encode call.
This is mostly a pragmatic fix to make the code a little bit
more functional, and there is more work to be done to have
clear contracts for the format() methods and clear points
of conversion to/from bytes.
Testing:
- Ran shell tests with Python 2 and Python 3 on Ubuntu 18
- Added a shell test that outputs a Unicode character
to an output file. Without the fix, this test fails.
Change-Id: Ic40be3d530c2694465f7bd2edb0e0586ff0e1fba
Reviewed-on: http://gerrit.cloudera.org:8080/18576
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The original issue is that the strict HS2 shell tests
are not running in precommit or nightly jobs, but they
do run in local developer environments. Investigating
this showed that the shell tests were running with a
weird set of test dimensions that includes
table_format_and_file_extension. That dimension is only
used in test_insert.py::TestInsertFileExtension.
What is happening is that the shell tests and other
locations are running add_test_dimensions() without
calling super(..., cls).add_test_dimensions(). The
behavior is unclear, but there is clearly cross-talk
between the different tests that do this.
This changes all add_test_dimensions() locations to
call super(..., cls).add_test_dimensions() if they
don't already. Each location has been tuned to run
the same set of tests as before (except the shell
tests which now run the strict HS2 tests).
As part of this, several shell tests need to be
skipped or fixed for strict HS2.
Testing:
- Ran core job
- Ran tests locally to verify the set of tests
didn't change.
Change-Id: Ib20fd479d3b91ed0ed89a0bc5623cd2a5a458614
Reviewed-on: http://gerrit.cloudera.org:8080/18557
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fixes a few impala-shell Python 3 issues:
1. In ImpalaShell's do_history(), the decode() call needs to be
avoided in Python 3, because in Python 3 the cmd is already
a string and doesn't need further decoding. (IMPALA-11315)
2. TestImpalaShell.test_http_socket_timeout() gets a different
error message in Python 3. It throws the "BlockingIOError"
rather than "socker.error". (IMPALA-11316)
3. ImpalaHttpClient.py's code to retrieve the body when
handling an HTTP error needs to have a decode() call
for the body. Otherwise, the body remains bytes and
causes TestImpalaShellInteractive.test_http_interactions_extra()
to fail. (IMPALA-11317)
Testing:
- Ran shell tests in the standard way
- Ran shell tests with the impala-shell executable coming from
a Python 3 virtualenv using the PyPi package
Change-Id: Ie58380a17d7e011f4ce96b27d34717509a0b80a6
Reviewed-on: http://gerrit.cloudera.org:8080/18556
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Thrift 0.11.0 has known issues where Unicode errors are
not handler properly, including one case where the client
can hang. The traditional form factor for impala-shell
uses a patched Thrift that fixes those issues, but the
PyPi package uses the unpatched Thrift 0.11.0.
This modifies the requirements.txt file to use Thrift 0.14.2,
which has fixes for these Unicode issues. Thrift 0.14.2 has
a slightly different error message, so this amends the
allowed error messages in test_utf8_decoding_error_handling().
This is a bit awkward, given that the Python code generation
continues to happen with Thrift 0.11.0. Comparing the
Python code for Thrift 0.11 vs Thrift 0.14, I didn't see
noticeable differences. Given that the client can hang,
this seems worth fixing ahead of the full conversion to
Thrift 0.14 for all of Impala.
Testing:
- Ran the Unicode error handling tests with a PyPi
impala-shell
- Ran the shell tests normally
Change-Id: I63e0a5dda98df20c9184a347397118b1f3529603
Reviewed-on: http://gerrit.cloudera.org:8080/18560
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The strict hs2 protocol mode is broken when fetching large results.
The FetchResults.hasMoreRows field is always returned as false. When
there are no more results, Hive returns an empty batch with no rows.
HIVE-26108 has been filed to support the hasMoreRows field.
Added a framework test that retrieves 1M rows from tpcds. The default
number of rows returned from Hive is 10K so this should be more than
enough to ensure that multiple fetches are done.
Change-Id: Ife436d91e7fe0c30bf020024e20a5d8ad89faa24
Reviewed-on: http://gerrit.cloudera.org:8080/18370
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
client
In 'hs2-http' mode, the socket timeout is None, which could cause
hang like symptoms in case of a problematic remote server.
Added support for configurable socket timeout using the new impala-shell
config option '--http_socket_timeout_s'. If a reasonable timeout is
set, impala-shell client can retry in case of connection issues, when
possible. The default value of '--http_socket_timeout_s' is set to None,
to prevent behavior changes for existing clients.
More details on socket timeout here:
https://docs.python.org/3/library/socket.html#socket-timeouts
Testing:
- Added tests for various timeout values in test_shell_commandline.py
- Ran e2e shell tests.
Change-Id: I29fa4ff96cdcf154c3aac7e43340af60d7d61e94
Reviewed-on: http://gerrit.cloudera.org:8080/18336
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
The insert command was broken for impala-shell in the strict_hs2
mode. The return parameter for close_dml should return two parameters.
The parameters returned by close_dml are rows returned and error
rows. These are not supported by strict hs2 mode since the close
does not return the TDmlResult structure. So the message to
the end user also had to be changed.
Change-Id: Ibe837c99e54d68d1e27b97f0025e17faf0a2cb9f
Reviewed-on: http://gerrit.cloudera.org:8080/18176
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Impala-shell already uses HS2 protocol to connect to Impalad.
This commit allows impala-shell to connect to any server (for
example, Hive) using the hs2 protocol. This will be done via
the "--strict_hs2_protocol" option.
When the "--strict_hs2_protocol" option is turned on, only features
supported by hs2 will work. For instance, "runtime-profile" is an
impalad specific feature and will be disabled.
The "--strict_hs2_protocol" will only work on servers that abide
by the strict definition of what is supported by HS2. So one will
be able to connect to Hive in this mode, but connections to Impala
will not work. Any feature supported by Hive (e.g. kerberos
authentication) should work as well.
Note: While authentication should work, the test framework is not
set up to create an HS2 server that does authentication at this point
so this feature should be used with caution.
Change-Id: I674a45640a4a7b3c9a577830dbc7b16a89865a9e
Reviewed-on: http://gerrit.cloudera.org:8080/17660
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The issue was that after the impala-shell is started in a seperate
process and an error is encountered then the process lingers on
and a long running query can hold on to resources and potentially
affect other tests running on the impala cluster.
This patch just makes sure that the impala-shell process is killed
regardless of any errors encountered.
Change-Id: I9f6d22d639921051cde5675fae1845bedb61c8cc
Reviewed-on: http://gerrit.cloudera.org:8080/17768
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.
New flags for GCS:
- num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.
Follow-up:
- Support for spilling to GCS will be addressed in IMPALA-10561.
- Support for caching GCS file handles will be addressed in
IMPALA-10568.
- test_concurrent_inserts and test_failing_inserts in
test_acid_stress.py are skipped due to slow file listing on
GCS (IMPALA-10562).
- Some tests are skipped due to issues introduced by /etc/hosts setting
on GCE instances (IMPALA-10563).
Tests:
- Compile and create hdfs test data on a GCE instance. Upload test data
to a GCS bucket. Modify all locations in HMS DB to point to the GCS
bucket. Remove some hdfs caching params. Run CORE tests.
- Compile and load snapshot data to a GCS bucket. Run CORE tests.
Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala shell outputs a batch of rows using OutputStream. Inside
OutputStream, output to a file is handled slightly differently from
output that is written to stdout. When writing to stdout we use print()
(which appends a newline) while when writing to a file we use write()
(which adds nothing). This difference was introduced in IMPALA-3343 so
this bug may be a regression introduced then. To ensure that output is
the same in either case we need to add a newline after writing each
batch of rows to a file.
TESTING:
Added a new test for this case.
Change-Id: I078a06c54e0834bc1f898626afbfff4ded579fa9
Reviewed-on: http://gerrit.cloudera.org:8080/16966
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A username can be determined for a session via two mechanisms:
* In a secure env, the user is authenticated by LDAP or Kerberos
* In an unsecure env, the client specifies the user name, either
as a parameter to the OpenSession API (HS2) or as a parameter
to the first query run (beeswax)
This patch affects what happens if neither of the above mechanisms
is used. Previously we would end up with the username being an
empty string, but this makes Ranger unhappy. Hive uses the name
"anonymous" in this situation, so we change Impala's behaviour too.
This is configurable by -anonymous_user_name. -anonymous_user_name=
reverts to the old behaviour.
Test
* Add an end-to-end test that exercises this via impala-shell for
HS2, HS2-HTTP and beeswax protocols.
* Tweak a couple of existing tests that depended on the previous
behavior.
Change-Id: I6db491231fa22484aed476062b8fe4c8f69130b0
Reviewed-on: http://gerrit.cloudera.org:8080/16902
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In some branches that impala-shell still uses older version of thrift,
e.g. thrift-0.9.3-p8, test_utf8_decoding_error_handling will fail since
the internal string representation of thrift versions lower than 0.10.0
is still bytes. Strings won't be decoded to unicodes so there won't be
any decoding errors. The test expects some bytes that can't be decoded
correctly be replaced with U+FFFD so fails.
This patch improve the test by also expecting results from older thrift
versions. So it can be cherry-picked to older branches.
Tests:
- Verify the test in master branch and a downstream branch that still
uses thrift-0.9.3-p8 in impala-shell.
Change-Id: Ieb0baa9b3a1480673af77f7cc35c05eacf4b449f
Reviewed-on: http://gerrit.cloudera.org:8080/16767
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This test for IMPALA-897 is testing that queries run by Impala Shell
from a script file are closed correctly. This is tested by an assertion
that there is one in-flight query during execution of a script
containing several queries. The test then closes the shell and checks
that there are no in-flight queries. This is the assertion which failed.
Change this assertion to instead wait for the number of in-flight
queries to be zero. This avoids whatever race was causing the flakiness.
Change-Id: Ib0485097c34282523ed0df6faa143fee6f74676d
Reviewed-on: http://gerrit.cloudera.org:8080/16743
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, impala-shell depends on thrift-0.11.0-p2, while impala
servers depend on thrift-0.9.3-p8. After 0.10.0, thrift changes its
internal strings representation from bytes to unicode (THRIFT-3503) to
support Python3. THRIFT-2087 and THRIFT-5303 are two patches for
specifying an error handling method in decoding utf-8 strings in thrift.
Without them, impala-shell may get an unexpected UnicodeDecodeError when
decoding thrift objects from impala servers. This patch bumps
impala-shell's thrift version to 0.11.0-p4 to include these two patches.
Tests:
- This is a regression after we bump impala-shell's thrift version to
0.11. Added a test to avoid the regression in the future.
Change-Id: I0f9898639b5648658efc2d3c5c0ee4721fb85776
Reviewed-on: http://gerrit.cloudera.org:8080/16700
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When the --quiet flag is used with impala-shell, the intention is that
if the query is successful then only the query results should be
printed.
This patch fixes two cases where --quiet was not being respected:
- When using the HTTP transport and --client_connect_timeout_ms is
set, a warning is printed that the timeout is not applied.
- When running in non-interactive mode, a warning is printed that
--live_progress is automatically disabled. This warning is now also
only printed if --live_progress is actually set.
Testing:
- Added a test that runs a simple query with --quiet and confirms the
output is as expected.
Change-Id: I1e94c9445ffba159725bacd6f6bc36f7c91b88fe
Reviewed-on: http://gerrit.cloudera.org:8080/16673
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>