This patch unify duplicated exec summary code used by python beeswax
clients: one used by the shell in impala_shell.py and one used by tests
in impala_beeswax.py. The code that has progress furthest is the one in
shell/impala_client.py, which is the one that can print correct exec
summary table for MT_DOP>0 queries. It is made into a dedicated
build_exec_summary_table function in impala_client.py, and then
impala_beeswax.py import it from impala_client.py.
This patch also fix several flake8 issues around the modified files.
Testing:
- Manually run TPC-DS Q74 in impala-shell and then type "summary"
command. Confirm that plan tree is displayed properly.
- Run single_node_perf_run.py over branches that produce different
TPC-DS Q74 plan tree. Confirm that the plan tree are displayed
correctly in performance_result.txt
Change-Id: Ica57c90dd571d9ac74d76d9830da26c7fe20c74f
Reviewed-on: http://gerrit.cloudera.org:8080/22060
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
This change will print timestamp of an exception or warning
occurred during execution of a query via impala-shell.
The timestamp will use timezone of the machine running impala-shell.
example:
Query submitted at: 2024-08-22 16:17:57 (Coordinator: http://host:25000)
Query state can be monitored at:
http://localhost:25000/query_plan?query_id=e04dcc55e560d1ee:11173fe800000000
^C Cancelling Query
Opened TCP connection to localhost:21050
2024-08-22 16:17:58 [Exception] type=<class 'socket.error'> in FetchResults.
[Errno 4] Interrupted system call
2024-08-22 16:17:58 [Warning] Cancelling Query
2024-08-22 16:17:58 [Warning] close session RPC failed: <class
'shell_exceptions.QueryCancelledByShellException'>
Opened TCP connection to localhost:21050
[localhost:21050] default>
Change-Id: I4abbd02aa9f61210b0333495bf191e72c22a5944
Reviewed-on: http://gerrit.cloudera.org:8080/21426
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When using hs2-http protocol, http messages from Impala clients may pass
through one or more proxies before reaching the Impala coordinator.
This can make it harder to track the origin of the http messages. The
'X-Forwarded-For' header is added to or edited by HTTP proxies when
forwarding a request, so it may contain multiple source addresses. Add
the value of this header to the runtime profile so that it can be
observed.
Impala will truncate the 'X-Forwarded-For' header value at 8096
characters. Apart from this, Impala does not do any verification or
sanitization of this value, so its value should only be trusted if the
deployment environment protects against spoofing.
A good reference for understanding the use of 'X-Forwarded-For' is
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Forwarded-For
This patch does not address the cases where http proxies insert
multiple 'X-Forwarded-For' headers. This issue is tracked in
IMPALA-13335.
TESTING: add an option '--hs2_x_forward' to impala-shell which will
set the 'X-Forwarded-For' header. Add tests which verify that the value
is set in the profile, and that a long value is truncated correctly.
Change-Id: I2e010cfb09674c5d043ef915347c3836696e03cf
Reviewed-on: http://gerrit.cloudera.org:8080/21700
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, Impala does an execute call, then the client polls
waiting for the operation to finish (or error out). The client
sleeps between polls, and this sleep time can be a substantial
percentage of a short query's execution time.
To reduce this client side sleep, this implements long polling to
provide an option to wait for query completion on the server side.
This is controlled by the long_polling_time_ms query option. If
set to greater than zero, status RPCs will wait for query
completion for up to that amount of time. This defaults to off (0ms).
Both Beeswax and HS2 add a wait for query completion in their
get status calls (get_state for Beeswax, GetOperationStatus for HS2).
This doesn't wait in the execute RPC calls (e.g. query for Beeswax,
ExecuteStatement for HS2), because neither includes the query status
in the response. The client will always need to do a separate status
RPC.
This modifies impala-shell and the beeswax client to avoid doing a
sleep if the get_state/GetOperationStatus calls take longer than
they would have slept. In other words, if they would have slept 50ms,
then they skip that sleep if the RPC to the server took longer than
50ms. This allows the client to maintain its sleep behavior with
older Impalas that don't use long polling while adapting properly
to systems that do have long polling. This has the added benefit
that it also adjusts for high latency to the server as well. This
does not change any of the sleep times.
Testing:
- This adds a test case in test_hs2.py to verify the long
polling behavior
Change-Id: I72ca595c5dd8a33b936f078f7f7faa5b3f0f337d
Reviewed-on: http://gerrit.cloudera.org:8080/19205
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Updates impala-shell to preserve all cookies by default, defined as
setting 'http_cookie_names=*'. Prior behavior of restricting cookies to
a user-specified list is preserved when 'http_cookie_names' is given any
value besides '*'. Setting 'http_cookie_names=' prevents any cookies
from being preserved.
Adds verbose output that prints all cookies that are preserved by the
HTTP client.
Existing cookie tests with LDAP still work. Adds a test where Impala
returns an extra cookie, and test verifies that verbose mode prints all
expected cookies.
Change-Id: Ic81f790288460b086ab218e6701e8115a996dfa7
Reviewed-on: http://gerrit.cloudera.org:8080/19827
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
The issue occurred in Python 3 when 0 rows were deleted from Iceberg.
It could also happen in other DMLs with older Impala servers where
TDmlResult.rows_deleted was not set. See the Jira for details of
the error.
Testing:
Extended shell tests for Kudu DML reporting to also cover Iceberg.
Change-Id: I5812b8006b9cacf34a7a0dbbc89a486d8b454438
Reviewed-on: http://gerrit.cloudera.org:8080/21284
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds limited UPDATE support for Iceberg tables. The
limitations mean users cannot update Iceberg tables if any of
the following is true:
* UPDATE value of partitioning column
* UPDATE table that went through partition evolution
* Table has SORT BY properties
The above limitations will be resolved by part 3. The usual limitations
like writing non-Parquet files, using copy-on-write, modifying V1 tables
are out of scope of IMPALA-12313.
This patch implements UPDATEs with the merge-on-read technique. This
means the UPDATE statement writes both data files and delete files.
Data files contain the updated records, delete files contain the
position delete records of the old data records that have been
touched.
To achieve the above this patch introduces a new sink: MultiDataSink.
We can configure multiple TableSinks for a single MultiDataSink object.
During execution, the row batches sent to the MultiDataSink will be
forwarded to all the TableSinks that have been registered.
The UPDATE statement for an Iceberg table creates a source select
statement with all table columns and virtual columns INPUT__FILE__NAME
and FILE__POSITION. E.g. imagine we have a table 'tbl' with schema
(i int, s string, k int), and we update the table with:
UPDATE tbl SET k = 5 WHERE i % 100 = 11;
The generated source statement will be ==>
SELECT i, s, 5, INPUT__FILE__NAME, FILE__POSITION
FROM tbl WHERE i % 100 = 11;
Then we create two table sinks that refer to expressions from the above
source statement:
Insert sink (i, s, 5)
Delete sink (INPUT__FILE__NAME, FILE__POSITION)
The tuples in the rowbatch of MultiDataSink contain slots for all the
above expressions (i, s, 5, INPUT__FILE__NAME, FILE__POSITION).
MultiDataSink forwards each row batch to each registered TableSink.
They will pick their relevant expressions from the tuple and write
data/delete files. The tuples are sorted by INPUTE__FILE__NAME and
FILE__POSITION because we need to write the delete records in this
order.
For partitioned tables we need to shuffle and sort the input tuples.
In this case we also add virtual columns "PARTITION__SPEC__ID" and
"ICEBERG__PARTITION__SERIALIZED" to the source statement and shuffle
and sort the rows based on them.
Data files and delete files are now separated in the DmlExecState, so
at the end of the operation we'll have two sets of files. We use these
two sets to create a new Iceberg snapshot.
Why does this patch have the limitations?
- Because we are shuffling and sorting rows based on the delete
records and their partitions. This means that the new data files
might not get written in an efficient way, e.g. there will be
too many of them, or we will need to keep too many open file
handles during writing.
Also, if the table has SORT BY properties, we cannot respect
it as the input rows are ordered in a way to favor the position
deletes.
Part 3 will introduce a buffering writer for position delete
files. This means we will shuffle and sort records based on
the data records' partitions and SORT BY properties while
delete records get buffered and written out at the end (sorted
by file_path and position). In some edge cases the delete records
might not get written efficiently, but it is a smaller problem
then inefficient data files.
Testing:
* negative tests
* planner tests
* update all supported data types
* partitioned tables
* Impala/Hive interop tests
* authz tests
* concurrent tests
Change-Id: Iff0ef6075a2b6ebe130d15daa389ac1a505a7a08
Reviewed-on: http://gerrit.cloudera.org:8080/20677
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
in python 3 environment when kerberos_host_fqdn option is used
In Pyhton 2, the sasl layer does not accept unicode strings,
so we have to explicitly encode the kerberos_host_fqdn string
to ascii. However, this is not the case in python 3, where
we have to omit the encode, because if we don't do this,
impala-shell wants to use the following service principal
during Kerberos auth:
my_service_name/b'my.kerberos.host.fqdn'@MY.REALM
instead of the correct one, which is:
my_service_name/my.kerberos.host.fqdn@MY.REALM
(This is because the output of the encode function
is a byte array in python 3.)
Tested with new unit tests and with a snapshot build
manually in CDP PVC DS.
Change-Id: I8b157d76824ad67faf531a529256a8afe2ab9d49
Reviewed-on: http://gerrit.cloudera.org:8080/20691
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
The end time of the exact same rpc call was different between stdout
and the rpc details file because the end time was calculated each
time the details were written out instead of calculating the end time
once and reusing that value.
The duration of each rpc call was being calculated incorrectly.
Change-Id: Ifd9dec189d0f6fb8713fb1c7b2b6c663e492ef05
Reviewed-on: http://gerrit.cloudera.org:8080/19932
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This support was modeled after the LDAP authentication.
If JWT authentication is used, the Impala shell enforces the use of the
hs2-http protocol since the JWT is sent via the "Authentication"
HTTP header.
The following flags have been added to the Impala shell:
* -j, --jwt: indicates that JWT authentication will be used
* --jwt_cmd: shell command to run to retrieve the JWT to use for
authentication
Testing
New Python tests have been added:
* The shell tests ensure that the various command line arguments are
handled properly. Situations such as a single authentication method,
JWTs cannot be sent in clear text without the proper arguments, etc
are asserted.
* The Python custom cluster tests leverage a test JWKS and test JWTs.
Then, a custom Impala cluster is started with the test JWKS. The
Impala shell attempts to authenticate using a valid JWT, an expired
(invalid) JWT, and a valid JWT signed by a different, untrusted JWKS.
These tests also exercise the Impala JWT authentication mechanism and
assert the prometheus JWT auth success and failure metrics are
reported accurately.
Change-Id: I52247f9262c548946269fe5358b549a3e8c86d4c
Reviewed-on: http://gerrit.cloudera.org:8080/19837
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In _do_beeswax_rpc(), the exception handling code tries to recognize the
exception and raise more meaningful exceptions. However, in the last
case for unknown exceptions, it does nothing so the method just returns
None. This makes the caller come into the error complaining 'NoneType'
object is not iterable, since the caller expects the result is a tuple
of two items:
handle, rpc_status = self._do_beeswax_rpc(...)
This patch prints more details of the unknown exception and finally
raise an exception in _do_beeswax_rpc(). So the callers can show more
meaningful errors.
Tests:
I can't reproduce the error mentioned in the JIRA description. So I
manually modify the code to give _do_beeswax_rpc() a function that will
always throw an exception. Here is the console output:
$ impala-shell.sh --protocol=beeswax
[localhost:21000] default> select 1;
Query: select 1
Query submitted at: 2023-04-21 10:24:57 (Coordinator: http://quanlong-OptiPlex-BJ:25000)
Caught exception My error, type=<type 'exceptions.Exception'>
Traceback (most recent call last):
File "/home/quanlong/workspace/Impala/shell/impala_client.py", line 1531, in _do_beeswax_rpc
ret = rpc()
File "/home/quanlong/workspace/Impala/shell/impala_client.py", line 1412, in myFunc
raise Exception("My error")
Exception: My error
Unknown Exception : Encountered unknown exception
Traceback (most recent call last):
File "/home/quanlong/workspace/Impala/shell/impala_shell.py", line 1325, in _execute_stmt
query_str, self.set_query_options)
File "/home/quanlong/workspace/Impala/shell/impala_client.py", line 1414, in execute_query
handle, rpc_status = self._do_beeswax_rpc(myFunc)
File "/home/quanlong/workspace/Impala/shell/impala_client.py", line 1604, in _do_beeswax_rpc
raise Exception("Encountered unknown exception")
Exception: Encountered unknown exception
[Not connected] > Goodbye quanlong
Change-Id: I7d847251d3dab815af2427bf7701d60dc05af659
Reviewed-on: http://gerrit.cloudera.org:8080/19777
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
encodestring has been a deprecated alias to encodebytes since Python
3.1. It was removed in Python 3.9. However encodebytes was only added in
Python 3.1, so we need to test and use the appropriate call for each
version.
Change-Id: If802eafa984a980d4442c4891876140ff9708096
Reviewed-on: http://gerrit.cloudera.org:8080/19635
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
When using the hs2 protocol with the http transport, include several
tracing http headers by default. These headers are:
* X-Request-Id -- client defined string that identifies the
http request, this string is meaningful only
to the client
* X-Impala-Session-Id -- session id generated by the Impala backend,
will be omitted on http calls that occur
before this id has been generated
* X-Impala-Query-Id -- query id generated by the Impala backend,
will be omitted on http calls that occur
before this id has been generated
The Impala shell includes these headers by default. The command
line argument --no_http_tracing has been added to remove these
headers.
The Impala backend logs out these headers if they are on the http
request. The log messages are written out at log level 2 (RPC).
Testing:
- manual testing (verified using debugging proxy and impala logs)
- new python test
Change-Id: I7857eb5ec03eba32e06ec8d4133480f2e958ad2f
Reviewed-on: http://gerrit.cloudera.org:8080/19428
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When the Impala shell is using the hs2 protocol, it makes multiple RPCs
to the Impala daemon. These calls pass Thrift objects back and forth.
This change adds the '--show_rpc' which outputs the details of the RPCs
to stdout and the '--rpc_file' flag which outputs the RPC details to the
specified file path.
RPC details include:
- operation name
- request attempt count
- Impala session/query ids (if applicable)
- call duration
- call status (success/failure)
- request Thrift objects
- response Thrift objects
Certain information is not included in the RPC details:
- Thrift object attributes named 'secret' or 'password'
are redacted.
- Thrift objects with a type of TRowSet or TGetRuntimeProfileResp
are not include as the information contained within them is
already available in the standard output from the Impala shell.
Testing:
- Added new tests in the end-to-end test suite.
Change-Id: I36f8dbc96726aa2a573133acbe8a558299381f8b
Reviewed-on: http://gerrit.cloudera.org:8080/19388
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently max tries for connecting to coordinator is hard coded to 4
in hs2-http mode. It's required to make the max tries when connecting
to coordinator a configurable option, especially in the environment
where coordinator is started slowly.
This patch added support for configurable max tries in hs2-http mode
using the new impala-shell config option '--connect_max_tries'.
The default value of '--connect_max_tries' is set to 4.
Testing:
- Ran e2e shell tests.
- Ran impala-shell with connect_max_tries as 100 before starting
impala coordinator daemon, verified that impala-shell connects to
coordinator after coordinator daemon was started.
Change-Id: I5f7caeb91a69e71a38689785fb1636094295fdb1
Reviewed-on: http://gerrit.cloudera.org:8080/19105
Reviewed-by: Andrew Sherman <asherman@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds a shell option called "hs2_fp_format"
which manipulates the print format of floating-point values in HS2.
It lets the user to specify a Python-based format specification
expression (https://docs.python.org/2.7/library/string.html#formatspec)
which will get parsed and applied to floating-point
column values. The default value is None, in this case the
formatting is the same as the state before this change.
This option does not support the Beeswax protocol, because Beeswax
converts all of the column values to strings in its response.
Tests: command line tests for various formatting options and
for invalid formatting option
Change-Id: I424339266be66437941be8bafaa83fa0f2dfbd4e
Reviewed-on: http://gerrit.cloudera.org:8080/18990
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Build Python 3 eggs for the shell tarball so it works with both Python 2
and Python 3. The impala-shell script selects eggs based on the
available Python version.
Inlines thrift for impala-shell so we can easily build Python 2 and
Python 3 versions, consistent with other libraries. The impala-shell
version should always be at least as new as IMPALA_THRIFT_PY_VERSION.
Thrift 0.13.0+ wraps all exceptions during TSocket read/write operations
in TTransportException. Specifically socket.error that we got as raw
exceptions are now wrapped. Unwraps them before raising to preserve
prior behavior.
A specific Python version can be selected with IMPALA_PYTHON_EXECUTABLE;
otherwise it will use 'python', and if unavailable try 'python3'.
Adds tests for impala-shell tarball with Python 3.
Change-Id: I94f86de9e2a6303151c2f0e6454b5f629cbc9444
Reviewed-on: http://gerrit.cloudera.org:8080/18653
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch ports the implementation of GSSAPI authentication over http
transport from Impyla (https://github.com/cloudera/impyla/pull/415) to
impala-shell.
The implementation adds a new dependency on 'kerberos' python module,
which is a pip-installed module distributed under Apache License Version
2.
When using impala-shell with Kerberos over http, it is assumed that the
host has a preexisting kinit-cached Kerberos ticket that impala-shell
can pass to the server automatically without the user to reenter the
password.
Testing:
- Passed exhaustive tests.
- Tested manually on a real cluster with a full Kerberos setup.
Change-Id: Ia59ba4004490735162adbd468a00a962165c5abd
Reviewed-on: http://gerrit.cloudera.org:8080/18493
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The strict hs2 protocol mode is broken when fetching large results.
The FetchResults.hasMoreRows field is always returned as false. When
there are no more results, Hive returns an empty batch with no rows.
HIVE-26108 has been filed to support the hasMoreRows field.
Added a framework test that retrieves 1M rows from tpcds. The default
number of rows returned from Hive is 10K so this should be more than
enough to ensure that multiple fetches are done.
Change-Id: Ife436d91e7fe0c30bf020024e20a5d8ad89faa24
Reviewed-on: http://gerrit.cloudera.org:8080/18370
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
client
In 'hs2-http' mode, the socket timeout is None, which could cause
hang like symptoms in case of a problematic remote server.
Added support for configurable socket timeout using the new impala-shell
config option '--http_socket_timeout_s'. If a reasonable timeout is
set, impala-shell client can retry in case of connection issues, when
possible. The default value of '--http_socket_timeout_s' is set to None,
to prevent behavior changes for existing clients.
More details on socket timeout here:
https://docs.python.org/3/library/socket.html#socket-timeouts
Testing:
- Added tests for various timeout values in test_shell_commandline.py
- Ran e2e shell tests.
Change-Id: I29fa4ff96cdcf154c3aac7e43340af60d7d61e94
Reviewed-on: http://gerrit.cloudera.org:8080/18336
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
The insert command was broken for impala-shell in the strict_hs2
mode. The return parameter for close_dml should return two parameters.
The parameters returned by close_dml are rows returned and error
rows. These are not supported by strict hs2 mode since the close
does not return the TDmlResult structure. So the message to
the end user also had to be changed.
Change-Id: Ibe837c99e54d68d1e27b97f0025e17faf0a2cb9f
Reviewed-on: http://gerrit.cloudera.org:8080/18176
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Impala-shell already uses HS2 protocol to connect to Impalad.
This commit allows impala-shell to connect to any server (for
example, Hive) using the hs2 protocol. This will be done via
the "--strict_hs2_protocol" option.
When the "--strict_hs2_protocol" option is turned on, only features
supported by hs2 will work. For instance, "runtime-profile" is an
impalad specific feature and will be disabled.
The "--strict_hs2_protocol" will only work on servers that abide
by the strict definition of what is supported by HS2. So one will
be able to connect to Hive in this mode, but connections to Impala
will not work. Any feature supported by Hive (e.g. kerberos
authentication) should work as well.
Note: While authentication should work, the test framework is not
set up to create an HS2 server that does authentication at this point
so this feature should be used with caution.
Change-Id: I674a45640a4a7b3c9a577830dbc7b16a89865a9e
Reviewed-on: http://gerrit.cloudera.org:8080/17660
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-10234 added support for cookie authentication for LDAP to
impala-shell. But it does not accept user input cookie name via
startup flags, and it retains only one cookie.
In some scenarios, we could use proxy to manage the sessions with
additional HTTP cookies added by proxy.
This patch made cookie support more generic for impala-shell.
It lets the user specify cookie names via a startup flag
"--http_cookie_names" and could retain more than one cookies.
Testing:
- Manualy tested the multiple cookies in HTTP headers with a
customized Impala server which could send and receive multiple
cookies.
- Passed core test, including new test cases.
Change-Id: I193422d5ec891886a522d82ecb0e9d974132ff2a
Reviewed-on: http://gerrit.cloudera.org:8080/17667
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Need some changes to impala-shell to make the client more HS2
compatible, including:
- when the fetch returns the bitset containing nulls, the lack of
presence of bits means it is not null. Currently it will fail
the query.
- adding fetchType to TCLIServiceThrift structure (though unused
currently in Impala)
Also a small refactor was done to put the functionality that retrieves
all query options into its own function.
Change-Id: Id3a4c4ce8a5d60db136df1743f32dba22172ee13
Reviewed-on: http://gerrit.cloudera.org:8080/17590
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
This change reduces to following command from 8.5s to 1.5s on my
machine:
shell/impala_shell.py -B -q "select * from tpch_parquet.lineitem limit 100000;" --protocol hs2-http > /dev/null
This nearly eliminates the speed difference between hs2 and hs2-http.
The root cause of the original slowness is the large number of
calls to socket.recv(). The query above used to call it 2809090 times,
now it is only 9007.
Testing:
- ran shell tests
Change-Id: If11f287be65b10bee2b0afffea118e3dc70fdbbd
Reviewed-on: http://gerrit.cloudera.org:8080/17346
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
Before this patch Impala mainly used Thrift 0.9.3, but it was
possible to compile Impala shell with Thrift 0.11.0, so the 0.11.0
Thrift lib was already included in the toolchain.
Most of the changes are related to replacing boost:: with std::
shared_ptr-s in cpp code (this is a continuation of patch by Sahil).
The Thrift upgrade also needs an Impyla release with Thrift 0.11.0, as
Impala's test framework relies on Impyla. A thrift_sasl release is also
needed, because it currently pins Thrift version to 0.9.3 for Python 2.
The current patch uses alpha releases from Impyla and thrift_sasl that
use thrift 0.11.0.
Notable side effects:
- old logic to compile thrift for impala-shell with 0.11.0 was removed
- impala_shell's utf8 handling had to be updated as the new 0.11.0
compilation happens with no_utf8strings. This also made things a
bit faster, e.g the following is ~0.22s instead of ~0.25
shell/impala_shell.py \
-B -q "select * from functional_parquet.alltypes;" > /dev/null
- THRIFT-3921 changed the stream operators to print an enum's name
instead of its number, leading to slightly different messages
in some cases.
- "templates" was added to the thift generator's parameters to avoid
a compilation issue (related to IMPALA-10600). I didn't notice any
change in compilation time. This option generated .tcc files with
templetized readers/writers for Thrift types. Currently we don't
use these, but they could potentially speed up (de)serialization.
Testing:
- ran Impyla's test suite with Python 2 and 3
- ran core tests
Change-Id: Idd13f177b4f7acc07872ea6399035aa180ef6ab6
Reviewed-on: http://gerrit.cloudera.org:8080/17170
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
UnboundLocalError, local variable 'retry_msg' referenced before assign
ImpalaHS2Client._open_session() has a 'retry_msg' variable which was
not initialized in the code-path where retry was disabled. If an
exception was hit with retry disabled, a compile time error was
generated.
The fix is to initialize 'retry_msg' in the non retry code-path.
Testing:
- Forced exception in ImpalaHS2Client._open_session() and verified that
proper error message was generated.
- Ran impala-shell e2e and custom cluster tests.
Change-Id: I50a08a62a332de759022d0a4862e74f5a81945d9
Reviewed-on: http://gerrit.cloudera.org:8080/17172
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-8584 added support for cookie authentication to Impala.
This change adds cookie authentication support to impala-shell
as well when using 'hs2-http' protocol.
Testing:
- Unit tests were added to test cookie handling methods.
- Tested e2e manually with nginx HTTP proxy.
TODO:
- Test with Knox HTTP proxy as well.
Change-Id: Icb0bc6e0f58f236866ca9913a2e63d97d5148f51
Reviewed-on: http://gerrit.cloudera.org:8080/16660
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When Impala Shell receives an http error message (that is a message with
http code greater than or equal to 300), it may sleep for a time before
retrying. If the message contains a 'Retry-After' header that has an
integer value, then this will be used as the time for which to sleep.
The implementation is to use a new HttpError exception (similar to that
used in Impyla) which includes more information from the error message
(including the headers) so that catchers of the exception can use the
'Retry-After' header if appropriate.
TESTING:
Hand testing with a proxy that uses the 'Retry-After' header.
Added new tests that use the fault injection framework in
test_hs2_fault_injection.py
Change-Id: I2b4226e7723d585d61deb4d1d6777aac901bfd93
Reviewed-on: http://gerrit.cloudera.org:8080/16702
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When the --quiet flag is used with impala-shell, the intention is that
if the query is successful then only the query results should be
printed.
This patch fixes two cases where --quiet was not being respected:
- When using the HTTP transport and --client_connect_timeout_ms is
set, a warning is printed that the timeout is not applied.
- When running in non-interactive mode, a warning is printed that
--live_progress is automatically disabled. This warning is now also
only printed if --live_progress is actually set.
Testing:
- Added a test that runs a simple query with --quiet and confirms the
output is as expected.
Change-Id: I1e94c9445ffba159725bacd6f6bc36f7c91b88fe
Reviewed-on: http://gerrit.cloudera.org:8080/16673
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch extends the 'summary' command of impala-shell to support
retrieving the summary of the original query attempt. The new syntax is
SUMMARY [ALL | LATEST | ORIGINAL]
If 'ALL' is specified, both the latest and original summaries are
printed. If 'LATEST' is specified, only the summary of the latest query
attempt is printed. If 'ORIGINAL' is specified, only the summary of the
original query attempt is printed. The default option is 'LATEST'.
Support for this has only been added to HS2 given that Beeswax is being
deprecated soon.
Tests:
- Add new tests in test_shell_interactive.py
Change-Id: I8605dd0eb2d3a2f64f154afb6c2fd34251c1fec2
Reviewed-on: http://gerrit.cloudera.org:8080/16502
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch introduces a new startup flag
--ping_expose_webserver_url (true by default) to control whether
PingImpalaService, PingImpalaHS2Service RPC calls should expose
the debug web url to the client or not.
This is necessary as the debug web UI is not something that
end-users will necessarily have access to.
If the flag is set to false, the RPC calls will return an empty
string instead of the real url signalling that the debug web ui
is not available.
Note that if the webserver is disabled (--enable_webserver flag
is set to false) the RPC calls will behave the same and return an
empty string for the url.
Change-Id: I7ec3e92764d712b8fee63c1f45b038c31c184cfc
Reviewed-on: http://gerrit.cloudera.org:8080/16573
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, the impala-shell 'profile' command only returns the profile
for the most recent profile attempt. There is no way to get the original
query profile (the profile of the first query attempt that failed) from
the impala-shell.
This patch modifies TGetRuntimeProfileReq and TGetRuntimeProfileResp to
add support for returning both the original and retried profiles for a
retried query. When a query is retried, TGetRuntimeProfileResp currently
contains the profile for the most recent query attempt.
TGetRuntimeProfileReq has a new field called 'include_query_attempts'
and when it is set to true, the TGetRuntimeProfileResp will include all
failed profiles in a new field called failed_profiles /
failed_thrift_profiles.
impala-shell has been modified so the 'profile' command has a new set of
options. The syntax is now:
PROFILE [ALL | LATEST | ORIGINAL]
If 'ALL' is specified, both the latest and original profiles are
printed. If 'LATEST' is specified, only the latest profile is printed.
If 'ORIGINAL' is printed, only the original profile is printed. The
default behavior is equivalent to specifying 'LATEST' (which is the
current behavior before this patch as well).
Support for this has only been added to HS2 given that Beeswax is being
deprecated soon. The new 'profile' options have no affect when the
Beeswax protocol is used.
Most of the code change is in impala-hs2-server and impala-server; a lot
of the GetRuntimeProfile code has been re-factored.
Testing:
* Added new impala-shell tests
* Ran core tests
Change-Id: I89cee02947b311e7bf9c7274f47dfc7214c1bb65
Reviewed-on: http://gerrit.cloudera.org:8080/16406
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
While the ds_hll_sketch() generates a string value as output the data
is not an ascii encoded text but a bitsketch, because of this, when
the shell get this data it disconnect while it tries to decode it.
The issue can be reproduced with a simple method like using unhex
with a wrong input.
Example: SELECT unhex("aa");
This patch contains a solution, where we replace any not UTF-8
decodable characters if we run into an UnicodeDecodeError after
fetching it.
This solution is working with the Thrift 0.9.3 autogenerated gen-py
but still fails with Thrift 0.11.0.
For Thrift 0.11.0 the error is catched and an error message is sent
(not working with beeswax protocol, because it generates a different
error (TypeError) which can come for other reasons too).
Testing:
-manual testing with these protocols: 'hs2-http', 'hs2', 'beeswax'
Change-Id: I0c5f1290356e21aed8ca7f896f953541942aed05
Reviewed-on: http://gerrit.cloudera.org:8080/16418
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
While the ds_hll_sketch() generates a string value as output the data
is not an ascii encoded text but a bitsketch, because of this, when
the shell get this data it disconnect while it tries to decode it.
The issue can be reproduced with a simple method like using unhex
with a wrong input.
Example: SELECT unhex("aa");
This patch contains a solution, where we replace any not UTF-8
decodable characters if we run into an UnicodeDecodeError after
fetching it.
This solution is working with the Thrift 0.9.3 autogenerated gen-py
but still fails with Thrift 0.11.0.
For Thrift 0.11.0 the error is catched and an error message is sent
(not working with beeswax protocol, because it generates a different
error (TypeError) which can come for other reasons too).
Testing:
-manual testing with these protocols: 'hs2-http', 'hs2', 'beeswax'
Change-Id: Ic5cfb907871ca83e5f04a39ca9d7a8e138d711a8
Reviewed-on: http://gerrit.cloudera.org:8080/16305
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
Beeswax clients use get_log() to retrieve the warning/error message
after the query finishes. HS2 clients use GetLog() for the same purpose.
This patch adds the retry information into the returned result if the
query is retried. So clients that print the log can show the original
query failure and the retried query id.
This patch also modifies impala-shell to extract the retried query id
and print the retried query link.
Here's an example of the impala-shell output:
Query: select count(*) from functional.alltypes where bool_col = sleep(60)
Query submitted at: 2020-06-18 21:23:52 (Coordinator: http://quanlong-OptiPlex-BJ:25000)
Query progress can be monitored at: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=7944ffee4d81cdd4:e7f9357a00000000
+----------+
| count(*) |
+----------+
| 3650 |
+----------+
WARNINGS: Original query failed:
Failed due to unreachable impalad(s): quanlong-OptiPlex-BJ:22001
Query has been retried using query id: 934b2734f67a1161:a0dbd60200000000
Retried query link: http://quanlong-OptiPlex-BJ:25000/query_plan?query_id=934b2734f67a1161:a0dbd60200000000
Tests:
- Add tests in test_query_retries.py to verify client logs returned
from GetLog().
- Run test_query_retries.py.
- Manually run queries in impala-shell and kill impalads. Verify
printed messages when the retried queries succeed or fail.
Change-Id: I58cf94f91a0b92eb9a3088bee3894ac157a954dc
Reviewed-on: http://gerrit.cloudera.org:8080/16093
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds the option --fetch_size to the Impala shell. This new option allows
users to specify the fetch size used when issuing fetch RPCs to the
Impala Coordinator (e.g. TFetchResultsReq and BeeswaxService.fetch).
This parameter applies for all client protocols: beeswax, hs2, hs2-http.
The default --fetch_size is set to 10240 (10x the default batch size).
The new --fetch_size parameter is most effective when result spooling is
enabled. When result spooling is disabled, Impala can only return a
single row batch per fetch RPC (so 1024 rows by default). When result
spooling is enabled, Impala can return up to 100 row batches per fetch
request.
Removes some logic in the the impala_client.py file that attempts to
simulate a fetch_size. The code would issue multiple fetch requests to
fullfill the given fetch_size. This logic is no longer needed now that
result spooling is available.
Testing:
* Ran core tests
* Added new tests in test_shell_client.py and test_shell_commandline.py
Change-Id: I8dc7962aada6b38795241d067a99bd94fabca57b
Reviewed-on: http://gerrit.cloudera.org:8080/16041
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Sahil Takiar <stakiar@cloudera.com>
This is the main patch for making the the impala-shell cross-compatible with
python 2 and python 3. The goal is wind up with a version of the shell that will
pass python e2e tests irrepsective of the version of python used to launch the
shell, under the assumption that the test framework itself will continue to run
with python 2.7.x for the time being.
Notable changes for reviewers to consider:
- With regard to validating the patch, my assumption is that simply passing
the existing set of e2e shell tests is sufficient to confirm that the shell
is functioning properly. No new tests were added.
- A new pytest command line option was added in conftest.py to enable a user
to specify a path to an alternate impala-shell executable to test. It's
possible to use this to point to an instance of the impala-shell that was
installed as a standalone python package in a separate virtualenv.
Example usage:
USE_THRIFT11_GEN_PY=true impala-py.test --shell_executable=/<path to virtualenv>/bin/impala-shell -sv shell/test_shell_commandline.py
The target virtualenv may be based on either python3 or python2. However,
this has no effect on the version of python used to run the test framework,
which remains tied to python 2.7.x for the foreseeable future.
- The $IMPALA_HOME/bin/impala-shell.sh now sets up the impala-shell python
environment independenty from bin/set-pythonpath.sh. The default version
of thrift is thrift-0.11.0 (See IMPALA-9489).
- The wording of the header changed a bit to include the python version
used to run the shell.
Starting Impala Shell with no authentication using Python 3.7.5
Opened TCP connection to localhost:21000
...
OR
Starting Impala Shell with LDAP-based authentication using Python 2.7.12
Opened TCP connection to localhost:21000
...
- By far, the biggest hassle has been juggling str versus unicode versus
bytes data types. Python 2.x was fairly loose and inconsistent in
how it dealt with strings. As a quick demo of what I mean:
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> d = 'like a duck'
>>> d == str(d) == bytes(d) == unicode(d) == d.encode('utf-8') == d.decode('utf-8')
True
...and yet there are weird unexpected gotchas.
>>> d.decode('utf-8') == d.encode('utf-8')
True
>>> d.encode('utf-8') == bytearray(d, 'utf-8')
True
>>> d.decode('utf-8') == bytearray(d, 'utf-8') # fails the eq property?
False
As a result, this was inconsistency was reflected in the way we handled
strings in the impala-shell code, but things still just worked.
In python3, there's a much clearer distinction between strings and bytes, and
as such, much tighter type consistency is expected by standard libs like
subprocess, re, sqlparse, prettytable, etc., which are used throughout the
shell. Even simple calls that worked in python 2.x:
>>> import re
>>> re.findall('foo', b'foobar')
['foo']
...can throw exceptions in python 3.x:
>>> import re
>>> re.findall('foo', b'foobar')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data0/systest/venvs/py3/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
Exceptions like this resulted in a many, if not most shell tests failing
under python 3.
What ultimately seemed like a better approach was to try to weed out as many
existing spurious str.encode() and str.decode() calls as I could, and try to
implement what is has colloquially been called a "unicode sandwich" -- namely,
"bytes on the outside, unicode on the inside, encode/decode at the edges."
The primary spot in the shell where we call decode() now is when sanitising
input...
args = self.sanitise_input(args.decode('utf-8'))
...and also whenever a library like re required it. Similarly, str.encode()
is primarily used where a library like readline or csv requires is.
- PYTHONIOENCODING needs to be set to utf-8 to override the default setting for
python 2. Without this, piping or redirecting stdout results in unicode errors.
- from __future__ import unicode_literals was added throughout
Testing:
To test the changes, I ran the e2e shell tests the way we always do (against
the normal build tarball), and then I set up a python 3 virtual env with the
shell installed as a package, and manually ran the tests against that.
No effort has been made at this point to come up with a way to integrate
testing of the shell in a python3 environment into our automated test
processes.
Change-Id: Idb004d352fe230a890a6b6356496ba76c2fab615
Reviewed-on: http://gerrit.cloudera.org:8080/15524
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Added retries for idempotent rpcs:
OpenSession, PingImpalaHS2Service, GetResultSetMetadata,
CloseImpalaOperation (non dmls), CancelOperation, GetOperationStatus,
GetRuntimeProfile, GetExecSummary, GetLog
Retries were also added to the 'set all' query execution and subsequent
result fetch in the ImpalaHS2Client._open_session()
The retries are only supported for hs2-http protocol and enabled by
default. At most there are 3 retries for a failed rpc. There is a sleep
duration of 'n' seconds after nth retry.
Only failed rpcs due to an error in the http transport are retried and
if an rpc failed because the server returned an error in the rpc
response then such scenarios are not retriable.
Improved error diagnostics by dumping stack trace when ImpalaShell.
_execute_stmt() gets an 'Unknown Exception'.
Testing:
- Added a custom_cluster test which injects fault into the http transport
and checks expected behavior from the various rpcs. Some of these tests
leave the session in an open state and so these tests are not suitable
for the e2e test framework which have metric verifiers expecting related
metrics to be 0 at the end of the test.
- Manually tested real world scenarios with impala-shell client
communicating with an impala coordinator via a fault injecting istio mesh.
- Manually tested dropping connections on an nginx ingress gateway by sending
SIGTERM to all worker processes.
Change-Id: I0da9e9e8d34a340eaf763397cc095ff6260d65d5
Reviewed-on: http://gerrit.cloudera.org:8080/15378
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A few built-ins were changed in python 3 -- e.g., xrange became range,
ConfigParser became configparser, etc. We can redefine some of those
things in a single place, and import them from there as needed. Other
items may also be added as we go along.
Change-Id: Ibd3d86df524666a98cbfa463756adac48bd1f8a3
Reviewed-on: http://gerrit.cloudera.org:8080/15514
Reviewed-by: David Knupp <dknupp@cloudera.com>
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In an effort to keep the work of reviewing the changes more manageable
with regard to making the impala-shell python3 compatible, I'm trying
to break the patches up into smaller chunks.
The first patch is the easiest one -- simply addressing the handful of
syntax issues that aren't python 3 compatible, namely changing the
print statements to function calls, changing the way we catch exceptions,
and adding a few simple branches to work around the removal of such
things as dict.iteritems().
We needed the print function imported from __future__ because it allows
us to pass in a file descriptor, e.g., sys.stderr.
Notably, there's nothing in this patch related to string/bytes/unicode
changes from python 2 to 3.
Change-Id: I9a515da01ef03d5936cb1a4d9e4bc6d105386b1d
Reviewed-on: http://gerrit.cloudera.org:8080/15487
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The 'Expect: 100-continue' http header allows http clients to send
only the headers for their request, get a confirmation back from the
server that the headers are valid, and only then send the body of the
request, avoiding the overhead of sending large requests that will
ultimately fail.
This patch adds support for this in the HS2 HTTP server by having
THttpServer look for the header, and if it's present and the request
is validated returning a '100 Continue' response before reading the
body of the request.
It also adds supports for using this header on large requests sent by
impala-shell.
Testing:
- This case is covered by the existing test_large_sql, however that
test was previously broken and passing spuriously. This patch fixes
the test.
- Passed all other shell tests.
Change-Id: I4153968551acd58b25c7923c2ebf75ee29a7e76b
Reviewed-on: http://gerrit.cloudera.org:8080/15284
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
This is a prelimary patch that simply copies THttpClient.py from
Thrift master into Impala, changes imports as appropriate, and adjusts
the formatting from 4 spaces to 2 spaces.
This is to allow us to make modifications to THttpClient in future
patches. There are no functional changes in this patch.
Change-Id: I2662f1d4d455120442ef7c0c198685c07207aeed
Reviewed-on: http://gerrit.cloudera.org:8080/15283
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
This enables parallel plans with the join build in a
separate fragment and fixes all of the ensuing fallout.
After this change, mt_dop plans with joins have separate
build fragments. There is still a 1:1 relationship between
join nodes and builders, so the builders are only accessed
by the join node's thread after it is handed off. This lets
us defer the work required to make PhjBuilder and NljBuilder
safe to be shared between nodes.
Planner changes:
* Combined the parallel and distributed planning code paths.
* Misc fixes to generate reasonable thrift structures in the
query exec requests, i.e. containing the right nodes.
* Fixes to resource calculations for the separate build plans.
** Calculate separate join/build resource consumption.
** Simplified the resource estimation by calculating resource
consumption for each fragment separately, and assuming that
all fragments hit their peak resource consumption at the
same time. IMPALA-9255 is the follow-on to make the resource
estimation more accurate.
Scheduler changes:
* Various fixes to handle multiple TPlanExecInfos correctly,
which are generated by the planner for the different cohorts.
* Add logic to colocate build fragments with parent fragments.
Runtime filter changes:
* Build sinks now produce runtime filters, which required
planner and coordinator fixes to handle.
DataSink changes:
* Close the input plan tree before calling FlushFinal() to release
resources. This depends on Send() not holding onto references
to input batches, which was true except for NljBuilder. This
invariant is documented.
Join builder changes:
* Add a common base class for PhjBuilder and NljBuilder with
functions to handle synchronisation with the join node.
* Close plan tree earlier in FragmentInstanceState::Exec()
so that peak resource requirements are lower.
* The NLJ always copies input batches, so that it can close
its input tree.
JoinNode changes:
* Join node blocks waiting for build-side to be ready,
then eventually signals that it's done, allowing the builder
to be cleaned up.
* NLJ and PHJ nodes handle both the integrated builder and
the external builder. There is a 1:1 relationship between
the node and the builder, so we don't deal with thread safety
yet.
* Buffer reservations are transferred between the builder and join
node when running with the separate builder. This is not really
necessary right now, since it is all single-threaded, but will
be important for the shared broadcast.
- The builder transfers memory for probe buffers to the join node
at the end of each build phase.
- At end of each probe phase, reservation needs to be handed back
to builder (or released).
ExecSummary changes:
* The summary logic was modified to handle connecting fragments
via join builds. The logic is an extension of what was used
for exchanges.
Testing:
* Enable --unlock_mt_dop for end-to-end tests
* Migrate some tests to run as part of end-to-end tests instead of
custom cluster.
* Add mt_dop dimension to various end-to-end tests to provide
coverage of join queries, spill-to-disk and cancellation.
* Ran a single node TPC-H and TPC-DS stress test with mt_dop=0
and mt_dop=4.
Perf:
* Ran TPC-H scale factor 30 locally with mt_dop=0. No significant
change.
Change-Id: I4403c8e62d9c13854e7830602ee613f8efc80c58
Reviewed-on: http://gerrit.cloudera.org:8080/14859
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change Impala Shell is not checking HTTP return codes when
using the hs2-http protocol. The shell is sending a request message
(e.g. send_CloseOperation) but the HTTP call to send this message may
fail. This will result in a failure when reading the reply (e.g. in
recv_CloseOperation) as there is no reply data to read. This will
typically result in an 'EOFError'.
In code that overrides THttpClient.flush(), check the HTTP code that is
returned after the HTTP call is made. If the code is not 1XX
(informational response) or 2XX (successful) then throw an RPCException.
This change does not contain any attempt to recover from an HTTP failures
but it does allow the failure to be detected and a message to be
printed.
In future it may be possible to retry after certain HTTP errors.
Testing:
- Add a new test for impala-shell that tries to connect to an HTTP
server that always returns a 503 error. Check that an appropriate
error message is printed.
Change-Id: I3c105f4b8237b87695324d759ffff81821c08c43
Reviewed-on: http://gerrit.cloudera.org:8080/14924
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When mt_dop > 0, the summary is reporting the number of fragment
instances, instead of the number of hosts as the header would
imply.
This commit fixes the issue so the number of hosts will be shown
under the #Hosts column. The commit also adds an #Inst column
where the number of instances are shown (current behaviour).
Tests:
* Changed profile tests with mt_dop > 0.
* Updated benchmark tests and shell tests accordingly.
Change-Id: I3bdf9a06d9bd842b2397cd16c28294b6bec7af69
Reviewed-on: http://gerrit.cloudera.org:8080/14715
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The patch adds a set of scripts for converting the impala-shell
into a true distributable python package. The package can be
installed using familiar python commands, e.g.:
$ python setup.py (install|develop)
or
$ pip install -e /path/to/dist/dir
The entry point script, make_python_package.sh, will run as a
part of the standard sequence of steps that results from calling
buildall.sh, and will produce a gzipped tarball inside of
Impala/shell/dist as an artifact. Thereafter, make_python_package.sh
can be run manually any time.
The expectation is that an official maintainer would need to manually
upload official releases to the Python Package Index as appropriate.
Change-Id: Ib8c745bddddf6a16f0c039430152745a2f00e044
Reviewed-on: http://gerrit.cloudera.org:8080/14181
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Older python versions shipped ssl libraries that did not implement
SSLContext class. THttpClient relies on it. This patch,
- Fails the shell gracefully when such a python version is used.
- Skips the http test dimension when running the test suite on a
machine that ships such a python verison (centos 6).
Change-Id: I28846bde0b8bb8f787e6330cddf91645dba4160e
Reviewed-on: http://gerrit.cloudera.org:8080/14069
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>