This patch extends the 'summary' command of impala-shell to support
retrieving the summary of the original query attempt. The new syntax is
SUMMARY [ALL | LATEST | ORIGINAL]
If 'ALL' is specified, both the latest and original summaries are
printed. If 'LATEST' is specified, only the summary of the latest query
attempt is printed. If 'ORIGINAL' is specified, only the summary of the
original query attempt is printed. The default option is 'LATEST'.
Support for this has only been added to HS2 given that Beeswax is being
deprecated soon.
Tests:
- Add new tests in test_shell_interactive.py
Change-Id: I8605dd0eb2d3a2f64f154afb6c2fd34251c1fec2
Reviewed-on: http://gerrit.cloudera.org:8080/16502
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently, the impala-shell 'profile' command only returns the profile
for the most recent profile attempt. There is no way to get the original
query profile (the profile of the first query attempt that failed) from
the impala-shell.
This patch modifies TGetRuntimeProfileReq and TGetRuntimeProfileResp to
add support for returning both the original and retried profiles for a
retried query. When a query is retried, TGetRuntimeProfileResp currently
contains the profile for the most recent query attempt.
TGetRuntimeProfileReq has a new field called 'include_query_attempts'
and when it is set to true, the TGetRuntimeProfileResp will include all
failed profiles in a new field called failed_profiles /
failed_thrift_profiles.
impala-shell has been modified so the 'profile' command has a new set of
options. The syntax is now:
PROFILE [ALL | LATEST | ORIGINAL]
If 'ALL' is specified, both the latest and original profiles are
printed. If 'LATEST' is specified, only the latest profile is printed.
If 'ORIGINAL' is printed, only the original profile is printed. The
default behavior is equivalent to specifying 'LATEST' (which is the
current behavior before this patch as well).
Support for this has only been added to HS2 given that Beeswax is being
deprecated soon. The new 'profile' options have no affect when the
Beeswax protocol is used.
Most of the code change is in impala-hs2-server and impala-server; a lot
of the GetRuntimeProfile code has been re-factored.
Testing:
* Added new impala-shell tests
* Ran core tests
Change-Id: I89cee02947b311e7bf9c7274f47dfc7214c1bb65
Reviewed-on: http://gerrit.cloudera.org:8080/16406
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This removes LD_LIBRARY_PATH and LD_PRELOAD from the
developer's shell and cleans it up. With the preceding
change, toolchain utilities like clang can be run without
a special LD_LIBRARY_PATH.
This fixes a bug where libjvm.so was registered as a
static instead of a shared library, which adds it to the
RUNPATH variable in the binary, which provides a default
search location that can be overriden by LD_LIBRARY_PATH.
Impala binaries don't have the rpath baked in for some
libraries, including Impala-lzo, libgcc and libstdc++.
, so we still need to set LD_LIBRARY_PATH when running
those. That is solved with wrapper scripts that sets
the environment variables only when invoking those
binaries, e.g. starting a daemon or running a backend
test. I added three scripts because there were 3 sets
of environment variables. The scripts are:
* run-binary.sh: just sets LD_LIBRARY_PATH
* run-jvm-binary.sh: sets LD_LIBRARY_PATH and CLASSPATH
* start-daemon.sh: sets LD_LIBRARY_PATH and CLASSPATH and
kerberos-related environment variables.
The binaries, in almost all cases, work fine without
those tweaks, because libstdc++ and libgcc are picked
up along with libkuduclient.so from the toolchain (they
are in the same directory). I decided to leave good enough
alone here. run-binary.sh and friends can be used in
any remaining edge cases to run binaries.
An alternative to the 3 scripts would be to have an
uber-script that set all the variables, but I felt
that it was better to be specific about what
each binary needed. Cleaning the LD_LIBRARY_PATH
mess up has given me a distaste for scattershot
setting of environment variables. I am open to
revisiting this.
Testing:
* Ran tests on centos 7
* Manually tested that my dev env with
LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu continued
to work (for now). All ubuntu 16.04 and 18.04 dev
envs that were set up with bootstrap_development.sh
will be in this state.
Change-Id: I61c83e6cca6debb87a12135e58ee501244bc9603
Reviewed-on: http://gerrit.cloudera.org:8080/14494
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds impala-shell support to connect to HiveServer2 HTTP endpoint.
Relies on toolchain change at https://gerrit.cloudera.org/#/c/13725/.
Use --protocol='hs2-http' to enable this behavior.
Example usages:
---------------
impala-shell --protocol='hs2-http' (No auth)
impala-shell --protocol='hs2-http' --ldap -u..... (PLAIN auth)
impala-shell --protocol-'hs2-http' --ssl --ca_cert... (TLS)
impala-shell --protocol='hs2-http' --ldap --ssl --ca_cert... (LDAP +
TLS)
Limitations:
-----------
- Does not support Kerberos (-k) due to lack ot SPNEGO support.
Testing:
--------
- Parameterized existing shell tests to support this combination.
- Added shell test coverage for LDAP auth.
Change-Id: I8323950857dfe1c1dfd5377fde79f87bc2ce9534
Reviewed-on: http://gerrit.cloudera.org:8080/13746
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
HS2 is added as an option via --protocol=hs2. The user-visible
differences in behaviour are minimal. Beeswax is still the
default and can be explicitly enabled via --protocol=beeswax
but will be deprecated. The default is unchanged because
changing the default could break certain workflows, e.g.
those that explicitly specify the port with -i or deployments
that hit --fe_service_threads for HS2 and somehow rely on
impala-shell not contributing to that limit. For most
workflows the change is transparent and we should change
the default in a major version change.
This support requires Impala-specific extensions to
the HS2 interface, similar to the existing extensions
to Beeswax. Thus the HS2 shell is only
forwards-compatible with newer Impala versions.
I considered trying to gracefully degrade when the
new extensions weren't present, but it didn't seem to be
worth the ongoing testing effort.
Differences between HS2 and Beeswax are abstracted into
ImpalaClient subclasses.
Here are the changes required to make it work:
* Switch to TBinaryProtocolAccelerated to avoid perf
regression. The HS2 protocol requires decoding
more primitive values (because its not a string-per-row),
which was slow with the pure python implementation of
TBinaryProtocol.
* Added bitarray module to efficiently unpack null indicators
* Minimise invasiveness of changes by transposing and stringifying
the columnar results into rows in impala_client.py. The transposition
needs to happen before display anyway.
* Add PingImpalaHS2Service() to get back version string and webserver
address.
* Add CloseImpalaOperation() extension to return DML row counts. This
possibly addresses IMPALA-1789, although we need to confirm that
this is a sufficient solution.
* Add is_closed member to query handles to avoid shell independently
tracking whether the query handle was closed or not.
* Include query status in HS2 log to match beeswax.
* HS2 GetLog() command now includes query status error message for
consistency with beeswax.
* "set"/"set all" uses the client requests options, not the session
default. This captures the effective value of TIMEZONE, which
was previously missing. This also requires test changes where
the tests set non-default values, e.g. for ABORT_ON_ERROR.
* "set all" on the server side returns REMOVED query options - the
shell needs to know these so it can correctly ignore them.
* Clean up self.orig_cmd/self.last_leading comment argument
passing to avoid implicit parameter passing through multiple
function calls.
* Clean up argument handling in shell tests to consistently pass
around lists of arguments instead of strings that are subject
to shell tokenisation rules.
* Consistently close connections in the shell to avoid leaking
HS2 sessions. This is enforced by making ImpalaShell a context
manager and also eliminating all sys.exit() calls that would
bypass the explicit connection closing.
Testing:
* Shell tests can run with both protocols
* Add tests for formatting of all types and NULL values
* Added testing for floating point output formatting, which does
change as a result of switching to server-side vs client-side
formatting.
* Verified that newly-added tests were actually going through HS2
by disabling hs2 on the minicluster and running tests.
* Add checks to test_verify_metrics.py to ensure that no sessions
are left open at the end of tests.
Performance:
Baseline from beeswax shell for large extract is as follows:
$ time impala-shell.sh -B -q 'select * from tpch_parquet.orders' > /dev/null
real 0m6.708s
user 0m5.132s
sys 0m0.204s
After this change it is somewhat slower, but we generally don't consider
bulk extract performance through the shell to be perf-critical:
real 0m7.625s
user 0m6.436s
sys 0m0.256s
Change-Id: I6d5cc83d545aacc659523f29b1d6feed672e2a12
Reviewed-on: http://gerrit.cloudera.org:8080/12884
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This sets up the tests to be extensible to test shell
in both beeswax and HS2 modes.
Testing:
* Add test dimension containing only beeswax in preparation
for HS2 dimension.
* Factor out hardcoded ports.
* Add tests for formatting of all types and NULL values.
* Merge date shell test into general type tests.
* Added testing for floating point output formatting, which does
change as a result of switching to server-side vs client-side
formatting.
* Use unique_database for tests that create tables.
Change-Id: Ibe5ab7f4817e690b7d3be08d71f8f14364b84412
Reviewed-on: http://gerrit.cloudera.org:8080/13083
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the queuing status, that is, whether the query was
queued and what was the latest queuing reason, to the ExecSummary.
Also added changes to allow impala-shell to expose this status by
pulling it out from the ExecSummary when either live_summary or
live_progress is set to true.
Testing:
Added custom cluster tests.
Change-Id: Ibde447b01559b9f0f3e970d4fa10f6ee4064bd49
Reviewed-on: http://gerrit.cloudera.org:8080/11816
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>