This is the main patch for making the the impala-shell cross-compatible with
python 2 and python 3. The goal is wind up with a version of the shell that will
pass python e2e tests irrepsective of the version of python used to launch the
shell, under the assumption that the test framework itself will continue to run
with python 2.7.x for the time being.
Notable changes for reviewers to consider:
- With regard to validating the patch, my assumption is that simply passing
the existing set of e2e shell tests is sufficient to confirm that the shell
is functioning properly. No new tests were added.
- A new pytest command line option was added in conftest.py to enable a user
to specify a path to an alternate impala-shell executable to test. It's
possible to use this to point to an instance of the impala-shell that was
installed as a standalone python package in a separate virtualenv.
Example usage:
USE_THRIFT11_GEN_PY=true impala-py.test --shell_executable=/<path to virtualenv>/bin/impala-shell -sv shell/test_shell_commandline.py
The target virtualenv may be based on either python3 or python2. However,
this has no effect on the version of python used to run the test framework,
which remains tied to python 2.7.x for the foreseeable future.
- The $IMPALA_HOME/bin/impala-shell.sh now sets up the impala-shell python
environment independenty from bin/set-pythonpath.sh. The default version
of thrift is thrift-0.11.0 (See IMPALA-9489).
- The wording of the header changed a bit to include the python version
used to run the shell.
Starting Impala Shell with no authentication using Python 3.7.5
Opened TCP connection to localhost:21000
...
OR
Starting Impala Shell with LDAP-based authentication using Python 2.7.12
Opened TCP connection to localhost:21000
...
- By far, the biggest hassle has been juggling str versus unicode versus
bytes data types. Python 2.x was fairly loose and inconsistent in
how it dealt with strings. As a quick demo of what I mean:
Python 2.7.12 (default, Nov 12 2018, 14:36:49)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> d = 'like a duck'
>>> d == str(d) == bytes(d) == unicode(d) == d.encode('utf-8') == d.decode('utf-8')
True
...and yet there are weird unexpected gotchas.
>>> d.decode('utf-8') == d.encode('utf-8')
True
>>> d.encode('utf-8') == bytearray(d, 'utf-8')
True
>>> d.decode('utf-8') == bytearray(d, 'utf-8') # fails the eq property?
False
As a result, this was inconsistency was reflected in the way we handled
strings in the impala-shell code, but things still just worked.
In python3, there's a much clearer distinction between strings and bytes, and
as such, much tighter type consistency is expected by standard libs like
subprocess, re, sqlparse, prettytable, etc., which are used throughout the
shell. Even simple calls that worked in python 2.x:
>>> import re
>>> re.findall('foo', b'foobar')
['foo']
...can throw exceptions in python 3.x:
>>> import re
>>> re.findall('foo', b'foobar')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data0/systest/venvs/py3/lib/python3.7/re.py", line 223, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
Exceptions like this resulted in a many, if not most shell tests failing
under python 3.
What ultimately seemed like a better approach was to try to weed out as many
existing spurious str.encode() and str.decode() calls as I could, and try to
implement what is has colloquially been called a "unicode sandwich" -- namely,
"bytes on the outside, unicode on the inside, encode/decode at the edges."
The primary spot in the shell where we call decode() now is when sanitising
input...
args = self.sanitise_input(args.decode('utf-8'))
...and also whenever a library like re required it. Similarly, str.encode()
is primarily used where a library like readline or csv requires is.
- PYTHONIOENCODING needs to be set to utf-8 to override the default setting for
python 2. Without this, piping or redirecting stdout results in unicode errors.
- from __future__ import unicode_literals was added throughout
Testing:
To test the changes, I ran the e2e shell tests the way we always do (against
the normal build tarball), and then I set up a python 3 virtual env with the
shell installed as a package, and manually ran the tests against that.
No effort has been made at this point to come up with a way to integrate
testing of the shell in a python3 environment into our automated test
processes.
Change-Id: Idb004d352fe230a890a6b6356496ba76c2fab615
Reviewed-on: http://gerrit.cloudera.org:8080/15524
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A few built-ins were changed in python 3 -- e.g., xrange became range,
ConfigParser became configparser, etc. We can redefine some of those
things in a single place, and import them from there as needed. Other
items may also be added as we go along.
Change-Id: Ibd3d86df524666a98cbfa463756adac48bd1f8a3
Reviewed-on: http://gerrit.cloudera.org:8080/15514
Reviewed-by: David Knupp <dknupp@cloudera.com>
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In an effort to keep the work of reviewing the changes more manageable
with regard to making the impala-shell python3 compatible, I'm trying
to break the patches up into smaller chunks.
The first patch is the easiest one -- simply addressing the handful of
syntax issues that aren't python 3 compatible, namely changing the
print statements to function calls, changing the way we catch exceptions,
and adding a few simple branches to work around the removal of such
things as dict.iteritems().
We needed the print function imported from __future__ because it allows
us to pass in a file descriptor, e.g., sys.stderr.
Notably, there's nothing in this patch related to string/bytes/unicode
changes from python 2 to 3.
Change-Id: I9a515da01ef03d5936cb1a4d9e4bc6d105386b1d
Reviewed-on: http://gerrit.cloudera.org:8080/15487
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Problem:
When assign --output_delimiter to invalid value, the validation of
the argument is done only after the query is running, ValueError is
raised in DelimitedOutputFormatter and caught in _exec_stmt in shell
Solution:
Add --output_delimiter option check before impala-shell initialization
Remove delimiter length check in DelimitedOutputFormatter
Testing:
tests/shell/test_shell_commandline.py passed
Example:
$ impala-shell.sh -B --output_delimiter '||' -q 'select 1,1,1'
Illegal delimiter ||, the delimiter must be a 1-character string.
Change-Id: I7ee2fccd305b104b3aff44c57659b6f14f2f4a05
Reviewed-on: http://gerrit.cloudera.org:8080/13690
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The bug is that PrettyOutputFormatter.format() returned a unicode
object, and Python cannot automatically write unicode objects to
output streams where there is no default encoding.
The fix is to convert to UTF-8 encoded in a regular string, which
can be output to any output device. This makes the output type
consistent with DelimitedOutputFormatter.format().
Based on code by Marcell Szabo.
Testing:
Added a basic test.
Played around in an interactive shell to make sure that unicode
characters still work in interactive mode.
Change-Id: I9de641ecf767a2feef3b9f48b344ef2d55e17a7f
Reviewed-on: http://gerrit.cloudera.org:8080/9928
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This patch adds a way to allow for dynamic progress reporting in the
shell. There are two new command line flags for the shell
--live_progress - will print the completed vs total # of scan ranges
--live_summary - prints an updated exec summary
In addition to the command line flags, these options can be set from
within the shell using:
set LIVE_SUMMARY=True
set LIVE_PROGRESS=True
The new options will be listed under shell options. Both reports will be
updated at most every second, for longer running queries it will be
adjusted to the time between two RPC calls to get the query status. To
provide this information in the ExecSummary, the Thrift structure for
the ExecSummary was extended to contain a progress indicator. The output
is printed to stderr and only available in interactive mode.
An example video is available here:
https://asciinema.org/a/5wi7ypckx4ol4ha1hlg3e3q1k
Change-Id: I70b2ab5fa74dc2ba5bc3b338ef13ddc6ccf367d2
Reviewed-on: http://gerrit.cloudera.org:8080/508
Tested-by: Internal Jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>