IMPALA-11325: Fix UnicodeDecodeError for shell file output

When using the --output_file commandline option for
impala-shell, the shell fails with UnicodeDecodeError
if the output contains Unicode characters.

For example, if running this command:
impala-shell -B -q "select '引'" --output_file=output.txt
This fails with:
UnicodeDecodeError : 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)

This happens due to an encode('utf-8') call happening
in OutputStream::write() on a string that is already UTF-8 encoded.
This changes the code to skip the encode('utf-8') call for Python 2.
Python 3 is using a string and still needs the encode call.

This is mostly a pragmatic fix to make the code a little bit
more functional, and there is more work to be done to have
clear contracts for the format() methods and clear points
of conversion to/from bytes.

Testing:
 - Ran shell tests with Python 2 and Python 3 on Ubuntu 18
 - Added a shell test that outputs a Unicode character
   to an output file. Without the fix, this test fails.

Change-Id: Ic40be3d530c2694465f7bd2edb0e0586ff0e1fba
Reviewed-on: http://gerrit.cloudera.org:8080/18576
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
Joe McDonnell
2022-05-31 16:14:55 -07:00
parent b3dec99ea1
commit ed0d9341d3
2 changed files with 26 additions and 1 deletions

View File

@@ -1202,6 +1202,23 @@ class TestImpalaShell(ImpalaTestSuite):
rows_from_file = [line.rstrip() for line in f]
assert rows_from_stdout == rows_from_file
def test_output_file_utf8(self, vector, tmp_file):
"""Test that writing UTF-8 output to a file using '--output_file' produces the
same output as written to stdout."""
# This is purely about UTF-8 output, so it doesn't need multiple rows.
query = "select ''"
# Run the query normally and keep the stdout
output = run_impala_shell_cmd(vector, ['-q', query, '-B', '--output_delimiter=;'])
assert "Fetched 1 row(s)" in output.stderr
rows_from_stdout = output.stdout.strip().split('\n')
# Run the query with output sent to a file using '--output_file'.
result = run_impala_shell_cmd(vector, ['-q', query, '-B', '--output_delimiter=;',
'--output_file=%s' % tmp_file])
assert "Fetched 1 row(s)" in result.stderr
with open(tmp_file, "r") as f:
rows_from_file = [line.rstrip() for line in f]
assert rows_from_stdout == rows_from_file
def test_http_socket_timeout(self, vector):
"""Test setting different http_socket_timeout_s values."""
if (vector.get_value('strict_hs2_protocol') or