impala

jprdonnelly/impala

Fork 0

mirror of https://github.com/apache/impala.git synced 2026-01-31 09:00:19 -05:00

Commit Graph

Author	SHA1	Message	Date
stiga-huang	e7839c4530	IMPALA-10416: Add raw string mode for testfiles to verify non-ascii results Currently, the result section of the testfile is required to used escaped strings. Take the following result section as an example: --- RESULTS 'Alice\nBob' 'Alice\\nBob' The first line is a string with a newline character. The second line is a string with a '\' and an 'n' character. When comparing with the actual query results, we need to escape the special characters in the actual results, e.g. replace newline characters with '\n'. This is done by invoking encode('unicode_escape') on the actual result strings. However, the input type of this method is unicode instead of str. When calling it on str vars, Python will implicitly convert the input vars to unicode type. The default encoding, ascii, is used. This causes UnicodeDecodeError when the str contains non-ascii bytes. To fix this, this patch explicitly decodes the input str using 'utf-8' encoding. After fixing the logic of escaping the actual result strings, the next problem is that it's painful to write unicode-escaped expected results. Here is an example: ---- QUERY select "你好\n你好" ---- RESULTS '\u4f60\u597d\n\u4f60\u597d' ---- TYPES STRING It's painful to manually translate the unicode characters. This patch adds a new comment, RAW_STRING, for the result section to use raw strings instead of unicode-escaped strings. Here is an example: ---- QUERY select "你好" ---- RESULTS: RAW_STRING '你好' ---- TYPES STRING If the result contains special characters, it's recommended to use the default string mode. If the special characters only contain newline characters, we can use RAW_STRING and the existing MULTI_LINE comment together. This patch also fixes the issue that pytest fails to report assertion failures if any of the compared str values contain non-ascii bytes (IMPALA-10419). However, pytest works if the compared values are both in unicode type. So we explicitly converting the actual and expected str values to unicode type. Test: - Add tests in special-strings.test for raw string mode and the escaped string mode (default). - Run test_exprs.py::TestExprs::test_special_strings locally. Change-Id: I7cc2ea3e5849bd3d973f0cb91322633bcc0ffa4b Reviewed-on: http://gerrit.cloudera.org:8080/16919 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-06 04:39:56 +00:00
Tim Armstrong	85166afa8a	IMPALA-6374: fix handling of commas in .test files The .test file parser implemented an unconventional method for parsing single-quoted strings in comma-separated value format. This didn't handle trailing commas in the string correctly. This commit switches to using a conventional method for parsing comma-separated value format: * Commas enclosed by single quotes are not treated as field separators * Single quotes can be escaped within a string by doubling them. I looked into using Python's .csv module for this, but it wouldn't work without modifying the test file format more because it automatically discards the quotes during parsing, which are actually semantically important in .test files. E.g. without the quotes we can't distinguish between the literal string 'regex:...' and the regex regex:.... Testing: Ran exhaustive tests and fixed .test files that required modifications. Will rerun before merging. Added a couple of tests to exercise edge cases in the test file parser. Change-Id: I18ddcb0440490ddf8184be66d3681038a1615dd9 Reviewed-on: http://gerrit.cloudera.org:8080/11800 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-10-30 22:17:49 +00:00

Author

SHA1

Message

Date

stiga-huang

e7839c4530

IMPALA-10416: Add raw string mode for testfiles to verify non-ascii results

Currently, the result section of the testfile is required to used
escaped strings. Take the following result section as an example:
  --- RESULTS
  'Alice\nBob'
  'Alice\\nBob'
The first line is a string with a newline character. The second line is
a string with a '\' and an 'n' character. When comparing with the actual
query results, we need to escape the special characters in the actual
results, e.g. replace newline characters with '\n'. This is done by
invoking encode('unicode_escape') on the actual result strings. However,
the input type of this method is unicode instead of str. When calling it
on str vars, Python will implicitly convert the input vars to unicode
type. The default encoding, ascii, is used. This causes
UnicodeDecodeError when the str contains non-ascii bytes. To fix this,
this patch explicitly decodes the input str using 'utf-8' encoding.

After fixing the logic of escaping the actual result strings, the next
problem is that it's painful to write unicode-escaped expected results.
Here is an example:
  ---- QUERY
  select "你好\n你好"
  ---- RESULTS
  '\u4f60\u597d\n\u4f60\u597d'
  ---- TYPES
  STRING
It's painful to manually translate the unicode characters.

This patch adds a new comment, RAW_STRING, for the result section to use
raw strings instead of unicode-escaped strings. Here is an example:
  ---- QUERY
  select "你好"
  ---- RESULTS: RAW_STRING
  '你好'
  ---- TYPES
  STRING
If the result contains special characters, it's recommended to use the
default string mode. If the special characters only contain newline
characters, we can use RAW_STRING and the existing MULTI_LINE comment
together.

This patch also fixes the issue that pytest fails to report assertion
failures if any of the compared str values contain non-ascii bytes
(IMPALA-10419). However, pytest works if the compared values are both
in unicode type. So we explicitly converting the actual and expected str
values to unicode type.

Test:
 - Add tests in special-strings.test for raw string mode and the escaped
   string mode (default).
 - Run test_exprs.py::TestExprs::test_special_strings locally.

Change-Id: I7cc2ea3e5849bd3d973f0cb91322633bcc0ffa4b
Reviewed-on: http://gerrit.cloudera.org:8080/16919
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2021-01-06 04:39:56 +00:00

Tim Armstrong

85166afa8a

IMPALA-6374: fix handling of commas in .test files

The .test file parser implemented an unconventional method for parsing
single-quoted strings in comma-separated value format. This didn't handle
trailing commas in the string correctly.

This commit switches to using a conventional method for parsing
comma-separated value format:
* Commas enclosed by single quotes are not treated as field separators
* Single quotes can be escaped within a string by doubling them.

I looked into using Python's .csv module for this, but it wouldn't
work without modifying the test file format more because it
automatically discards the quotes during parsing, which are actually
semantically important in .test files. E.g. without the quotes we can't
distinguish between the literal string 'regex:...' and the regex
regex:....

Testing:
Ran exhaustive tests and fixed .test files that required modifications.
Will rerun before merging.

Added a couple of tests to exercise edge cases in the test file parser.

Change-Id: I18ddcb0440490ddf8184be66d3681038a1615dd9
Reviewed-on: http://gerrit.cloudera.org:8080/11800
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>

2018-10-30 22:17:49 +00:00

2 Commits