Files
impala/testdata/workloads/functional-query/queries/QueryTest
stiga-huang 4df03a31ec IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests
 - Add random tests to make sure malformed UTF-8 characters won't crash
   us.

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Reviewed-on: http://gerrit.cloudera.org:8080/17580
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-20 13:28:30 +00:00
..
2021-07-06 18:35:30 +00:00
2021-02-15 22:25:41 +00:00