There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.
There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
+-------+--------+---------+
| | Dotted | Dotless |
+-------+--------+---------+
| Upper | İ | I |
+-------+--------+---------+
| Lower | i | ı |
+-------+--------+---------+
2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".
3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.
The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.
--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().
Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.
Test:
- Add BE unit tests and e2e tests.
Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Reviewed-on: http://gerrit.cloudera.org:8080/17785
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.
This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.
Implementation
- Extends the existing MaskTransform function to deal with unicode code
points(represented by uint32_t).
- Extends the existing GetFirstChar function to get the code point of
given masked charactors in UTF-8 mode.
- Implement a MaskSubStrUtf8 method as the core functionality.
- Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
- For better testing, this patch also adds an overload for all mask
functions for only masking other chars but keeping the
upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').
Tests
- Add BE tests in expr-test
- Add e2e tests in utf8-string-functions.test
Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.
In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.
Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.
Tests:
- Add BE unit tests and e2e tests
- Add random tests to make sure malformed UTF-8 characters won't crash
us.
Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Reviewed-on: http://gerrit.cloudera.org:8080/17580
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A unicode character can be encoded into 1-4 bytes in UTF-8. String
functions will return undesired results when the input contains unicode
characters, because we deal with a string as a byte array. For instance,
length() returns the length in bytes, not in unicode characters.
UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem.
This patch adds UTF-8 support in some string functions so they can have
UTF-8 aware behavior. For compatibility with the old versions, a new
query option, UTF8_MODE, is added for turning on/off the UTF-8 aware
behavior. Currently, only length(), substring() and reverse() support
it. Other function supports will be added in later patches.
String functions will check the query option and switch to use the
desired implementation. It's similar to how we use the decimal_v2 query
option in builtin functions.
For easy testing, the UTF-8 aware version of string functions are
also exposed as builtin functions (named by utf8_*, e.g. utf8_length).
Tests:
- Add BE tests for utf8 functions.
- Add e2e tests for the UTF8_MODE query option.
Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Reviewed-on: http://gerrit.cloudera.org:8080/16908
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>