impala

mirror of https://github.com/apache/impala.git synced 2026-01-28 09:03:52 -05:00

Author	SHA1	Message	Date
Mihaly Szjatinya	81f2673883	IMPALA-889: Add trim() function matching ANSI SQL definition As agreed in JIRA discussions, the current PR extends existing TRIM functionality with the support of SQL-standardized TRIM-FROM syntax: TRIM({[LEADING / TRAILING / BOTH] \| [STRING characters]} FROM expr). Implemented based on the existing LTRIM / RTRIM / BTRIM family of functions prepared earlier in IMPALA-6059 and extended for UTF-8 in IMPALA-12718. Besides, partly based on abandoned PR https://gerrit.cloudera.org/#/c/4474 and similar EXTRACT-FROM functionality from https://github.com/apache/impala/commit/543fa73f3a846 f0e4527514c993cb0985912b06c. Supported syntaxes: Syntax #1 TRIM(<where> FROM <string>); Syntax #2 TRIM(<charset> FROM <string>); Syntax #3 TRIM(<where> <charset> FROM <string>); "where": Case-insensitive trim direction. Valid options are "leading", "trailing", and "both". "leading" means trimming characters from the start; "trailing" means trimming characters from the end; "both" means trimming characters from both sides. For Syntax #2, since no "where" is specified, the option "both" is implied by default. "charset": Case-sensitive characters to be removed. This argument is regarded as a character set going to be removed. The occurrence order of each character doesn't matter and duplicated instances of the same character will be ignored. NULL argument implies " " (standard space) by default. Empty argument ("" or '') makes TRIM return the string untouched. For Syntax #1, since no "charset" is specified, it trims " " (standard space) by default. "string": Case-sensitive target string to trim. This argument can be NULL. The UTF8_MODE query option is honored by TRIM-FROM, similarly to existing TRIM(). UTF8_TRIM-FROM can be used to force UTF8 mode regardless of the query option. Design Notes: 1. No-BE. Since the existing LTRIM / RTRIM / BTRIM functions fully cover all needed use-cases, no backend logic is required. This differs from similar EXTRACT-FROM. 2. Syntax wrapper. TrimFromExpr class was introduced as a syntax wrapper around FunctionCallExpr, which instantiates one of the regular LTRIM / RTRIM / BTRIM functions. TrimFromExpr's role is to maintain the integrity of the "phantom" TRIM-FROM built-in function. 3. No TRIM keyword. Following EXTRACT-FROM, no "TRIM" keyword was added to the language. Although generally a keyword would allow easier and better parsing, on the negative side it restricts token's usage in general context. However, leading/trailing/both, being previously saved as reserved words, are now added as keywords to make possible their usage with no escaping. Change-Id: I3c4fa6d0d8d0684c4b6d8dac8fd531d205e4f7b4 Reviewed-on: http://gerrit.cloudera.org:8080/21825 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2024-12-02 15:15:15 +00:00
Eyizoha	e489ab35b1	IMPALA-12718: Provides UTF-8 support for the trim functions Currently, the trim function (including BTRIM, LTRIM, RTRIM) cannot correctly handle strings containing multi-byte UTF-8 characters. Multi-byte UTF-8 characters are interpreted as multiple single-byte characters, leading to unexpected results. This patch provides UTF-8 support for the trim functions, enabling these functions to correctly handle multi-byte UTF-8 characters (when set utf8_mode=true). It also introduces a set of trim functions with the 'utf8_' prefix, offering the same capability even when utf8_mode is not enabled. Testing: - Added new BE test case in ExprTest#Utf8Test - Added new E2E test case in TestUtf8StringFunctions Change-Id: I5cfaffd71009f16eae75910af835bd2a34410856 Reviewed-on: http://gerrit.cloudera.org:8080/20926 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-02-02 11:06:24 +00:00
Joe McDonnell	11e66523d6	IMPALA-11526: Install en_US.UTF-8 locale into docker images In IMPALA-11492, ExprTest.Utf8MaskTest was failing on some configurations because the en_US.UTF-8 was missing. Since the Docker images don't contain en_US.UTF-8, they are subject to the same bug. This was confirmed by adding tests cases to the test_utf8_strings.py end-to-end test and running it in the dockerized tests. This add the appropriate language pack to the list of packages installed for the Docker build. Testing: - This adds end-to-end tests to test_utf8_strings.py covering the same cases that were failing in ExprTest.Utf8MaskTest. They failed without the added languages packs, and now succeed. Change-Id: I353f257b3cb6d45f7d0a28f7d5319fdb457e6e3d Reviewed-on: http://gerrit.cloudera.org:8080/19080 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>	2022-10-11 20:30:50 +00:00
stiga-huang	35375b3287	IMPALA-2019(part-4): Add UTF-8 support for case conversion functions There are 3 builtin case conversion string functions: upper(), lower(), and initcap(). Previously they only convert English alphabetic characters. This patch adds support to deal with Unicode characters. There are many corner cases in case conversion depending on the locale and context. E.g. 1) Case conversion is locale-sensitive. Turkish has 4 letter "I"s. English has only two, a lowercase dotted i and an uppercase dotless I. Turkish has lowercase and uppercase forms of both dotted and dotless I. So simply converting "i" to "I" for upper case is wrong in Turkish: +-------+--------+---------+ \| \| Dotted \| Dotless \| +-------+--------+---------+ \| Upper \| İ \| I \| +-------+--------+---------+ \| Lower \| i \| ı \| +-------+--------+---------+ 2) Case conversion may change a string's length. The German word "grüßen" should be converted to "GRÜSSEN" in upper case: the letter "ß" should be converted to "SS". 3) Case conversion is context-sensitive. The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the Greek letter "Σ" is converted to "σ" or to "ς", depending on its position in the word. The above cases will be focus in follow-up JIRAs. This patch addes the initial implementation of UTF-8 aware case conversion functions. -------- Implementation: In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the bytes in strings are converted to wide characters using std::mbrtowc(). Each wide character (wchar_t) will then be converted using std::towupper or std::towlower correspondingly. We then convert them back to multi bytes using std::wcrtomb(). Note that these builtins are locale aware. If impalad is launched without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't recognize non-ascii characters, which will return unexpected results. Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C". This patch also logs the current locale when launching impala daemons for better debugging. We will support customized locale in IMPALA-11080. Test: - Add BE unit tests and e2e tests. Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd Reviewed-on: http://gerrit.cloudera.org:8080/17785 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-15 18:40:59 +00:00
stiga-huang	3850d49711	IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions Mask functions are used in Ranger column masking policies to mask sensitive data. There are 5 mask functions: mask(), mask_first_n(), mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as an example, by default, it will mask uppercase to 'X', lowercase to 'x', digits to 'n' and leave other characters unmasked. For masking all characters to '', we can use mask(my_col, '', '', '', ''); The current implementations mask strings byte-to-byte, which have inconsistent results with Hive when the string contains unicode characters: mask('中国', '', '', '', '') => '***' Each Chinese character is encoded into 3 bytes in UTF-8 so we get the above result. The result in Hive is '' since there are two Chinese characters. This patch provides consistent masking behavior with Hive for strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode, the masked unit of a string is a unicode code point. Implementation - Extends the existing MaskTransform function to deal with unicode code points(represented by uint32_t). - Extends the existing GetFirstChar function to get the code point of given masked charactors in UTF-8 mode. - Implement a MaskSubStrUtf8 method as the core functionality. - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode. - For better testing, this patch also adds an overload for all mask functions for only masking other chars but keeping the upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X'). Tests - Add BE tests in expr-test - Add e2e tests in utf8-string-functions.test Change-Id: I1276eccc94c9528507349b155a51e76f338367d5 Reviewed-on: http://gerrit.cloudera.org:8080/17780 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-15 05:04:07 +00:00
stiga-huang	4df03a31ec	IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() Similar to the previous patch, this patch adds UTF-8 support in instr() and locate() builtin functions so they can have consistent behaviors with Hive's. These two string functions both have an optional argument as position: INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]]) LOCATE(STRING substr, STRING str[, INT pos]) Their return values are positions of the matched substring. In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are counted by UTF-8 characters instead of bytes. Error handling: Malformed UTF-8 characters are counted as one byte per character. This is consistent with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(), which performs the replacement. We can provide more behaviors on error handling like ignoring them or reporting errors. IMPALA-10761 will focus on this. Tests: - Add BE unit tests and e2e tests - Add random tests to make sure malformed UTF-8 characters won't crash us. Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Reviewed-on: http://gerrit.cloudera.org:8080/17580 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-20 13:28:30 +00:00
stiga-huang	e8720b40f1	IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions A unicode character can be encoded into 1-4 bytes in UTF-8. String functions will return undesired results when the input contains unicode characters, because we deal with a string as a byte array. For instance, length() returns the length in bytes, not in unicode characters. UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem. This patch adds UTF-8 support in some string functions so they can have UTF-8 aware behavior. For compatibility with the old versions, a new query option, UTF8_MODE, is added for turning on/off the UTF-8 aware behavior. Currently, only length(), substring() and reverse() support it. Other function supports will be added in later patches. String functions will check the query option and switch to use the desired implementation. It's similar to how we use the decimal_v2 query option in builtin functions. For easy testing, the UTF-8 aware version of string functions are also exposed as builtin functions (named by utf8_*, e.g. utf8_length). Tests: - Add BE tests for utf8 functions. - Add e2e tests for the UTF8_MODE query option. Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Reviewed-on: http://gerrit.cloudera.org:8080/16908 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-26 00:43:39 +00:00

7 Commits