mirror of
https://github.com/apache/impala.git
synced 2025-12-19 18:12:08 -05:00
IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.
This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.
Implementation
- Extends the existing MaskTransform function to deal with unicode code
points(represented by uint32_t).
- Extends the existing GetFirstChar function to get the code point of
given masked charactors in UTF-8 mode.
- Implement a MaskSubStrUtf8 method as the core functionality.
- Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
- For better testing, this patch also adds an overload for all mask
functions for only masking other chars but keeping the
upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').
Tests
- Add BE tests in expr-test
- Add e2e tests in utf8-string-functions.test
Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
committed by
Impala Public Jenkins
parent
1e21aa6b96
commit
3850d49711
@@ -828,6 +828,8 @@ visible_functions = [
|
||||
[['mask_show_first_n'], 'STRING', ['STRING', 'INT'], 'impala::MaskFunctions::MaskShowFirstN'],
|
||||
[['mask_show_first_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING'],
|
||||
'impala::MaskFunctions::MaskShowFirstN'],
|
||||
[['mask_show_first_n'], 'STRING', ['STRING', 'INT', 'INT', 'INT', 'INT', 'STRING'],
|
||||
'impala::MaskFunctions::MaskShowFirstN'],
|
||||
[['mask_show_first_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING', 'INT'],
|
||||
'impala::MaskFunctions::MaskShowFirstN'],
|
||||
[['mask_show_first_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'INT', 'STRING'],
|
||||
@@ -856,6 +858,8 @@ visible_functions = [
|
||||
[['mask_show_last_n'], 'STRING', ['STRING', 'INT'], 'impala::MaskFunctions::MaskShowLastN'],
|
||||
[['mask_show_last_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING'],
|
||||
'impala::MaskFunctions::MaskShowLastN'],
|
||||
[['mask_show_last_n'], 'STRING', ['STRING', 'INT', 'INT', 'INT', 'INT', 'STRING'],
|
||||
'impala::MaskFunctions::MaskShowLastN'],
|
||||
[['mask_show_last_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING', 'INT'],
|
||||
'impala::MaskFunctions::MaskShowLastN'],
|
||||
[['mask_show_last_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'INT', 'STRING'],
|
||||
@@ -886,6 +890,8 @@ visible_functions = [
|
||||
[['mask_first_n'], 'STRING', ['STRING', 'INT'], 'impala::MaskFunctions::MaskFirstN'],
|
||||
[['mask_first_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING'],
|
||||
'impala::MaskFunctions::MaskFirstN'],
|
||||
[['mask_first_n'], 'STRING', ['STRING', 'INT', 'INT', 'INT', 'INT', 'STRING'],
|
||||
'impala::MaskFunctions::MaskFirstN'],
|
||||
[['mask_first_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING', 'INT'],
|
||||
'impala::MaskFunctions::MaskFirstN'],
|
||||
[['mask_first_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'INT', 'STRING'],
|
||||
@@ -916,6 +922,8 @@ visible_functions = [
|
||||
[['mask_last_n'], 'STRING', ['STRING', 'INT'], 'impala::MaskFunctions::MaskLastN'],
|
||||
[['mask_last_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING'],
|
||||
'impala::MaskFunctions::MaskLastN'],
|
||||
[['mask_last_n'], 'STRING', ['STRING', 'INT', 'INT', 'INT', 'INT', 'STRING'],
|
||||
'impala::MaskFunctions::MaskLastN'],
|
||||
[['mask_last_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'STRING', 'INT'],
|
||||
'impala::MaskFunctions::MaskLastN'],
|
||||
[['mask_last_n'], 'STRING', ['STRING', 'INT', 'STRING', 'STRING', 'STRING', 'INT', 'STRING'],
|
||||
@@ -945,6 +953,8 @@ visible_functions = [
|
||||
[['mask'], 'STRING', ['STRING'], 'impala::MaskFunctions::Mask'],
|
||||
[['mask'], 'STRING', ['STRING', 'STRING', 'STRING', 'STRING', 'STRING'],
|
||||
'impala::MaskFunctions::Mask'],
|
||||
[['mask'], 'STRING', ['STRING', 'INT', 'INT', 'INT', 'STRING'],
|
||||
'impala::MaskFunctions::Mask'],
|
||||
[['mask'], 'STRING', ['STRING', 'STRING', 'STRING', 'STRING', 'STRING', 'INT'],
|
||||
'impala::MaskFunctions::Mask'],
|
||||
[['mask'], 'STRING', ['STRING', 'STRING', 'STRING', 'STRING', 'INT', 'STRING'],
|
||||
|
||||
Reference in New Issue
Block a user