Files
impala/common/function-registry
stiga-huang 0936384271 IMPALA-9010: Add builtin mask functions
There're 6 builtin GenericUDFs for column masking in Hive:
  mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_hash(value)
  mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar,
      dayValue, monthValue, yearValue)

Description of the parameters:
   value      - value to mask. Supported types: TINYINT, SMALLINT, INT,
                BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()).
   charCount  - number of characters. Default value: 4
   upperChar  - character to replace upper-case characters with. Specify
                -1 to retain original character. Default value: 'X'
   lowerChar  - character to replace lower-case characters with. Specify
                -1 to retain original character. Default value: 'x'
   digitChar  - character to replace digit characters with. Specify -1
                to retain original character. Default value: 'n'
   otherChar  - character to replace all other characters with. Specify
                -1 to retain original character. Default value: -1
   numberChar - character to replace digits in a number with. Valid
                values: 0-9. Default value: '1'
   dayValue   - value to replace day field in a date with.
                Specify -1 to retain original value. Valid values: 1-31.
                Default value: 1
   monthValue - value to replace month field in a date with. Specify -1
                to retain original value. Valid values: 0-11. Default
                value: 0
   yearValue  - value to replace year field in a date with. Specify -1
                to retain original value. Default value: 1

In Hive, these functions accept variable length of arguments in
non-restricted types:
   mask_show_first_n(val)
   mask_show_first_n(val, 8)
   mask_show_first_n(val, 8, 'X', 'x', 'n')
   mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2)
   mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9')
The arguments of upperChar, lowerChar, digitChar, otherChar and
numberChar can be in string or numeric types.

Impala doesn't support Hive GenericUDFs, so we are lack of these mask
functions to support Ranger column masking policies. On the other hand,
we want the masking functions to be evaluated in the C++ builtin logic
rather than calling out to java UDFs for performance. This patch
introduces our builtin implementation of them.

We currently don't have a corresponding framework for GenericUDF
(IMPALA-9271), so we implement these by overloads. However, it may
requires hundreds of overloads to cover all possible combinations. We
just implement some important overloads, including
 - those used by Ranger default masking policies,
 - those with simple arguments and may be useful for users,
 - an overload with all arguments in int type for full functionality.
   Char argument need to be converted to their ASCII value.

Tests:
 - Add BE tests in expr-test

Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664
Reviewed-on: http://gerrit.cloudera.org:8080/14963
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-17 15:34:34 +00:00
..