diff --git a/docs/impala.ditamap b/docs/impala.ditamap index 8e1d4b9e6..6d046e132 100644 --- a/docs/impala.ditamap +++ b/docs/impala.ditamap @@ -256,6 +256,7 @@ under the License. + @@ -292,6 +293,7 @@ under the License. + diff --git a/docs/topics/impala_string.xml b/docs/topics/impala_string.xml index fa3c31601..587666508 100644 --- a/docs/topics/impala_string.xml +++ b/docs/topics/impala_string.xml @@ -147,9 +147,7 @@ under the License.

    -
  • - String manipulation functions. -
  • +
  • CHAR/VARCHAR truncating/padding.
  • Comparison operators. @@ -171,6 +169,8 @@ under the License. those national language characteristics of string data, use logic on the application side.

    +

    If you just need Hive-compatible string function behaviors on UTF-8 encoded strings, turn on + the query option UTF8_MODE. See more in .

    Conversions:

    diff --git a/docs/topics/impala_utf8_mode.xml b/docs/topics/impala_utf8_mode.xml new file mode 100644 index 000000000..80ab949c4 --- /dev/null +++ b/docs/topics/impala_utf8_mode.xml @@ -0,0 +1,55 @@ + + + + + + UTF8_MODE Query Option + UTF8_MODE + + + + + + + + + + + + +

    + UTF8_MODE Query Option UTF-8 support allows string + functions to recognize the UTF-8 characters, thus processing strings in a compatible way as other + engines.

    +

    You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The query + option can be set globally, or at per session level. Only queries with UTF8_MODE=true will have + UTF-8 aware behaviors. If the query option UTF8_MODE is turned on globally, existing queries that + depend on the original binary behavior need to be explicitly set to UTF8_MODE=false.

    + +

    Type:BOOLEAN

    +

    Default:FALSE

    +

    Added in:Impala 4.1

    +

    +

    + , + +

    +
    +
    \ No newline at end of file diff --git a/docs/topics/impala_utf_8.xml b/docs/topics/impala_utf_8.xml new file mode 100644 index 000000000..fac6bce88 --- /dev/null +++ b/docs/topics/impala_utf_8.xml @@ -0,0 +1,103 @@ + + + + + UTF-8 Support + + + + + + + + + + +

    Impala has traditionally offered a single-byte binary character set for STRING data type and + the character data is encoded in ASCII character set. Prior to this release, Impala was + incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala + used to return the length of bytes of the string, while length() in Hive returns the length of + UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length + bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This + release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with + Hive on UTF-8 strings using a query option.

    +

    UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC, + thus improving interoperability with other engines that also support those standard formats.

    +
    + + Turning ON the UTF-8 behavior + +

    You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The + query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will + have UTF-8 aware behaviors.

    +

    + If the query option UTF8_MODE is turned on globally, existing queries that depend on the + original binary behavior need to explicitly set UTF8_MODE=false.

    +
    +
    + + List of STRING Functions + +

    The new query option introduced will turn on the UTF-8 aware behavior of the following string + functions:

    +
      +
    • LENGTH(STRING a)
        +
      • returns the number of UTF-8 characters instead of bytes
      • +
    • +
    • SUBSTR(STRING a, INT start [, INT len])
    • +
    • SUBSTRING(STRING a, INT start [, INT len])()
        +
      • the substring start position and length is counted by UTF-8 characters instead of + bytes
      • +
    • +
    • REVERSE(STRING a)
        +
      • the unit of the operation is a UTF-8 character, ie. it won't reverse bytes inside a UTF-8 + character.

        + The results of reverse("最快的SQL引擎") used to be "��敼�LQS��竿倜�" and now + "擎引LQS的快最".

      • +
    • +
    • INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
    • +
    • LOCATE(STRING substr, STRING str[, INT pos])
        +
      • These functions have an optional position argument. The return values are also positions + in the string. In UTF-8 mode, these positions are counted by UTF-8 characters instead of + bytes.
      • +
    • +
    • mask functions
        +
      • The unit of the operation is a UTF-8 character, ie. they won't mask the string + byte-to-byte.
      • +
    • +
    • upper/lower/initcap
        +
      • These functions will recognize non-ascii characters and transform them based on the + current locale used by the Impala process.
      • +
    • +
    +
    +
    + + Limitations + +
      +
    • Use the UTF8_MODE option only when needed since the performance of UTF_8 is not optimized + yet. It is only an experimental feature.
    • +
    • UTF-8 support for CHAR and VARCHAR types is not implemented yet. So VARCHAR(N) will still + return N bytes instead of N UTF-8 characters.
    • +
    +
    +
    +
    \ No newline at end of file