diff --git a/docs/impala.ditamap b/docs/impala.ditamap
index 8e1d4b9e6..6d046e132 100644
--- a/docs/impala.ditamap
+++ b/docs/impala.ditamap
@@ -256,6 +256,7 @@ under the License.
If you just need Hive-compatible string function behaviors on UTF-8 encoded strings, turn on
+ the query option UTF8_MODE. See more in
Conversions:
diff --git a/docs/topics/impala_utf8_mode.xml b/docs/topics/impala_utf8_mode.xml new file mode 100644 index 000000000..80ab949c4 --- /dev/null +++ b/docs/topics/impala_utf8_mode.xml @@ -0,0 +1,55 @@ + + + +
+
You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The query + option can be set globally, or at per session level. Only queries with UTF8_MODE=true will have + UTF-8 aware behaviors. If the query option UTF8_MODE is turned on globally, existing queries that + depend on the original binary behavior need to be explicitly set to UTF8_MODE=false.
+ +Type:BOOLEAN
+Default:FALSE
+Added in:Impala 4.1
+ +
+
Impala has traditionally offered a single-byte binary character set for STRING data type and + the character data is encoded in ASCII character set. Prior to this release, Impala was + incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala + used to return the length of bytes of the string, while length() in Hive returns the length of + UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length + bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This + release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with + Hive on UTF-8 strings using a query option.
+UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC, + thus improving interoperability with other engines that also support those standard formats.
+You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The + query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will + have UTF-8 aware behaviors.
+
+
The new query option introduced will turn on the UTF-8 aware behavior of the following string + functions:
+
+