mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
added a note about Glibc version and en_US.UTF-8 locale updated the notes in both topics Change-Id: I4d7a21c787c66868219c7bd64aa31f772de2f850 Reviewed-on: http://gerrit.cloudera.org:8080/18897 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
112 lines
5.3 KiB
XML
112 lines
5.3 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!--
|
||
Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
-->
|
||
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
||
<concept id="utf_8">
|
||
<title>UTF-8 Support</title>
|
||
<prolog>
|
||
<metadata>
|
||
<data name="Category" value="Impala"/>
|
||
<data name="Category" value="Impala Functions"/>
|
||
<data name="Category" value="utf_8"/>
|
||
<data name="Category" value="Developers"/>
|
||
<data name="Category" value="Data Analysts"/>
|
||
</metadata>
|
||
</prolog>
|
||
<conbody>
|
||
<p>Impala has traditionally offered a single-byte binary character set for STRING data type and
|
||
the character data is encoded in ASCII character set. Prior to this release, Impala was
|
||
incompatible with Hive in some functions applying on non-ASCII strings. E.g. length() in Impala
|
||
used to return the length of bytes of the string, while length() in Hive returns the length of
|
||
UTF-8 characters of the string. UTF-8 characters (code points) are assembled in variant-length
|
||
bytes (1~4 bytes), so the results differ when there are non-ASCII characters in the string. This
|
||
release provides a UTF-8 aware behavior for Impala STRING type to get consistent behavior with
|
||
Hive on UTF-8 strings using a query option.</p>
|
||
<p>UTF-8 support allows you to read and write UTF-8 from standard formats like Parquet and ORC,
|
||
thus improving interoperability with other engines that also support those standard formats.</p>
|
||
</conbody>
|
||
<concept id="turning_ON">
|
||
<title>Turning ON the UTF-8 behavior</title>
|
||
<conbody>
|
||
<p>You can use the new query option, UTF8_MODE, to turn on/off the UTF-8 aware behavior. The
|
||
query option can be set globally, or at per session level. Only queries with UTF8_MODE=true will
|
||
have UTF-8 aware behaviors.</p>
|
||
<p>
|
||
<note>
|
||
<ul id="ul_vs2_qrx_p5b">
|
||
<li>If the query option UTF8_MODE is turned on globally, existing queries that depend on
|
||
the original binary behavior need to explicitly set UTF8_MODE=false.</li>
|
||
<li>Impala Daemons should be deployed on nodes using the same Glibc version since
|
||
different Glibc version supports different Unicode standard version and also ensure
|
||
that the en_US.UTF-8 locale is installed in the nodes. Not using the same Glibc
|
||
version might result in inconsistent UTF-8 behavior when UTF8_MODE is set to
|
||
true.</li>
|
||
</ul>
|
||
</note></p>
|
||
</conbody>
|
||
</concept>
|
||
<concept id="list_string_functions">
|
||
<title>List of STRING Functions</title>
|
||
<conbody>
|
||
<p>The new query option introduced will turn on the UTF-8 aware behavior of the following string
|
||
functions:</p>
|
||
<ul>
|
||
<li>LENGTH(STRING a)<ul id="ul_jgr_x1l_gtb">
|
||
<li>returns the number of UTF-8 characters instead of bytes</li>
|
||
</ul></li>
|
||
<li>SUBSTR(STRING a, INT start [, INT len])</li>
|
||
<li>SUBSTRING(STRING a, INT start [, INT len])()<ul id="ul_tkh_x1l_gtb">
|
||
<li>the substring start position and length is counted by UTF-8 characters instead of
|
||
bytes</li>
|
||
</ul></li>
|
||
<li>REVERSE(STRING a)<ul id="ul_o1d_jbl_gtb">
|
||
<li>the unit of the operation is a UTF-8 character, ie. it won't reverse bytes inside a UTF-8
|
||
character.<p>
|
||
<note>The results of reverse("最快的SQL引擎") used to be "<22><>敼<EFBFBD>LQS<51><53>竿倜<E7ABBF>" and now
|
||
"擎引LQS的快最".</note></p></li>
|
||
</ul></li>
|
||
<li>INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])</li>
|
||
<li>LOCATE(STRING substr, STRING str[, INT pos])<ul id="ul_y1p_sbl_gtb">
|
||
<li>These functions have an optional position argument. The return values are also positions
|
||
in the string. In UTF-8 mode, these positions are counted by UTF-8 characters instead of
|
||
bytes.</li>
|
||
</ul></li>
|
||
<li>mask functions<ul id="ul_qmg_5bl_gtb">
|
||
<li>The unit of the operation is a UTF-8 character, ie. they won't mask the string
|
||
byte-to-byte.</li>
|
||
</ul></li>
|
||
<li>upper/lower/initcap<ul id="ul_x3c_wbl_gtb">
|
||
<li>These functions will recognize non-ascii characters and transform them based on the
|
||
current locale used by the Impala process.</li>
|
||
</ul></li>
|
||
</ul>
|
||
</conbody>
|
||
</concept>
|
||
<concept id="limitations">
|
||
<title>Limitations</title>
|
||
<conbody>
|
||
<ul id="ul_dhh_dcl_gtb">
|
||
<li>Use the UTF8_MODE option only when needed since the performance of UTF_8 is not optimized
|
||
yet. It is only an experimental feature.</li>
|
||
<li>UTF-8 support for CHAR and VARCHAR types is not implemented yet. So VARCHAR(N) will still
|
||
return N bytes instead of N UTF-8 characters.</li>
|
||
</ul>
|
||
</conbody>
|
||
</concept>
|
||
</concept> |