Files
impala/docker/daemon_entrypoint.sh
stiga-huang 35375b3287 IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.

--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().

Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Reviewed-on: http://gerrit.cloudera.org:8080/17785
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-15 18:40:59 +00:00

73 lines
2.6 KiB
Bash
Executable File

#!/bin/bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# Wrapper script that runs the command-line provided as its arguments after
# setting up the environment required for the daemon processes to run.
#
# Supported environment variables:
# JAVA_TOOL_OPTIONS: additional options passed to any embedded JVMs. Can be used, e.g.
# to set a max heap size with JAVA_TOOL_OPTIONS="-Xmx4g".
export IMPALA_HOME=/opt/impala
# Add directories containing dynamic libraries required by the daemons that
# are not on the system library paths.
export LD_LIBRARY_PATH=/opt/impala/lib
LD_LIBRARY_PATH+=:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/
LD_LIBRARY_PATH+=:/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/
# Add directory with optional plugins that can be mounted for the container.
LD_LIBRARY_PATH+=:/opt/impala/lib/plugins
# Configs should be first on classpath
export CLASSPATH=/opt/impala/conf
# Append all of the jars in /opt/impala/lib to the classpath.
for jar in /opt/impala/lib/*.jar
do
CLASSPATH+=:$jar
done
echo "CLASSPATH: $CLASSPATH"
echo "LD_LIBRARY_PATH: $LD_LIBRARY_PATH"
# Default to 2GB heap. Allow overriding by externally-set JAVA_TOOL_OPTIONS.
export JAVA_TOOL_OPTIONS="-Xmx2g $JAVA_TOOL_OPTIONS"
# Various Hadoop libraries depend on having a username. If we're running under
# an unknown username, create an entry in the password file for this user.
if ! whoami ; then
export USER=${HADOOP_USER_NAME:-dummyuser}
echo "${USER}:x:$(id -u):$(id -g):,,,:/opt/impala:/bin/bash" >> /etc/passwd
whoami
cat /etc/passwd
fi
# IMPALA-10006: avoid cryptic failures if log dir isn't writable.
LOG_DIR=$IMPALA_HOME/logs
if [[ ! -w "$LOG_DIR" ]]; then
echo "$LOG_DIR is not writable"
exit 1
fi
# Set ulimit core file size 0.
ulimit -c 0
# Set a UTF-8 locale to enable upper/lower/initcap functions with UTF-8 mode.
export LC_ALL=C.UTF-8
exec "$@"