mirror of
https://github.com/apache/impala.git
synced 2026-02-03 00:00:40 -05:00
A unicode character can be encoded into 1-4 bytes in UTF-8. String functions will return undesired results when the input contains unicode characters, because we deal with a string as a byte array. For instance, length() returns the length in bytes, not in unicode characters. UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem. This patch adds UTF-8 support in some string functions so they can have UTF-8 aware behavior. For compatibility with the old versions, a new query option, UTF8_MODE, is added for turning on/off the UTF-8 aware behavior. Currently, only length(), substring() and reverse() support it. Other function supports will be added in later patches. String functions will check the query option and switch to use the desired implementation. It's similar to how we use the decimal_v2 query option in builtin functions. For easy testing, the UTF-8 aware version of string functions are also exposed as builtin functions (named by utf8_*, e.g. utf8_length). Tests: - Add BE tests for utf8 functions. - Add e2e tests for the UTF8_MODE query option. Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Reviewed-on: http://gerrit.cloudera.org:8080/16908 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
43 lines
1.8 KiB
Python
43 lines
1.8 KiB
Python
# Licensed to the Apache Software Foundation (ASF) under one
|
|
# or more contributor license agreements. See the NOTICE file
|
|
# distributed with this work for additional information
|
|
# regarding copyright ownership. The ASF licenses this file
|
|
# to you under the Apache License, Version 2.0 (the
|
|
# "License"); you may not use this file except in compliance
|
|
# with the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing,
|
|
# software distributed under the License is distributed on an
|
|
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
# KIND, either express or implied. See the License for the
|
|
# specific language governing permissions and limitations
|
|
# under the License.
|
|
|
|
from tests.common.impala_test_suite import ImpalaTestSuite
|
|
from tests.common.test_dimensions import (create_exec_option_dimension,
|
|
create_client_protocol_dimension, hs2_parquet_constraint)
|
|
|
|
|
|
class TestUtf8StringFunctions(ImpalaTestSuite):
|
|
@classmethod
|
|
def get_workload(cls):
|
|
return 'functional-query'
|
|
|
|
@classmethod
|
|
def add_test_dimensions(cls):
|
|
super(TestUtf8StringFunctions, cls).add_test_dimensions()
|
|
cls.ImpalaTestMatrix.add_dimension(
|
|
create_exec_option_dimension(disable_codegen_options=[False, True]))
|
|
cls.ImpalaTestMatrix.add_constraint(lambda v:
|
|
v.get_value('table_format').file_format in ['parquet'] and
|
|
v.get_value('table_format').compression_codec in ['none'])
|
|
# Run these queries through both beeswax and HS2 to get coverage of CHAR/VARCHAR
|
|
# returned via both protocols.
|
|
cls.ImpalaTestMatrix.add_dimension(create_client_protocol_dimension())
|
|
cls.ImpalaTestMatrix.add_constraint(hs2_parquet_constraint)
|
|
|
|
def test_string_functions(self, vector):
|
|
self.run_test_case('QueryTest/utf8-string-functions', vector)
|