IMPALA-10319: Support arbitrary encodings on Text files

As proposed in Jira, this implements decoding and encoding of text
buffers for Impala/Hive text tables. Given a table with
'serialization.encoding' property set, similarly to Hive, Impala should
be able to encode the inserted data into charset specified, consequently
saving it into a text file. The opposite decoding operation should be
performed upon reading data buffers from text files. Both operations
employ boost::locale::conv library.

Since Hive doesn't encode line delimiters, charsets that would have
delimiters stored differently from ASCII are not allowed.

One difference from Hive is that Impala implements
'serialization.encoding' only as a per partition serdeproperty to avoid
confusion of allowing both serde and tbl properties. (See related
IMPALA-13748)

Note: Due to precreated non-UTF-8 files present in the patch
'gerrit-code-review-checks' was performed locally. (See IMPALA-14100)

Change-Id: I787cd01caa52a19d6645519a6cedabe0a5253a65
Reviewed-on: http://gerrit.cloudera.org:8080/22049
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
Mihaly Szjatinya
2025-06-01 15:36:48 +02:00
committed by Impala Public Jenkins
parent f8a1f6046a
commit 4837cedc79
34 changed files with 1063 additions and 20 deletions

View File

@@ -365,6 +365,7 @@ struct THdfsStorageDescriptor {
7: required THdfsFileFormat fileFormat
8: required i32 blockSize
9: optional TJsonBinaryFormat jsonBinaryFormat
10: optional string encodingValue
}
// Represents an HDFS partition

View File

@@ -500,7 +500,9 @@ error_codes = (
"cache entry ($0 bytes)"),
("TUPLE_CACHE_OUTSTANDING_WRITE_LIMIT_EXCEEDED", 163, "Outstanding tuple cache writes "
"exceeded the limit ($0 bytes)")
"exceeded the limit ($0 bytes)"),
("CHARSET_CONVERSION_ERROR", 164, "Error during buffer conversion: $0")
)
import sys