mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
IMPALA-7291: [DOCS] Note about no codegen support for CHAR
Also, cleaned up confusing examples. Change-Id: Id89dcf44e31f1bc56d888527585b3ec90229981a Reviewed-on: http://gerrit.cloudera.org:8080/11022 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
committed by
Impala Public Jenkins
parent
514dbb79ab
commit
f9e7d93854
@@ -21,7 +21,13 @@ under the License.
|
||||
<concept id="char" rev="2.0.0">
|
||||
|
||||
<title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
|
||||
|
||||
<titlealts audience="PDF">
|
||||
|
||||
<navtitle>CHAR</navtitle>
|
||||
|
||||
</titlealts>
|
||||
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
@@ -36,9 +42,9 @@ under the License.
|
||||
<conbody>
|
||||
|
||||
<p rev="2.0.0">
|
||||
<indexterm audience="hidden">CHAR data type</indexterm>
|
||||
A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
|
||||
values are longer than the specified length, Impala truncates any trailing characters.
|
||||
A fixed-length character type, padded with trailing spaces if necessary to achieve the
|
||||
specified length. If values are longer than the specified length, Impala truncates any
|
||||
trailing characters.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
||||
@@ -50,7 +56,7 @@ under the License.
|
||||
<codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
|
||||
|
||||
<p>
|
||||
The maximum length you can specify is 255.
|
||||
The maximum <varname>length</varname> you can specify is 255.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
@@ -59,21 +65,54 @@ under the License.
|
||||
|
||||
<ul>
|
||||
<li>
|
||||
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
|
||||
the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
|
||||
the column definition.
|
||||
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a
|
||||
table, queries return the value padded with trailing spaces if necessary; the resulting
|
||||
value has the same length as specified in the column definition.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
|
||||
not stored in the data file. When the value is retrieved by a query, the result could have a different
|
||||
number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
|
||||
specified length of the column.
|
||||
Leading spaces in <codeph>CHAR</codeph> are preserved within the data file.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
|
||||
values are considered identical.
|
||||
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those
|
||||
trailing spaces are not stored in the data file. When the value is retrieved by a query,
|
||||
the result could have a different number of trailing spaces. That is, the value includes
|
||||
however many spaces are needed to pad it to the specified length of the column.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
If you compare two <codeph>CHAR</codeph> values that differ only in the number of
|
||||
trailing spaces, those values are considered identical.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
When comparing or processing <codeph>CHAR</codeph> values:
|
||||
<ul>
|
||||
<li>
|
||||
<codeph>CAST()</codeph> truncates any longer string to fit within
|
||||
the defined length. For example:
|
||||
<codeblock>SELECT CAST('x' AS CHAR(4)) = CAST('x ' AS CHAR(4)); -- Returns TRUE.
|
||||
</codeblock>
|
||||
</li>
|
||||
<li>
|
||||
If a <codeph>CHAR</codeph> value is shorter than the specified
|
||||
length, it is padded on the right with spaces until it matches the
|
||||
specified length.
|
||||
</li>
|
||||
<li>
|
||||
<codeph>CHAR_LENGTH()</codeph> returns the length including any
|
||||
trailing spaces.
|
||||
</li>
|
||||
<li>
|
||||
<codeph>LENGTH()</codeph> returns the length excluding trailing
|
||||
spaces.
|
||||
</li>
|
||||
<li>
|
||||
<codeph>CONCAT()</codeph> returns the length including trailing
|
||||
spaces.
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
@@ -93,18 +132,19 @@ under the License.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Parquet files generated by Impala and containing this type can be freely interchanged with other components
|
||||
such as Hive and MapReduce.
|
||||
Parquet files generated by Impala and containing this type can be freely interchanged
|
||||
with other components such as Hive and MapReduce.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
|
||||
Any trailing spaces, whether implicitly or explicitly specified, are not written to the
|
||||
Parquet data files.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Parquet data files might contain values that are longer than allowed by the
|
||||
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
|
||||
it processes those values during a query.
|
||||
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra
|
||||
trailing characters when it processes those values during a query.
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
@@ -112,14 +152,18 @@ under the License.
|
||||
|
||||
<p>
|
||||
Text data files might contain values that are longer than allowed for a particular
|
||||
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
|
||||
processes those values during a query. Text data files can also contain values that are shorter than the
|
||||
defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
|
||||
files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
|
||||
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are
|
||||
ignored when Impala processes those values during a query. Text data files can also
|
||||
contain values that are shorter than the defined length limit, and Impala pads them with
|
||||
trailing spaces up to the specified length. Any text data files produced by Impala
|
||||
<codeph>INSERT</codeph> statements do not include any trailing blanks for
|
||||
<codeph>CHAR</codeph> columns.
|
||||
</p>
|
||||
|
||||
<p><b>Avro considerations:</b></p>
|
||||
<p>
|
||||
<b>Avro considerations:</b>
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
||||
@@ -129,7 +173,8 @@ under the License.
|
||||
</p>
|
||||
|
||||
<p>
|
||||
Some other database systems make the length specification optional. For Impala, the length is required.
|
||||
Some other database systems make the length specification optional. For Impala, the length
|
||||
is required.
|
||||
</p>
|
||||
|
||||
<!--
|
||||
@@ -146,142 +191,46 @@ it silently treats the value as length 255.
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
||||
|
||||
<!-- Seems like a logical design decision but don't think it's currently implemented like this.
|
||||
<p>
|
||||
Because both the maximum and average length are always known and always the same for
|
||||
any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
|
||||
in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
|
||||
<codeph>COMPUTE STATS</codeph> on the table.
|
||||
</p>
|
||||
-->
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
These examples show how trailing spaces are not considered significant when comparing or processing
|
||||
<codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
|
||||
length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
|
||||
spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
|
||||
including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
|
||||
trailing spaces.
|
||||
</p>
|
||||
|
||||
<codeblock>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
|
||||
+--------------------------+
|
||||
| unpadded equal to padded |
|
||||
+--------------------------+
|
||||
| true |
|
||||
+--------------------------+
|
||||
|
||||
create table char_length(c char(3));
|
||||
insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
|
||||
select concat("[",c,"]") as c, length(c) from char_length;
|
||||
+-------+-----------+
|
||||
| c | length(c) |
|
||||
+-------+-----------+
|
||||
| [1 ] | 3 |
|
||||
| [12 ] | 3 |
|
||||
| [123] | 3 |
|
||||
| [123] | 3 |
|
||||
+-------+-----------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
|
||||
is a logical data type to use.
|
||||
<!--
|
||||
Because all the <codeph>CHAR</codeph> values have a constant predictable length,
|
||||
Impala can efficiently analyze how best to use these values in join queries,
|
||||
aggregation queries, and other contexts where column length is significant.
|
||||
-->
|
||||
</p>
|
||||
|
||||
<codeblock>create table addresses
|
||||
(id bigint,
|
||||
street_name string,
|
||||
state_abbreviation char(2),
|
||||
country_abbreviation char(2));
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example shows how values written by Impala do not physically include the trailing spaces. It
|
||||
creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
|
||||
and then prints the resulting data file to show that the delimited values are not separated by spaces. The
|
||||
same behavior applies to binary-format Parquet data files.
|
||||
</p>
|
||||
|
||||
<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
|
||||
row format delimited fields terminated by ',';
|
||||
|
||||
insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
|
||||
|
||||
-- Running this Linux command inside impala-shell using the ! shortcut.
|
||||
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
|
||||
foo,bar,baz
|
||||
hello,goodbye,aloha
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
The following example further illustrates the treatment of spaces. It replaces the contents of the previous
|
||||
table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
|
||||
within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
|
||||
leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
|
||||
</p>
|
||||
|
||||
<codeblock>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
|
||||
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
|
||||
trailing, leading and trailing, leading
|
||||
|
||||
select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
|
||||
+------------------------+----------------------------------+--------------------------------------------+
|
||||
| a | b | c |
|
||||
+------------------------+----------------------------------+--------------------------------------------+
|
||||
| [trailing ] | [ leading and trailing ] | [ leading ] |
|
||||
+------------------------+----------------------------------+--------------------------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
|
||||
|
||||
<p>
|
||||
<b>Performance consideration:</b>
|
||||
</p>
|
||||
|
||||
<p>
|
||||
The <codeph>CHAR</codeph> type currently does not have the Impala Codegen support, and we
|
||||
recommend using <codeph>VARCHAR</codeph> or <codeph>STRING</codeph> over
|
||||
<codeph>CHAR</codeph> as the performance gain of Codegen outweighs the benefits of fixed
|
||||
width <codeph>CHAR</codeph>.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
||||
|
||||
<p>
|
||||
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
|
||||
scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
|
||||
that column.
|
||||
Because the blank-padding behavior requires allocating the maximum length for each value
|
||||
in memory, for scalability reasons, you should avoid declaring <codeph>CHAR</codeph>
|
||||
columns that are much longer than typical values in that column.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
|
||||
|
||||
<p>
|
||||
When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
|
||||
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
|
||||
first, with trailing spaces preserved.
|
||||
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to
|
||||
<codeph>STRING</codeph> first, with trailing spaces preserved.
|
||||
</p>
|
||||
|
||||
<codeblock>select cast("foo " as char(5)) = 'foo' as "char equal to string";
|
||||
+----------------------+
|
||||
| char equal to string |
|
||||
+----------------------+
|
||||
| false |
|
||||
+----------------------+
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
This behavior differs from other popular database systems. To get the expected result of
|
||||
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
|
||||
length:
|
||||
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values
|
||||
of the appropriate length. For example:
|
||||
</p>
|
||||
|
||||
<codeblock>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
|
||||
+----------------------+
|
||||
| char equal to string |
|
||||
+----------------------+
|
||||
| true |
|
||||
+----------------------+
|
||||
</codeblock>
|
||||
<codeblock>SELECT CAST("foo " AS CHAR(5)) = CAST('foo' AS CHAR(3)); -- Returns TRUE.</codeblock>
|
||||
|
||||
<p>
|
||||
This behavior is subject to change in future releases.
|
||||
@@ -294,5 +243,7 @@ select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c fr
|
||||
<xref href="impala_literals.xml#string_literals"/>,
|
||||
<xref href="impala_string_functions.xml#string_functions"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
Reference in New Issue
Block a user