IMPALA-7291: [DOCS] Note about no codegen support for CHAR

Also, cleaned up confusing examples.

Change-Id: Id89dcf44e31f1bc56d888527585b3ec90229981a
Reviewed-on: http://gerrit.cloudera.org:8080/11022
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit is contained in:
Alex Rodoni
2018-07-23 15:22:42 -07:00
committed by Impala Public Jenkins
parent 514dbb79ab
commit f9e7d93854

View File

@@ -21,7 +21,13 @@ under the License.
<concept id="char" rev="2.0.0">
<title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
<titlealts audience="PDF">
<navtitle>CHAR</navtitle>
</titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -36,9 +42,9 @@ under the License.
<conbody>
<p rev="2.0.0">
<indexterm audience="hidden">CHAR data type</indexterm>
A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
values are longer than the specified length, Impala truncates any trailing characters.
A fixed-length character type, padded with trailing spaces if necessary to achieve the
specified length. If values are longer than the specified length, Impala truncates any
trailing characters.
</p>
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
@@ -50,7 +56,7 @@ under the License.
<codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
<p>
The maximum length you can specify is 255.
The maximum <varname>length</varname> you can specify is 255.
</p>
<p>
@@ -59,21 +65,54 @@ under the License.
<ul>
<li>
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
the column definition.
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a
table, queries return the value padded with trailing spaces if necessary; the resulting
value has the same length as specified in the column definition.
</li>
<li>
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
not stored in the data file. When the value is retrieved by a query, the result could have a different
number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
specified length of the column.
Leading spaces in <codeph>CHAR</codeph> are preserved within the data file.
</li>
<li>
If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
values are considered identical.
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those
trailing spaces are not stored in the data file. When the value is retrieved by a query,
the result could have a different number of trailing spaces. That is, the value includes
however many spaces are needed to pad it to the specified length of the column.
</li>
<li>
If you compare two <codeph>CHAR</codeph> values that differ only in the number of
trailing spaces, those values are considered identical.
</li>
<li>
When comparing or processing <codeph>CHAR</codeph> values:
<ul>
<li>
<codeph>CAST()</codeph> truncates any longer string to fit within
the defined length. For example:
<codeblock>SELECT CAST('x' AS CHAR(4)) = CAST('x ' AS CHAR(4)); -- Returns TRUE.
</codeblock>
</li>
<li>
If a <codeph>CHAR</codeph> value is shorter than the specified
length, it is padded on the right with spaces until it matches the
specified length.
</li>
<li>
<codeph>CHAR_LENGTH()</codeph> returns the length including any
trailing spaces.
</li>
<li>
<codeph>LENGTH()</codeph> returns the length excluding trailing
spaces.
</li>
<li>
<codeph>CONCAT()</codeph> returns the length including trailing
spaces.
</li>
</ul>
</li>
</ul>
@@ -93,18 +132,19 @@ under the License.
</li>
<li>
Parquet files generated by Impala and containing this type can be freely interchanged with other components
such as Hive and MapReduce.
Parquet files generated by Impala and containing this type can be freely interchanged
with other components such as Hive and MapReduce.
</li>
<li>
Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
Any trailing spaces, whether implicitly or explicitly specified, are not written to the
Parquet data files.
</li>
<li>
Parquet data files might contain values that are longer than allowed by the
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
it processes those values during a query.
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra
trailing characters when it processes those values during a query.
</li>
</ul>
@@ -112,14 +152,18 @@ under the License.
<p>
Text data files might contain values that are longer than allowed for a particular
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
processes those values during a query. Text data files can also contain values that are shorter than the
defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are
ignored when Impala processes those values during a query. Text data files can also
contain values that are shorter than the defined length limit, and Impala pads them with
trailing spaces up to the specified length. Any text data files produced by Impala
<codeph>INSERT</codeph> statements do not include any trailing blanks for
<codeph>CHAR</codeph> columns.
</p>
<p><b>Avro considerations:</b></p>
<p>
<b>Avro considerations:</b>
</p>
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
@@ -129,7 +173,8 @@ under the License.
</p>
<p>
Some other database systems make the length specification optional. For Impala, the length is required.
Some other database systems make the length specification optional. For Impala, the length
is required.
</p>
<!--
@@ -146,142 +191,46 @@ it silently treats the value as length 255.
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
<!-- Seems like a logical design decision but don't think it's currently implemented like this.
<p>
Because both the maximum and average length are always known and always the same for
any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
<codeph>COMPUTE STATS</codeph> on the table.
</p>
-->
<p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
These examples show how trailing spaces are not considered significant when comparing or processing
<codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
trailing spaces.
</p>
<codeblock>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
+--------------------------+
| unpadded equal to padded |
+--------------------------+
| true |
+--------------------------+
create table char_length(c char(3));
insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
select concat("[",c,"]") as c, length(c) from char_length;
+-------+-----------+
| c | length(c) |
+-------+-----------+
| [1 ] | 3 |
| [12 ] | 3 |
| [123] | 3 |
| [123] | 3 |
+-------+-----------+
</codeblock>
<p>
This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
is a logical data type to use.
<!--
Because all the <codeph>CHAR</codeph> values have a constant predictable length,
Impala can efficiently analyze how best to use these values in join queries,
aggregation queries, and other contexts where column length is significant.
-->
</p>
<codeblock>create table addresses
(id bigint,
street_name string,
state_abbreviation char(2),
country_abbreviation char(2));
</codeblock>
<p>
The following example shows how values written by Impala do not physically include the trailing spaces. It
creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
and then prints the resulting data file to show that the delimited values are not separated by spaces. The
same behavior applies to binary-format Parquet data files.
</p>
<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
row format delimited fields terminated by ',';
insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
-- Running this Linux command inside impala-shell using the ! shortcut.
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
foo,bar,baz
hello,goodbye,aloha
</codeblock>
<p>
The following example further illustrates the treatment of spaces. It replaces the contents of the previous
table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
</p>
<codeblock>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
trailing, leading and trailing, leading
select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
+------------------------+----------------------------------+--------------------------------------------+
| a | b | c |
+------------------------+----------------------------------+--------------------------------------------+
| [trailing ] | [ leading and trailing ] | [ leading ] |
+------------------------+----------------------------------+--------------------------------------------+
</codeblock>
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
<p conref="../shared/impala_common.xml#common/kudu_unsupported_data_type"/>
<p>
<b>Performance consideration:</b>
</p>
<p>
The <codeph>CHAR</codeph> type currently does not have the Impala Codegen support, and we
recommend using <codeph>VARCHAR</codeph> or <codeph>STRING</codeph> over
<codeph>CHAR</codeph> as the performance gain of Codegen outweighs the benefits of fixed
width <codeph>CHAR</codeph>.
</p>
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
<p>
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
that column.
Because the blank-padding behavior requires allocating the maximum length for each value
in memory, for scalability reasons, you should avoid declaring <codeph>CHAR</codeph>
columns that are much longer than typical values in that column.
</p>
<p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
<p>
When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
first, with trailing spaces preserved.
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to
<codeph>STRING</codeph> first, with trailing spaces preserved.
</p>
<codeblock>select cast("foo " as char(5)) = 'foo' as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| false |
+----------------------+
</codeblock>
<p>
This behavior differs from other popular database systems. To get the expected result of
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
length:
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values
of the appropriate length. For example:
</p>
<codeblock>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
+----------------------+
| char equal to string |
+----------------------+
| true |
+----------------------+
</codeblock>
<codeblock>SELECT CAST("foo " AS CHAR(5)) = CAST('foo' AS CHAR(3)); -- Returns TRUE.</codeblock>
<p>
This behavior is subject to change in future releases.
@@ -294,5 +243,7 @@ select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c fr
<xref href="impala_literals.xml#string_literals"/>,
<xref href="impala_string_functions.xml#string_functions"/>
</p>
</conbody>
</concept>