mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
297 lines
12 KiB
XML
297 lines
12 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="char" rev="2.0.0">
|
|
|
|
<title>CHAR Data Type (<keyword keyref="impala20"/> or higher only)</title>
|
|
<titlealts audience="PDF"><navtitle>CHAR</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Impala Data Types"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Schemas"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p rev="2.0.0">
|
|
<indexterm audience="hidden">CHAR data type</indexterm>
|
|
A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If
|
|
values are longer than the specified length, Impala truncates any trailing characters.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<p>
|
|
In the column definition of a <codeph>CREATE TABLE</codeph> statement:
|
|
</p>
|
|
|
|
<codeblock><varname>column_name</varname> CHAR(<varname>length</varname>)</codeblock>
|
|
|
|
<p>
|
|
The maximum length you can specify is 255.
|
|
</p>
|
|
|
|
<p>
|
|
<b>Semantics of trailing spaces:</b>
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
When you store a <codeph>CHAR</codeph> value shorter than the specified length in a table, queries return
|
|
the value padded with trailing spaces if necessary; the resulting value has the same length as specified in
|
|
the column definition.
|
|
</li>
|
|
|
|
<li>
|
|
If you store a <codeph>CHAR</codeph> value containing trailing spaces in a table, those trailing spaces are
|
|
not stored in the data file. When the value is retrieved by a query, the result could have a different
|
|
number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the
|
|
specified length of the column.
|
|
</li>
|
|
|
|
<li>
|
|
If you compare two <codeph>CHAR</codeph> values that differ only in the number of trailing spaces, those
|
|
values are considered identical.
|
|
</li>
|
|
</ul>
|
|
|
|
<p conref="../shared/impala_common.xml#common/partitioning_bad"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hbase_no"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/parquet_blurb"/>
|
|
|
|
<ul>
|
|
<li>
|
|
This type can be read from and written to Parquet files.
|
|
</li>
|
|
|
|
<li>
|
|
There is no requirement for a particular level of Parquet.
|
|
</li>
|
|
|
|
<li>
|
|
Parquet files generated by Impala and containing this type can be freely interchanged with other components
|
|
such as Hive and MapReduce.
|
|
</li>
|
|
|
|
<li>
|
|
Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files.
|
|
</li>
|
|
|
|
<li>
|
|
Parquet data files might contain values that are longer than allowed by the
|
|
<codeph>CHAR(<varname>n</varname>)</codeph> length limit. Impala ignores any extra trailing characters when
|
|
it processes those values during a query.
|
|
</li>
|
|
</ul>
|
|
|
|
<p conref="../shared/impala_common.xml#common/text_blurb"/>
|
|
|
|
<p>
|
|
Text data files might contain values that are longer than allowed for a particular
|
|
<codeph>CHAR(<varname>n</varname>)</codeph> column. Any extra trailing characters are ignored when Impala
|
|
processes those values during a query. Text data files can also contain values that are shorter than the
|
|
defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data
|
|
files produced by Impala <codeph>INSERT</codeph> statements do not include any trailing blanks for
|
|
<codeph>CHAR</codeph> columns.
|
|
</p>
|
|
|
|
<p><b>Avro considerations:</b></p>
|
|
<p conref="../shared/impala_common.xml#common/avro_2gb_strings"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/compatibility_blurb"/>
|
|
|
|
<p>
|
|
This type is available using Impala 2.0 or higher under CDH 4, or with Impala on CDH 5.2 or higher. There are
|
|
no compatibility issues with other components when exchanging data files or running Impala on CDH 4.
|
|
</p>
|
|
|
|
<p>
|
|
Some other database systems make the length specification optional. For Impala, the length is required.
|
|
</p>
|
|
|
|
<!--
|
|
<p>
|
|
The Impala maximum length is larger than for the <codeph>CHAR</codeph> data type in Hive.
|
|
If a Hive query encounters a <codeph>CHAR</codeph> value longer than 255 during processing,
|
|
it silently treats the value as length 255.
|
|
</p>
|
|
-->
|
|
|
|
<p conref="../shared/impala_common.xml#common/internals_max_bytes"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/column_stats_constant"/>
|
|
|
|
<!-- Seems like a logical design decision but don't think it's currently implemented like this.
|
|
<p>
|
|
Because both the maximum and average length are always known and always the same for
|
|
any given <codeph>CHAR(<varname>n</varname>)</codeph> column, those fields are always filled
|
|
in for <codeph>SHOW COLUMN STATS</codeph> output, even before you run
|
|
<codeph>COMPUTE STATS</codeph> on the table.
|
|
</p>
|
|
-->
|
|
|
|
<p conref="../shared/impala_common.xml#common/udf_blurb_no"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
These examples show how trailing spaces are not considered significant when comparing or processing
|
|
<codeph>CHAR</codeph> values. <codeph>CAST()</codeph> truncates any longer string to fit within the defined
|
|
length. If a <codeph>CHAR</codeph> value is shorter than the specified length, it is padded on the right with
|
|
spaces until it matches the specified length. Therefore, <codeph>LENGTH()</codeph> represents the length
|
|
including any trailing spaces, and <codeph>CONCAT()</codeph> also treats the column value as if it has
|
|
trailing spaces.
|
|
</p>
|
|
|
|
<codeblock>select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded";
|
|
+--------------------------+
|
|
| unpadded equal to padded |
|
|
+--------------------------+
|
|
| true |
|
|
+--------------------------+
|
|
|
|
create table char_length(c char(3));
|
|
insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3)));
|
|
select concat("[",c,"]") as c, length(c) from char_length;
|
|
+-------+-----------+
|
|
| c | length(c) |
|
|
+-------+-----------+
|
|
| [1 ] | 3 |
|
|
| [12 ] | 3 |
|
|
| [123] | 3 |
|
|
| [123] | 3 |
|
|
+-------+-----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
This example shows a case where data values are known to have a specific length, where <codeph>CHAR</codeph>
|
|
is a logical data type to use.
|
|
<!--
|
|
Because all the <codeph>CHAR</codeph> values have a constant predictable length,
|
|
Impala can efficiently analyze how best to use these values in join queries,
|
|
aggregation queries, and other contexts where column length is significant.
|
|
-->
|
|
</p>
|
|
|
|
<codeblock>create table addresses
|
|
(id bigint,
|
|
street_name string,
|
|
state_abbreviation char(2),
|
|
country_abbreviation char(2));
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows how values written by Impala do not physically include the trailing spaces. It
|
|
creates a table using text format, with <codeph>CHAR</codeph> values much shorter than the declared length,
|
|
and then prints the resulting data file to show that the delimited values are not separated by spaces. The
|
|
same behavior applies to binary-format Parquet data files.
|
|
</p>
|
|
|
|
<codeblock>create table char_in_text (a char(20), b char(30), c char(40))
|
|
row format delimited fields terminated by ',';
|
|
|
|
insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40)));
|
|
|
|
-- Running this Linux command inside impala-shell using the ! shortcut.
|
|
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
|
|
foo,bar,baz
|
|
hello,goodbye,aloha
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example further illustrates the treatment of spaces. It replaces the contents of the previous
|
|
table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved
|
|
within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the
|
|
leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala.
|
|
</p>
|
|
|
|
<codeblock>insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40)));
|
|
!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*';
|
|
trailing, leading and trailing, leading
|
|
|
|
select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text;
|
|
+------------------------+----------------------------------+--------------------------------------------+
|
|
| a | b | c |
|
|
+------------------------+----------------------------------+--------------------------------------------+
|
|
| [trailing ] | [ leading and trailing ] | [ leading ] |
|
|
+------------------------+----------------------------------+--------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
|
|
|
<p>
|
|
Because the blank-padding behavior requires allocating the maximum length for each value in memory, for
|
|
scalability reasons avoid declaring <codeph>CHAR</codeph> columns that are much longer than typical values in
|
|
that column.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/blobs_are_strings"/>
|
|
|
|
<p>
|
|
When an expression compares a <codeph>CHAR</codeph> with a <codeph>STRING</codeph> or
|
|
<codeph>VARCHAR</codeph>, the <codeph>CHAR</codeph> value is implicitly converted to <codeph>STRING</codeph>
|
|
first, with trailing spaces preserved.
|
|
</p>
|
|
|
|
<codeblock>select cast("foo " as char(5)) = 'foo' as "char equal to string";
|
|
+----------------------+
|
|
| char equal to string |
|
|
+----------------------+
|
|
| false |
|
|
+----------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
This behavior differs from other popular database systems. To get the expected result of
|
|
<codeph>TRUE</codeph>, cast the expressions on both sides to <codeph>CHAR</codeph> values of the appropriate
|
|
length:
|
|
</p>
|
|
|
|
<codeblock>select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string";
|
|
+----------------------+
|
|
| char equal to string |
|
|
+----------------------+
|
|
| true |
|
|
+----------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
This behavior is subject to change in future releases.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
<xref href="impala_string.xml#string"/>, <xref href="impala_varchar.xml#varchar"/>,
|
|
<xref href="impala_literals.xml#string_literals"/>,
|
|
<xref href="impala_string_functions.xml#string_functions"/>
|
|
</p>
|
|
</conbody>
|
|
</concept>
|