mirror of
https://github.com/apache/impala.git
synced 2026-01-31 09:00:19 -05:00
Added that ORC reads are enabled by default in Impala 3.4.0, and removed "experimental". Change-Id: I6d99f1926619874a319b0db3af5ae6f5d443fb30 Reviewed-on: http://gerrit.cloudera.org:8080/15432 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
282 lines
9.5 KiB
XML
282 lines
9.5 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="orc">
|
|
|
|
<title>Using the ORC File Format with Impala Tables</title>
|
|
<titlealts audience="PDF"><navtitle>ORC Data Files</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<!-- <data name="Category" value="ORC"/> -->
|
|
<data name="Category" value="File Formats"/>
|
|
<data name="Category" value="Tables"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">ORC support in Impala</indexterm> Impala supports using ORC data
|
|
files. By default, ORC reads are enabled in Impala 3.4.0. To disable, set --enable_orc_scanner
|
|
to false when starting the cluster. </p>
|
|
|
|
<table>
|
|
<title>ORC Format Support in Impala</title>
|
|
<tgroup cols="5">
|
|
<colspec colname="1" colwidth="10*"/>
|
|
<colspec colname="2" colwidth="10*"/>
|
|
<colspec colname="3" colwidth="20*"/>
|
|
<colspec colname="4" colwidth="30*"/>
|
|
<colspec colname="5" colwidth="30*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>
|
|
File Type
|
|
</entry>
|
|
<entry>
|
|
Format
|
|
</entry>
|
|
<entry>
|
|
Compression Codecs
|
|
</entry>
|
|
<entry>
|
|
Impala Can CREATE?
|
|
</entry>
|
|
<entry>
|
|
Impala Can INSERT?
|
|
</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row conref="impala_file_formats.xml#file_formats/orc_support">
|
|
<entry/>
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="orc_create">
|
|
|
|
<title>Creating ORC Tables and Loading Data</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="ETL"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
If you do not have an existing data file to use, begin by creating one in the appropriate format.
|
|
</p>
|
|
|
|
<p>
|
|
<b>To create an ORC table:</b>
|
|
</p>
|
|
|
|
<p>
|
|
In the <codeph>impala-shell</codeph> interpreter, issue a command similar to:
|
|
</p>
|
|
|
|
<codeblock>CREATE TABLE orc_table (<varname>column_specs</varname>) STORED AS ORC;</codeblock>
|
|
|
|
<p>
|
|
Because Impala can query some kinds of tables that it cannot currently write to, after creating tables of
|
|
certain file formats, you might use the Hive shell to load the data. See
|
|
<xref href="impala_file_formats.xml#file_formats"/> for details. After loading data into a table through
|
|
Hive or other mechanism outside of Impala, issue a <codeph>REFRESH <varname>table_name</varname></codeph>
|
|
statement the next time you connect to the Impala node, before querying the table, to make Impala recognize
|
|
the new data.
|
|
</p>
|
|
|
|
<p>
|
|
For example, here is how you might create some ORC tables in Impala (by specifying the columns
|
|
explicitly, or cloning the structure of another table), load data through Hive, and query them through
|
|
Impala:
|
|
</p>
|
|
|
|
<codeblock>$ impala-shell -i localhost
|
|
[localhost:21000] default> CREATE TABLE orc_table (x INT) STORED AS ORC;
|
|
[localhost:21000] default> CREATE TABLE orc_clone LIKE some_other_table STORED AS ORC;
|
|
[localhost:21000] default> quit;
|
|
|
|
$ hive
|
|
hive> INSERT INTO TABLE orc_table SELECT x FROM some_other_table;
|
|
3 Rows loaded to orc_table
|
|
Time taken: 4.169 seconds
|
|
hive> quit;
|
|
|
|
$ impala-shell -i localhost
|
|
[localhost:21000] default> SELECT * FROM orc_table;
|
|
Fetched 0 row(s) in 0.11s
|
|
[localhost:21000] default> -- Make Impala recognize the data loaded through Hive;
|
|
[localhost:21000] default> REFRESH orc_table;
|
|
[localhost:21000] default> SELECT * FROM orc_table;
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 1 |
|
|
| 2 |
|
|
| 3 |
|
|
+---+
|
|
Fetched 3 row(s) in 0.11s</codeblock>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="orc_compression">
|
|
|
|
<title>Enabling Compression for ORC Tables</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Snappy"/>
|
|
<data name="Category" value="Compression"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">compression</indexterm>
|
|
ORC tables are in zlib (Deflate in Impala) compression in default. You may want
|
|
to use Snappy or LZO compression on existing tables for different balance between
|
|
compression ratio and decompression speed. In Hive-1.1.0, the supported
|
|
compressions for ORC tables are NONE, ZLIB, SNAPPY and LZO.
|
|
For example, to enable Snappy compression, you would specify
|
|
the following additional settings when loading data through the Hive shell:
|
|
</p>
|
|
|
|
<codeblock>hive> SET hive.exec.compress.output=true;
|
|
hive> SET orc.compress=SNAPPY;
|
|
hive> INSERT OVERWRITE TABLE <varname>new_table</varname> SELECT * FROM <varname>old_table</varname>;</codeblock>
|
|
|
|
<p>
|
|
If you are converting partitioned tables, you must complete additional steps. In such a case, specify
|
|
additional settings similar to the following:
|
|
</p>
|
|
|
|
<codeblock>hive> CREATE TABLE <varname>new_table</varname> (<varname>your_cols</varname>) PARTITIONED BY (<varname>partition_cols</varname>) STORED AS <varname>new_format</varname>;
|
|
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
|
|
hive> SET hive.exec.dynamic.partition=true;
|
|
hive> INSERT OVERWRITE TABLE <varname>new_table</varname> PARTITION(<varname>comma_separated_partition_cols</varname>) SELECT * FROM <varname>old_table</varname>;</codeblock>
|
|
|
|
<p>
|
|
Remember that Hive does not require that you specify a source format for it. Consider the case of
|
|
converting a table with two partition columns called <codeph>year</codeph> and <codeph>month</codeph> to a
|
|
Snappy compressed ORC table. Combining the components outlined previously to complete this table conversion,
|
|
you would specify settings similar to the following:
|
|
</p>
|
|
|
|
<codeblock>hive> CREATE TABLE tbl_orc (int_col INT, string_col STRING) STORED AS ORC;
|
|
hive> SET hive.exec.compress.output=true;
|
|
hive> SET orc.compress=SNAPPY;
|
|
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
|
|
hive> SET hive.exec.dynamic.partition=true;
|
|
hive> INSERT OVERWRITE TABLE tbl_orc SELECT * FROM tbl;</codeblock>
|
|
|
|
<p>
|
|
To complete a similar process for a table that includes partitions, you would specify settings similar to
|
|
the following:
|
|
</p>
|
|
|
|
<codeblock>hive> CREATE TABLE tbl_orc (int_col INT, string_col STRING) PARTITIONED BY (year INT) STORED AS ORC;
|
|
hive> SET hive.exec.compress.output=true;
|
|
hive> SET orc.compress=SNAPPY;
|
|
hive> SET hive.exec.dynamic.partition.mode=nonstrict;
|
|
hive> SET hive.exec.dynamic.partition=true;
|
|
hive> INSERT OVERWRITE TABLE tbl_orc PARTITION(year) SELECT * FROM tbl;</codeblock>
|
|
|
|
<note>
|
|
<p>
|
|
The compression type is specified in the following command:
|
|
</p>
|
|
<codeblock>SET orc.compress=SNAPPY;</codeblock>
|
|
<p>
|
|
You could elect to specify alternative codecs such as <codeph>NONE, GZIP, LZO</codeph> here.
|
|
</p>
|
|
</note>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="rcfile_performance">
|
|
|
|
<title>Query Performance for Impala ORC Tables</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In general, expect query performance with ORC tables to be
|
|
faster than with tables using text data, but slower than with
|
|
Parquet tables since there're bunch of optimizations for Parquet.
|
|
See <xref href="impala_parquet.xml#parquet"/>
|
|
for information about using the Parquet file format for
|
|
high-performance analytic queries.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/s3_block_splitting"/>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="orc_data_types">
|
|
|
|
<title>Data Type Considerations for ORC Tables</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The ORC format defines a set of data types whose names differ from the names of the corresponding
|
|
Impala data types. If you are preparing ORC files using other Hadoop components such as Pig or
|
|
MapReduce, you might need to work with the type names defined by ORC. The following figure lists the
|
|
ORC-defined types and the equivalent types in Impala.
|
|
</p>
|
|
|
|
<p>
|
|
<b>Primitive types:</b>
|
|
</p>
|
|
|
|
<codeblock>BINARY -> STRING
|
|
BOOLEAN -> BOOLEAN
|
|
DOUBLE -> DOUBLE
|
|
FLOAT -> FLOAT
|
|
TINYINT -> TINYINT
|
|
SMALLINT -> SMALLINT
|
|
INT -> INT
|
|
BIGINT -> BIGINT
|
|
TIMESTAMP -> TIMESTAMP
|
|
DATE (not supported)
|
|
</codeblock>
|
|
|
|
<p>
|
|
<b>Complex types:</b>
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/complex_types_short_intro"/>
|
|
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|