mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
Impala supports reading ORC files by default for quite some time. Removed enable_orc_scanner flag and related code and test, disabling ORC support is no longer possible. Removed notes on how to disable ORC support from docs. Change-Id: I7ff640afb98cbe3aa46bf03f9bff782574c998a5 Reviewed-on: http://gerrit.cloudera.org:8080/18188 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
376 lines
13 KiB
XML
376 lines
13 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="file_formats">
|
|
|
|
<title>How Impala Works with Hadoop File Formats</title>
|
|
|
|
<titlealts audience="PDF">
|
|
|
|
<navtitle>File Formats</navtitle>
|
|
|
|
</titlealts>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Concepts"/>
|
|
<data name="Category" value="Hadoop"/>
|
|
<data name="Category" value="File Formats"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Stub Pages"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Impala supports several familiar file formats used in Apache Hadoop. Impala can load and
|
|
query data files produced by other Hadoop components such as Spark, and data files
|
|
produced by Impala can be used by other components also. The following sections discuss
|
|
the procedures, limitations, and performance considerations for using each file format
|
|
with Impala.
|
|
</p>
|
|
|
|
<p>
|
|
The file format used for an Impala table has significant performance consequences. Some
|
|
file formats include compression support that affects the size of data on the disk and,
|
|
consequently, the amount of I/O and CPU resources required to deserialize data. The
|
|
amounts of I/O and CPU resources required can be a limiting factor in query performance
|
|
since querying often begins with moving and decompressing data. To reduce the potential
|
|
impact of this part of the process, data is often compressed. By compressing data, a
|
|
smaller total number of bytes are transferred from disk to memory. This reduces the amount
|
|
of time taken to transfer the data, but a tradeoff occurs when the CPU decompresses the
|
|
content.
|
|
</p>
|
|
|
|
<p>
|
|
For the file formats that Impala cannot write to, create the table from within Impala
|
|
whenever possible and insert data using another component such as Hive or Spark. See the
|
|
table below for specific file formats.
|
|
</p>
|
|
|
|
<p>
|
|
The following table lists the file formats that Impala supports.
|
|
</p>
|
|
|
|
<table>
|
|
<tgroup cols="5">
|
|
<colspec colname="1" colwidth="10*"/>
|
|
<colspec colname="2" colwidth="10*"/>
|
|
<colspec colname="3" colwidth="20*"/>
|
|
<colspec colname="4" colwidth="30*"/>
|
|
<colspec colname="5" colwidth="30*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>
|
|
File Type
|
|
</entry>
|
|
<entry>
|
|
Format
|
|
</entry>
|
|
<entry>
|
|
Compression Codecs
|
|
</entry>
|
|
<entry>
|
|
Impala Can CREATE?
|
|
</entry>
|
|
<entry>
|
|
Impala Can INSERT?
|
|
</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row id="parquet_support">
|
|
<entry>
|
|
<xref href="impala_parquet.xml#parquet">Parquet</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>Snappy, gzip, zstd, lz4; currently Snappy by default </entry>
|
|
<entry>
|
|
Yes.
|
|
</entry>
|
|
<entry>
|
|
Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD
|
|
DATA</codeph>, and query.
|
|
</entry>
|
|
</row>
|
|
<row id="orc_support">
|
|
<entry>
|
|
<xref href="impala_orc.xml#orc">ORC</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
gzip, Snappy, LZO, LZ4; currently gzip by default
|
|
</entry>
|
|
<entry> Yes, in Impala 2.12.0 and higher. <p>By default, ORC reads are enabled in Impala
|
|
3.4.0 and higher. </p>
|
|
</entry>
|
|
<entry>
|
|
No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
|
|
right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
|
|
<varname>table_name</varname></codeph> in Impala.
|
|
</entry>
|
|
</row>
|
|
<row id="txtfile_support">
|
|
<entry>
|
|
<xref href="impala_txtfile.xml#txtfile">Text</xref>
|
|
</entry>
|
|
<entry>
|
|
Unstructured
|
|
</entry>
|
|
<entry rev="2.0.0">bzip2, deflate, gzip, LZO, Snappy, zstd</entry>
|
|
<entry>
|
|
Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause,
|
|
the default file format is uncompressed text, with values separated by ASCII
|
|
<codeph>0x01</codeph> characters (typically represented as Ctrl-A).
|
|
</entry>
|
|
<entry> Yes if uncompressed.<p>No if compressed.</p><p>If LZO
|
|
compression is used, you must create the table and load data in
|
|
Hive.</p><p>If other kinds of compression are used, you must
|
|
load data through <codeph>LOAD DATA</codeph>, Hive, or manually
|
|
in HDFS. </p></entry>
|
|
</row>
|
|
<row id="avro_support">
|
|
<entry>
|
|
<xref href="impala_avro.xml#avro">Avro</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
Snappy, gzip, deflate
|
|
</entry>
|
|
<entry rev="1.4.0">
|
|
Yes, in Impala 1.4.0 and higher. In lower versions, create the table using Hive.
|
|
</entry>
|
|
<entry>
|
|
No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
|
|
right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
|
|
<varname>table_name</varname></codeph> in Impala.
|
|
</entry>
|
|
<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
|
|
</row>
|
|
<row id="hudi_support">
|
|
<entry>
|
|
<xref href="impala_hudi.xml#hudi">Hudi</xref>
|
|
</entry>
|
|
<entry>Structured</entry>
|
|
<entry>Snappy, gzip, zstd, lz4; currently Snappy by default </entry>
|
|
<entry>Yes, support for Read Optimized Queries is experimental.</entry>
|
|
<entry>No. Create an external table in Impala. Set the table location to the Hudi table
|
|
directory. Alternatively, create the Hudi table in Hive. </entry>
|
|
</row>
|
|
<row id="rcfile_support">
|
|
<entry>
|
|
<xref href="impala_rcfile.xml#rcfile">RCFile</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
Snappy, gzip, deflate, bzip2
|
|
</entry>
|
|
<entry>
|
|
Yes.
|
|
</entry>
|
|
<entry>
|
|
No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
|
|
right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
|
|
<varname>table_name</varname></codeph> in Impala.
|
|
</entry>
|
|
<!--
|
|
<entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
|
|
-->
|
|
</row>
|
|
<row id="sequencefile_support">
|
|
<entry>
|
|
<xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
Snappy, gzip, deflate, bzip2
|
|
</entry>
|
|
<entry>
|
|
Yes.
|
|
</entry>
|
|
<entry>
|
|
No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the
|
|
right format, or use <codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH
|
|
<varname>table_name</varname></codeph> in Impala.
|
|
</entry>
|
|
<!--
|
|
<entry rev="2.0.0">
|
|
Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
|
|
DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
|
|
</entry>
|
|
-->
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<p>
|
|
Impala supports the following compression codecs:
|
|
</p>
|
|
|
|
<dl>
|
|
<dlentry rev="2.0.0">
|
|
|
|
<dt>
|
|
Snappy
|
|
</dt>
|
|
|
|
<dd>
|
|
<p>
|
|
Recommended for its effective balance between compression ratio and decompression
|
|
speed. Snappy compression is very fast, but gzip provides greater space savings.
|
|
Supported for text, RC, Sequence, and Avro files in Impala 2.0 and higher.
|
|
</p>
|
|
</dd>
|
|
|
|
</dlentry>
|
|
|
|
<dlentry rev="2.0.0">
|
|
|
|
<dt>
|
|
Gzip
|
|
</dt>
|
|
|
|
<dd>
|
|
<p>
|
|
Recommended when achieving the highest level of compression (and therefore greatest
|
|
disk-space savings) is desired. Supported for text, RC, Sequence and Avro files in
|
|
Impala 2.0 and higher.
|
|
</p>
|
|
</dd>
|
|
|
|
</dlentry>
|
|
|
|
<dlentry rev="2.0.0">
|
|
|
|
<dt>
|
|
Deflate
|
|
</dt>
|
|
|
|
<dd>
|
|
<p> Supported for AVRO, RC, Sequence, and text files. </p>
|
|
</dd>
|
|
|
|
</dlentry>
|
|
|
|
<dlentry rev="2.0.0">
|
|
|
|
<dt>
|
|
Bzip2
|
|
</dt>
|
|
|
|
<dd>
|
|
<p>
|
|
Supported for text, RC, and Sequence files in Impala 2.0 and higher.
|
|
</p>
|
|
</dd>
|
|
|
|
</dlentry>
|
|
|
|
<dlentry rev="2.0.0">
|
|
|
|
<dt>
|
|
LZO
|
|
</dt>
|
|
|
|
<dd>
|
|
<p>
|
|
For text files only. Impala can query LZO-compressed text tables, but currently
|
|
cannot create them or insert data into them. You need to perform these operations in
|
|
Hive.
|
|
</p>
|
|
</dd>
|
|
|
|
</dlentry>
|
|
<dlentry>
|
|
<dt>Zstd</dt>
|
|
<dd>For Parquet and text files only.</dd>
|
|
</dlentry>
|
|
|
|
<dlentry>
|
|
<dt>Lz4</dt>
|
|
<dd>For Parquet files only.</dd>
|
|
</dlentry>
|
|
</dl>
|
|
|
|
</conbody>
|
|
|
|
<concept id="file_format_choosing">
|
|
|
|
<title>Choosing the File Format for a Table</title>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Planning"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Different file formats and compression codecs work better for different data sets.
|
|
Choosing the proper format for your data can yield performance improvements. Use the
|
|
following considerations to decide which combination of file format and compression to
|
|
use for a particular table:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
If you are working with existing files that are already in a supported file format,
|
|
use the same format for the Impala table if performance is acceptable. If the original
|
|
format does not yield acceptable query performance or resource usage, consider
|
|
creating a new Impala table with different file format or compression characteristics,
|
|
and doing a one-time conversion by rewriting the data to the new table.
|
|
</li>
|
|
|
|
<li>
|
|
Text files are convenient to produce through many different tools, and are
|
|
human-readable for ease of verification and debugging. Those characteristics are why
|
|
text is the default format for an Impala <codeph>CREATE TABLE</codeph> statement.
|
|
However, when performance and resource usage are the primary considerations, use one
|
|
of the structured file formats that include metadata and built-in compression.
|
|
<p>
|
|
A typical workflow might involve bringing data into an Impala table by copying CSV
|
|
or TSV files into the appropriate data directory, and then using the <codeph>INSERT
|
|
... SELECT</codeph> syntax to rewrite the data into a table using a different, more
|
|
compact file format.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|