mirror of
https://github.com/apache/impala.git
synced 2026-01-07 18:02:33 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
289 lines
12 KiB
XML
289 lines
12 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="file_formats">
|
|
|
|
<title>How Impala Works with Hadoop File Formats</title>
|
|
<titlealts audience="PDF"><navtitle>File Formats</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Concepts"/>
|
|
<data name="Category" value="Hadoop"/>
|
|
<data name="Category" value="File Formats"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<!-- Like Impala Administration, this page has a fair bit of info already, but it could benefit from wiki-style embedded of intro text from those other pages. -->
|
|
<!-- In this case, that would also enable a good in-page TOC since there is already one lonely subtopic on this same page. -->
|
|
<data name="Category" value="Stub Pages"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">file formats</indexterm>
|
|
<indexterm audience="hidden">compression</indexterm>
|
|
Impala supports several familiar file formats used in Apache Hadoop. Impala can load and query data files
|
|
produced by other Hadoop components such as Pig or MapReduce, and data files produced by Impala can be used
|
|
by other components also. The following sections discuss the procedures, limitations, and performance
|
|
considerations for using each file format with Impala.
|
|
</p>
|
|
|
|
<p>
|
|
The file format used for an Impala table has significant performance consequences. Some file formats include
|
|
compression support that affects the size of data on the disk and, consequently, the amount of I/O and CPU
|
|
resources required to deserialize data. The amounts of I/O and CPU resources required can be a limiting
|
|
factor in query performance since querying often begins with moving and decompressing data. To reduce the
|
|
potential impact of this part of the process, data is often compressed. By compressing data, a smaller total
|
|
number of bytes are transferred from disk to memory. This reduces the amount of time taken to transfer the
|
|
data, but a tradeoff occurs when the CPU decompresses the content.
|
|
</p>
|
|
|
|
<p>
|
|
Impala can query files encoded with most of the popular file formats and compression codecs used in Hadoop.
|
|
Impala can create and insert data into tables that use some file formats but not others; for file formats
|
|
that Impala cannot write to, create the table in Hive, issue the <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
|
|
statement in <codeph>impala-shell</codeph>, and query the table through Impala. File formats can be
|
|
structured, in which case they may include metadata and built-in compression. Supported formats include:
|
|
</p>
|
|
|
|
<table>
|
|
<title>File Format Support in Impala</title>
|
|
<tgroup cols="5">
|
|
<colspec colname="1" colwidth="10*"/>
|
|
<colspec colname="2" colwidth="10*"/>
|
|
<colspec colname="3" colwidth="20*"/>
|
|
<colspec colname="4" colwidth="30*"/>
|
|
<colspec colname="5" colwidth="30*"/>
|
|
<thead>
|
|
<row>
|
|
<entry>
|
|
File Type
|
|
</entry>
|
|
<entry>
|
|
Format
|
|
</entry>
|
|
<entry>
|
|
Compression Codecs
|
|
</entry>
|
|
<entry>
|
|
Impala Can CREATE?
|
|
</entry>
|
|
<entry>
|
|
Impala Can INSERT?
|
|
</entry>
|
|
</row>
|
|
</thead>
|
|
<tbody>
|
|
<row id="parquet_support">
|
|
<entry>
|
|
<xref href="impala_parquet.xml#parquet">Parquet</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
Snappy, gzip; currently Snappy by default
|
|
</entry>
|
|
<entry>
|
|
Yes.
|
|
</entry>
|
|
<entry>
|
|
Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
|
|
</entry>
|
|
</row>
|
|
<row id="txtfile_support">
|
|
<entry>
|
|
<xref href="impala_txtfile.xml#txtfile">Text</xref>
|
|
</entry>
|
|
<entry>
|
|
Unstructured
|
|
</entry>
|
|
<entry rev="2.0.0">
|
|
LZO, gzip, bzip2, Snappy
|
|
</entry>
|
|
<entry>
|
|
Yes. For <codeph>CREATE TABLE</codeph> with no <codeph>STORED AS</codeph> clause, the default file
|
|
format is uncompressed text, with values separated by ASCII <codeph>0x01</codeph> characters
|
|
(typically represented as Ctrl-A).
|
|
</entry>
|
|
<entry>
|
|
Yes: <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, and query.
|
|
If LZO compression is used, you must create the table and load data in Hive. If other kinds of
|
|
compression are used, you must load data through <codeph>LOAD DATA</codeph>, Hive, or manually in
|
|
HDFS.
|
|
|
|
<!-- <ph rev="2.0.0">Impala 2.0 and higher can write LZO-compressed text data; for earlier Impala releases, you must create the table and load data in Hive.</ph> -->
|
|
</entry>
|
|
</row>
|
|
<row id="avro_support">
|
|
<entry>
|
|
<xref href="impala_avro.xml#avro">Avro</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
Snappy, gzip, deflate, bzip2
|
|
</entry>
|
|
<entry rev="1.4.0">
|
|
Yes, in Impala 1.4.0 and higher. Before that, create the table using Hive.
|
|
</entry>
|
|
<entry>
|
|
No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
|
|
<codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
|
|
</entry>
|
|
<!-- <entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry> -->
|
|
</row>
|
|
<row id="rcfile_support">
|
|
<entry>
|
|
<xref href="impala_rcfile.xml#rcfile">RCFile</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
Snappy, gzip, deflate, bzip2
|
|
</entry>
|
|
<entry>
|
|
Yes.
|
|
</entry>
|
|
<entry>
|
|
No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
|
|
<codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
|
|
</entry>
|
|
<!--
|
|
<entry rev="2.0.0">Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.</entry>
|
|
-->
|
|
</row>
|
|
<row id="sequencefile_support">
|
|
<entry>
|
|
<xref href="impala_seqfile.xml#seqfile">SequenceFile</xref>
|
|
</entry>
|
|
<entry>
|
|
Structured
|
|
</entry>
|
|
<entry>
|
|
Snappy, gzip, deflate, bzip2
|
|
</entry>
|
|
<entry>Yes.</entry>
|
|
<entry>
|
|
No. Import data by using <codeph>LOAD DATA</codeph> on data files already in the right format, or use
|
|
<codeph>INSERT</codeph> in Hive followed by <codeph>REFRESH <varname>table_name</varname></codeph> in Impala.
|
|
</entry>
|
|
<!--
|
|
<entry rev="2.0.0">
|
|
Yes, in Impala 2.0 and higher. For earlier Impala releases, load data through <codeph>LOAD
|
|
DATA</codeph> on data files already in the right format, or use <codeph>INSERT</codeph> in Hive.
|
|
</entry>
|
|
-->
|
|
</row>
|
|
</tbody>
|
|
</tgroup>
|
|
</table>
|
|
|
|
<p rev="DOCS-1370">
|
|
Impala can only query the file formats listed in the preceding table.
|
|
In particular, Impala does not support the ORC file format.
|
|
</p>
|
|
|
|
<p>
|
|
Impala supports the following compression codecs:
|
|
</p>
|
|
|
|
<ul>
|
|
<li rev="2.0.0">
|
|
Snappy. Recommended for its effective balance between compression ratio and decompression speed. Snappy
|
|
compression is very fast, but gzip provides greater space savings. Supported for text files in Impala 2.0
|
|
and higher.
|
|
<!-- Not supported for text files. -->
|
|
</li>
|
|
|
|
<li rev="2.0.0">
|
|
Gzip. Recommended when achieving the highest level of compression (and therefore greatest disk-space
|
|
savings) is desired. Supported for text files in Impala 2.0 and higher.
|
|
</li>
|
|
|
|
<li>
|
|
Deflate. Not supported for text files.
|
|
</li>
|
|
|
|
<li rev="2.0.0">
|
|
Bzip2. Supported for text files in Impala 2.0 and higher.
|
|
<!-- Not supported for text files. -->
|
|
</li>
|
|
|
|
<li>
|
|
<p rev="2.0.0"> LZO, for text files only. Impala can query
|
|
LZO-compressed text tables, but currently cannot create them or insert
|
|
data into them; perform these operations in Hive. </p>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
|
|
<concept id="file_format_choosing">
|
|
|
|
<title>Choosing the File Format for a Table</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Planning"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Different file formats and compression codecs work better for different data sets. While Impala typically
|
|
provides performance gains regardless of file format, choosing the proper format for your data can yield
|
|
further performance improvements. Use the following considerations to decide which combination of file
|
|
format and compression to use for a particular table:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
If you are working with existing files that are already in a supported file format, use the same format
|
|
for the Impala table where practical. If the original format does not yield acceptable query performance
|
|
or resource usage, consider creating a new Impala table with different file format or compression
|
|
characteristics, and doing a one-time conversion by copying the data to the new table using the
|
|
<codeph>INSERT</codeph> statement. Depending on the file format, you might run the
|
|
<codeph>INSERT</codeph> statement in <codeph>impala-shell</codeph> or in Hive.
|
|
</li>
|
|
|
|
<li>
|
|
Text files are convenient to produce through many different tools, and are human-readable for ease of
|
|
verification and debugging. Those characteristics are why text is the default format for an Impala
|
|
<codeph>CREATE TABLE</codeph> statement. When performance and resource usage are the primary
|
|
considerations, use one of the other file formats and consider using compression. A typical workflow
|
|
might involve bringing data into an Impala table by copying CSV or TSV files into the appropriate data
|
|
directory, and then using the <codeph>INSERT ... SELECT</codeph> syntax to copy the data into a table
|
|
using a different, more compact file format.
|
|
</li>
|
|
|
|
<li>
|
|
If your architecture involves storing data to be queried in memory, do not compress the data. There is no
|
|
I/O savings since the data does not need to be moved from disk, but there is a CPU cost to decompress the
|
|
data.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|