mirror of
https://github.com/apache/impala.git
synced 2026-02-02 06:00:36 -05:00
Updated impala_s3.xml to refer to the trashcan as the "S3A trashcan" rather than the "HDFS trashcan". Change-Id: If321117b0d58e3f6d79251fad97b8bd92882cc12 Reviewed-on: http://gerrit.cloudera.org:8080/15022 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
576 lines
26 KiB
XML
576 lines
26 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="s3" rev="2.2.0">
|
|
|
|
<title>Using Impala with Amazon S3 Object Store</title>
|
|
<titlealts audience="PDF"><navtitle>S3 Tables</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Amazon"/>
|
|
<data name="Category" value="S3"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Querying"/>
|
|
<data name="Category" value="Preview Features"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p rev="2.2.0"> You can use Impala to query data residing on the Amazon S3
|
|
object store. This capability allows convenient access to a storage system
|
|
that is remotely managed, accessible from anywhere, and integrated with
|
|
various cloud-based services. Impala can query files in any supported file
|
|
format from S3. The S3 storage location can be for an entire table, or
|
|
individual partitions in a partitioned table. </p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
|
|
</conbody>
|
|
<concept id="s3_best_practices" rev="2.6.0 IMPALA-1878">
|
|
<title>Best Practices for Using Impala with S3</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Guidelines"/>
|
|
<data name="Category" value="Best Practices"/>
|
|
</metadata>
|
|
</prolog>
|
|
<conbody>
|
|
<p> The following guidelines summarize the best practices described in the
|
|
rest of this topic: </p>
|
|
<ul>
|
|
<li>
|
|
<p> Any reference to an S3 location must be fully qualified when S3 is
|
|
not designated as the default storage, for example,
|
|
<codeph>s3a:://[s3-bucket-name]</codeph>.</p>
|
|
</li>
|
|
<li>
|
|
<p> Set <codeph>fs.s3a.connection.maximum</codeph> to 1500 for
|
|
<cmdname>impalad</cmdname>. </p>
|
|
</li>
|
|
<li>
|
|
<p> Set <codeph>fs.s3a.block.size</codeph> to 134217728 (128 MB in
|
|
bytes) if most Parquet files queried by Impala were written by Hive
|
|
or ParquetMR jobs. </p>
|
|
<p>Set the block size to 268435456 (256 MB in bytes) if most Parquet
|
|
files queried by Impala were written by Impala. </p>
|
|
<p>Starting in Impala 3.4.0, instead of
|
|
<codeph>fs.s3a.block.size</codeph>, the
|
|
<codeph>PARQUET_OBJECT_STORE_SPLIT_SIZE</codeph> query option
|
|
controls the Parquet-specific split size. The default value is 256
|
|
MB.</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
<codeph>DROP TABLE .. PURGE</codeph> is much faster than the default
|
|
<codeph>DROP TABLE</codeph>. The same applies to <codeph>ALTER
|
|
TABLE ... DROP PARTITION PURGE</codeph> versus the default
|
|
<codeph>DROP PARTITION</codeph> operation. Due to the eventually
|
|
consistent nature of S3, the files for that table or partition could
|
|
remain for some unbounded time when using <codeph>PURGE</codeph>.
|
|
The default <codeph>DROP TABLE/PARTITION</codeph> is slow because
|
|
Impala copies the files to the S3A trash folder, and Impala waits
|
|
until all the data is moved. <codeph>DROP TABLE/PARTITION ..
|
|
PURGE</codeph> is a fast delete operation, and the Impala
|
|
statement finishes quickly even though the change might not have
|
|
propagated fully throughout S3. </p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
<codeph>INSERT</codeph> statements are faster than <codeph>INSERT
|
|
OVERWRITE</codeph> for S3. The query option
|
|
<codeph>S3_SKIP_INSERT_STAGING</codeph>, which is set to
|
|
<codeph>true</codeph> by default, skips the staging step for
|
|
regular <codeph>INSERT</codeph> (but not <codeph>INSERT
|
|
OVERWRITE</codeph>). This makes the operation much faster, but
|
|
consistency is not guaranteed: if a node fails during execution, the
|
|
table could end up with inconsistent data. Set this option to
|
|
<codeph>false</codeph> if stronger consistency is required,
|
|
however, this setting will make the <codeph>INSERT</codeph>
|
|
operations slower. </p>
|
|
<ul>
|
|
<li>
|
|
<p> For Impala-ACID tables, both <codeph>INSERT</codeph> and
|
|
<codeph>INSERT OVERWRITE</codeph> tables for S3 are fast,
|
|
regardless of the setting of
|
|
<codeph>S3_SKIP_INSERT_STAGING</codeph>. Plus, consistency is
|
|
guaranteed with ACID tables.</p>
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
<li>Enable <xref href="impala_data_cache.xml#data_cache">data cache for
|
|
remote reads</xref>.</li>
|
|
<li>Enable <xref
|
|
href="https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/s3guard.html"
|
|
format="html" scope="external">S3Guard</xref> in your cluster for
|
|
data consistency.</li>
|
|
<li>
|
|
<p> Too many files in a table can make metadata load and update slow
|
|
in S3. If too many requests are made to S3, S3 has a back-off
|
|
mechanism and responds slower than usual.</p>
|
|
<ul>
|
|
<li>If you have many small files due to over-granular partitioning,
|
|
configure partitions with many megabytes of data so that even a
|
|
query against a single partition can be parallelized effectively. </li>
|
|
<li>If you have many small files because of many small
|
|
<codeph>INSERT</codeph> queries, use bulk
|
|
<codeph>INSERT</codeph>s so that more data is written to fewer
|
|
files. </li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="s3_sql">
|
|
<title>How Impala SQL Statements Work with S3</title>
|
|
<conbody>
|
|
<p> Impala SQL statements work with data in S3 as follows: </p>
|
|
<ul>
|
|
<li>
|
|
<p> The <xref href="impala_create_table.xml#create_table">CREATE
|
|
TABLE</xref> or <xref href="impala_alter_table.xml#alter_table"
|
|
>ALTER TABLE</xref> statement can specify that a table resides in
|
|
the S3 object store by encoding an <codeph>s3a://</codeph> prefix
|
|
for the <codeph>LOCATION</codeph> property. <codeph>ALTER
|
|
TABLE</codeph> can also set the <codeph>LOCATION</codeph> property
|
|
for an individual partition so that some data in a table resides in
|
|
S3 and other data in the same table resides on HDFS. </p>
|
|
</li>
|
|
<li>
|
|
<p> Once a table or partition is designated as residing in S3, the
|
|
<xref href="impala_select.xml#select"/> statement transparently
|
|
accesses the data files from the appropriate storage layer. </p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
If the S3 table is an internal table, the <xref
|
|
href="impala_drop_table.xml#drop_table">DROP TABLE</xref> statement
|
|
removes the corresponding data files from S3 when the table is dropped.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p> The <xref href="impala_truncate_table.xml#truncate_table">TRUNCATE
|
|
TABLE</xref> statement always removes the corresponding
|
|
data files from S3 when the table is truncated. </p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
The <xref href="impala_load_data.xml#load_data">LOAD DATA</xref>
|
|
statement can move data files residing in HDFS into
|
|
an S3 table.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>
|
|
The <xref href="impala_insert.xml#insert">INSERT</xref> statement, or the <codeph>CREATE TABLE AS SELECT</codeph>
|
|
form of the <codeph>CREATE TABLE</codeph> statement, can copy data from an HDFS table or another S3
|
|
table into an S3 table. The <xref
|
|
href="impala_s3_skip_insert_staging.xml#s3_skip_insert_staging">S3_SKIP_INSERT_STAGING</xref>
|
|
query option chooses whether or not to use a fast code path for these write operations to S3,
|
|
with the tradeoff of potential inconsistency in the case of a failure during the statement.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
For usage information about Impala SQL statements with S3 tables, see <xref href="impala_s3.xml#s3_ddl"/>
|
|
and <xref href="impala_s3.xml#s3_dml"/>.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="s3_creds">
|
|
|
|
<title>Specifying Impala Credentials to Access Data in S3</title>
|
|
|
|
<conbody>
|
|
|
|
<p> To allow Impala to access data in S3, specify values for the following
|
|
configuration settings in your <filepath>core-site.xml</filepath> file: </p>
|
|
<codeblock>
|
|
<property>
|
|
<name>fs.s3a.access.key</name>
|
|
<value><varname>your_access_key</varname></value>
|
|
</property>
|
|
<property>
|
|
<name>fs.s3a.secret.key</name>
|
|
<value><varname>your_secret_key</varname></value>
|
|
</property>
|
|
</codeblock>
|
|
|
|
<p> After specifying the credentials, restart both the Impala and Hive
|
|
services. Restarting Hive is required because Impala statements, such as
|
|
<codeph>CREATE TABLE</codeph>, go through the Hive Metastore. </p>
|
|
|
|
<note type="important">
|
|
<p>
|
|
Although you can specify the access key ID and secret key as part of the <codeph>s3a://</codeph> URL in the
|
|
<codeph>LOCATION</codeph> attribute, doing so makes this sensitive information visible in many places, such
|
|
as <codeph>DESCRIBE FORMATTED</codeph> output and Impala log files. Therefore, specify this information
|
|
centrally in the <filepath>core-site.xml</filepath> file, and restrict read access to that file to only
|
|
trusted users.
|
|
</p>
|
|
</note>
|
|
<p>See <xref
|
|
href="https://www.google.com/url?q=https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html%23Authenticating_with_S3&sa=D&ust=1572980027740000&usg=AFQjCNFnzPSfNBMVRgJZRenvhLblezHbdw"
|
|
format="html" scope="external">Authenticating with S3</xref> for
|
|
additional authentication mechanisms to access S3.</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="s3_etl">
|
|
|
|
<title>Loading Data into S3 for Impala Queries</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="ETL"/>
|
|
<data name="Category" value="Ingest"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
If your ETL pipeline involves moving data into S3 and then querying through Impala,
|
|
you can either use Impala DML statements to create, move, or copy the data, or
|
|
use the same data loading techniques as you would for non-Impala data.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
<concept id="s3_dml" rev="2.6.0 IMPALA-1878">
|
|
<title>Using Impala DML Statements for S3 Data</title>
|
|
<conbody>
|
|
<p>The Impala DML statements (<codeph>INSERT</codeph>, <codeph>LOAD
|
|
DATA</codeph>, and <codeph>CREATE TABLE AS SELECT</codeph>) can
|
|
write data into a table or partition that resides in S3. The syntax of
|
|
the DML statements is the same as for any other tables because the S3
|
|
location for tables and partitions is specified by an
|
|
<codeph>s3a://</codeph> prefix in the <codeph>LOCATION</codeph>
|
|
attribute of <codeph>CREATE TABLE</codeph> or <codeph>ALTER
|
|
TABLE</codeph> statements. If you bring data into S3 using the
|
|
normal S3 transfer mechanisms instead of Impala DML statements, issue
|
|
a <codeph>REFRESH</codeph> statement for the table before using Impala
|
|
to query the S3 data.</p>
|
|
<p conref="../shared/impala_common.xml#common/s3_dml_performance"/>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="s3_manual_etl">
|
|
<title>Manually Loading Data into Impala Tables in S3</title>
|
|
<conbody>
|
|
<p>
|
|
As an alternative, or on earlier Impala releases without DML support for S3,
|
|
you can use the Amazon-provided methods to bring data files into S3 for querying through Impala. See
|
|
<xref href="http://aws.amazon.com/s3/" scope="external" format="html">the Amazon S3 web site</xref> for
|
|
details.
|
|
</p>
|
|
|
|
<note type="important">
|
|
<p conref="../shared/impala_common.xml#common/s3_drop_table_purge"/>
|
|
</note>
|
|
|
|
<p> After you upload data files to a location already mapped to an
|
|
Impala table or partition, or if you delete files in S3 from such a
|
|
location, issue the <codeph>REFRESH</codeph> statement to make Impala
|
|
aware of the new set of data files. </p>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
</concept>
|
|
|
|
<concept id="s3_ddl">
|
|
|
|
<title>Creating Impala Databases, Tables, and Partitions for Data Stored in
|
|
S3</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Databases"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
<p>To create a table that resides in S3, run the <codeph>CREATE
|
|
TABLE</codeph> or <codeph>ALTER TABLE</codeph> statement with the
|
|
<codeph>LOCATION</codeph> clause. </p>
|
|
<p><codeph>ALTER TABLE</codeph> can set the <codeph>LOCATION</codeph>
|
|
property for an individual partition, so that some data in a table
|
|
resides in S3 and other data in the same table resides on HDFS.</p>
|
|
<p>The syntax for the <codeph>LOCATION</codeph> clause is:</p>
|
|
<codeblock>LOCATION 's3a://<varname>bucket_name</varname>/<varname>path</varname>/<varname>to</varname>/<varname>file</varname>'</codeblock>
|
|
<p>The file system prefix is always <codeph>s3a://</codeph>. Impala does
|
|
not support the <codeph>s3://</codeph> or <codeph>s3n://</codeph>
|
|
prefixes. </p>
|
|
<p> For a partitioned table, either specify a separate
|
|
<codeph>LOCATION</codeph> clause for each new partition, or specify a
|
|
base <codeph>LOCATION</codeph> for the table and set up a directory
|
|
structure in S3 to mirror the way Impala partitioned tables are
|
|
structured in S3. </p>
|
|
|
|
<p> You point a nonpartitioned table or an individual partition at S3 by
|
|
specifying a single directory path in S3, which could be any arbitrary
|
|
directory. To replicate the structure of an entire Impala partitioned
|
|
table or database in S3 requires more care, with directories and
|
|
subdirectories nested and named to match the equivalent directory tree
|
|
in HDFS. Consider setting up an empty staging area if necessary in HDFS,
|
|
and recording the complete directory structure so that you can replicate
|
|
it in S3. </p>
|
|
<p> When working with multiple tables with data files stored in S3, you can
|
|
create a database with a <codeph>LOCATION</codeph> attribute pointing to
|
|
an S3 path. Specify a URL of the form
|
|
<codeph>s3a://<varname>bucket</varname>/<varname>root</varname>/<varname>path</varname>/<varname>for</varname>/<varname>database</varname></codeph>
|
|
for the <codeph>LOCATION</codeph> attribute of the database. Any tables
|
|
created inside that database automatically create directories underneath
|
|
the one specified by the database <codeph>LOCATION</codeph> attribute. </p>
|
|
<p>The following example creates a table with one partition for the year
|
|
2017 resides on HDFS and one partition for the year 2018 resides in
|
|
S3.</p>
|
|
|
|
<p>The partition for year 2018 includes a <codeph>LOCATION</codeph>
|
|
attribute with an <codeph>s3a://</codeph> URL, and so refers to data
|
|
residing in S3, under a specific path underneath the bucket
|
|
<codeph>impala-demo</codeph>. </p>
|
|
|
|
<codeblock>CREATE TABLE mostly_on_hdfs (x int) PARTITIONED BY (year INT);
|
|
ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2017);
|
|
ALTER TABLE mostly_on_hdfs ADD PARTITION (year=2018)
|
|
LOCATION 's3a://impala-demo/dir1/dir2/dir3/t1';
|
|
</codeblock>
|
|
|
|
<p> The following session creates a database and two partitioned tables
|
|
residing entirely in S3, one partitioned by a single column and the
|
|
other partitioned by multiple columns. </p>
|
|
<ul>
|
|
<li>Because a <codeph>LOCATION</codeph> attribute with an
|
|
<codeph>s3a://</codeph> URL is specified for the database, the
|
|
tables inside that database are automatically created in S3 underneath
|
|
the database directory. </li>
|
|
<li>To see the names of the associated subdirectories, including the
|
|
partition key values, use an S3 client tool to examine how the
|
|
directory structure is organized in S3. </li>
|
|
</ul>
|
|
|
|
<codeblock>CREATE DATABASE db_on_s3 LOCATION 's3a://impala-demo/dir1/dir2/dir3';
|
|
CREATE TABLE partitioned_multiple_keys (x INT)
|
|
PARTITIONED BY (year SMALLINT, month TINYINT, day TINYINT);
|
|
|
|
ALTER TABLE partitioned_multiple_keys
|
|
ADD PARTITION (year=2015,month=1,day=1);
|
|
ALTER TABLE partitioned_multiple_keys
|
|
ADD PARTITION (year=2015,month=1,day=31);
|
|
|
|
!hdfs dfs -ls -R s3a://impala-demo/dir1/dir2/dir3
|
|
2015-03-17 13:56:34 0 dir1/dir2/dir3/
|
|
2015-03-17 16:47:13 0 dir1/dir2/dir3/partitioned_multiple_keys/
|
|
2015-03-17 16:47:44 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/
|
|
2015-03-17 16:47:50 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/</codeblock>
|
|
|
|
<p>
|
|
The <codeph>CREATE DATABASE</codeph> and <codeph>CREATE TABLE</codeph> statements create the associated
|
|
directory paths if they do not already exist. You can specify multiple levels of directories, and the
|
|
<codeph>CREATE</codeph> statement creates all appropriate levels, similar to using <codeph>mkdir
|
|
-p</codeph>.
|
|
</p>
|
|
|
|
<p> Use the standard S3 file upload methods to put the actual data files
|
|
into the right locations. You can also put the directory paths and data
|
|
files in place before creating the associated Impala databases or
|
|
tables, and Impala automatically uses the data from the appropriate
|
|
location after the associated databases and tables are created. </p>
|
|
<p>Use the <codeph>ALTER TABLE</codeph> statement with the
|
|
<codeph>LOCATION</codeph> clause to switch whether an existing table
|
|
or partition points to data in HDFS or S3. For example, if you have an
|
|
Impala table or partition pointing to data files in HDFS or S3, and you
|
|
later transfer those data files to the other filesystem, use the
|
|
<codeph>ALTER TABLE</codeph> statement to adjust the
|
|
<codeph>LOCATION</codeph> attribute of the corresponding table or
|
|
partition to reflect that change. </p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="s3_internal_external">
|
|
|
|
<title>Internal and External Tables Located in S3</title>
|
|
|
|
<conbody>
|
|
|
|
<p> Just as with tables located on HDFS storage, you can designate
|
|
S3-based tables as either internal (managed by Impala) or external, by
|
|
using the syntax <codeph>CREATE TABLE</codeph> or <codeph>CREATE
|
|
EXTERNAL TABLE</codeph> respectively. </p>
|
|
<p>When you drop an internal table, the files associated with the table
|
|
are removed, even if they are in S3 storage. When you drop an external
|
|
table, the files associated with the table are left alone, and are still
|
|
available for access by other tools or components.</p>
|
|
|
|
<p> If the data in S3 is intended to be long-lived and accessed by other
|
|
tools in addition to Impala, create any associated S3 tables with the
|
|
<codeph>CREATE EXTERNAL TABLE</codeph> syntax, so that the files are
|
|
not deleted from S3 when the table is dropped. </p>
|
|
|
|
<p> If the data in S3 is only needed for querying by Impala and can be
|
|
safely discarded once the Impala workflow is complete, create the
|
|
associated S3 tables using the <codeph>CREATE TABLE</codeph> syntax, so
|
|
that dropping the table also deletes the corresponding data files in S3. </p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="s3_queries">
|
|
|
|
<title>Running and Tuning Impala Queries for Data Stored in S3</title>
|
|
|
|
<conbody>
|
|
<p> Once a table or partition is designated as residing in S3, the
|
|
<codeph>SELECT</codeph> statement transparently accesses the data
|
|
files from the appropriate storage layer. </p>
|
|
|
|
<ul>
|
|
<li>
|
|
Queries against S3 data support all the same file formats as for HDFS data.
|
|
</li>
|
|
|
|
<li>
|
|
Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in S3
|
|
corresponding to the HDFS directories representing partition key values, or use <codeph>ALTER TABLE ...
|
|
ADD PARTITION</codeph> to set up the appropriate paths in S3.
|
|
</li>
|
|
|
|
<li>
|
|
HDFS and HBase tables can be joined to S3 tables, or S3 tables can be joined with each other.
|
|
</li>
|
|
|
|
<li> Authorization to control access to databases, tables, or columns
|
|
works the same whether the data is in HDFS or in S3. </li>
|
|
<li> The Catalog Server (<cmdname>catalogd</cmdname>) daemon caches
|
|
metadata for both HDFS and S3 tables.</li>
|
|
|
|
<li>
|
|
Queries against S3 tables are subject to the same kinds of admission control and resource management as
|
|
HDFS tables.
|
|
</li>
|
|
|
|
<li> Metadata about S3 tables is stored in the same Metastore database
|
|
as for HDFS tables. </li>
|
|
|
|
<li>
|
|
You can set up views referring to S3 tables, the same as for HDFS tables.
|
|
</li>
|
|
|
|
<li> The <codeph>COMPUTE STATS</codeph>, <codeph>SHOW TABLE
|
|
STATS</codeph>, and <codeph>SHOW COLUMN STATS</codeph> statements
|
|
work for S3 tables. </li>
|
|
</ul>
|
|
|
|
</conbody>
|
|
|
|
<concept id="s3_performance">
|
|
|
|
<title>Understanding and Tuning Impala Query Performance for S3 Data</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Performance"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>Here are techniques you can use to interpret explain plans and
|
|
profiles for queries against S3 data, and tips to achieve the best
|
|
performance possible for such queries. </p>
|
|
|
|
<p> All else being equal, performance is expected to be lower for
|
|
queries running against data in S3 rather than HDFS. The actual
|
|
mechanics of the <codeph>SELECT</codeph> statement are somewhat
|
|
different when the data is in S3. Although the work is still
|
|
distributed across the DataNodes of the cluster, Impala might
|
|
parallelize the work for a distributed query differently for data on
|
|
HDFS and S3.</p>
|
|
<p>S3 does not have the same block notion as HDFS, so Impala uses
|
|
heuristics to determine how to split up large S3 files for processing
|
|
in parallel. Because all hosts can access any S3 data file with equal
|
|
efficiency, the distribution of work might be different than for HDFS
|
|
data, where the data blocks are physically read using short-circuit
|
|
local reads by hosts that contain the appropriate block replicas.
|
|
Although the I/O to read the S3 data might be spread evenly across the
|
|
hosts of the cluster, the fact that all data is initially retrieved
|
|
across the network means that the overall query performance is likely
|
|
to be lower for S3 data than for HDFS data. </p>
|
|
<p>Use the <codeph>PARQUET_OBJECT_STORE_SPLIT_SIZE</codeph> query option
|
|
to control the Parquet-specific split size. The default value is 256
|
|
MB.</p>
|
|
|
|
<p> When optimizing aspects of complex queries, such as the join order,
|
|
Impala treats tables on HDFS and S3 the same way. Therefore, follow
|
|
all the same tuning recommendations for S3 tables as for HDFS ones,
|
|
such as using the <codeph>COMPUTE STATS</codeph> statement to help
|
|
Impala construct accurate estimates of row counts and cardinality. See
|
|
<xref href="impala_performance.xml#performance"/> for details. </p>
|
|
|
|
<p> In query profile reports, the numbers for
|
|
<codeph>BytesReadLocal</codeph>,
|
|
<codeph>BytesReadShortCircuit</codeph>,
|
|
<codeph>BytesReadDataNodeCached</codeph>, and
|
|
<codeph>BytesReadRemoteUnexpected</codeph> are blank because those
|
|
metrics come from HDFS. By definition, all the I/O for S3 tables
|
|
involves remote reads. </p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|
|
|
|
<concept id="s3_restrictions">
|
|
|
|
<title>Restrictions on Impala Support for S3</title>
|
|
|
|
<conbody>
|
|
|
|
<p>The following restrictions apply when using Impala with S3:</p>
|
|
<ul>
|
|
<li> Impala does not support the old <codeph>s3://</codeph> block-based
|
|
and <codeph>s3n://</codeph> filesystem schemes, and it only supports
|
|
<codeph>s3a://</codeph>. </li>
|
|
<li>Although S3 is often used to store JSON-formatted data, the current
|
|
Impala support for S3 does not include directly querying JSON data.
|
|
For Impala queries, use data files in one of the file formats listed
|
|
in <xref href="impala_file_formats.xml#file_formats"/>. If you have
|
|
data in JSON format, you can prepare a flattened version of that data
|
|
for querying by Impala as part of your ETL cycle. </li>
|
|
<li>You cannot use the <codeph>ALTER TABLE ... SET CACHED</codeph>
|
|
statement for tables or partitions that are located in S3. </li>
|
|
</ul>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
|
|
</concept>
|