mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
This now gives a clean RAT check with bin/check-rat-report.py, which is one way for the Impala community to check compliance with ASF rules on intellectual property. Change-Id: I2ad06435f84a65ba126759e42a18fdaf52cd7036 Reviewed-on: http://gerrit.cloudera.org:8080/5232 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins Reviewed-by: John Russell <jrussell@cloudera.com>
343 lines
14 KiB
XML
343 lines
14 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="refresh">
|
|
|
|
<title>REFRESH Statement</title>
|
|
<titlealts audience="PDF"><navtitle>REFRESH</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="DDL"/>
|
|
<data name="Category" value="Tables"/>
|
|
<data name="Category" value="Hive"/>
|
|
<data name="Category" value="Metastore"/>
|
|
<data name="Category" value="ETL"/>
|
|
<data name="Category" value="Ingest"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="Cloudera">REFRESH statement</indexterm>
|
|
To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are
|
|
connected through <cmdname>impala-shell</cmdname>, JDBC, or ODBC) must have current metadata about those
|
|
databases and tables that are referenced in Impala queries. If you are not familiar with the way Impala uses
|
|
metadata and how it shares the same metastore database as Hive, see
|
|
<xref href="impala_hadoop.xml#intro_metastore"/> for background information.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock rev="IMPALA-1683 CDH-43732">REFRESH [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>key_col1</varname>=<varname>val1</varname> [, <varname>key_col2</varname>=<varname>val2</varname>...])]</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
Use the <codeph>REFRESH</codeph> statement to load the latest metastore metadata and block location data for
|
|
a particular table in these scenarios:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL
|
|
pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why
|
|
metadata needs to be refreshed.)
|
|
</li>
|
|
|
|
<li>
|
|
After issuing <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, or other
|
|
table-modifying SQL statement in Hive.
|
|
</li>
|
|
</ul>
|
|
|
|
<note rev="2.3.0">
|
|
<p rev="2.3.0">
|
|
In <keyword keyref="impala23_full"/> and higher, the syntax <codeph>ALTER TABLE <varname>table_name</varname> RECOVER PARTITIONS</codeph>
|
|
is a faster alternative to <codeph>REFRESH</codeph> when the only change to the table data is the addition of
|
|
new partition directories through Hive or manual HDFS operations.
|
|
See <xref href="impala_alter_table.xml#alter_table"/> for details.
|
|
</p>
|
|
</note>
|
|
|
|
<p>
|
|
You only need to issue the <codeph>REFRESH</codeph> statement on the node to which you connect to issue
|
|
queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read
|
|
requests for the correct HDFS blocks without relying on the metadata on the other nodes.
|
|
</p>
|
|
|
|
<p>
|
|
<codeph>REFRESH</codeph> reloads the metadata for the table from the metastore database, and does an
|
|
incremental reload of the low-level block location data to account for any new data files added to the HDFS
|
|
data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common
|
|
scenario where new data files are added to HDFS.
|
|
</p>
|
|
|
|
<p>
|
|
Only the metadata for the specified table is flushed. The table must already exist and be known to Impala,
|
|
either because the <codeph>CREATE TABLE</codeph> statement was run in Impala rather than Hive, or because a
|
|
previous <codeph>INVALIDATE METADATA</codeph> statement caused Impala to reload its entire metadata catalog.
|
|
</p>
|
|
|
|
<note>
|
|
<p rev="1.2">
|
|
The catalog service broadcasts any changed metadata as a result of Impala
|
|
<codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> statements to all
|
|
Impala nodes. Thus, the <codeph>REFRESH</codeph> statement is only required if you load data through Hive
|
|
or by manipulating data files in HDFS directly. See <xref href="impala_components.xml#intro_catalogd"/> for
|
|
more information on the catalog service.
|
|
</p>
|
|
<p rev="1.2.1">
|
|
Another way to avoid inconsistency across nodes is to enable the
|
|
<codeph>SYNC_DDL</codeph> query option before performing a DDL statement or an <codeph>INSERT</codeph> or
|
|
<codeph>LOAD DATA</codeph>.
|
|
</p>
|
|
<p rev="1.1">
|
|
The table name is a required parameter. To flush the metadata for all tables, use the
|
|
<codeph><xref href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE METADATA</xref></codeph>
|
|
command.
|
|
</p>
|
|
<p conref="../shared/impala_common.xml#common/invalidate_then_refresh"/>
|
|
</note>
|
|
|
|
<p conref="../shared/impala_common.xml#common/refresh_vs_invalidate"/>
|
|
|
|
<p>
|
|
A metadata update for an <codeph>impalad</codeph> instance <b>is</b> required if:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
A metadata change occurs.
|
|
</li>
|
|
|
|
<li>
|
|
<b>and</b> the change is made through Hive.
|
|
</li>
|
|
|
|
<li>
|
|
<b>and</b> the change is made to a metastore database to which clients such as the Impala shell or ODBC directly
|
|
connect.
|
|
</li>
|
|
</ul>
|
|
|
|
<p rev="1.2">
|
|
A metadata update for an Impala node is <b>not</b> required after you run <codeph>ALTER TABLE</codeph>,
|
|
<codeph>INSERT</codeph>, or other table-modifying statement in Impala rather than Hive. Impala handles the
|
|
metadata synchronization automatically through the catalog service.
|
|
</p>
|
|
|
|
<p>
|
|
Database and table metadata is typically modified by:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Hive - through <codeph>ALTER</codeph>, <codeph>CREATE</codeph>, <codeph>DROP</codeph> or
|
|
<codeph>INSERT</codeph> operations.
|
|
</li>
|
|
|
|
<li>
|
|
Impalad - through <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>INSERT</codeph>
|
|
operations. <ph rev="1.2">Such changes are propagated to all Impala nodes by the
|
|
Impala catalog service.</ph>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
<codeph>REFRESH</codeph> causes the metadata for that table to be immediately reloaded. For a huge table,
|
|
that process could take a noticeable amount of time; but doing the refresh up front avoids an unpredictable
|
|
delay later, for example if the next reference to the table is during a benchmark test.
|
|
</p>
|
|
|
|
<p rev="IMPALA-1683 CDH-43732">
|
|
<b>Refreshing a single partition:</b>
|
|
</p>
|
|
|
|
<p rev="IMPALA-1683 CDH-43732">
|
|
In <keyword keyref="impala27_full"/> and higher, the <codeph>REFRESH</codeph> statement can apply to a single partition at a time,
|
|
rather than the whole table. Include the optional <codeph>PARTITION (<varname>partition_spec</varname>)</codeph>
|
|
clause and specify values for each of the partition key columns.
|
|
</p>
|
|
|
|
<p rev="IMPALA-1683 CDH-43732">
|
|
The following examples show how to make Impala aware of data added to a single partition, after data is loaded into
|
|
a partition's data directory using some mechanism outside Impala, such as Hive or Spark. The partition can be one that
|
|
Impala created and is already aware of, or a new partition created through Hive.
|
|
</p>
|
|
|
|
<codeblock rev="IMPALA-1683 CDH-43732"><![CDATA[
|
|
impala> create table p (x int) partitioned by (y int);
|
|
impala> insert into p (x,y) values (1,2), (2,2), (2,1);
|
|
impala> show partitions p;
|
|
+-------+-------+--------+------+...
|
|
| y | #Rows | #Files | Size |...
|
|
+-------+-------+--------+------+...
|
|
| 1 | -1 | 1 | 2B |...
|
|
| 2 | -1 | 1 | 4B |...
|
|
| Total | -1 | 2 | 6B |...
|
|
+-------+-------+--------+------+...
|
|
|
|
-- ... Data is inserted into one of the partitions by some external mechanism ...
|
|
beeline> insert into p partition (y = 1) values(1000);
|
|
|
|
impala> refresh p partition (y=1);
|
|
impala> select x from p where y=1;
|
|
+------+
|
|
| x |
|
|
+------+
|
|
| 2 | <- Original data created by Impala
|
|
| 1000 | <- Additional data inserted through Beeline
|
|
+------+
|
|
]]>
|
|
</codeblock>
|
|
|
|
<p rev="IMPALA-1683 CDH-43732">
|
|
The same applies for tables with more than one partition key column.
|
|
The <codeph>PARTITION</codeph> clause of the <codeph>REFRESH</codeph>
|
|
statement must include all the partition key columns.
|
|
</p>
|
|
|
|
<codeblock rev="IMPALA-1683 CDH-43732"><![CDATA[
|
|
impala> create table p2 (x int) partitioned by (y int, z int);
|
|
impala> insert into p2 (x,y,z) values (0,0,0), (1,2,3), (2,2,3);
|
|
impala> show partitions p2;
|
|
+-------+---+-------+--------+------+...
|
|
| y | z | #Rows | #Files | Size |...
|
|
+-------+---+-------+--------+------+...
|
|
| 0 | 0 | -1 | 1 | 2B |...
|
|
| 2 | 3 | -1 | 1 | 4B |...
|
|
| Total | | -1 | 2 | 6B |...
|
|
+-------+---+-------+--------+------+...
|
|
|
|
-- ... Data is inserted into one of the partitions by some external mechanism ...
|
|
beeline> insert into p2 partition (y = 2, z = 3) values(1000);
|
|
|
|
impala> refresh p2 partition (y=2, z=3);
|
|
impala> select x from p where y=2 and z = 3;
|
|
+------+
|
|
| x |
|
|
+------+
|
|
| 1 | <- Original data created by Impala
|
|
| 2 | <- Original data created by Impala
|
|
| 1000 | <- Additional data inserted through Beeline
|
|
+------+
|
|
]]>
|
|
</codeblock>
|
|
|
|
<p rev="IMPALA-1683 CDH-43732">
|
|
The following examples show how specifying a nonexistent partition does not cause any error,
|
|
and the order of the partition key columns does not have to match the column order in the table.
|
|
The partition spec must include all the partition key columns; specifying an incomplete set of
|
|
columns does cause an error.
|
|
</p>
|
|
|
|
<codeblock rev="IMPALA-1683 CDH-43732"><![CDATA[
|
|
-- Partition doesn't exist.
|
|
refresh p2 partition (y=0, z=3);
|
|
refresh p2 partition (y=0, z=-1)
|
|
-- Key columns specified in a different order than the table definition.
|
|
refresh p2 partition (z=1, y=0)
|
|
-- Incomplete partition spec causes an error.
|
|
refresh p2 partition (y=0)
|
|
ERROR: AnalysisException: Items in partition spec must exactly match the partition columns in the table definition: default.p2 (1 vs 2)
|
|
]]>
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example shows how you might use the <codeph>REFRESH</codeph> statement after manually adding
|
|
new HDFS data files to the Impala data directory for a table:
|
|
</p>
|
|
|
|
<codeblock>[impalad-host:21000] > refresh t1;
|
|
[impalad-host:21000] > refresh t2;
|
|
[impalad-host:21000] > select * from t1;
|
|
...
|
|
[impalad-host:21000] > select * from t2;
|
|
... </codeblock>
|
|
|
|
<p>
|
|
For more examples of using <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> with a
|
|
combination of Impala and Hive operations, see <xref href="impala_tutorial.xml#tutorial_impala_hive"/>.
|
|
</p>
|
|
|
|
<p>
|
|
<b>Related impala-shell options:</b>
|
|
</p>
|
|
|
|
<p rev="1.1">
|
|
The <cmdname>impala-shell</cmdname> option <codeph>-r</codeph> issues an <codeph>INVALIDATE METADATA</codeph> statement
|
|
when starting up the shell, effectively performing a <codeph>REFRESH</codeph> of all tables.
|
|
Due to the expense of reloading the metadata for all tables, the <cmdname>impala-shell</cmdname> <codeph>-r</codeph>
|
|
option is not recommended for day-to-day use in a production environment. (This option was mainly intended as a workaround
|
|
for synchronization issues in very old Impala versions.)
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
|
<p rev="CDH-19187">
|
|
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
|
typically the <codeph>impala</codeph> user, must have execute
|
|
permissions for all the relevant directories holding table data.
|
|
(A table could have data spread across multiple directories,
|
|
or in unexpected paths, if it uses partitioning or
|
|
specifies a <codeph>LOCATION</codeph> attribute for
|
|
individual partitions or the entire table.)
|
|
Issues with permissions might not cause an immediate error for this statement,
|
|
but subsequent statements such as <codeph>SELECT</codeph>
|
|
or <codeph>SHOW TABLE STATS</codeph> could fail.
|
|
</p>
|
|
<p rev="IMPALA-1683 CDH-43732">
|
|
All HDFS and Sentry permissions and privileges are the same whether you refresh the entire table
|
|
or a single partition.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hdfs_blurb"/>
|
|
|
|
<p>
|
|
The <codeph>REFRESH</codeph> command checks HDFS permissions of the underlying data files and directories,
|
|
caching this information so that a statement can be cancelled immediately if for example the
|
|
<codeph>impala</codeph> user does not have permission to write to the data directory for the table. Impala
|
|
reports any lack of write permissions as an <codeph>INFO</codeph> message in the log file, in case that
|
|
represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala
|
|
user, issue another <codeph>REFRESH</codeph> to make Impala aware of the change.
|
|
</p>
|
|
|
|
<note conref="../shared/impala_common.xml#common/compute_stats_next"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/s3_metadata"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
<p>
|
|
<xref href="impala_hadoop.xml#intro_metastore"/>,
|
|
<xref href="impala_invalidate_metadata.xml#invalidate_metadata"/>
|
|
</p>
|
|
</conbody>
|
|
</concept>
|