mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
This commit adds information on how LOAD DATA statement can be used with Iceberg tables. Testing: - Built docs locally Change-Id: Iec242781a4551aa04e4e920e3f3a1010c7ab808e Reviewed-on: http://gerrit.cloudera.org:8080/19396 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gergely Fürnstáhl <gfurnstahl@cloudera.com> Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com> Reviewed-by: Tamas Mate <tmater@apache.org>
281 lines
14 KiB
XML
281 lines
14 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept rev="1.1" id="load_data">
|
|
|
|
<title>LOAD DATA Statement</title>
|
|
<titlealts audience="PDF"><navtitle>LOAD DATA</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="ETL"/>
|
|
<data name="Category" value="Ingest"/>
|
|
<data name="Category" value="DML"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="HDFS"/>
|
|
<data name="Category" value="Tables"/>
|
|
<data name="Category" value="S3"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p> The <codeph>LOAD DATA</codeph> statement streamlines the ETL process for
|
|
an internal Impala table by moving a data file or all the data files in a
|
|
directory from an HDFS location into the Impala data directory for that
|
|
table. </p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>LOAD DATA INPATH '<varname>hdfs_file_or_directory_path</varname>' [OVERWRITE] INTO TABLE <varname>tablename</varname>
|
|
[PARTITION (<varname>partcol1</varname>=<varname>val1</varname>, <varname>partcol2</varname>=<varname>val2</varname> ...)]</codeblock>
|
|
|
|
<p>
|
|
When the <codeph>LOAD DATA</codeph> statement operates on a partitioned table,
|
|
it always operates on one partition at a time. Specify the <codeph>PARTITION</codeph> clauses
|
|
and list all the partition key columns, with a constant value specified for each.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/dml_blurb"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<ul>
|
|
<li>
|
|
The loaded data files are moved, not copied, into the Impala data directory.
|
|
</li>
|
|
|
|
<li>
|
|
You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the
|
|
files inside that directory. You cannot specify any sort of wildcard to take only some of the files from a
|
|
directory. When loading a directory full of data files, keep all the data files at the top level, with no
|
|
nested directories underneath.
|
|
</li>
|
|
|
|
<li>
|
|
Currently, the Impala <codeph>LOAD DATA</codeph> statement only imports files from HDFS, not from the local
|
|
filesystem. It does not support the <codeph>LOCAL</codeph> keyword of the Hive <codeph>LOAD DATA</codeph>
|
|
statement. You must specify a path, not an <codeph>hdfs://</codeph> URI.
|
|
</li>
|
|
|
|
<li>
|
|
In the interest of speed, only limited error checking is done. If the loaded files have the wrong file
|
|
format, different columns than the destination table, or other kind of mismatch, Impala does not raise any
|
|
error for the <codeph>LOAD DATA</codeph> statement. Querying the table afterward could produce a runtime
|
|
error or unexpected results. Currently, the only checking the <codeph>LOAD DATA</codeph> statement does is
|
|
to avoid mixing together uncompressed and LZO-compressed text files in the same table.
|
|
</li>
|
|
|
|
<li>
|
|
When you specify an HDFS directory name as the <codeph>LOAD DATA</codeph> argument, any hidden files in
|
|
that directory (files whose names start with a <codeph>.</codeph>) are not moved to the Impala data
|
|
directory.
|
|
</li>
|
|
|
|
<li rev="2.5.0 IMPALA-2867">
|
|
The operation fails if the source directory contains any non-hidden directories.
|
|
Prior to <keyword keyref="impala25_full"/> if the source directory contained any subdirectory, even a hidden one such as
|
|
<filepath>_impala_insert_staging</filepath>, the <codeph>LOAD DATA</codeph> statement would fail.
|
|
In <keyword keyref="impala25_full"/> and higher, <codeph>LOAD DATA</codeph> ignores hidden subdirectories in the
|
|
source directory, and only fails if any of the subdirectories are non-hidden.
|
|
</li>
|
|
|
|
<li>
|
|
The loaded data files retain their original names in the new location, unless a name conflicts with an
|
|
existing data file, in which case the name of the new file is modified slightly to be unique. (The
|
|
name-mangling is a slight difference from the Hive <codeph>LOAD DATA</codeph> statement, which replaces
|
|
identically named files.)
|
|
</li>
|
|
|
|
<li>
|
|
By providing an easy way to transport files from known locations in HDFS into the Impala data directory
|
|
structure, the <codeph>LOAD DATA</codeph> statement lets you avoid memorizing the locations and layout of
|
|
HDFS directory tree containing the Impala databases and tables. (For a quick way to check the location of
|
|
the data files for an Impala table, issue the statement <codeph>DESCRIBE FORMATTED
|
|
<varname>table_name</varname></codeph>.)
|
|
</li>
|
|
|
|
<li>
|
|
The <codeph>PARTITION</codeph> clause is especially convenient for ingesting new data for a partitioned
|
|
table. As you receive new data for a time period, geographic region, or other division that corresponds to
|
|
one or more partitioning columns, you can load that data straight into the appropriate Impala data
|
|
directory, which might be nested several levels down if the table is partitioned by multiple columns. When
|
|
the table is partitioned, you must specify constant values for all the partitioning columns.
|
|
</li>
|
|
</ul>
|
|
|
|
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
|
|
|
<p rev="2.3.0">
|
|
Because Impala currently cannot create Parquet data files containing complex types
|
|
(<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>), the
|
|
<codeph>LOAD DATA</codeph> statement is especially important when working with
|
|
tables containing complex type columns. You create the Parquet data files outside
|
|
Impala, then use either <codeph>LOAD DATA</codeph>, an external table, or HDFS-level
|
|
file operations followed by <codeph>REFRESH</codeph> to associate the data files with
|
|
the corresponding table.
|
|
See <xref href="impala_complex_types.xml#complex_types"/> for details about using complex types.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
|
|
|
<note conref="../shared/impala_common.xml#common/compute_stats_next"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
First, we use a trivial Python script to write different numbers of strings (one per line) into files stored
|
|
in the <codeph>doc_demo</codeph> HDFS user account. (Substitute the path for your own HDFS user account when
|
|
doing <cmdname>hdfs dfs</cmdname> operations like these.)
|
|
</p>
|
|
|
|
<codeblock>$ random_strings.py 1000 | hdfs dfs -put - /user/doc_demo/thousand_strings.txt
|
|
$ random_strings.py 100 | hdfs dfs -put - /user/doc_demo/hundred_strings.txt
|
|
$ random_strings.py 10 | hdfs dfs -put - /user/doc_demo/ten_strings.txt</codeblock>
|
|
|
|
<p>
|
|
Next, we create a table and load an initial set of data into it. Remember, unless you specify a
|
|
<codeph>STORED AS</codeph> clause, Impala tables default to <codeph>TEXTFILE</codeph> format with Ctrl-A (hex
|
|
01) as the field delimiter. This example uses a single-column table, so the delimiter is not significant. For
|
|
large-scale ETL jobs, you would typically use binary format data files such as Parquet or Avro, and load them
|
|
into Impala tables that use the corresponding file format.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table t1 (s string);
|
|
[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1;
|
|
Query finished, fetching results ...
|
|
+----------------------------------------------------------+
|
|
| summary |
|
|
+----------------------------------------------------------+
|
|
| Loaded 1 file(s). Total files in destination location: 1 |
|
|
+----------------------------------------------------------+
|
|
Returned 1 row(s) in 0.61s
|
|
[kilo2-202-961.cs1cloud.internal:21000] > select count(*) from t1;
|
|
Query finished, fetching results ...
|
|
+------+
|
|
| _c0 |
|
|
+------+
|
|
| 1000 |
|
|
+------+
|
|
Returned 1 row(s) in 0.67s
|
|
[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1;
|
|
ERROR: AnalysisException: INPATH location '/user/doc_demo/thousand_strings.txt' does not exist. </codeblock>
|
|
|
|
<p>
|
|
As indicated by the message at the end of the previous example, the data file was moved from its original
|
|
location. The following example illustrates how the data file was moved into the Impala data directory for
|
|
the destination table, keeping its original filename:
|
|
</p>
|
|
|
|
<codeblock>$ hdfs dfs -ls /user/hive/warehouse/load_data_testing.db/t1
|
|
Found 1 items
|
|
-rw-r--r-- 1 doc_demo doc_demo 13926 2013-06-26 15:40 /user/hive/warehouse/load_data_testing.db/t1/thousand_strings.txt</codeblock>
|
|
|
|
<p>
|
|
The following example demonstrates the difference between the <codeph>INTO TABLE</codeph> and
|
|
<codeph>OVERWRITE TABLE</codeph> clauses. The table already contains 1000 rows. After issuing the
|
|
<codeph>LOAD DATA</codeph> statement with the <codeph>INTO TABLE</codeph> clause, the table contains 100 more
|
|
rows, for a total of 1100. After issuing the <codeph>LOAD DATA</codeph> statement with the <codeph>OVERWRITE
|
|
INTO TABLE</codeph> clause, the former contents are gone, and now the table only contains the 10 rows from
|
|
the just-loaded data file.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > load data inpath '/user/doc_demo/hundred_strings.txt' into table t1;
|
|
Query finished, fetching results ...
|
|
+----------------------------------------------------------+
|
|
| summary |
|
|
+----------------------------------------------------------+
|
|
| Loaded 1 file(s). Total files in destination location: 2 |
|
|
+----------------------------------------------------------+
|
|
Returned 1 row(s) in 0.24s
|
|
[localhost:21000] > select count(*) from t1;
|
|
Query finished, fetching results ...
|
|
+------+
|
|
| _c0 |
|
|
+------+
|
|
| 1100 |
|
|
+------+
|
|
Returned 1 row(s) in 0.55s
|
|
[localhost:21000] > load data inpath '/user/doc_demo/ten_strings.txt' overwrite into table t1;
|
|
Query finished, fetching results ...
|
|
+----------------------------------------------------------+
|
|
| summary |
|
|
+----------------------------------------------------------+
|
|
| Loaded 1 file(s). Total files in destination location: 1 |
|
|
+----------------------------------------------------------+
|
|
Returned 1 row(s) in 0.26s
|
|
[localhost:21000] > select count(*) from t1;
|
|
Query finished, fetching results ...
|
|
+-----+
|
|
| _c0 |
|
|
+-----+
|
|
| 10 |
|
|
+-----+
|
|
Returned 1 row(s) in 0.62s</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/s3_dml"/>
|
|
<p conref="../shared/impala_common.xml#common/s3_dml_performance"/>
|
|
<p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/adls_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/adls_dml"
|
|
conrefend="../shared/impala_common.xml#common/adls_dml_end"/>
|
|
<p>See <xref href="../topics/impala_adls.xml#adls"/> for details about reading and writing ADLS data with Impala.</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
|
<p rev="">
|
|
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
|
typically the <codeph>impala</codeph> user, must have read and write
|
|
permissions for the files in the source directory, and write
|
|
permission for the destination directory.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/kudu_no_load_data"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/hbase_no_load_data"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/iceberg_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/iceberg_load_data"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
<p>
|
|
The <codeph>LOAD DATA</codeph> statement is an alternative to the
|
|
<codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> statement.
|
|
Use <codeph>LOAD DATA</codeph>
|
|
when you have the data files in HDFS but outside of any Impala table.
|
|
</p>
|
|
<p>
|
|
The <codeph>LOAD DATA</codeph> statement is also an alternative
|
|
to the <codeph>CREATE EXTERNAL TABLE</codeph> statement. Use
|
|
<codeph>LOAD DATA</codeph> when it is appropriate to move the
|
|
data files under Impala control rather than querying them
|
|
from their original location. See <xref href="impala_tables.xml#external_tables"/>
|
|
for information about working with external tables.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|