mirror of
https://github.com/apache/impala.git
synced 2026-01-04 09:00:56 -05:00
Genericize 3-part version numbers in "known issues". Genericize CDH version numbers in 'ports' topic. Genericize 'Cloudera' and hostnames in 'Tables' topic. Genericize the version numbers in 'added in' blurbs. Remove lots of CDH / Impala notices from release notes. Remove obsolete conref'able elements that weren't actually being called from anywhere, that contained CDH version number wording. Reword 'Cloudera recommends'. Remove more hidden or commented material with Cloudera-specific wording. Remove obsolete CDH references from 'incompatible changes'. Change 'cloudera' HDFS username for LOAD DATA examples. Remove material related to big lists of CDH fixed JIRAs. Genericize some CDH-related language. Change-Id: Iaa5db6c20f4d010972ade4945a3ea59b32ef95de Reviewed-on: http://gerrit.cloudera.org:8080/6267 Reviewed-by: Ambreen Kazi <ambreen.kazi@cloudera.com> Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
275 lines
13 KiB
XML
275 lines
13 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept rev="1.1" id="load_data">
|
|
|
|
<title>LOAD DATA Statement</title>
|
|
<titlealts audience="PDF"><navtitle>LOAD DATA</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="ETL"/>
|
|
<data name="Category" value="Ingest"/>
|
|
<data name="Category" value="DML"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="HDFS"/>
|
|
<data name="Category" value="Tables"/>
|
|
<data name="Category" value="S3"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">LOAD DATA statement</indexterm>
|
|
The <codeph>LOAD DATA</codeph> statement streamlines the ETL process for an internal Impala table by moving a
|
|
data file or all the data files in a directory from an HDFS location into the Impala data directory for that
|
|
table.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>LOAD DATA INPATH '<varname>hdfs_file_or_directory_path</varname>' [OVERWRITE] INTO TABLE <varname>tablename</varname>
|
|
[PARTITION (<varname>partcol1</varname>=<varname>val1</varname>, <varname>partcol2</varname>=<varname>val2</varname> ...)]</codeblock>
|
|
|
|
<p>
|
|
When the <codeph>LOAD DATA</codeph> statement operates on a partitioned table,
|
|
it always operates on one partition at a time. Specify the <codeph>PARTITION</codeph> clauses
|
|
and list all the partition key columns, with a constant value specified for each.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/dml_blurb"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<ul>
|
|
<li>
|
|
The loaded data files are moved, not copied, into the Impala data directory.
|
|
</li>
|
|
|
|
<li>
|
|
You can specify the HDFS path of a single file to be moved, or the HDFS path of a directory to move all the
|
|
files inside that directory. You cannot specify any sort of wildcard to take only some of the files from a
|
|
directory. When loading a directory full of data files, keep all the data files at the top level, with no
|
|
nested directories underneath.
|
|
</li>
|
|
|
|
<li>
|
|
Currently, the Impala <codeph>LOAD DATA</codeph> statement only imports files from HDFS, not from the local
|
|
filesystem. It does not support the <codeph>LOCAL</codeph> keyword of the Hive <codeph>LOAD DATA</codeph>
|
|
statement. You must specify a path, not an <codeph>hdfs://</codeph> URI.
|
|
</li>
|
|
|
|
<li>
|
|
In the interest of speed, only limited error checking is done. If the loaded files have the wrong file
|
|
format, different columns than the destination table, or other kind of mismatch, Impala does not raise any
|
|
error for the <codeph>LOAD DATA</codeph> statement. Querying the table afterward could produce a runtime
|
|
error or unexpected results. Currently, the only checking the <codeph>LOAD DATA</codeph> statement does is
|
|
to avoid mixing together uncompressed and LZO-compressed text files in the same table.
|
|
</li>
|
|
|
|
<li>
|
|
When you specify an HDFS directory name as the <codeph>LOAD DATA</codeph> argument, any hidden files in
|
|
that directory (files whose names start with a <codeph>.</codeph>) are not moved to the Impala data
|
|
directory.
|
|
</li>
|
|
|
|
<li rev="2.5.0 IMPALA-2867">
|
|
The operation fails if the source directory contains any non-hidden directories.
|
|
Prior to <keyword keyref="impala25_full"/> if the source directory contained any subdirectory, even a hidden one such as
|
|
<filepath>_impala_insert_staging</filepath>, the <codeph>LOAD DATA</codeph> statement would fail.
|
|
In <keyword keyref="impala25_full"/> and higher, <codeph>LOAD DATA</codeph> ignores hidden subdirectories in the
|
|
source directory, and only fails if any of the subdirectories are non-hidden.
|
|
</li>
|
|
|
|
<li>
|
|
The loaded data files retain their original names in the new location, unless a name conflicts with an
|
|
existing data file, in which case the name of the new file is modified slightly to be unique. (The
|
|
name-mangling is a slight difference from the Hive <codeph>LOAD DATA</codeph> statement, which replaces
|
|
identically named files.)
|
|
</li>
|
|
|
|
<li>
|
|
By providing an easy way to transport files from known locations in HDFS into the Impala data directory
|
|
structure, the <codeph>LOAD DATA</codeph> statement lets you avoid memorizing the locations and layout of
|
|
HDFS directory tree containing the Impala databases and tables. (For a quick way to check the location of
|
|
the data files for an Impala table, issue the statement <codeph>DESCRIBE FORMATTED
|
|
<varname>table_name</varname></codeph>.)
|
|
</li>
|
|
|
|
<li>
|
|
The <codeph>PARTITION</codeph> clause is especially convenient for ingesting new data for a partitioned
|
|
table. As you receive new data for a time period, geographic region, or other division that corresponds to
|
|
one or more partitioning columns, you can load that data straight into the appropriate Impala data
|
|
directory, which might be nested several levels down if the table is partitioned by multiple columns. When
|
|
the table is partitioned, you must specify constant values for all the partitioning columns.
|
|
</li>
|
|
</ul>
|
|
|
|
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
|
|
|
<p rev="2.3.0">
|
|
Because Impala currently cannot create Parquet data files containing complex types
|
|
(<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>), the
|
|
<codeph>LOAD DATA</codeph> statement is especially important when working with
|
|
tables containing complex type columns. You create the Parquet data files outside
|
|
Impala, then use either <codeph>LOAD DATA</codeph>, an external table, or HDFS-level
|
|
file operations followed by <codeph>REFRESH</codeph> to associate the data files with
|
|
the corresponding table.
|
|
See <xref href="impala_complex_types.xml#complex_types"/> for details about using complex types.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/>
|
|
|
|
<note conref="../shared/impala_common.xml#common/compute_stats_next"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
First, we use a trivial Python script to write different numbers of strings (one per line) into files stored
|
|
in the <codeph>doc_demo</codeph> HDFS user account. (Substitute the path for your own HDFS user account when
|
|
doing <cmdname>hdfs dfs</cmdname> operations like these.)
|
|
</p>
|
|
|
|
<codeblock>$ random_strings.py 1000 | hdfs dfs -put - /user/doc_demo/thousand_strings.txt
|
|
$ random_strings.py 100 | hdfs dfs -put - /user/doc_demo/hundred_strings.txt
|
|
$ random_strings.py 10 | hdfs dfs -put - /user/doc_demo/ten_strings.txt</codeblock>
|
|
|
|
<p>
|
|
Next, we create a table and load an initial set of data into it. Remember, unless you specify a
|
|
<codeph>STORED AS</codeph> clause, Impala tables default to <codeph>TEXTFILE</codeph> format with Ctrl-A (hex
|
|
01) as the field delimiter. This example uses a single-column table, so the delimiter is not significant. For
|
|
large-scale ETL jobs, you would typically use binary format data files such as Parquet or Avro, and load them
|
|
into Impala tables that use the corresponding file format.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table t1 (s string);
|
|
[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1;
|
|
Query finished, fetching results ...
|
|
+----------------------------------------------------------+
|
|
| summary |
|
|
+----------------------------------------------------------+
|
|
| Loaded 1 file(s). Total files in destination location: 1 |
|
|
+----------------------------------------------------------+
|
|
Returned 1 row(s) in 0.61s
|
|
[kilo2-202-961.cs1cloud.internal:21000] > select count(*) from t1;
|
|
Query finished, fetching results ...
|
|
+------+
|
|
| _c0 |
|
|
+------+
|
|
| 1000 |
|
|
+------+
|
|
Returned 1 row(s) in 0.67s
|
|
[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1;
|
|
ERROR: AnalysisException: INPATH location '/user/doc_demo/thousand_strings.txt' does not exist. </codeblock>
|
|
|
|
<p>
|
|
As indicated by the message at the end of the previous example, the data file was moved from its original
|
|
location. The following example illustrates how the data file was moved into the Impala data directory for
|
|
the destination table, keeping its original filename:
|
|
</p>
|
|
|
|
<codeblock>$ hdfs dfs -ls /user/hive/warehouse/load_data_testing.db/t1
|
|
Found 1 items
|
|
-rw-r--r-- 1 doc_demo doc_demo 13926 2013-06-26 15:40 /user/hive/warehouse/load_data_testing.db/t1/thousand_strings.txt</codeblock>
|
|
|
|
<p>
|
|
The following example demonstrates the difference between the <codeph>INTO TABLE</codeph> and
|
|
<codeph>OVERWRITE TABLE</codeph> clauses. The table already contains 1000 rows. After issuing the
|
|
<codeph>LOAD DATA</codeph> statement with the <codeph>INTO TABLE</codeph> clause, the table contains 100 more
|
|
rows, for a total of 1100. After issuing the <codeph>LOAD DATA</codeph> statement with the <codeph>OVERWRITE
|
|
INTO TABLE</codeph> clause, the former contents are gone, and now the table only contains the 10 rows from
|
|
the just-loaded data file.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > load data inpath '/user/doc_demo/hundred_strings.txt' into table t1;
|
|
Query finished, fetching results ...
|
|
+----------------------------------------------------------+
|
|
| summary |
|
|
+----------------------------------------------------------+
|
|
| Loaded 1 file(s). Total files in destination location: 2 |
|
|
+----------------------------------------------------------+
|
|
Returned 1 row(s) in 0.24s
|
|
[localhost:21000] > select count(*) from t1;
|
|
Query finished, fetching results ...
|
|
+------+
|
|
| _c0 |
|
|
+------+
|
|
| 1100 |
|
|
+------+
|
|
Returned 1 row(s) in 0.55s
|
|
[localhost:21000] > load data inpath '/user/doc_demo/ten_strings.txt' overwrite into table t1;
|
|
Query finished, fetching results ...
|
|
+----------------------------------------------------------+
|
|
| summary |
|
|
+----------------------------------------------------------+
|
|
| Loaded 1 file(s). Total files in destination location: 1 |
|
|
+----------------------------------------------------------+
|
|
Returned 1 row(s) in 0.26s
|
|
[localhost:21000] > select count(*) from t1;
|
|
Query finished, fetching results ...
|
|
+-----+
|
|
| _c0 |
|
|
+-----+
|
|
| 10 |
|
|
+-----+
|
|
Returned 1 row(s) in 0.62s</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/s3_dml"/>
|
|
<p conref="../shared/impala_common.xml#common/s3_dml_performance"/>
|
|
<p>See <xref href="../topics/impala_s3.xml#s3"/> for details about reading and writing S3 data with Impala.</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
|
|
<p rev="">
|
|
The user ID that the <cmdname>impalad</cmdname> daemon runs under,
|
|
typically the <codeph>impala</codeph> user, must have read and write
|
|
permissions for the files in the source directory, and write
|
|
permission for the destination directory.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/kudu_no_load_data"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
|
|
<p conref="../shared/impala_common.xml#common/hbase_no_load_data"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
<p>
|
|
The <codeph>LOAD DATA</codeph> statement is an alternative to the
|
|
<codeph><xref href="impala_insert.xml#insert">INSERT</xref></codeph> statement.
|
|
Use <codeph>LOAD DATA</codeph>
|
|
when you have the data files in HDFS but outside of any Impala table.
|
|
</p>
|
|
<p>
|
|
The <codeph>LOAD DATA</codeph> statement is also an alternative
|
|
to the <codeph>CREATE EXTERNAL TABLE</codeph> statement. Use
|
|
<codeph>LOAD DATA</codeph> when it is appropriate to move the
|
|
data files under Impala control rather than querying them
|
|
from their original location. See <xref href="impala_tables.xml#external_tables"/>
|
|
for information about working with external tables.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|