mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
Change-Id: I07ec0a197de8a625788a3b0485d5ecf237e554ba Reviewed-on: http://gerrit.cloudera.org:8080/22576 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Peter Rozsa <prozsa@cloudera.com>
832 lines
37 KiB
HTML
832 lines
37 KiB
HTML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!DOCTYPE html
|
|
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
|
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
|
|
|
<meta name="copyright" content="(C) Copyright 2025" />
|
|
<meta name="DC.rights.owner" content="(C) Copyright 2025" />
|
|
<meta name="DC.Type" content="concept" />
|
|
<meta name="DC.Title" content="Using Impala with the Azure Data Lake Store (ADLS)" />
|
|
<meta name="prodname" content="Impala" />
|
|
<meta name="version" content="Impala 3.4.x" />
|
|
<meta name="DC.Format" content="XHTML" />
|
|
<meta name="DC.Identifier" content="adls" />
|
|
<link rel="stylesheet" type="text/css" href="../commonltr.css" />
|
|
<title>Using Impala with the Azure Data Lake Store (ADLS)</title>
|
|
</head>
|
|
<body id="adls">
|
|
|
|
|
|
<h1 class="title topictitle1" id="ariaid-title1">Using Impala with the Azure Data Lake Store (ADLS)</h1>
|
|
|
|
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<p class="p">
|
|
You can use Impala to query data residing on the Azure Data Lake Store
|
|
(ADLS) filesystem. This capability allows convenient access to a storage
|
|
system that is remotely managed, accessible from anywhere, and integrated
|
|
with various cloud-based services. Impala can query files in any supported
|
|
file format from ADLS. The ADLS storage location can be for an entire
|
|
table or individual partitions in a partitioned table.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
The default Impala tables use data files stored on HDFS, which are ideal for bulk loads and queries using
|
|
full-table scans. In contrast, queries against ADLS data are less performant, making ADLS suitable for holding
|
|
<span class="q">"cold"</span> data that is only queried occasionally, while more frequently accessed <span class="q">"hot"</span> data resides in
|
|
HDFS. In a partitioned table, you can set the <code class="ph codeph">LOCATION</code> attribute for individual partitions
|
|
to put some partitions on HDFS and others on ADLS, typically depending on the age of the data.
|
|
</p>
|
|
|
|
<p class="p">Starting in <span class="keyword">Impala 3.1</span>, Impala supports ADLS Gen2
|
|
filesystem, Azure Blob File System (ABFS).</p>
|
|
|
|
|
|
<p class="p toc inpage"></p>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title2" id="prereqs">
|
|
<h2 class="title topictitle2" id="ariaid-title2">Prerequisites</h2>
|
|
|
|
<div class="body conbody">
|
|
<p class="p">
|
|
These procedures presume that you have already set up an Azure account,
|
|
configured an ADLS store, and configured your Hadoop cluster with appropriate
|
|
credentials to be able to access ADLS. See the following resources for information:
|
|
</p>
|
|
|
|
<ul class="ul">
|
|
<li class="li">
|
|
<p class="p">
|
|
<a class="xref" href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal" target="_blank">Get started with Azure Data Lake Store using the Azure Portal</a>
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li class="li"><a class="xref" href="https://docs.microsoft.com/en-us/azure/storage/data-lake-storage/quickstart-create-account" target="_blank">Azure Data Lake Storage Gen2</a></li>
|
|
|
|
<li class="li">
|
|
<p class="p">
|
|
<a class="xref" href="https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html" target="_blank">Hadoop Azure Data Lake Support</a>
|
|
</p>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title3" id="sql">
|
|
<h2 class="title topictitle2" id="ariaid-title3">How Impala SQL Statements Work with ADLS</h2>
|
|
|
|
<div class="body conbody">
|
|
<p class="p"> Impala SQL statements work with data on ADLS as follows. </p>
|
|
|
|
<ul class="ul">
|
|
<li class="li"><div class="p"> The <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> or <a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> statements can specify
|
|
that a table resides on the ADLS filesystem by using one of the
|
|
following ADLS prefixes in the <code class="ph codeph">LOCATION</code> property.<ul class="ul">
|
|
<li class="li">For ADLS Gen1: <code class="ph codeph">adl://</code></li>
|
|
|
|
<li class="li">For ADLS Gen2: <code class="ph codeph">abfs://</code> or
|
|
<code class="ph codeph">abfss://</code></li>
|
|
|
|
</ul>
|
|
</div>
|
|
<p class="p"><code class="ph codeph">ALTER TABLE</code> can also set the
|
|
<code class="ph codeph">LOCATION</code> property for an individual partition, so
|
|
that some data in a table resides on ADLS and other data in the same
|
|
table resides on HDFS. </p>
|
|
See <a class="xref" href="impala_adls.html#ddl">Creating Impala Databases, Tables, and Partitions for Data Stored on ADLS</a>
|
|
for usage information.</li>
|
|
|
|
<li class="li">
|
|
<p class="p">
|
|
Once a table or partition is designated as residing on ADLS, the <a class="xref" href="impala_select.html#select">SELECT Statement</a>
|
|
statement transparently accesses the data files from the appropriate storage layer.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li class="li">
|
|
<p class="p">
|
|
If the ADLS table is an internal table, the <a class="xref" href="impala_drop_table.html#drop_table">DROP TABLE Statement</a> statement
|
|
removes the corresponding data files from ADLS when the table is dropped.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li class="li">
|
|
<p class="p">
|
|
The <a class="xref" href="impala_truncate_table.html#truncate_table">TRUNCATE TABLE Statement (Impala 2.3 or higher only)</a> statement always removes the corresponding
|
|
data files from ADLS when the table is truncated.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li class="li">
|
|
<p class="p">
|
|
The <a class="xref" href="impala_load_data.html#load_data">LOAD DATA Statement</a> can move data files residing in HDFS into
|
|
an ADLS table.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li class="li">
|
|
<p class="p">
|
|
The <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>, or the <code class="ph codeph">CREATE TABLE AS SELECT</code>
|
|
form of the <code class="ph codeph">CREATE TABLE</code> statement, can copy data from an HDFS table or another ADLS
|
|
table into an ADLS table.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
<p class="p"> For usage information about Impala SQL statements with ADLS tables,
|
|
see <a class="xref" href="impala_adls.html#dml">Using Impala DML Statements for ADLS Data</a>. </p>
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title4" id="creds">
|
|
|
|
<h2 class="title topictitle2" id="ariaid-title4">Specifying Impala Credentials to Access Data in ADLS</h2>
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<p class="p"> To allow Impala to access data in ADLS, specify values for the
|
|
following configuration settings in your
|
|
<span class="ph filepath">core-site.xml</span> file.</p>
|
|
|
|
<p class="p">For ADLS Gen1:</p>
|
|
|
|
|
|
<pre class="pre codeblock"><code><property>
|
|
<name>dfs.adls.oauth2.access.token.provider.type</name>
|
|
<value>ClientCredential</value>
|
|
</property>
|
|
<property>
|
|
<name>dfs.adls.oauth2.client.id</name>
|
|
<value><var class="keyword varname">your_client_id</var></value>
|
|
</property>
|
|
<property>
|
|
<name>dfs.adls.oauth2.credential</name>
|
|
<value><var class="keyword varname">your_client_secret</var></value>
|
|
</property>
|
|
<property>
|
|
<name>dfs.adls.oauth2.refresh.url</name>
|
|
<value>https://login.windows.net/<var class="keyword varname">your_azure_tenant_id</var>/oauth2/token</value>
|
|
</property>
|
|
|
|
</code></pre>
|
|
<p class="p">For ADLS Gen2:</p>
|
|
|
|
<pre class="pre codeblock"><code> <property>
|
|
<name>fs.azure.account.auth.type</name>
|
|
<value>OAuth</value>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.azure.account.oauth.provider.type</name>
|
|
<value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.azure.account.oauth2.client.id</name>
|
|
<value><var class="keyword varname">your_client_id</var></value>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.azure.account.oauth2.client.secret</name>
|
|
<value><var class="keyword varname">your_client_secret</var></value>
|
|
</property>
|
|
|
|
<property>
|
|
<name>fs.azure.account.oauth2.client.endpoint</name>
|
|
<value>https://login.microsoftonline.com/<var class="keyword varname">your_azure_tenant_id</var>/oauth2/token</value>
|
|
</property></code></pre>
|
|
|
|
<div class="note note"><span class="notetitle">Note:</span>
|
|
<p class="p">
|
|
Check if your Hadoop distribution or cluster management tool includes support for
|
|
filling in and distributing credentials across the cluster in an automated way.
|
|
</p>
|
|
|
|
</div>
|
|
|
|
|
|
<p class="p"> After specifying the credentials, restart both the Impala and Hive
|
|
services. Restarting Hive is required because certain Impala queries,
|
|
such as <code class="ph codeph">CREATE TABLE</code> statements, go through the Hive
|
|
metastore.</p>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title5" id="etl">
|
|
|
|
<h2 class="title topictitle2" id="ariaid-title5">Loading Data into ADLS for Impala Queries</h2>
|
|
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<p class="p">
|
|
If your ETL pipeline involves moving data into ADLS and then querying through Impala,
|
|
you can either use Impala DML statements to create, move, or copy the data, or
|
|
use the same data loading techniques as you would for non-Impala data.
|
|
</p>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested2" aria-labelledby="ariaid-title6" id="dml">
|
|
<h3 class="title topictitle3" id="ariaid-title6">Using Impala DML Statements for ADLS Data</h3>
|
|
|
|
<div class="body conbody">
|
|
<p class="p">
|
|
In <span class="keyword">Impala 2.9</span> and higher, the Impala DML statements
|
|
(<code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and <code class="ph codeph">CREATE TABLE AS
|
|
SELECT</code>) can write data into a table or partition that resides in the Azure Data
|
|
Lake Store (ADLS). ADLS Gen2 is supported in <span class="keyword">Impala 3.1</span> and higher.
|
|
</p>
|
|
<p class="p">
|
|
In the<code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statements, specify
|
|
the ADLS location for tables and partitions with the <code class="ph codeph">adl://</code> prefix for
|
|
ADLS Gen1 and <code class="ph codeph">abfs://</code> or <code class="ph codeph">abfss://</code> for ADLS Gen2 in the
|
|
<code class="ph codeph">LOCATION</code> attribute.
|
|
</p>
|
|
<p class="p">
|
|
If you bring data into ADLS using the normal ADLS transfer mechanisms instead of Impala
|
|
DML statements, issue a <code class="ph codeph">REFRESH</code> statement for the table before using
|
|
Impala to query the ADLS data.
|
|
</p>
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested2" aria-labelledby="ariaid-title7" id="manual_etl">
|
|
<h3 class="title topictitle3" id="ariaid-title7">Manually Loading Data into Impala Tables on ADLS</h3>
|
|
|
|
<div class="body conbody">
|
|
<p class="p">
|
|
As an alternative, you can use the Microsoft-provided methods to bring data files
|
|
into ADLS for querying through Impala. See
|
|
<a class="xref" href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-azure-storage-blob" target="_blank">the Microsoft ADLS documentation</a>
|
|
for details.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
After you upload data files to a location already mapped to an Impala table or partition, or if you delete
|
|
files in ADLS from such a location, issue the <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code>
|
|
statement to make Impala aware of the new set of data files.
|
|
</p>
|
|
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title8" id="ddl">
|
|
|
|
<h2 class="title topictitle2" id="ariaid-title8">Creating Impala Databases, Tables, and Partitions for Data Stored on ADLS</h2>
|
|
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<div class="p">
|
|
Impala reads data for a table or partition from ADLS based on the
|
|
<code class="ph codeph">LOCATION</code> attribute for the table or partition.
|
|
Specify the ADLS details in the <code class="ph codeph">LOCATION</code> clause of a
|
|
<code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code>
|
|
statement. The syntax for the <code class="ph codeph">LOCATION</code> clause is:
|
|
<ul class="ul">
|
|
<li class="li">
|
|
For ADLS Gen1:
|
|
<pre class="pre codeblock"><code>adl://<var class="keyword varname">account</var>.azuredatalakestore.net/<var class="keyword varname">path/file</var></code></pre></li>
|
|
|
|
<li class="li">
|
|
For ADLS Gen2:
|
|
<pre class="pre codeblock"><code>abfs://<var class="keyword varname">container</var>@<var class="keyword varname">account</var>.dfs.core.windows.net/<var class="keyword varname">path</var>/<var class="keyword varname">file</var></code></pre>
|
|
<p class="p">
|
|
or
|
|
</p>
|
|
|
|
<pre class="pre codeblock"><code>abfss://<var class="keyword varname">container</var>@<var class="keyword varname">account</var>.dfs.core.windows.net/<var class="keyword varname">path</var>/<var class="keyword varname">file</var></code></pre>
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
</div>
|
|
|
|
<p class="p">
|
|
<code class="ph codeph"><var class="keyword varname">container</var></code> denotes the parent
|
|
location that holds the files and folders, which is the Containers in
|
|
the Azure Storage Blobs service.
|
|
</p>
|
|
|
|
<p class="p">
|
|
<code class="ph codeph"><var class="keyword varname">account</var></code> is the name given for your
|
|
storage account.
|
|
</p>
|
|
|
|
<div class="note note"><span class="notetitle">Note:</span>
|
|
<p class="p"> By default, TLS is enabled both with <code class="ph codeph">abfs://</code> and
|
|
<code class="ph codeph">abfss://</code>. </p>
|
|
|
|
<p class="p">
|
|
When you set the <code class="ph codeph">fs.azure.always.use.https=false</code>
|
|
property, TLS is disabled with <code class="ph codeph">abfs://</code>, and TLS is
|
|
enabled with <code class="ph codeph">abfss://</code>
|
|
</p>
|
|
|
|
</div>
|
|
|
|
|
|
<p class="p">
|
|
For a partitioned table, either specify a separate <code class="ph codeph">LOCATION</code> clause for each new partition,
|
|
or specify a base <code class="ph codeph">LOCATION</code> for the table and set up a directory structure in ADLS to mirror
|
|
the way Impala partitioned tables are structured in HDFS. Although, strictly speaking, ADLS filenames do not
|
|
have directory paths, Impala treats ADLS filenames with <code class="ph codeph">/</code> characters the same as HDFS
|
|
pathnames that include directories.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
To point a nonpartitioned table or an individual partition at ADLS, specify a single directory
|
|
path in ADLS, which could be any arbitrary directory. To replicate the structure of an entire Impala
|
|
partitioned table or database in ADLS requires more care, with directories and subdirectories nested and
|
|
named to match the equivalent directory tree in HDFS. Consider setting up an empty staging area if
|
|
necessary in HDFS, and recording the complete directory structure so that you can replicate it in ADLS.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
For example, the following session creates a partitioned table where only a single partition resides on ADLS.
|
|
The partitions for years 2013 and 2014 are located on HDFS. The partition for year 2015 includes a
|
|
<code class="ph codeph">LOCATION</code> attribute with an <code class="ph codeph">adl://</code> URL, and so refers to data residing on
|
|
ADLS, under a specific path underneath the store <code class="ph codeph">impalademo</code>.
|
|
</p>
|
|
|
|
|
|
<pre class="pre codeblock"><code>[localhost:21000] > create database db_on_hdfs;
|
|
[localhost:21000] > use db_on_hdfs;
|
|
[localhost:21000] > create table mostly_on_hdfs (x int) partitioned by (year int);
|
|
[localhost:21000] > alter table mostly_on_hdfs add partition (year=2013);
|
|
[localhost:21000] > alter table mostly_on_hdfs add partition (year=2014);
|
|
[localhost:21000] > alter table mostly_on_hdfs add partition (year=2015)
|
|
> location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3/t1';
|
|
</code></pre>
|
|
|
|
<p class="p"> For convenience when working with multiple tables with data files
|
|
stored in ADLS, you can create a database with a
|
|
<code class="ph codeph">LOCATION</code> attribute pointing to an ADLS path. Specify
|
|
a URL of the form as shown above. Any tables created inside that
|
|
database automatically create directories underneath the one specified
|
|
by the database <code class="ph codeph">LOCATION</code> attribute. </p>
|
|
|
|
|
|
<p class="p">
|
|
The following session creates a database and two partitioned tables residing entirely on ADLS, one
|
|
partitioned by a single column and the other partitioned by multiple columns. Because a
|
|
<code class="ph codeph">LOCATION</code> attribute with an <code class="ph codeph">adl://</code> URL is specified for the database, the
|
|
tables inside that database are automatically created on ADLS underneath the database directory. To see the
|
|
names of the associated subdirectories, including the partition key values, we use an ADLS client tool to
|
|
examine how the directory structure is organized on ADLS. For example, Impala partition directories such as
|
|
<code class="ph codeph">month=1</code> do not include leading zeroes, which sometimes appear in partition directories created
|
|
through Hive.
|
|
</p>
|
|
|
|
|
|
<pre class="pre codeblock"><code>[localhost:21000] > create database db_on_adls location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3';
|
|
[localhost:21000] > use db_on_adls;
|
|
|
|
[localhost:21000] > create table partitioned_on_adls (x int) partitioned by (year int);
|
|
[localhost:21000] > alter table partitioned_on_adls add partition (year=2013);
|
|
[localhost:21000] > alter table partitioned_on_adls add partition (year=2014);
|
|
[localhost:21000] > alter table partitioned_on_adls add partition (year=2015);
|
|
|
|
[localhost:21000] > ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive;
|
|
2015-03-17 13:56:34 0 dir1/dir2/dir3/
|
|
2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_adls/
|
|
2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_adls/year=2013/
|
|
2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_adls/year=2014/
|
|
2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_adls/year=2015/
|
|
|
|
[localhost:21000] > create table partitioned_multiple_keys (x int)
|
|
> partitioned by (year smallint, month tinyint, day tinyint);
|
|
[localhost:21000] > alter table partitioned_multiple_keys
|
|
> add partition (year=2015,month=1,day=1);
|
|
[localhost:21000] > alter table partitioned_multiple_keys
|
|
> add partition (year=2015,month=1,day=31);
|
|
[localhost:21000] > alter table partitioned_multiple_keys
|
|
> add partition (year=2015,month=2,day=28);
|
|
|
|
[localhost:21000] > ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive;
|
|
2015-03-17 13:56:34 0 dir1/dir2/dir3/
|
|
2015-03-17 16:47:13 0 dir1/dir2/dir3/partitioned_multiple_keys/
|
|
2015-03-17 16:47:44 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/
|
|
2015-03-17 16:47:50 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/
|
|
2015-03-17 16:47:57 0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=2/day=28/
|
|
2015-03-17 16:43:28 0 dir1/dir2/dir3/partitioned_on_adls/
|
|
2015-03-17 16:43:49 0 dir1/dir2/dir3/partitioned_on_adls/year=2013/
|
|
2015-03-17 16:43:53 0 dir1/dir2/dir3/partitioned_on_adls/year=2014/
|
|
2015-03-17 16:43:58 0 dir1/dir2/dir3/partitioned_on_adls/year=2015/
|
|
</code></pre>
|
|
|
|
<p class="p">
|
|
The <code class="ph codeph">CREATE DATABASE</code> and <code class="ph codeph">CREATE TABLE</code> statements create the associated
|
|
directory paths if they do not already exist. You can specify multiple levels of directories, and the
|
|
<code class="ph codeph">CREATE</code> statement creates all appropriate levels, similar to using <code class="ph codeph">mkdir
|
|
-p</code>.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
Use the standard ADLS file upload methods to actually put the data files into the right locations. You can
|
|
also put the directory paths and data files in place before creating the associated Impala databases or
|
|
tables, and Impala automatically uses the data from the appropriate location after the associated databases
|
|
and tables are created.
|
|
</p>
|
|
|
|
|
|
<p class="p"> You can switch whether an existing table or partition points to data
|
|
in HDFS or ADLS. For example, if you have an Impala table or partition
|
|
pointing to data files in HDFS or ADLS, and you later transfer those
|
|
data files to the other filesystem, use an <code class="ph codeph">ALTER TABLE</code>
|
|
statement to adjust the <code class="ph codeph">LOCATION</code> attribute of the
|
|
corresponding table or partition to reflect that change. This
|
|
location-switching technique is not practical for entire databases that
|
|
have a custom <code class="ph codeph">LOCATION</code> attribute. </p>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title9" id="internal_external">
|
|
|
|
<h2 class="title topictitle2" id="ariaid-title9">Internal and External Tables Located on ADLS</h2>
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<p class="p">
|
|
Just as with tables located on HDFS storage, you can designate ADLS-based tables as either internal (managed
|
|
by Impala) or external, by using the syntax <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">CREATE EXTERNAL
|
|
TABLE</code> respectively. When you drop an internal table, the files associated with the table are
|
|
removed, even if they are on ADLS storage. When you drop an external table, the files associated with the
|
|
table are left alone, and are still available for access by other tools or components. See
|
|
<a class="xref" href="impala_tables.html#tables">Overview of Impala Tables</a> for details.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
If the data on ADLS is intended to be long-lived and accessed by other tools in addition to Impala, create
|
|
any associated ADLS tables with the <code class="ph codeph">CREATE EXTERNAL TABLE</code> syntax, so that the files are not
|
|
deleted from ADLS when the table is dropped.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
If the data on ADLS is only needed for querying by Impala and can be safely discarded once the Impala
|
|
workflow is complete, create the associated ADLS tables using the <code class="ph codeph">CREATE TABLE</code> syntax, so
|
|
that dropping the table also deletes the corresponding data files on ADLS.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
For example, this session creates a table in ADLS with the same column layout as a table in HDFS, then
|
|
examines the ADLS table and queries some data from it. The table in ADLS works the same as a table in HDFS as
|
|
far as the expected file format of the data, table and column statistics, and other table properties. The
|
|
only indication that it is not an HDFS table is the <code class="ph codeph">adl://</code> URL in the
|
|
<code class="ph codeph">LOCATION</code> property. Many data files can reside in the ADLS directory, and their combined
|
|
contents form the table data. Because the data in this example is uploaded after the table is created, a
|
|
<code class="ph codeph">REFRESH</code> statement prompts Impala to update its cached information about the data files.
|
|
</p>
|
|
|
|
|
|
<pre class="pre codeblock"><code>[localhost:21000] > create table usa_cities_adls like usa_cities location 'adl://impalademo.azuredatalakestore.net/usa_cities';
|
|
[localhost:21000] > desc usa_cities_adls;
|
|
+-------+----------+---------+
|
|
| name | type | comment |
|
|
+-------+----------+---------+
|
|
| id | smallint | |
|
|
| city | string | |
|
|
| state | string | |
|
|
+-------+----------+---------+
|
|
|
|
-- Now from a web browser, upload the same data file(s) to ADLS as in the HDFS table,
|
|
-- under the relevant store and path. If you already have the data in ADLS, you would
|
|
-- point the table LOCATION at an existing path.
|
|
|
|
[localhost:21000] > refresh usa_cities_adls;
|
|
[localhost:21000] > select count(*) from usa_cities_adls;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 289 |
|
|
+----------+
|
|
[localhost:21000] > select distinct state from sample_data_adls limit 5;
|
|
+----------------------+
|
|
| state |
|
|
+----------------------+
|
|
| Louisiana |
|
|
| Minnesota |
|
|
| Georgia |
|
|
| Alaska |
|
|
| Ohio |
|
|
+----------------------+
|
|
[localhost:21000] > desc formatted usa_cities_adls;
|
|
+------------------------------+----------------------------------------------------+---------+
|
|
| name | type | comment |
|
|
+------------------------------+----------------------------------------------------+---------+
|
|
| # col_name | data_type | comment |
|
|
| | NULL | NULL |
|
|
| id | smallint | NULL |
|
|
| city | string | NULL |
|
|
| state | string | NULL |
|
|
| | NULL | NULL |
|
|
| # Detailed Table Information | NULL | NULL |
|
|
| Database: | adls_testing | NULL |
|
|
| Owner: | jrussell | NULL |
|
|
| CreateTime: | Mon Mar 16 11:36:25 PDT 2017 | NULL |
|
|
| LastAccessTime: | UNKNOWN | NULL |
|
|
| Protect Mode: | None | NULL |
|
|
| Retention: | 0 | NULL |
|
|
| Location: | adl://impalademo.azuredatalakestore.net/usa_cities | NULL |
|
|
| Table Type: | MANAGED_TABLE | NULL |
|
|
...
|
|
+------------------------------+----------------------------------------------------+---------+
|
|
</code></pre>
|
|
|
|
<p class="p">
|
|
In this case, we have already uploaded a Parquet file with a million rows of data to the
|
|
<code class="ph codeph">sample_data</code> directory underneath the <code class="ph codeph">impalademo</code> store on ADLS. This
|
|
session creates a table with matching column settings pointing to the corresponding location in ADLS, then
|
|
queries the table. Because the data is already in place on ADLS when the table is created, no
|
|
<code class="ph codeph">REFRESH</code> statement is required.
|
|
</p>
|
|
|
|
|
|
<pre class="pre codeblock"><code>[localhost:21000] > create table sample_data_adls
|
|
> (id int, id bigint, val int, zerofill string,
|
|
> name string, assertion boolean, city string, state string)
|
|
> stored as parquet location 'adl://impalademo.azuredatalakestore.net/sample_data';
|
|
[localhost:21000] > select count(*) from sample_data_adls;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 1000000 |
|
|
+----------+
|
|
[localhost:21000] > select count(*) howmany, assertion from sample_data_adls group by assertion;
|
|
+---------+-----------+
|
|
| howmany | assertion |
|
|
+---------+-----------+
|
|
| 667149 | true |
|
|
| 332851 | false |
|
|
+---------+-----------+
|
|
</code></pre>
|
|
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title10" id="queries">
|
|
|
|
<h2 class="title topictitle2" id="ariaid-title10">Running and Tuning Impala Queries for Data Stored on ADLS</h2>
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<p class="p">
|
|
Once the appropriate <code class="ph codeph">LOCATION</code> attributes are set up at the table or partition level, you
|
|
query data stored in ADLS exactly the same as data stored on HDFS or in HBase:
|
|
</p>
|
|
|
|
|
|
<ul class="ul">
|
|
<li class="li">
|
|
Queries against ADLS data support all the same file formats as for HDFS data.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in ADLS
|
|
corresponding to the HDFS directories representing partition key values, or use <code class="ph codeph">ALTER TABLE ...
|
|
ADD PARTITION</code> to set up the appropriate paths in ADLS.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
HDFS, Kudu, and HBase tables can be joined to ADLS tables, or ADLS tables can be joined with each other.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
Authorization using the Ranger framework to control access to databases, tables, or columns works the
|
|
same whether the data is in HDFS or in ADLS.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
The <span class="keyword cmdname">catalogd</span> daemon caches metadata for both HDFS and ADLS tables. Use
|
|
<code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> for ADLS tables in the same situations
|
|
where you would issue those statements for HDFS tables.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
Queries against ADLS tables are subject to the same kinds of admission control and resource management as
|
|
HDFS tables.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
Metadata about ADLS tables is stored in the same metastore database as for HDFS tables.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
You can set up views referring to ADLS tables, the same as for HDFS tables.
|
|
</li>
|
|
|
|
|
|
<li class="li">
|
|
The <code class="ph codeph">COMPUTE STATS</code>, <code class="ph codeph">SHOW TABLE STATS</code>, and <code class="ph codeph">SHOW COLUMN
|
|
STATS</code> statements work for ADLS tables also.
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested2" aria-labelledby="ariaid-title11" id="performance">
|
|
|
|
<h3 class="title topictitle3" id="ariaid-title11">Understanding and Tuning Impala Query Performance for ADLS Data</h3>
|
|
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<p class="p">
|
|
Although Impala queries for data stored in ADLS might be less performant than queries against the
|
|
equivalent data stored in HDFS, you can still do some tuning. Here are techniques you can use to
|
|
interpret explain plans and profiles for queries against ADLS data, and tips to achieve the best
|
|
performance possible for such queries.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
All else being equal, performance is expected to be lower for queries running against data on ADLS rather
|
|
than HDFS. The actual mechanics of the <code class="ph codeph">SELECT</code> statement are somewhat different when the
|
|
data is in ADLS. Although the work is still distributed across the datanodes of the cluster, Impala might
|
|
parallelize the work for a distributed query differently for data on HDFS and ADLS. ADLS does not have the
|
|
same block notion as HDFS, so Impala uses heuristics to determine how to split up large ADLS files for
|
|
processing in parallel. Because all hosts can access any ADLS data file with equal efficiency, the
|
|
distribution of work might be different than for HDFS data, where the data blocks are physically read
|
|
using short-circuit local reads by hosts that contain the appropriate block replicas. Although the I/O to
|
|
read the ADLS data might be spread evenly across the hosts of the cluster, the fact that all data is
|
|
initially retrieved across the network means that the overall query performance is likely to be lower for
|
|
ADLS data than for HDFS data.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
Because ADLS does not expose the block sizes of data files the way HDFS does, any Impala
|
|
<code class="ph codeph">INSERT</code> or <code class="ph codeph">CREATE TABLE AS SELECT</code> statements use the
|
|
<code class="ph codeph">PARQUET_FILE_SIZE</code> query option setting to define the size of Parquet
|
|
data files. (Using a large block size is more important for Parquet tables than for
|
|
tables that use other file formats.)
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
When optimizing aspects of for complex queries such as the join order, Impala treats tables on HDFS and
|
|
ADLS the same way. Therefore, follow all the same tuning recommendations for ADLS tables as for HDFS ones,
|
|
such as using the <code class="ph codeph">COMPUTE STATS</code> statement to help Impala construct accurate estimates of
|
|
row counts and cardinality. See <a class="xref" href="impala_performance.html#performance">Tuning Impala for Performance</a> for details.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
In query profile reports, the numbers for <code class="ph codeph">BytesReadLocal</code>,
|
|
<code class="ph codeph">BytesReadShortCircuit</code>, <code class="ph codeph">BytesReadDataNodeCached</code>, and
|
|
<code class="ph codeph">BytesReadRemoteUnexpected</code> are blank because those metrics come from HDFS.
|
|
If you do see any indications that a query against an ADLS table performed <span class="q">"remote read"</span>
|
|
operations, do not be alarmed. That is expected because, by definition, all the I/O for ADLS tables involves
|
|
remote reads.
|
|
</p>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title12" id="restrictions">
|
|
|
|
<h2 class="title topictitle2" id="ariaid-title12">Restrictions on Impala Support for ADLS</h2>
|
|
|
|
|
|
<div class="body conbody">
|
|
|
|
<p class="p">
|
|
Impala requires that the default filesystem for the cluster be HDFS. You cannot use ADLS as the only
|
|
filesystem in the cluster.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
Although ADLS is often used to store JSON-formatted data, the current Impala support for ADLS does not include
|
|
directly querying JSON data. For Impala queries, use data files in one of the file formats listed in
|
|
<a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a>. If you have data in JSON format, you can prepare a
|
|
flattened version of that data for querying by Impala as part of your ETL cycle.
|
|
</p>
|
|
|
|
|
|
<p class="p">
|
|
You cannot use the <code class="ph codeph">ALTER TABLE ... SET CACHED</code> statement for tables or partitions that are
|
|
located in ADLS.
|
|
</p>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
</div>
|
|
|
|
|
|
<div class="topic concept nested1" aria-labelledby="ariaid-title13" id="best_practices">
|
|
<h2 class="title topictitle2" id="ariaid-title13">Best Practices for Using Impala with ADLS</h2>
|
|
|
|
|
|
<div class="body conbody">
|
|
<p class="p">
|
|
The following guidelines represent best practices derived from testing and real-world experience with Impala on ADLS:
|
|
</p>
|
|
|
|
<ul class="ul">
|
|
<li class="li">
|
|
<p class="p">
|
|
Any reference to an ADLS location must be fully qualified. (This rule applies when
|
|
ADLS is not designated as the default filesystem.)
|
|
</p>
|
|
|
|
</li>
|
|
|
|
<li class="li">
|
|
<p class="p">
|
|
Set any appropriate configuration settings for <span class="keyword cmdname">impalad</span>.
|
|
</p>
|
|
|
|
</li>
|
|
|
|
</ul>
|
|
|
|
|
|
</div>
|
|
|
|
</div>
|
|
|
|
|
|
</body>
|
|
</html> |