impala/docs/build/plain-html/topics/impala_adls.html

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

<meta name="copyright" content="(C) Copyright 2025" />
<meta name="DC.rights.owner" content="(C) Copyright 2025" />
<meta name="DC.Type" content="concept" />
<meta name="DC.Title" content="Using Impala with the Azure Data Lake Store (ADLS)" />
<meta name="prodname" content="Impala" />
<meta name="version" content="Impala 3.4.x" />
<meta name="DC.Format" content="XHTML" />
<meta name="DC.Identifier" content="adls" />
<link rel="stylesheet" type="text/css" href="../commonltr.css" />
<title>Using Impala with the Azure Data Lake Store (ADLS)</title>
</head>
<body id="adls">


  <h1 class="title topictitle1" id="ariaid-title1">Using Impala with the Azure Data Lake Store (ADLS)</h1>


  <div class="body conbody">

    <p class="p">
      You can use Impala to query data residing on the Azure Data Lake Store
      (ADLS) filesystem. This capability allows convenient access to a storage
      system that is remotely managed, accessible from anywhere, and integrated
      with various cloud-based services. Impala can query files in any supported
      file format from ADLS. The ADLS storage location can be for an entire
      table or individual partitions in a partitioned table.
    </p>


    <p class="p">
      The default Impala tables use data files stored on HDFS, which are ideal for bulk loads and queries using
      full-table scans. In contrast, queries against ADLS data are less performant, making ADLS suitable for holding
      <span class="q">"cold"</span> data that is only queried occasionally, while more frequently accessed <span class="q">"hot"</span> data resides in
      HDFS. In a partitioned table, you can set the <code class="ph codeph">LOCATION</code> attribute for individual partitions
      to put some partitions on HDFS and others on ADLS, typically depending on the age of the data.
    </p>

    <p class="p">Starting in <span class="keyword">Impala 3.1</span>, Impala supports ADLS Gen2
      filesystem, Azure Blob File System (ABFS).</p>


    <p class="p toc inpage"></p>


  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title2" id="prereqs">
    <h2 class="title topictitle2" id="ariaid-title2">Prerequisites</h2>

    <div class="body conbody">
      <p class="p">
        These procedures presume that you have already set up an Azure account,
        configured an ADLS store, and configured your Hadoop cluster with appropriate
        credentials to be able to access ADLS. See the following resources for information:
      </p>

      <ul class="ul">
        <li class="li">
          <p class="p">
            <a class="xref" href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-get-started-portal" target="_blank">Get started with Azure Data Lake Store using the Azure Portal</a>
          </p>

        </li>

        <li class="li"><a class="xref" href="https://docs.microsoft.com/en-us/azure/storage/data-lake-storage/quickstart-create-account" target="_blank">Azure Data Lake Storage Gen2</a></li>

        <li class="li">
          <p class="p">
            <a class="xref" href="https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html" target="_blank">Hadoop Azure Data Lake Support</a>
          </p>

        </li>

      </ul>

    </div>

  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title3" id="sql">
    <h2 class="title topictitle2" id="ariaid-title3">How Impala SQL Statements Work with ADLS</h2>

    <div class="body conbody">
      <p class="p"> Impala SQL statements work with data on ADLS as follows. </p>

      <ul class="ul">
        <li class="li"><div class="p"> The <a class="xref" href="impala_create_table.html#create_table">CREATE TABLE Statement</a> or <a class="xref" href="impala_alter_table.html#alter_table">ALTER TABLE Statement</a> statements can specify
            that a table resides on the ADLS filesystem by using one of the
            following ADLS prefixes in the <code class="ph codeph">LOCATION</code> property.<ul class="ul">
              <li class="li">For ADLS Gen1: <code class="ph codeph">adl://</code></li>

              <li class="li">For ADLS Gen2: <code class="ph codeph">abfs://</code> or
                  <code class="ph codeph">abfss://</code></li>

            </ul>
</div>
<p class="p"><code class="ph codeph">ALTER TABLE</code> can also set the
              <code class="ph codeph">LOCATION</code> property for an individual partition, so
            that some data in a table resides on ADLS and other data in the same
            table resides on HDFS. </p>
 See <a class="xref" href="impala_adls.html#ddl">Creating Impala Databases, Tables, and Partitions for Data Stored on ADLS</a>
          for usage information.</li>

        <li class="li">
          <p class="p">
            Once a table or partition is designated as residing on ADLS, the <a class="xref" href="impala_select.html#select">SELECT Statement</a>
            statement transparently accesses the data files from the appropriate storage layer.
          </p>

        </li>

        <li class="li">
          <p class="p">
            If the ADLS table is an internal table, the <a class="xref" href="impala_drop_table.html#drop_table">DROP TABLE Statement</a> statement
            removes the corresponding data files from ADLS when the table is dropped.
          </p>

        </li>

        <li class="li">
          <p class="p">
            The <a class="xref" href="impala_truncate_table.html#truncate_table">TRUNCATE TABLE Statement (Impala 2.3 or higher only)</a> statement always removes the corresponding
            data files from ADLS when the table is truncated.
          </p>

        </li>

        <li class="li">
          <p class="p">
            The <a class="xref" href="impala_load_data.html#load_data">LOAD DATA Statement</a> can move data files residing in HDFS into
            an ADLS table.
          </p>

        </li>

        <li class="li">
          <p class="p">
            The <a class="xref" href="impala_insert.html#insert">INSERT Statement</a>, or the <code class="ph codeph">CREATE TABLE AS SELECT</code>
            form of the <code class="ph codeph">CREATE TABLE</code> statement, can copy data from an HDFS table or another ADLS
            table into an ADLS table.
          </p>

        </li>

      </ul>

      <p class="p"> For usage information about Impala SQL statements with ADLS tables,
        see <a class="xref" href="impala_adls.html#dml">Using Impala DML Statements for ADLS Data</a>. </p>

    </div>

  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title4" id="creds">

    <h2 class="title topictitle2" id="ariaid-title4">Specifying Impala Credentials to Access Data in ADLS</h2>


    <div class="body conbody">

      <p class="p"> To allow Impala to access data in ADLS, specify values for the
        following configuration settings in your
          <span class="ph filepath">core-site.xml</span> file.</p>

      <p class="p">For ADLS Gen1:</p>


<pre class="pre codeblock"><code>&lt;property&gt;
   &lt;name&gt;dfs.adls.oauth2.access.token.provider.type&lt;/name&gt;
   &lt;value&gt;ClientCredential&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
   &lt;name&gt;dfs.adls.oauth2.client.id&lt;/name&gt;
   &lt;value&gt;<var class="keyword varname">your_client_id</var>&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
   &lt;name&gt;dfs.adls.oauth2.credential&lt;/name&gt;
   &lt;value&gt;<var class="keyword varname">your_client_secret</var>&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
   &lt;name&gt;dfs.adls.oauth2.refresh.url&lt;/name&gt;
   &lt;value&gt;https://login.windows.net/<var class="keyword varname">your_azure_tenant_id</var>/oauth2/token&lt;/value&gt;
&lt;/property&gt;

</code></pre>
      <p class="p">For ADLS Gen2:</p>

      <pre class="pre codeblock"><code> &lt;property&gt;
    &lt;name&gt;fs.azure.account.auth.type&lt;/name&gt;
    &lt;value&gt;OAuth&lt;/value&gt;
  &lt;/property&gt;

  &lt;property&gt;
    &lt;name&gt;fs.azure.account.oauth.provider.type&lt;/name&gt;
    &lt;value&gt;org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider&lt;/value&gt;
  &lt;/property&gt;

  &lt;property&gt;
    &lt;name&gt;fs.azure.account.oauth2.client.id&lt;/name&gt;
    &lt;value&gt;<var class="keyword varname">your_client_id</var>&lt;/value&gt;
  &lt;/property&gt;

  &lt;property&gt;
    &lt;name&gt;fs.azure.account.oauth2.client.secret&lt;/name&gt;
    &lt;value&gt;<var class="keyword varname">your_client_secret</var>&lt;/value&gt;
  &lt;/property&gt;

  &lt;property&gt;
    &lt;name&gt;fs.azure.account.oauth2.client.endpoint&lt;/name&gt;
    &lt;value&gt;https://login.microsoftonline.com/<var class="keyword varname">your_azure_tenant_id</var>/oauth2/token&lt;/value&gt;
  &lt;/property&gt;</code></pre>

      <div class="note note"><span class="notetitle">Note:</span>
        <p class="p">
          Check if your Hadoop distribution or cluster management tool includes support for
          filling in and distributing credentials across the cluster in an automated way.
        </p>

      </div>


      <p class="p"> After specifying the credentials, restart both the Impala and Hive
        services. Restarting Hive is required because certain Impala queries,
        such as <code class="ph codeph">CREATE TABLE</code> statements, go through the Hive
        metastore.</p>


    </div>


  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title5" id="etl">

    <h2 class="title topictitle2" id="ariaid-title5">Loading Data into ADLS for Impala Queries</h2>


    <div class="body conbody">

      <p class="p">
        If your ETL pipeline involves moving data into ADLS and then querying through Impala,
        you can either use Impala DML statements to create, move, or copy the data, or
        use the same data loading techniques as you would for non-Impala data.
      </p>


    </div>


    <div class="topic concept nested2" aria-labelledby="ariaid-title6" id="dml">
      <h3 class="title topictitle3" id="ariaid-title6">Using Impala DML Statements for ADLS Data</h3>

      <div class="body conbody">
        <p class="p">
        In <span class="keyword">Impala 2.9</span> and higher, the Impala DML statements
        (<code class="ph codeph">INSERT</code>, <code class="ph codeph">LOAD DATA</code>, and <code class="ph codeph">CREATE TABLE AS
        SELECT</code>) can write data into a table or partition that resides in the Azure Data
        Lake Store (ADLS). ADLS Gen2 is supported in <span class="keyword">Impala 3.1</span> and higher.
      </p>
<p class="p">
        In the<code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code> statements, specify
        the ADLS location for tables and partitions with the <code class="ph codeph">adl://</code> prefix for
        ADLS Gen1 and <code class="ph codeph">abfs://</code> or <code class="ph codeph">abfss://</code> for ADLS Gen2 in the
        <code class="ph codeph">LOCATION</code> attribute.
      </p>
<p class="p">
        If you bring data into ADLS using the normal ADLS transfer mechanisms instead of Impala
        DML statements, issue a <code class="ph codeph">REFRESH</code> statement for the table before using
        Impala to query the ADLS data.
      </p>

      </div>

    </div>


    <div class="topic concept nested2" aria-labelledby="ariaid-title7" id="manual_etl">
      <h3 class="title topictitle3" id="ariaid-title7">Manually Loading Data into Impala Tables on ADLS</h3>

      <div class="body conbody">
        <p class="p">
          As an alternative, you can use the Microsoft-provided methods to bring data files
          into ADLS for querying through Impala. See
          <a class="xref" href="https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-copy-data-azure-storage-blob" target="_blank">the Microsoft ADLS documentation</a>
          for details.
        </p>


        <p class="p">
          After you upload data files to a location already mapped to an Impala table or partition, or if you delete
          files in ADLS from such a location, issue the <code class="ph codeph">REFRESH <var class="keyword varname">table_name</var></code>
          statement to make Impala aware of the new set of data files.
        </p>


      </div>

    </div>


  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title8" id="ddl">

    <h2 class="title topictitle2" id="ariaid-title8">Creating Impala Databases, Tables, and Partitions for Data Stored on ADLS</h2>


    <div class="body conbody">

      <div class="p">
        Impala reads data for a table or partition from ADLS based on the
          <code class="ph codeph">LOCATION</code> attribute for the table or partition.
        Specify the ADLS details in the <code class="ph codeph">LOCATION</code> clause of a
          <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">ALTER TABLE</code>
        statement. The syntax for the <code class="ph codeph">LOCATION</code> clause is:
        <ul class="ul">
          <li class="li">
            For ADLS Gen1:
<pre class="pre codeblock"><code>adl://<var class="keyword varname">account</var>.azuredatalakestore.net/<var class="keyword varname">path/file</var></code></pre></li>

          <li class="li">
            For ADLS Gen2:
<pre class="pre codeblock"><code>abfs://<var class="keyword varname">container</var>@<var class="keyword varname">account</var>.dfs.core.windows.net/<var class="keyword varname">path</var>/<var class="keyword varname">file</var></code></pre>
            <p class="p">
              or
            </p>

<pre class="pre codeblock"><code>abfss://<var class="keyword varname">container</var>@<var class="keyword varname">account</var>.dfs.core.windows.net/<var class="keyword varname">path</var>/<var class="keyword varname">file</var></code></pre>
          </li>

        </ul>

      </div>

      <p class="p">
        <code class="ph codeph"><var class="keyword varname">container</var></code> denotes the parent
        location that holds the files and folders, which is the Containers in
        the Azure Storage Blobs service.
      </p>

      <p class="p">
        <code class="ph codeph"><var class="keyword varname">account</var></code> is the name given for your
        storage account.
      </p>

      <div class="note note"><span class="notetitle">Note:</span>
        <p class="p"> By default, TLS is enabled both with <code class="ph codeph">abfs://</code> and
            <code class="ph codeph">abfss://</code>. </p>

        <p class="p">
          When you set the <code class="ph codeph">fs.azure.always.use.https=false</code>
          property, TLS is disabled with <code class="ph codeph">abfs://</code>, and TLS is
          enabled with <code class="ph codeph">abfss://</code>
        </p>

      </div>


      <p class="p">
        For a partitioned table, either specify a separate <code class="ph codeph">LOCATION</code> clause for each new partition,
        or specify a base <code class="ph codeph">LOCATION</code> for the table and set up a directory structure in ADLS to mirror
        the way Impala partitioned tables are structured in HDFS. Although, strictly speaking, ADLS filenames do not
        have directory paths, Impala treats ADLS filenames with <code class="ph codeph">/</code> characters the same as HDFS
        pathnames that include directories.
      </p>


      <p class="p">
        To point a nonpartitioned table or an individual partition at ADLS, specify a single directory
        path in ADLS, which could be any arbitrary directory. To replicate the structure of an entire Impala
        partitioned table or database in ADLS requires more care, with directories and subdirectories nested and
        named to match the equivalent directory tree in HDFS. Consider setting up an empty staging area if
        necessary in HDFS, and recording the complete directory structure so that you can replicate it in ADLS.
      </p>


      <p class="p">
        For example, the following session creates a partitioned table where only a single partition resides on ADLS.
        The partitions for years 2013 and 2014 are located on HDFS. The partition for year 2015 includes a
        <code class="ph codeph">LOCATION</code> attribute with an <code class="ph codeph">adl://</code> URL, and so refers to data residing on
        ADLS, under a specific path underneath the store <code class="ph codeph">impalademo</code>.
      </p>


<pre class="pre codeblock"><code>[localhost:21000] &gt; create database db_on_hdfs;
[localhost:21000] &gt; use db_on_hdfs;
[localhost:21000] &gt; create table mostly_on_hdfs (x int) partitioned by (year int);
[localhost:21000] &gt; alter table mostly_on_hdfs add partition (year=2013);
[localhost:21000] &gt; alter table mostly_on_hdfs add partition (year=2014);
[localhost:21000] &gt; alter table mostly_on_hdfs add partition (year=2015)
                  &gt;   location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3/t1';
</code></pre>

      <p class="p"> For convenience when working with multiple tables with data files
        stored in ADLS, you can create a database with a
          <code class="ph codeph">LOCATION</code> attribute pointing to an ADLS path. Specify
        a URL of the form as shown above. Any tables created inside that
        database automatically create directories underneath the one specified
        by the database <code class="ph codeph">LOCATION</code> attribute. </p>


      <p class="p">
        The following session creates a database and two partitioned tables residing entirely on ADLS, one
        partitioned by a single column and the other partitioned by multiple columns. Because a
        <code class="ph codeph">LOCATION</code> attribute with an <code class="ph codeph">adl://</code> URL is specified for the database, the
        tables inside that database are automatically created on ADLS underneath the database directory. To see the
        names of the associated subdirectories, including the partition key values, we use an ADLS client tool to
        examine how the directory structure is organized on ADLS. For example, Impala partition directories such as
        <code class="ph codeph">month=1</code> do not include leading zeroes, which sometimes appear in partition directories created
        through Hive.
      </p>


<pre class="pre codeblock"><code>[localhost:21000] &gt; create database db_on_adls location 'adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3';
[localhost:21000] &gt; use db_on_adls;

[localhost:21000] &gt; create table partitioned_on_adls (x int) partitioned by (year int);
[localhost:21000] &gt; alter table partitioned_on_adls add partition (year=2013);
[localhost:21000] &gt; alter table partitioned_on_adls add partition (year=2014);
[localhost:21000] &gt; alter table partitioned_on_adls add partition (year=2015);

[localhost:21000] &gt; ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive;
2015-03-17 13:56:34          0 dir1/dir2/dir3/
2015-03-17 16:43:28          0 dir1/dir2/dir3/partitioned_on_adls/
2015-03-17 16:43:49          0 dir1/dir2/dir3/partitioned_on_adls/year=2013/
2015-03-17 16:43:53          0 dir1/dir2/dir3/partitioned_on_adls/year=2014/
2015-03-17 16:43:58          0 dir1/dir2/dir3/partitioned_on_adls/year=2015/

[localhost:21000] &gt; create table partitioned_multiple_keys (x int)
                  &gt;   partitioned by (year smallint, month tinyint, day tinyint);
[localhost:21000] &gt; alter table partitioned_multiple_keys
                  &gt;   add partition (year=2015,month=1,day=1);
[localhost:21000] &gt; alter table partitioned_multiple_keys
                  &gt;   add partition (year=2015,month=1,day=31);
[localhost:21000] &gt; alter table partitioned_multiple_keys
                  &gt;   add partition (year=2015,month=2,day=28);

[localhost:21000] &gt; ! hadoop fs -ls adl://impalademo.azuredatalakestore.net/dir1/dir2/dir3 --recursive;
2015-03-17 13:56:34          0 dir1/dir2/dir3/
2015-03-17 16:47:13          0 dir1/dir2/dir3/partitioned_multiple_keys/
2015-03-17 16:47:44          0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=1/
2015-03-17 16:47:50          0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=1/day=31/
2015-03-17 16:47:57          0 dir1/dir2/dir3/partitioned_multiple_keys/year=2015/month=2/day=28/
2015-03-17 16:43:28          0 dir1/dir2/dir3/partitioned_on_adls/
2015-03-17 16:43:49          0 dir1/dir2/dir3/partitioned_on_adls/year=2013/
2015-03-17 16:43:53          0 dir1/dir2/dir3/partitioned_on_adls/year=2014/
2015-03-17 16:43:58          0 dir1/dir2/dir3/partitioned_on_adls/year=2015/
</code></pre>

      <p class="p">
        The <code class="ph codeph">CREATE DATABASE</code> and <code class="ph codeph">CREATE TABLE</code> statements create the associated
        directory paths if they do not already exist. You can specify multiple levels of directories, and the
        <code class="ph codeph">CREATE</code> statement creates all appropriate levels, similar to using <code class="ph codeph">mkdir
        -p</code>.
      </p>


      <p class="p">
        Use the standard ADLS file upload methods to actually put the data files into the right locations. You can
        also put the directory paths and data files in place before creating the associated Impala databases or
        tables, and Impala automatically uses the data from the appropriate location after the associated databases
        and tables are created.
      </p>


      <p class="p"> You can switch whether an existing table or partition points to data
        in HDFS or ADLS. For example, if you have an Impala table or partition
        pointing to data files in HDFS or ADLS, and you later transfer those
        data files to the other filesystem, use an <code class="ph codeph">ALTER TABLE</code>
        statement to adjust the <code class="ph codeph">LOCATION</code> attribute of the
        corresponding table or partition to reflect that change. This
        location-switching technique is not practical for entire databases that
        have a custom <code class="ph codeph">LOCATION</code> attribute. </p>


    </div>


  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title9" id="internal_external">

    <h2 class="title topictitle2" id="ariaid-title9">Internal and External Tables Located on ADLS</h2>


    <div class="body conbody">

      <p class="p">
        Just as with tables located on HDFS storage, you can designate ADLS-based tables as either internal (managed
        by Impala) or external, by using the syntax <code class="ph codeph">CREATE TABLE</code> or <code class="ph codeph">CREATE EXTERNAL
        TABLE</code> respectively. When you drop an internal table, the files associated with the table are
        removed, even if they are on ADLS storage. When you drop an external table, the files associated with the
        table are left alone, and are still available for access by other tools or components. See
        <a class="xref" href="impala_tables.html#tables">Overview of Impala Tables</a> for details.
      </p>


      <p class="p">
        If the data on ADLS is intended to be long-lived and accessed by other tools in addition to Impala, create
        any associated ADLS tables with the <code class="ph codeph">CREATE EXTERNAL TABLE</code> syntax, so that the files are not
        deleted from ADLS when the table is dropped.
      </p>


      <p class="p">
        If the data on ADLS is only needed for querying by Impala and can be safely discarded once the Impala
        workflow is complete, create the associated ADLS tables using the <code class="ph codeph">CREATE TABLE</code> syntax, so
        that dropping the table also deletes the corresponding data files on ADLS.
      </p>


      <p class="p">
        For example, this session creates a table in ADLS with the same column layout as a table in HDFS, then
        examines the ADLS table and queries some data from it. The table in ADLS works the same as a table in HDFS as
        far as the expected file format of the data, table and column statistics, and other table properties. The
        only indication that it is not an HDFS table is the <code class="ph codeph">adl://</code> URL in the
        <code class="ph codeph">LOCATION</code> property. Many data files can reside in the ADLS directory, and their combined
        contents form the table data. Because the data in this example is uploaded after the table is created, a
        <code class="ph codeph">REFRESH</code> statement prompts Impala to update its cached information about the data files.
      </p>


<pre class="pre codeblock"><code>[localhost:21000] &gt; create table usa_cities_adls like usa_cities location 'adl://impalademo.azuredatalakestore.net/usa_cities';
[localhost:21000] &gt; desc usa_cities_adls;
+-------+----------+---------+
| name  | type     | comment |
+-------+----------+---------+
| id    | smallint |         |
| city  | string   |         |
| state | string   |         |
+-------+----------+---------+

-- Now from a web browser, upload the same data file(s) to ADLS as in the HDFS table,
-- under the relevant store and path. If you already have the data in ADLS, you would
-- point the table LOCATION at an existing path.

[localhost:21000] &gt; refresh usa_cities_adls;
[localhost:21000] &gt; select count(*) from usa_cities_adls;
+----------+
| count(*) |
+----------+
| 289      |
+----------+
[localhost:21000] &gt; select distinct state from sample_data_adls limit 5;
+----------------------+
| state                |
+----------------------+
| Louisiana            |
| Minnesota            |
| Georgia              |
| Alaska               |
| Ohio                 |
+----------------------+
[localhost:21000] &gt; desc formatted usa_cities_adls;
+------------------------------+----------------------------------------------------+---------+
| name                         | type                                               | comment |
+------------------------------+----------------------------------------------------+---------+
| # col_name                   | data_type                                          | comment |
|                              | NULL                                               | NULL    |
| id                           | smallint                                           | NULL    |
| city                         | string                                             | NULL    |
| state                        | string                                             | NULL    |
|                              | NULL                                               | NULL    |
| # Detailed Table Information | NULL                                               | NULL    |
| Database:                    | adls_testing                                       | NULL    |
| Owner:                       | jrussell                                           | NULL    |
| CreateTime:                  | Mon Mar 16 11:36:25 PDT 2017                       | NULL    |
| LastAccessTime:              | UNKNOWN                                            | NULL    |
| Protect Mode:                | None                                               | NULL    |
| Retention:                   | 0                                                  | NULL    |
| Location:                    | adl://impalademo.azuredatalakestore.net/usa_cities | NULL    |
| Table Type:                  | MANAGED_TABLE                                      | NULL    |
...
+------------------------------+----------------------------------------------------+---------+
</code></pre>

      <p class="p">
        In this case, we have already uploaded a Parquet file with a million rows of data to the
        <code class="ph codeph">sample_data</code> directory underneath the <code class="ph codeph">impalademo</code> store on ADLS. This
        session creates a table with matching column settings pointing to the corresponding location in ADLS, then
        queries the table. Because the data is already in place on ADLS when the table is created, no
        <code class="ph codeph">REFRESH</code> statement is required.
      </p>


<pre class="pre codeblock"><code>[localhost:21000] &gt; create table sample_data_adls
                  &gt; (id int, id bigint, val int, zerofill string,
                  &gt; name string, assertion boolean, city string, state string)
                  &gt; stored as parquet location 'adl://impalademo.azuredatalakestore.net/sample_data';
[localhost:21000] &gt; select count(*) from sample_data_adls;
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
[localhost:21000] &gt; select count(*) howmany, assertion from sample_data_adls group by assertion;
+---------+-----------+
| howmany | assertion |
+---------+-----------+
| 667149  | true      |
| 332851  | false     |
+---------+-----------+
</code></pre>

    </div>


  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title10" id="queries">

    <h2 class="title topictitle2" id="ariaid-title10">Running and Tuning Impala Queries for Data Stored on ADLS</h2>


    <div class="body conbody">

      <p class="p">
        Once the appropriate <code class="ph codeph">LOCATION</code> attributes are set up at the table or partition level, you
        query data stored in ADLS exactly the same as data stored on HDFS or in HBase:
      </p>


      <ul class="ul">
        <li class="li">
          Queries against ADLS data support all the same file formats as for HDFS data.
        </li>


        <li class="li">
          Tables can be unpartitioned or partitioned. For partitioned tables, either manually construct paths in ADLS
          corresponding to the HDFS directories representing partition key values, or use <code class="ph codeph">ALTER TABLE ...
          ADD PARTITION</code> to set up the appropriate paths in ADLS.
        </li>


        <li class="li">
          HDFS, Kudu, and HBase tables can be joined to ADLS tables, or ADLS tables can be joined with each other.
        </li>


        <li class="li">
          Authorization using the Ranger framework to control access to databases, tables, or columns works the
          same whether the data is in HDFS or in ADLS.
        </li>


        <li class="li">
          The <span class="keyword cmdname">catalogd</span> daemon caches metadata for both HDFS and ADLS tables. Use
          <code class="ph codeph">REFRESH</code> and <code class="ph codeph">INVALIDATE METADATA</code> for ADLS tables in the same situations
          where you would issue those statements for HDFS tables.
        </li>


        <li class="li">
          Queries against ADLS tables are subject to the same kinds of admission control and resource management as
          HDFS tables.
        </li>


        <li class="li">
          Metadata about ADLS tables is stored in the same metastore database as for HDFS tables.
        </li>


        <li class="li">
          You can set up views referring to ADLS tables, the same as for HDFS tables.
        </li>


        <li class="li">
          The <code class="ph codeph">COMPUTE STATS</code>, <code class="ph codeph">SHOW TABLE STATS</code>, and <code class="ph codeph">SHOW COLUMN
          STATS</code> statements work for ADLS tables also.
        </li>

      </ul>


    </div>


    <div class="topic concept nested2" aria-labelledby="ariaid-title11" id="performance">

      <h3 class="title topictitle3" id="ariaid-title11">Understanding and Tuning Impala Query Performance for ADLS Data</h3>


      <div class="body conbody">

        <p class="p">
          Although Impala queries for data stored in ADLS might be less performant than queries against the
          equivalent data stored in HDFS, you can still do some tuning. Here are techniques you can use to
          interpret explain plans and profiles for queries against ADLS data, and tips to achieve the best
          performance possible for such queries.
        </p>


        <p class="p">
          All else being equal, performance is expected to be lower for queries running against data on ADLS rather
          than HDFS. The actual mechanics of the <code class="ph codeph">SELECT</code> statement are somewhat different when the
          data is in ADLS. Although the work is still distributed across the datanodes of the cluster, Impala might
          parallelize the work for a distributed query differently for data on HDFS and ADLS. ADLS does not have the
          same block notion as HDFS, so Impala uses heuristics to determine how to split up large ADLS files for
          processing in parallel. Because all hosts can access any ADLS data file with equal efficiency, the
          distribution of work might be different than for HDFS data, where the data blocks are physically read
          using short-circuit local reads by hosts that contain the appropriate block replicas. Although the I/O to
          read the ADLS data might be spread evenly across the hosts of the cluster, the fact that all data is
          initially retrieved across the network means that the overall query performance is likely to be lower for
          ADLS data than for HDFS data.
        </p>


        <p class="p">
        Because ADLS does not expose the block sizes of data files the way HDFS does, any Impala
        <code class="ph codeph">INSERT</code> or <code class="ph codeph">CREATE TABLE AS SELECT</code> statements use the
        <code class="ph codeph">PARQUET_FILE_SIZE</code> query option setting to define the size of Parquet
        data files. (Using a large block size is more important for Parquet tables than for
        tables that use other file formats.)
      </p>


        <p class="p">
          When optimizing aspects of for complex queries such as the join order, Impala treats tables on HDFS and
          ADLS the same way. Therefore, follow all the same tuning recommendations for ADLS tables as for HDFS ones,
          such as using the <code class="ph codeph">COMPUTE STATS</code> statement to help Impala construct accurate estimates of
          row counts and cardinality. See <a class="xref" href="impala_performance.html#performance">Tuning Impala for Performance</a> for details.
        </p>


        <p class="p">
          In query profile reports, the numbers for <code class="ph codeph">BytesReadLocal</code>,
          <code class="ph codeph">BytesReadShortCircuit</code>, <code class="ph codeph">BytesReadDataNodeCached</code>, and
          <code class="ph codeph">BytesReadRemoteUnexpected</code> are blank because those metrics come from HDFS.
          If you do see any indications that a query against an ADLS table performed <span class="q">"remote read"</span>
          operations, do not be alarmed. That is expected because, by definition, all the I/O for ADLS tables involves
          remote reads.
        </p>


      </div>


    </div>


  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title12" id="restrictions">

    <h2 class="title topictitle2" id="ariaid-title12">Restrictions on Impala Support for ADLS</h2>


    <div class="body conbody">

      <p class="p">
        Impala requires that the default filesystem for the cluster be HDFS. You cannot use ADLS as the only
        filesystem in the cluster.
      </p>


      <p class="p">
        Although ADLS is often used to store JSON-formatted data, the current Impala support for ADLS does not include
        directly querying JSON data. For Impala queries, use data files in one of the file formats listed in
        <a class="xref" href="impala_file_formats.html#file_formats">How Impala Works with Hadoop File Formats</a>. If you have data in JSON format, you can prepare a
        flattened version of that data for querying by Impala as part of your ETL cycle.
      </p>


      <p class="p">
        You cannot use the <code class="ph codeph">ALTER TABLE ... SET CACHED</code> statement for tables or partitions that are
        located in ADLS.
      </p>


    </div>


  </div>


  <div class="topic concept nested1" aria-labelledby="ariaid-title13" id="best_practices">
    <h2 class="title topictitle2" id="ariaid-title13">Best Practices for Using Impala with ADLS</h2>


    <div class="body conbody">
      <p class="p">
        The following guidelines represent best practices derived from testing and real-world experience with Impala on ADLS:
      </p>

      <ul class="ul">
        <li class="li">
          <p class="p">
            Any reference to an ADLS location must be fully qualified. (This rule applies when
            ADLS is not designated as the default filesystem.)
          </p>

        </li>

        <li class="li">
          <p class="p">
            Set any appropriate configuration settings for <span class="keyword cmdname">impalad</span>.
          </p>

        </li>

      </ul>


    </div>

  </div>


</body>
</html>