mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
This now gives a clean RAT check with bin/check-rat-report.py, which is one way for the Impala community to check compliance with ASF rules on intellectual property. Change-Id: I2ad06435f84a65ba126759e42a18fdaf52cd7036 Reviewed-on: http://gerrit.cloudera.org:8080/5232 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins Reviewed-by: John Russell <jrussell@cloudera.com>
2519 lines
107 KiB
XML
2519 lines
107 KiB
XML
<?xml version="1.0" encoding="UTF-8"?><!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="tutorial">
|
|
|
|
<title id="tutorials">Impala Tutorials</title>
|
|
<titlealts audience="PDF"><navtitle>Tutorials</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Tutorials"/>
|
|
<data name="Category" value="Getting Started"/>
|
|
<data name="Category" value="Querying"/>
|
|
<data name="Category" value="Tables"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This section includes tutorial scenarios that demonstrate how to begin using Impala once the software is
|
|
installed. It focuses on techniques for loading data, because once you have some data in tables and can query
|
|
that data, you can quickly progress to more advanced Impala features.
|
|
</p>
|
|
|
|
<note>
|
|
<p>
|
|
Where practical, the tutorials take you from <q>ground zero</q> to having the desired Impala tables and
|
|
data. In some cases, you might need to download additional files from outside sources, set up additional
|
|
software components, modify commands or scripts to fit your own configuration, or substitute your own
|
|
sample data.
|
|
</p>
|
|
</note>
|
|
|
|
<p>
|
|
Before trying these tutorial lessons, install Impala using one of these procedures:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
If you already have a CDH environment set up and just need to add Impala to it, follow the installation
|
|
process described in <xref href="impala_install.xml#install"/>. Make sure to also install the Hive
|
|
metastore service if you do not already have Hive configured.
|
|
</li>
|
|
|
|
<li>
|
|
To set up Impala and all its prerequisites at once, in a minimal configuration that you can use for
|
|
small-scale experiments, set up the Cloudera QuickStart VM, which includes CDH and Impala on CentOS.
|
|
Use this single-node VM to try out basic SQL functionality, not anything related to
|
|
performance and scalability. For more information, see
|
|
<xref href="http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html" scope="external" format="html">the
|
|
Cloudera QuickStart VM</xref>.
|
|
</li>
|
|
</ul>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="tut_beginner">
|
|
|
|
<title>Tutorials for Getting Started</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
These tutorials demonstrate the basics of using Impala. They are intended for first-time users, and for
|
|
trying out Impala on any new cluster to make sure the major components are working correctly.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="tutorial_explore">
|
|
|
|
<title>Explore a New Impala Instance</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This tutorial demonstrates techniques for finding your way around the tables and databases of an
|
|
unfamiliar (possibly empty) Impala instance.
|
|
</p>
|
|
|
|
<p>
|
|
When you connect to an Impala instance for the first time, you use the <codeph>SHOW DATABASES</codeph>
|
|
and <codeph>SHOW TABLES</codeph> statements to view the most common types of objects. Also, call the
|
|
<codeph>version()</codeph> function to confirm which version of Impala you are running; the version
|
|
number is important when consulting documentation and dealing with support issues.
|
|
</p>
|
|
|
|
<p>
|
|
A completely empty Impala instance contains no tables, but still has two databases:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<codeph>default</codeph>, where new tables are created when you do not specify any other database.
|
|
</li>
|
|
|
|
<li>
|
|
<codeph>_impala_builtins</codeph>, a system database used to hold all the built-in functions.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
The following example shows how to see the available databases, and the tables in each. If the list of
|
|
databases or tables is long, you can use wildcard notation to locate specific databases or tables based
|
|
on their names.
|
|
</p>
|
|
|
|
<codeblock>$ impala-shell -i localhost --quiet
|
|
Starting Impala Shell without Kerberos authentication
|
|
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
|
|
|
|
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
|
|
|
|
(Shell build version: Impala Shell v...
|
|
[localhost:21000] > select version();
|
|
+-------------------------------------------
|
|
| version()
|
|
+-------------------------------------------
|
|
| impalad version ...
|
|
| Built on ...
|
|
+-------------------------------------------
|
|
[localhost:21000] > show databases;
|
|
+--------------------------+
|
|
| name |
|
|
+--------------------------+
|
|
| _impala_builtins |
|
|
| ctas |
|
|
| d1 |
|
|
| d2 |
|
|
| d3 |
|
|
| default |
|
|
| explain_plans |
|
|
| external_table |
|
|
| file_formats |
|
|
| tpc |
|
|
+--------------------------+
|
|
[localhost:21000] > select current_database();
|
|
+--------------------+
|
|
| current_database() |
|
|
+--------------------+
|
|
| default |
|
|
+--------------------+
|
|
[localhost:21000] > show tables;
|
|
+-------+
|
|
| name |
|
|
+-------+
|
|
| ex_t |
|
|
| t1 |
|
|
+-------+
|
|
[localhost:21000] > show tables in d3;
|
|
|
|
[localhost:21000] > show tables in tpc;
|
|
+------------------------+
|
|
| name |
|
|
+------------------------+
|
|
| city |
|
|
| customer |
|
|
| customer_address |
|
|
| customer_demographics |
|
|
| household_demographics |
|
|
| item |
|
|
| promotion |
|
|
| store |
|
|
| store2 |
|
|
| store_sales |
|
|
| ticket_view |
|
|
| time_dim |
|
|
| tpc_tables |
|
|
+------------------------+
|
|
[localhost:21000] > show tables in tpc like 'customer*';
|
|
+-----------------------+
|
|
| name |
|
|
+-----------------------+
|
|
| customer |
|
|
| customer_address |
|
|
| customer_demographics |
|
|
+-----------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Once you know what tables and databases are available, you descend into a database with the
|
|
<codeph>USE</codeph> statement. To understand the structure of each table, you use the
|
|
<codeph>DESCRIBE</codeph> command. Once inside a database, you can issue statements such as
|
|
<codeph>INSERT</codeph> and <codeph>SELECT</codeph> that operate on particular tables.
|
|
</p>
|
|
|
|
<p>
|
|
The following example explores a database named <codeph>TPC</codeph> whose name we learned in the
|
|
previous example. It shows how to filter the table names within a database based on a search string,
|
|
examine the columns of a table, and run queries to examine the characteristics of the table data. For
|
|
example, for an unfamiliar table you might want to know the number of rows, the number of different
|
|
values for a column, and other properties such as whether the column contains any <codeph>NULL</codeph>
|
|
values. When sampling the actual data values from a table, use a <codeph>LIMIT</codeph> clause to avoid
|
|
excessive output if the table contains more rows or distinct values than you expect.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > use tpc;
|
|
[localhost:21000] > show tables like '*view*';
|
|
+-------------+
|
|
| name |
|
|
+-------------+
|
|
| ticket_view |
|
|
+-------------+
|
|
[localhost:21000] > describe city;
|
|
+-------------+--------+---------+
|
|
| name | type | comment |
|
|
+-------------+--------+---------+
|
|
| id | int | |
|
|
| name | string | |
|
|
| countrycode | string | |
|
|
| district | string | |
|
|
| population | int | |
|
|
+-------------+--------+---------+
|
|
[localhost:21000] > select count(*) from city;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 0 |
|
|
+----------+
|
|
[localhost:21000] > desc customer;
|
|
+------------------------+--------+---------+
|
|
| name | type | comment |
|
|
+------------------------+--------+---------+
|
|
| c_customer_sk | int | |
|
|
| c_customer_id | string | |
|
|
| c_current_cdemo_sk | int | |
|
|
| c_current_hdemo_sk | int | |
|
|
| c_current_addr_sk | int | |
|
|
| c_first_shipto_date_sk | int | |
|
|
| c_first_sales_date_sk | int | |
|
|
| c_salutation | string | |
|
|
| c_first_name | string | |
|
|
| c_last_name | string | |
|
|
| c_preferred_cust_flag | string | |
|
|
| c_birth_day | int | |
|
|
| c_birth_month | int | |
|
|
| c_birth_year | int | |
|
|
| c_birth_country | string | |
|
|
| c_login | string | |
|
|
| c_email_address | string | |
|
|
| c_last_review_date | string | |
|
|
+------------------------+--------+---------+
|
|
[localhost:21000] > select count(*) from customer;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 100000 |
|
|
+----------+
|
|
[localhost:21000] > select count(distinct c_birth_month) from customer;
|
|
+-------------------------------+
|
|
| count(distinct c_birth_month) |
|
|
+-------------------------------+
|
|
| 12 |
|
|
+-------------------------------+
|
|
[localhost:21000] > select count(*) from customer where c_email_address is null;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 0 |
|
|
+----------+
|
|
[localhost:21000] > select distinct c_salutation from customer limit 10;
|
|
+--------------+
|
|
| c_salutation |
|
|
+--------------+
|
|
| Mr. |
|
|
| Ms. |
|
|
| Dr. |
|
|
| |
|
|
| Miss |
|
|
| Sir |
|
|
| Mrs. |
|
|
+--------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
When you graduate from read-only exploration, you use statements such as <codeph>CREATE DATABASE</codeph>
|
|
and <codeph>CREATE TABLE</codeph> to set up your own database objects.
|
|
</p>
|
|
|
|
<p>
|
|
The following example demonstrates creating a new database holding a new table. Although the last example
|
|
ended inside the <codeph>TPC</codeph> database, the new <codeph>EXPERIMENTS</codeph> database is not
|
|
nested inside <codeph>TPC</codeph>; all databases are arranged in a single top-level list.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create database experiments;
|
|
[localhost:21000] > show databases;
|
|
+--------------------------+
|
|
| name |
|
|
+--------------------------+
|
|
| _impala_builtins |
|
|
| ctas |
|
|
| d1 |
|
|
| d2 |
|
|
| d3 |
|
|
| default |
|
|
| experiments |
|
|
| explain_plans |
|
|
| external_table |
|
|
| file_formats |
|
|
| tpc |
|
|
+--------------------------+
|
|
[localhost:21000] > show databases like 'exp*';
|
|
+---------------+
|
|
| name |
|
|
+---------------+
|
|
| experiments |
|
|
| explain_plans |
|
|
+---------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example creates a new table, <codeph>T1</codeph>. To illustrate a common mistake, it creates this table inside
|
|
the wrong database, the <codeph>TPC</codeph> database where the previous example ended. The <codeph>ALTER
|
|
TABLE</codeph> statement lets you move the table to the intended database, <codeph>EXPERIMENTS</codeph>, as part of a rename operation.
|
|
The <codeph>USE</codeph> statement is always needed to switch to a new database, and the
|
|
<codeph>current_database()</codeph> function confirms which database the session is in, to avoid these
|
|
kinds of mistakes.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table t1 (x int);
|
|
|
|
[localhost:21000] > show tables;
|
|
+------------------------+
|
|
| name |
|
|
+------------------------+
|
|
| city |
|
|
| customer |
|
|
| customer_address |
|
|
| customer_demographics |
|
|
| household_demographics |
|
|
| item |
|
|
| promotion |
|
|
| store |
|
|
| store2 |
|
|
| store_sales |
|
|
| t1 |
|
|
| ticket_view |
|
|
| time_dim |
|
|
| tpc_tables |
|
|
+------------------------+
|
|
[localhost:21000] > select current_database();
|
|
+--------------------+
|
|
| current_database() |
|
|
+--------------------+
|
|
| tpc |
|
|
+--------------------+
|
|
[localhost:21000] > alter table t1 rename to experiments.t1;
|
|
[localhost:21000] > use experiments;
|
|
[localhost:21000] > show tables;
|
|
+------+
|
|
| name |
|
|
+------+
|
|
| t1 |
|
|
+------+
|
|
[localhost:21000] > select current_database();
|
|
+--------------------+
|
|
| current_database() |
|
|
+--------------------+
|
|
| experiments |
|
|
+--------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
For your initial experiments with tables, you can use ones with just a few columns and a few rows, and
|
|
text-format data files.
|
|
</p>
|
|
|
|
<note>
|
|
As you graduate to more realistic scenarios, you will use more elaborate tables with many columns,
|
|
features such as partitioning, and file formats such as Parquet. When dealing with realistic data
|
|
volumes, you will bring in data using <codeph>LOAD DATA</codeph> or <codeph>INSERT ... SELECT</codeph>
|
|
statements to operate on millions or billions of rows at once.
|
|
</note>
|
|
|
|
<p>
|
|
The following example sets up a couple of simple tables with a few rows, and performs queries involving
|
|
sorting, aggregate functions and joins.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > insert into t1 values (1), (3), (2), (4);
|
|
[localhost:21000] > select x from t1 order by x desc;
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 4 |
|
|
| 3 |
|
|
| 2 |
|
|
| 1 |
|
|
+---+
|
|
[localhost:21000] > select min(x), max(x), sum(x), avg(x) from t1;
|
|
+--------+--------+--------+--------+
|
|
| min(x) | max(x) | sum(x) | avg(x) |
|
|
+--------+--------+--------+--------+
|
|
| 1 | 4 | 10 | 2.5 |
|
|
+--------+--------+--------+--------+
|
|
|
|
[localhost:21000] > create table t2 (id int, word string);
|
|
[localhost:21000] > insert into t2 values (1, "one"), (3, "three"), (5, 'five');
|
|
[localhost:21000] > select word from t1 join t2 on (t1.x = t2.id);
|
|
+-------+
|
|
| word |
|
|
+-------+
|
|
| one |
|
|
| three |
|
|
+-------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
After completing this tutorial, you should now know:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
How to tell which version of Impala is running on your system.
|
|
</li>
|
|
|
|
<li>
|
|
How to find the names of databases in an Impala instance, either displaying the full list or
|
|
searching for specific names.
|
|
</li>
|
|
|
|
<li>
|
|
How to find the names of tables in an Impala database, either displaying the full list or
|
|
searching for specific names.
|
|
</li>
|
|
|
|
<li>
|
|
How to switch between databases and check which database you are currently in.
|
|
</li>
|
|
|
|
<li>
|
|
How to learn the column names and types of a table.
|
|
</li>
|
|
|
|
<li>
|
|
How to create databases and tables, insert small amounts of test data, and run simple queries.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="Cloudera" id="tutorial_quickstart_vm">
|
|
|
|
<title>Setting Up Impala with the QuickStart VM</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This tutorial scenario shows how to set up the Cloudera QuickStart VM, load some sample data, and begin
|
|
experimenting with Impala inside this virtualized environment.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="Cloudera" id="tutorial_hdfs">
|
|
|
|
<title>Exploring the HDFS Directory Structure</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This tutorial scenario illustrates the HDFS directory structures that correspond to various
|
|
Impala databases, tables, and partitions. It also shows how data directories are shared between
|
|
Impala and Hive, because of the shared metastore database.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="Cloudera" id="tutorial_external_table">
|
|
|
|
<title/>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
In this tutorial scenario, you create a simple text-format data file in HDFS and then define an Impala
|
|
table that refers to the data in its original location.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tutorial_csv_setup">
|
|
|
|
<title>Load CSV Data from Local Files</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This scenario illustrates how to create some very small tables, suitable for first-time users to
|
|
experiment with Impala SQL features. <codeph>TAB1</codeph> and <codeph>TAB2</codeph> are loaded with data
|
|
from files in HDFS. A subset of data is copied from <codeph>TAB1</codeph> into <codeph>TAB3</codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
Populate HDFS with the data you want to query. To begin this process, create one or more new
|
|
subdirectories underneath your user directory in HDFS. The data for each table resides in a separate
|
|
subdirectory. Substitute your own username for <codeph>cloudera</codeph> where appropriate. This example
|
|
uses the <codeph>-p</codeph> option with the <codeph>mkdir</codeph> operation to create any necessary
|
|
parent directories if they do not already exist.
|
|
</p>
|
|
|
|
<codeblock>$ whoami
|
|
cloudera
|
|
$ hdfs dfs -ls /user
|
|
Found 3 items
|
|
drwxr-xr-x - cloudera cloudera 0 2013-04-22 18:54 /user/cloudera
|
|
drwxrwx--- - mapred mapred 0 2013-03-15 20:11 /user/history
|
|
drwxr-xr-x - hue supergroup 0 2013-03-15 20:10 /user/hive
|
|
<!-- $ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1 -->
|
|
$ hdfs dfs -mkdir -p /user/cloudera/sample_data/tab1 /user/cloudera/sample_data/tab2</codeblock>
|
|
|
|
<p>
|
|
Here is some sample data, for two tables named <codeph>TAB1</codeph> and <codeph>TAB2</codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
Copy the following content to <codeph>.csv</codeph> files in your local filesystem:
|
|
</p>
|
|
|
|
<p>
|
|
<filepath>tab1.csv</filepath>:
|
|
</p>
|
|
|
|
<codeblock>1,true,123.123,2012-10-24 08:55:00
|
|
2,false,1243.5,2012-10-25 13:40:00
|
|
3,false,24453.325,2008-08-22 09:33:21.123
|
|
4,false,243423.325,2007-05-12 22:32:21.33454
|
|
5,true,243.325,1953-04-22 09:11:33
|
|
</codeblock>
|
|
|
|
<p>
|
|
<filepath>tab2.csv</filepath>:
|
|
</p>
|
|
|
|
<codeblock>1,true,12789.123
|
|
2,false,1243.5
|
|
3,false,24453.325
|
|
4,false,2423.3254
|
|
5,true,243.325
|
|
60,false,243565423.325
|
|
70,true,243.325
|
|
80,false,243423.325
|
|
90,true,243.325
|
|
</codeblock>
|
|
|
|
<p>
|
|
Put each <codeph>.csv</codeph> file into a separate HDFS directory using commands like the following,
|
|
which use paths available in the Impala Demo VM:
|
|
</p>
|
|
|
|
<codeblock><!-- $ hdfs dfs -mkdir /user/cloudera/tab1 -->$ hdfs dfs -put tab1.csv /user/cloudera/sample_data/tab1
|
|
$ hdfs dfs -ls /user/cloudera/sample_data/tab1
|
|
Found 1 items
|
|
-rw-r--r-- 1 cloudera cloudera 192 2013-04-02 20:08 /user/cloudera/sample_data/tab1/tab1.csv
|
|
|
|
<!-- $ hdfs dfs -mkdir /user/cloudera/tab2 -->
|
|
$ hdfs dfs -put tab2.csv /user/cloudera/sample_data/tab2
|
|
$ hdfs dfs -ls /user/cloudera/sample_data/tab2
|
|
Found 1 items
|
|
-rw-r--r-- 1 cloudera cloudera 158 2013-04-02 20:09 /user/cloudera/sample_data/tab2/tab2.csv
|
|
</codeblock>
|
|
|
|
<p>
|
|
The name of each data file is not significant. In fact, when Impala examines the contents of the data
|
|
directory for the first time, it considers all files in the directory to make up the data of the table,
|
|
regardless of how many files there are or what the files are named.
|
|
</p>
|
|
|
|
<p>
|
|
To understand what paths are available within your own HDFS filesystem and what the permissions are for
|
|
the various directories and files, issue <codeph>hdfs dfs -ls /</codeph> and work your way down the tree
|
|
doing <codeph>-ls</codeph> operations for the various directories.
|
|
</p>
|
|
|
|
<p>
|
|
Use the <codeph>impala-shell</codeph> command to create tables, either interactively or through a SQL
|
|
script.
|
|
</p>
|
|
|
|
<p>
|
|
The following example shows creating three tables. For each table, the example shows creating columns
|
|
with various attributes such as Boolean or integer types. The example also includes commands that provide
|
|
information about how the data is formatted, such as rows terminating with commas, which makes sense in
|
|
the case of importing data from a <codeph>.csv</codeph> file. Where we already have <codeph>.csv</codeph>
|
|
files containing data in the HDFS directory tree, we specify the location of the directory containing the
|
|
appropriate <codeph>.csv</codeph> file. Impala considers all the data from all the files in that
|
|
directory to represent the data for the table.
|
|
</p>
|
|
|
|
<codeblock>DROP TABLE IF EXISTS tab1;
|
|
-- The EXTERNAL clause means the data is located outside the central location
|
|
-- for Impala data files and is preserved when the associated Impala table is dropped.
|
|
-- We expect the data to already exist in the directory specified by the LOCATION clause.
|
|
CREATE EXTERNAL TABLE tab1
|
|
(
|
|
id INT,
|
|
col_1 BOOLEAN,
|
|
col_2 DOUBLE,
|
|
col_3 TIMESTAMP
|
|
)
|
|
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
|
|
LOCATION '/user/cloudera/sample_data/tab1';
|
|
|
|
DROP TABLE IF EXISTS tab2;
|
|
-- TAB2 is an external table, similar to TAB1.
|
|
CREATE EXTERNAL TABLE tab2
|
|
(
|
|
id INT,
|
|
col_1 BOOLEAN,
|
|
col_2 DOUBLE
|
|
)
|
|
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
|
|
LOCATION '/user/cloudera/sample_data/tab2';
|
|
|
|
DROP TABLE IF EXISTS tab3;
|
|
-- Leaving out the EXTERNAL clause means the data will be managed
|
|
-- in the central Impala data directory tree. Rather than reading
|
|
-- existing data files when the table is created, we load the
|
|
-- data after creating the table.
|
|
CREATE TABLE tab3
|
|
(
|
|
id INT,
|
|
col_1 BOOLEAN,
|
|
col_2 DOUBLE,
|
|
month INT,
|
|
day INT
|
|
)
|
|
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
|
|
</codeblock>
|
|
|
|
<note>
|
|
Getting through these <codeph>CREATE TABLE</codeph> statements successfully is an important validation
|
|
step to confirm everything is configured correctly with the Hive metastore and HDFS permissions. If you
|
|
receive any errors during the <codeph>CREATE TABLE</codeph> statements:
|
|
<ul>
|
|
<li>
|
|
Make sure you followed the installation instructions closely, in
|
|
<xref href="impala_install.xml#install"/>.
|
|
</li>
|
|
|
|
<li>
|
|
Make sure the <codeph>hive.metastore.warehouse.dir</codeph> property points to a directory that
|
|
Impala can write to. The ownership should be <codeph>hive:hive</codeph>, and the
|
|
<codeph>impala</codeph> user should also be a member of the <codeph>hive</codeph> group.
|
|
</li>
|
|
|
|
<li>
|
|
If the value of <codeph>hive.metastore.warehouse.dir</codeph> is different in the Cloudera Manager
|
|
dialogs and in the Hive shell, you might need to
|
|
<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_managing_roles.html" scope="external" format="html">designate
|
|
the hosts running <cmdname>impalad</cmdname> with the <q>gateway</q> role for Hive</xref>, and
|
|
<xref href="http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_client_config.html" scope="external" format="html">deploy
|
|
the client configuration files</xref> to those hosts.
|
|
</li>
|
|
</ul>
|
|
</note>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tutorial_create_table">
|
|
|
|
<title>Point an Impala Table at Existing Data Files</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
A convenient way to set up data for Impala to access is to use an external table, where the data already
|
|
exists in a set of HDFS files and you just point the Impala table at the directory containing those
|
|
files. For example, you might run in <codeph>impala-shell</codeph> a <codeph>*.sql</codeph> file with
|
|
contents similar to the following, to create an Impala table that accesses an existing data file used by
|
|
Hive.
|
|
</p>
|
|
|
|
<p>
|
|
The following examples set up 2 tables, referencing the paths and sample data supplied with the Cloudera
|
|
QuickStart VM. For historical reasons, the data physically resides in an HDFS directory tree under
|
|
<filepath>/user/hive</filepath>, although this particular data is entirely managed by Impala rather than
|
|
Hive. When we create an external table, we specify the directory containing one or more data files, and
|
|
Impala queries the combined content of all the files inside that directory. Here is how we examine the
|
|
directories and files within the HDFS filesystem:
|
|
</p>
|
|
|
|
<codeblock>$ cd ~/cloudera/datasets
|
|
$ ./tpcds-setup.sh
|
|
... Downloads and unzips the kit, builds the data and loads it into HDFS ...
|
|
$ hdfs dfs -ls /user/hive/tpcds/customer
|
|
Found 1 items
|
|
-rw-r--r-- 1 cloudera supergroup 13209372 2013-03-22 18:09 /user/hive/tpcds/customer/customer.dat
|
|
$ hdfs dfs -cat /user/hive/tpcds/customer/customer.dat | more
|
|
1|AAAAAAAABAAAAAAA|980124|7135|32946|2452238|2452208|Mr.|Javier|Lewis|Y|9|12|1936|CHILE||Javie
|
|
r.Lewis@VFAxlnZEvOx.org|2452508|
|
|
2|AAAAAAAACAAAAAAA|819667|1461|31655|2452318|2452288|Dr.|Amy|Moses|Y|9|4|1966|TOGO||Amy.Moses@
|
|
Ovk9KjHH.com|2452318|
|
|
3|AAAAAAAADAAAAAAA|1473522|6247|48572|2449130|2449100|Miss|Latisha|Hamilton|N|18|9|1979|NIUE||
|
|
Latisha.Hamilton@V.com|2452313|
|
|
4|AAAAAAAAEAAAAAAA|1703214|3986|39558|2450030|2450000|Dr.|Michael|White|N|7|6|1983|MEXICO||Mic
|
|
hael.White@i.org|2452361|
|
|
5|AAAAAAAAFAAAAAAA|953372|4470|36368|2449438|2449408|Sir|Robert|Moran|N|8|5|1956|FIJI||Robert.
|
|
Moran@Hh.edu|2452469|
|
|
...
|
|
</codeblock>
|
|
|
|
<p>
|
|
Here is a SQL script to set up Impala tables pointing to some of these data files in HDFS.
|
|
(The script in the VM sets up tables like this through Hive; ignore those tables
|
|
for purposes of this demonstration.)
|
|
Save the following as <filepath>customer_setup.sql</filepath>:
|
|
</p>
|
|
|
|
<codeblock>--
|
|
-- store_sales fact table and surrounding dimension tables only
|
|
--
|
|
create database tpcds;
|
|
use tpcds;
|
|
|
|
drop table if exists customer;
|
|
create external table customer
|
|
(
|
|
c_customer_sk int,
|
|
c_customer_id string,
|
|
c_current_cdemo_sk int,
|
|
c_current_hdemo_sk int,
|
|
c_current_addr_sk int,
|
|
c_first_shipto_date_sk int,
|
|
c_first_sales_date_sk int,
|
|
c_salutation string,
|
|
c_first_name string,
|
|
c_last_name string,
|
|
c_preferred_cust_flag string,
|
|
c_birth_day int,
|
|
c_birth_month int,
|
|
c_birth_year int,
|
|
c_birth_country string,
|
|
c_login string,
|
|
c_email_address string,
|
|
c_last_review_date string
|
|
)
|
|
row format delimited fields terminated by '|'
|
|
location '/user/hive/tpcds/customer';
|
|
|
|
drop table if exists customer_address;
|
|
create external table customer_address
|
|
(
|
|
ca_address_sk int,
|
|
ca_address_id string,
|
|
ca_street_number string,
|
|
ca_street_name string,
|
|
ca_street_type string,
|
|
ca_suite_number string,
|
|
ca_city string,
|
|
ca_county string,
|
|
ca_state string,
|
|
ca_zip string,
|
|
ca_country string,
|
|
ca_gmt_offset float,
|
|
ca_location_type string
|
|
)
|
|
row format delimited fields terminated by '|'
|
|
location '/user/hive/tpcds/customer_address';
|
|
</codeblock>
|
|
|
|
<p>
|
|
We would run this script with a command such as:
|
|
<codeblock>impala-shell -i localhost -f customer_setup.sql</codeblock>
|
|
</p>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tutorial_describe_impala">
|
|
|
|
<title>Describe the Impala Table</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Now that you have updated the database metadata that Impala caches, you can confirm that the expected
|
|
tables are accessible by Impala and examine the attributes of one of the tables. We created these tables
|
|
in the database named <codeph>default</codeph>. If the tables were in a database other than the default,
|
|
we would issue a command <codeph>use <varname>db_name</varname> </codeph> to switch to that database
|
|
before examining or querying its tables. We could also qualify the name of a table by prepending the
|
|
database name, for example <codeph>default.customer</codeph> and <codeph>default.customer_name</codeph>.
|
|
</p>
|
|
|
|
<codeblock>[impala-host:21000] > show databases
|
|
Query finished, fetching results ...
|
|
default
|
|
Returned 1 row(s) in 0.00s
|
|
[impala-host:21000] > show tables
|
|
Query finished, fetching results ...
|
|
customer
|
|
customer_address
|
|
Returned 2 row(s) in 0.00s
|
|
[impala-host:21000] > describe customer_address
|
|
+------------------+--------+---------+
|
|
| name | type | comment |
|
|
+------------------+--------+---------+
|
|
| ca_address_sk | int | |
|
|
| ca_address_id | string | |
|
|
| ca_street_number | string | |
|
|
| ca_street_name | string | |
|
|
| ca_street_type | string | |
|
|
| ca_suite_number | string | |
|
|
| ca_city | string | |
|
|
| ca_county | string | |
|
|
| ca_state | string | |
|
|
| ca_zip | string | |
|
|
| ca_country | string | |
|
|
| ca_gmt_offset | float | |
|
|
| ca_location_type | string | |
|
|
+------------------+--------+---------+
|
|
Returned 13 row(s) in 0.01
|
|
</codeblock>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tutorial_query_impala">
|
|
|
|
<title>Query the Impala Table</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can query data contained in the tables. Impala coordinates the query execution across a single node
|
|
or multiple nodes depending on your configuration, without the overhead of running MapReduce jobs to
|
|
perform the intermediate processing.
|
|
</p>
|
|
|
|
<p>
|
|
There are a variety of ways to execute queries on Impala:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Using the <codeph>impala-shell</codeph> command in interactive mode:
|
|
<codeblock>$ impala-shell -i impala-host
|
|
Connected to localhost:21000
|
|
[impala-host:21000] > select count(*) from customer_address;
|
|
50000
|
|
Returned 1 row(s) in 0.37s
|
|
</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
Passing a set of commands contained in a file:
|
|
<codeblock>$ impala-shell -i impala-host -f myquery.sql
|
|
Connected to localhost:21000
|
|
50000
|
|
Returned 1 row(s) in 0.19s</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
Passing a single command to the <codeph>impala-shell</codeph> command. The query is executed, the
|
|
results are returned, and the shell exits. Make sure to quote the command, preferably with single
|
|
quotation marks to avoid shell expansion of characters such as <codeph>*</codeph>.
|
|
<codeblock>$ impala-shell -i impala-host -q 'select count(*) from customer_address'
|
|
Connected to localhost:21000
|
|
50000
|
|
Returned 1 row(s) in 0.29s</codeblock>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tutorial_etl">
|
|
|
|
<title>Data Loading and Querying Examples</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This section describes how to create some sample tables and load data into them. These tables can then be
|
|
queried using the Impala shell.
|
|
</p>
|
|
</conbody>
|
|
|
|
<concept id="tutorial_loading">
|
|
|
|
<title>Loading Data</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Loading data involves:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Establishing a data set. The example below uses <codeph>.csv</codeph> files.
|
|
</li>
|
|
|
|
<li>
|
|
Creating tables to which to load data.
|
|
</li>
|
|
|
|
<li>
|
|
Loading the data into the tables you created.
|
|
</li>
|
|
</ul>
|
|
|
|
<!--
|
|
<section id="tut_hdfs_perms">
|
|
|
|
<title>
|
|
Modifying Directory Permissions
|
|
</title>
|
|
|
|
<p>
|
|
Change permission settings so Hive and Impala are both able read and write to the Hive warehouse directory.
|
|
This process involves changing permissions for the root warehouse directory, then modifying the client
|
|
<codeph>hive-site.xml</codeph> file so newly created files inherit these permissions.
|
|
</p>
|
|
|
|
<p>
|
|
Begin by modifying the file permissions. For example, if the warehouse directory is
|
|
<codeph>/user/hive/warehouse</codeph>, issue the following command:
|
|
</p>
|
|
|
|
<codeblock>hdfs dfs -chmod -R 775 /user/hive/warehouse</codeblock>
|
|
<p>
|
|
Add the following property to the Hive client <codeph>hive-site.xml</codeph> file:
|
|
</p>
|
|
|
|
<codeblock><property>
|
|
<name>hive.warehouse.subdir.inherit.perms</name>
|
|
<value>true</value>
|
|
<description>Set this to true if the table directories should inherit the
|
|
permission of the warehouse or database directory instead of being created
|
|
with the permissions derived from dfs umask</description>
|
|
</property></codeblock>
|
|
|
|
</section>
|
|
-->
|
|
|
|
<!-- To do: This is from old days when Impala didn't support LOAD DATA; re-do to show in impala-shell.
|
|
<section id="tut_loading_csv">
|
|
|
|
<title>
|
|
Loading .csv Data into Tables
|
|
</title>
|
|
|
|
<p>
|
|
In the Hive shell, load data into TAB1 and TAB2:
|
|
</p>
|
|
|
|
|
|
<codeblock>LOAD DATA LOCAL INPATH 'tab1.csv' OVERWRITE INTO TABLE tab1;
|
|
LOAD DATA LOCAL INPATH 'tab2.csv' OVERWRITE INTO TABLE tab2;</codeblock>
|
|
|
|
</section>
|
|
-->
|
|
|
|
<!--
|
|
<section id="tut_updating_metadata">
|
|
|
|
<title>
|
|
Updating Metadata
|
|
</title>
|
|
|
|
<p>
|
|
While you have created new tables and loaded data, Impala does not have information about these new tables. To
|
|
enable Impala to work with the data in these new tables, connect to an Impala host and then refresh the Impala
|
|
metadata. For example, if you had an Impala host named <codeph>impala-host</codeph> using the default
|
|
port, you would issue the following command:
|
|
</p>
|
|
|
|
<codeblock>$ impala-shell
|
|
[Not connected] > connect impala-host
|
|
[impala-host:21000] > refresh</codeblock>
|
|
|
|
</section>
|
|
-->
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tutorial_queries">
|
|
|
|
<title>Sample Queries</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To run these sample queries, create a SQL query file <codeph>query.sql</codeph>, copy and paste each
|
|
query into the query file, and then run the query file using the shell. For example, to run
|
|
<codeph>query.sql</codeph> on <codeph>impala-host</codeph>, you might use the command:
|
|
</p>
|
|
|
|
<codeblock>impala-shell.sh -i impala-host -f query.sql</codeblock>
|
|
|
|
<p>
|
|
The examples and results below assume you have loaded the sample data into the tables as described
|
|
above.
|
|
</p>
|
|
|
|
<example>
|
|
|
|
<title>Example: Examining Contents of Tables</title>
|
|
|
|
<p>
|
|
Let's start by verifying that the tables do contain the data we expect. Because Impala often deals
|
|
with tables containing millions or billions of rows, when examining tables of unknown size, include
|
|
the <codeph>LIMIT</codeph> clause to avoid huge amounts of unnecessary output, as in the final query.
|
|
(If your interactive query starts displaying an unexpected volume of data, press
|
|
<codeph>Ctrl-C</codeph> in <codeph>impala-shell</codeph> to cancel the query.)
|
|
</p>
|
|
|
|
<codeblock>SELECT * FROM tab1;
|
|
SELECT * FROM tab2;
|
|
SELECT * FROM tab2 LIMIT 5;</codeblock>
|
|
|
|
<p>
|
|
Results:
|
|
</p>
|
|
|
|
<codeblock>+----+-------+------------+-------------------------------+
|
|
| id | col_1 | col_2 | col_3 |
|
|
+----+-------+------------+-------------------------------+
|
|
| 1 | true | 123.123 | 2012-10-24 08:55:00 |
|
|
| 2 | false | 1243.5 | 2012-10-25 13:40:00 |
|
|
| 3 | false | 24453.325 | 2008-08-22 09:33:21.123000000 |
|
|
| 4 | false | 243423.325 | 2007-05-12 22:32:21.334540000 |
|
|
| 5 | true | 243.325 | 1953-04-22 09:11:33 |
|
|
+----+-------+------------+-------------------------------+
|
|
|
|
+----+-------+---------------+
|
|
| id | col_1 | col_2 |
|
|
+----+-------+---------------+
|
|
| 1 | true | 12789.123 |
|
|
| 2 | false | 1243.5 |
|
|
| 3 | false | 24453.325 |
|
|
| 4 | false | 2423.3254 |
|
|
| 5 | true | 243.325 |
|
|
| 60 | false | 243565423.325 |
|
|
| 70 | true | 243.325 |
|
|
| 80 | false | 243423.325 |
|
|
| 90 | true | 243.325 |
|
|
+----+-------+---------------+
|
|
|
|
+----+-------+-----------+
|
|
| id | col_1 | col_2 |
|
|
+----+-------+-----------+
|
|
| 1 | true | 12789.123 |
|
|
| 2 | false | 1243.5 |
|
|
| 3 | false | 24453.325 |
|
|
| 4 | false | 2423.3254 |
|
|
| 5 | true | 243.325 |
|
|
+----+-------+-----------+</codeblock>
|
|
|
|
</example>
|
|
|
|
<example>
|
|
|
|
<title>Example: Aggregate and Join</title>
|
|
|
|
<codeblock>SELECT tab1.col_1, MAX(tab2.col_2), MIN(tab2.col_2)
|
|
FROM tab2 JOIN tab1 USING (id)
|
|
GROUP BY col_1 ORDER BY 1 LIMIT 5;</codeblock>
|
|
|
|
<p>
|
|
Results:
|
|
</p>
|
|
|
|
<codeblock>+-------+-----------------+-----------------+
|
|
| col_1 | max(tab2.col_2) | min(tab2.col_2) |
|
|
+-------+-----------------+-----------------+
|
|
| false | 24453.325 | 1243.5 |
|
|
| true | 12789.123 | 243.325 |
|
|
+-------+-----------------+-----------------+</codeblock>
|
|
|
|
</example>
|
|
|
|
<example>
|
|
|
|
<title>Example: Subquery, Aggregate and Joins</title>
|
|
|
|
<codeblock>SELECT tab2.*
|
|
FROM tab2,
|
|
(SELECT tab1.col_1, MAX(tab2.col_2) AS max_col2
|
|
FROM tab2, tab1
|
|
WHERE tab1.id = tab2.id
|
|
GROUP BY col_1) subquery1
|
|
WHERE subquery1.max_col2 = tab2.col_2;</codeblock>
|
|
|
|
<p>
|
|
Results:
|
|
</p>
|
|
|
|
<codeblock>+----+-------+-----------+
|
|
| id | col_1 | col_2 |
|
|
+----+-------+-----------+
|
|
| 1 | true | 12789.123 |
|
|
| 3 | false | 24453.325 |
|
|
+----+-------+-----------+</codeblock>
|
|
|
|
</example>
|
|
|
|
<example>
|
|
|
|
<title>Example: INSERT Query</title>
|
|
|
|
<codeblock>INSERT OVERWRITE TABLE tab3
|
|
SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3)
|
|
FROM tab1 WHERE YEAR(col_3) = 2012;</codeblock>
|
|
|
|
<p>
|
|
Query <codeph>TAB3</codeph> to check the result:
|
|
</p>
|
|
|
|
<codeblock>SELECT * FROM tab3;
|
|
</codeblock>
|
|
|
|
<p>
|
|
Results:
|
|
</p>
|
|
|
|
<codeblock>+----+-------+---------+-------+-----+
|
|
| id | col_1 | col_2 | month | day |
|
|
+----+-------+---------+-------+-----+
|
|
| 1 | true | 123.123 | 10 | 24 |
|
|
| 2 | false | 1243.5 | 10 | 25 |
|
|
+----+-------+---------+-------+-----+</codeblock>
|
|
|
|
</example>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|
|
</concept>
|
|
|
|
<concept id="tut_advanced">
|
|
|
|
<title>Advanced Tutorials</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
These tutorials walk you through advanced scenarios or specialized features.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="tut_external_partition_data">
|
|
|
|
<title>Attaching an External Partitioned Table to an HDFS Directory Structure</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This tutorial shows how you might set up a directory tree in HDFS, put data files into the lowest-level
|
|
subdirectories, and then use an Impala external table to query the data files from their original
|
|
locations.
|
|
</p>
|
|
|
|
<p>
|
|
The tutorial uses a table with web log data, with separate subdirectories for the year, month, day, and
|
|
host. For simplicity, we use a tiny amount of CSV data, loading the same data into each partition.
|
|
</p>
|
|
|
|
<p>
|
|
First, we make an Impala partitioned table for CSV data, and look at the underlying HDFS directory
|
|
structure to understand the directory structure to re-create elsewhere in HDFS. The columns
|
|
<codeph>field1</codeph>, <codeph>field2</codeph>, and <codeph>field3</codeph> correspond to the contents
|
|
of the CSV data files. The <codeph>year</codeph>, <codeph>month</codeph>, <codeph>day</codeph>, and
|
|
<codeph>host</codeph> columns are all represented as subdirectories within the table structure, and are
|
|
not part of the CSV files. We use <codeph>STRING</codeph> for each of these columns so that we can
|
|
produce consistent subdirectory names, with leading zeros for a consistent length.
|
|
</p>
|
|
|
|
<codeblock>create database external_partitions;
|
|
use external_partitions;
|
|
create table logs (field1 string, field2 string, field3 string)
|
|
partitioned by (year string, month string , day string, host string)
|
|
row format delimited fields terminated by ',';
|
|
insert into logs partition (year="2013", month="07", day="28", host="host1") values ("foo","foo","foo");
|
|
insert into logs partition (year="2013", month="07", day="28", host="host2") values ("foo","foo","foo");
|
|
insert into logs partition (year="2013", month="07", day="29", host="host1") values ("foo","foo","foo");
|
|
insert into logs partition (year="2013", month="07", day="29", host="host2") values ("foo","foo","foo");
|
|
insert into logs partition (year="2013", month="08", day="01", host="host1") values ("foo","foo","foo");
|
|
</codeblock>
|
|
|
|
<p>
|
|
Back in the Linux shell, we examine the HDFS directory structure. (Your Impala data directory might be in
|
|
a different location; for historical reasons, it is sometimes under the HDFS path
|
|
<filepath>/user/hive/warehouse</filepath>.) We use the <codeph>hdfs dfs -ls</codeph> command to examine
|
|
the nested subdirectories corresponding to each partitioning column, with separate subdirectories at each
|
|
level (with <codeph>=</codeph> in their names) representing the different values for each partitioning
|
|
column. When we get to the lowest level of subdirectory, we use the <codeph>hdfs dfs -cat</codeph>
|
|
command to examine the data file and see CSV-formatted data produced by the <codeph>INSERT</codeph>
|
|
statement in Impala.
|
|
</p>
|
|
|
|
<codeblock>$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db
|
|
Found 1 items
|
|
drwxrwxrwt - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs
|
|
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs
|
|
Found 1 items
|
|
drwxr-xr-x - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs/year=2013
|
|
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013
|
|
Found 2 items
|
|
drwxr-xr-x - impala hive 0 2013-08-07 12:23 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
|
|
drwxr-xr-x - impala hive 0 2013-08-07 12:24 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=08
|
|
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07
|
|
Found 2 items
|
|
drwxr-xr-x - impala hive 0 2013-08-07 12:22 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
|
|
drwxr-xr-x - impala hive 0 2013-08-07 12:23 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=29
|
|
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28
|
|
Found 2 items
|
|
drwxr-xr-x - impala hive 0 2013-08-07 12:21 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
|
|
drwxr-xr-x - impala hive 0 2013-08-07 12:22 /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host2
|
|
$ hdfs dfs -ls /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/host=host1
|
|
Found 1 items
|
|
-rw-r--r-- 3 impala hive 12 2013-08-07 12:21 /user/impala/warehouse/external_partiti
|
|
ons.db/logs/year=2013/month=07/day=28/host=host1/3981726974111751120--8907184999369517436_822630111_data.0
|
|
$ hdfs dfs -cat /user/impala/warehouse/external_partitions.db/logs/year=2013/month=07/day=28/\
|
|
host=host1/3981726974111751120--8 907184999369517436_822630111_data.0
|
|
foo,foo,foo
|
|
</codeblock>
|
|
|
|
<p>
|
|
Still in the Linux shell, we use <codeph>hdfs dfs -mkdir</codeph> to create several data directories
|
|
outside the HDFS directory tree that Impala controls (<filepath>/user/impala/warehouse</filepath> in this
|
|
example, maybe different in your case). Depending on your configuration, you might need to log in as a
|
|
user with permission to write into this HDFS directory tree; for example, the commands shown here were
|
|
run while logged in as the <codeph>hdfs</codeph> user.
|
|
</p>
|
|
|
|
<codeblock>$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host1
|
|
$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host2
|
|
$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=28/host=host1
|
|
$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=07/day=29/host=host1
|
|
$ hdfs dfs -mkdir -p /user/impala/data/logs/year=2013/month=08/day=01/host=host1
|
|
</codeblock>
|
|
|
|
<p>
|
|
We make a tiny CSV file, with values different than in the <codeph>INSERT</codeph> statements used
|
|
earlier, and put a copy within each subdirectory that we will use as an Impala partition.
|
|
</p>
|
|
|
|
<codeblock>$ cat >dummy_log_data
|
|
bar,baz,bletch
|
|
$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=08/day=01/host=host1
|
|
$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=28/host=host1
|
|
$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=28/host=host2
|
|
$ hdfs dfs -mkdir -p /user/impala/data/external_partitions/year=2013/month=07/day=29/host=host1
|
|
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host1
|
|
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=28/host=host2
|
|
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=07/day=29/host=host1
|
|
$ hdfs dfs -put dummy_log_data /user/impala/data/logs/year=2013/month=08/day=01/host=host1
|
|
</codeblock>
|
|
|
|
<p>
|
|
Back in the <cmdname>impala-shell</cmdname> interpreter, we move the original Impala-managed table aside,
|
|
and create a new <i>external</i> table with a <codeph>LOCATION</codeph> clause pointing to the directory
|
|
under which we have set up all the partition subdirectories and data files.
|
|
</p>
|
|
|
|
<codeblock>use external_partitions;
|
|
alter table logs rename to logs_original;
|
|
create external table logs (field1 string, field2 string, field3 string)
|
|
partitioned by (year string, month string, day string, host string)
|
|
row format delimited fields terminated by ','
|
|
location '/user/impala/data/logs';
|
|
</codeblock>
|
|
|
|
<p>
|
|
Because partition subdirectories and data files come and go during the data lifecycle, you must identify
|
|
each of the partitions through an <codeph>ALTER TABLE</codeph> statement before Impala recognizes the
|
|
data files they contain.
|
|
</p>
|
|
|
|
<codeblock>alter table logs add partition (year="2013",month="07",day="28",host="host1")
|
|
alter table log_type add partition (year="2013",month="07",day="28",host="host2");
|
|
alter table log_type add partition (year="2013",month="07",day="29",host="host1");
|
|
alter table log_type add partition (year="2013",month="08",day="01",host="host1");
|
|
</codeblock>
|
|
|
|
<p>
|
|
We issue a <codeph>REFRESH</codeph> statement for the table, always a safe practice when data files have
|
|
been manually added, removed, or changed. Then the data is ready to be queried. The <codeph>SELECT
|
|
*</codeph> statement illustrates that the data from our trivial CSV file was recognized in each of the
|
|
partitions where we copied it. Although in this case there are only a few rows, we include a
|
|
<codeph>LIMIT</codeph> clause on this test query just in case there is more data than we expect.
|
|
</p>
|
|
|
|
<codeblock>refresh log_type;
|
|
select * from log_type limit 100;
|
|
+--------+--------+--------+------+-------+-----+-------+
|
|
| field1 | field2 | field3 | year | month | day | host |
|
|
+--------+--------+--------+------+-------+-----+-------+
|
|
| bar | baz | bletch | 2013 | 07 | 28 | host1 |
|
|
| bar | baz | bletch | 2013 | 08 | 01 | host1 |
|
|
| bar | baz | bletch | 2013 | 07 | 29 | host1 |
|
|
| bar | baz | bletch | 2013 | 07 | 28 | host2 |
|
|
+--------+--------+--------+------+-------+-----+-------+
|
|
</codeblock>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tutorial_impala_hive">
|
|
|
|
<title>Switching Back and Forth Between Impala and Hive</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Sometimes, you might find it convenient to switch to the Hive shell to perform some data loading or
|
|
transformation operation, particularly on file formats such as RCFile, SequenceFile, and Avro that Impala
|
|
currently can query but not write to.
|
|
</p>
|
|
|
|
<p>
|
|
Whenever you create, drop, or alter a table or other kind of object through Hive, the next time you
|
|
switch back to the <cmdname>impala-shell</cmdname> interpreter, issue a one-time <codeph>INVALIDATE
|
|
METADATA</codeph> statement so that Impala recognizes the new or changed object.
|
|
</p>
|
|
|
|
<p>
|
|
Whenever you load, insert, or change data in an existing table through Hive (or even through manual HDFS
|
|
operations such as the <cmdname>hdfs</cmdname> command), the next time you switch back to the
|
|
<cmdname>impala-shell</cmdname> interpreter, issue a one-time <codeph>REFRESH
|
|
<varname>table_name</varname></codeph> statement so that Impala recognizes the new or changed data.
|
|
</p>
|
|
|
|
<p>
|
|
For examples showing how this process works for the <codeph>REFRESH</codeph> statement, look at the
|
|
examples of creating RCFile and SequenceFile tables in Impala, loading data through Hive, and then
|
|
querying the data through Impala. See <xref href="impala_rcfile.xml#rcfile"/> and
|
|
<xref href="impala_seqfile.xml#seqfile"/> for those examples.
|
|
</p>
|
|
|
|
<p>
|
|
For examples showing how this process works for the <codeph>INVALIDATE METADATA</codeph> statement, look
|
|
at the example of creating and loading an Avro table in Hive, and then querying the data through Impala.
|
|
See <xref href="impala_avro.xml#avro"/> for that example.
|
|
</p>
|
|
|
|
<note rev="1.2.0">
|
|
<p rev="1.2.0">
|
|
Originally, Impala did not support UDFs, but this feature is available in Impala starting in Impala
|
|
1.2. Some <codeph>INSERT ... SELECT</codeph> transformations that you originally did through Hive can
|
|
now be done through Impala. See <xref href="impala_udf.xml#udfs"/> for details.
|
|
</p>
|
|
|
|
<p rev="1.2.0">
|
|
Prior to Impala 1.2, the <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements
|
|
needed to be issued on each Impala node to which you connected and issued queries. In Impala 1.2 and
|
|
higher, when you issue either of those statements on any Impala node, the results are broadcast to all
|
|
the Impala nodes in the cluster, making it truly a one-step operation after each round of DDL or ETL
|
|
operations in Hive.
|
|
</p>
|
|
</note>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="1.2.2" id="tut_cross_join">
|
|
|
|
<title>Cross Joins and Cartesian Products with the CROSS JOIN Operator</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Originally, Impala restricted join queries so that they had to include at least one equality comparison
|
|
between the columns of the tables on each side of the join operator. With the huge tables typically
|
|
processed by Impala, any miscoded query that produced a full Cartesian product as a result set could
|
|
consume a huge amount of cluster resources.
|
|
</p>
|
|
|
|
<p>
|
|
In Impala 1.2.2 and higher, this restriction is lifted when you use the <codeph>CROSS JOIN</codeph>
|
|
operator in the query. You still cannot remove all <codeph>WHERE</codeph> clauses from a query like
|
|
<codeph>SELECT * FROM t1 JOIN t2</codeph> to produce all combinations of rows from both tables. But you
|
|
can use the <codeph>CROSS JOIN</codeph> operator to explicitly request such a Cartesian product.
|
|
Typically, this operation is applicable for smaller tables, where the result set still fits within the
|
|
memory of a single Impala node.
|
|
</p>
|
|
|
|
<p>
|
|
The following example sets up data for use in a series of comic books where characters battle each other.
|
|
At first, we use an equijoin query, which only allows characters from the same time period and the same
|
|
planet to meet.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table heroes (name string, era string, planet string);
|
|
[localhost:21000] > create table villains (name string, era string, planet string);
|
|
[localhost:21000] > insert into heroes values
|
|
> ('Tesla','20th century','Earth'),
|
|
> ('Pythagoras','Antiquity','Earth'),
|
|
> ('Zopzar','Far Future','Mars');
|
|
Inserted 3 rows in 2.28s
|
|
[localhost:21000] > insert into villains values
|
|
> ('Caligula','Antiquity','Earth'),
|
|
> ('John Dillinger','20th century','Earth'),
|
|
> ('Xibulor','Far Future','Venus');
|
|
Inserted 3 rows in 1.93s
|
|
[localhost:21000] > select concat(heroes.name,' vs. ',villains.name) as battle
|
|
> from heroes join villains
|
|
> where heroes.era = villains.era and heroes.planet = villains.planet;
|
|
+--------------------------+
|
|
| battle |
|
|
+--------------------------+
|
|
| Tesla vs. John Dillinger |
|
|
| Pythagoras vs. Caligula |
|
|
+--------------------------+
|
|
Returned 2 row(s) in 0.47s</codeblock>
|
|
|
|
<p>
|
|
Readers demanded more action, so we added elements of time travel and space travel so that any hero could
|
|
face any villain. Prior to Impala 1.2.2, this type of query was impossible because all joins had to
|
|
reference matching values between the two tables:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > -- Cartesian product not possible in Impala 1.1.
|
|
> select concat(heroes.name,' vs. ',villains.name) as battle from heroes join villains;
|
|
ERROR: NotImplementedException: Join between 'heroes' and 'villains' requires at least one conjunctive equality predicate between the two tables</codeblock>
|
|
|
|
<p>
|
|
With Impala 1.2.2, we rewrite the query slightly to use <codeph>CROSS JOIN</codeph> rather than
|
|
<codeph>JOIN</codeph>, and now the result set includes all combinations:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > -- Cartesian product available in Impala 1.2.2 with the CROSS JOIN syntax.
|
|
> select concat(heroes.name,' vs. ',villains.name) as battle from heroes cross join villains;
|
|
+-------------------------------+
|
|
| battle |
|
|
+-------------------------------+
|
|
| Tesla vs. Caligula |
|
|
| Tesla vs. John Dillinger |
|
|
| Tesla vs. Xibulor |
|
|
| Pythagoras vs. Caligula |
|
|
| Pythagoras vs. John Dillinger |
|
|
| Pythagoras vs. Xibulor |
|
|
| Zopzar vs. Caligula |
|
|
| Zopzar vs. John Dillinger |
|
|
| Zopzar vs. Xibulor |
|
|
+-------------------------------+
|
|
Returned 9 row(s) in 0.33s</codeblock>
|
|
|
|
<p>
|
|
The full combination of rows from both tables is known as the Cartesian product. This type of result set
|
|
is often used for creating grid data structures. You can also filter the result set by including
|
|
<codeph>WHERE</codeph> clauses that do not explicitly compare columns between the two tables. The
|
|
following example shows how you might produce a list of combinations of year and quarter for use in a
|
|
chart, and then a shorter list with only selected quarters.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table x_axis (x int);
|
|
[localhost:21000] > create table y_axis (y int);
|
|
[localhost:21000] > insert into x_axis values (1),(2),(3),(4);
|
|
Inserted 4 rows in 2.14s
|
|
[localhost:21000] > insert into y_axis values (2010),(2011),(2012),(2013),(2014);
|
|
Inserted 5 rows in 1.32s
|
|
[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis;
|
|
+------+---------+
|
|
| year | quarter |
|
|
+------+---------+
|
|
| 2010 | 1 |
|
|
| 2011 | 1 |
|
|
| 2012 | 1 |
|
|
| 2013 | 1 |
|
|
| 2014 | 1 |
|
|
| 2010 | 2 |
|
|
| 2011 | 2 |
|
|
| 2012 | 2 |
|
|
| 2013 | 2 |
|
|
| 2014 | 2 |
|
|
| 2010 | 3 |
|
|
| 2011 | 3 |
|
|
| 2012 | 3 |
|
|
| 2013 | 3 |
|
|
| 2014 | 3 |
|
|
| 2010 | 4 |
|
|
| 2011 | 4 |
|
|
| 2012 | 4 |
|
|
| 2013 | 4 |
|
|
| 2014 | 4 |
|
|
+------+---------+
|
|
Returned 20 row(s) in 0.38s
|
|
[localhost:21000] > select y as year, x as quarter from x_axis cross join y_axis where x in (1,3);
|
|
+------+---------+
|
|
| year | quarter |
|
|
+------+---------+
|
|
| 2010 | 1 |
|
|
| 2011 | 1 |
|
|
| 2012 | 1 |
|
|
| 2013 | 1 |
|
|
| 2014 | 1 |
|
|
| 2010 | 3 |
|
|
| 2011 | 3 |
|
|
| 2012 | 3 |
|
|
| 2013 | 3 |
|
|
| 2014 | 3 |
|
|
+------+---------+
|
|
Returned 10 row(s) in 0.39s</codeblock>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="Cloudera" id="tut_connect">
|
|
|
|
<title>Connecting to Impala through impala-shell</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This tutorial provides some tips for connecting to Impala through the <cmdname>impala-shell</cmdname>
|
|
interpreter, and gives usage tips for particular scenarios.
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
By default, <cmdname>impala-shell</cmdname> with no arguments connects to the current host, port
|
|
21000.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>-i localhost</codeph> option gives you a shorter command prompt.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The <codeph>--quiet</codeph> option suppresses some informational messages, such as the
|
|
<codeph>Query:</codeph> line that echoes each command entered.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<codeph>CONNECT <varname>hostname</varname></codeph> lets you switch to another host (keeping the same
|
|
default port 21000).
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
You could run <cmdname>impala-shell</cmdname> from a completely different host to keep login accounts
|
|
off the actual Impala nodes.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For load balancing, you would connect to different coordinator nodes for concurrent queries.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For benchmarking, you would skip pretty printing with the <codeph>-B</codeph> option, or even skip
|
|
all screen output with the <codeph>-o</codeph> option or shell redirection.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|
|
|
|
<concept audience="Cloudera" id="tut_mem_limit">
|
|
|
|
<title>Effects of Memory Limits on Impala Queries</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Memory"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p></p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="tut_parquet_schemaless">
|
|
|
|
<title>Dealing with Parquet Files with Unknown Schema</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Tutorials"/>
|
|
<data name="Category" value="Schemas"/>
|
|
<data name="Category" value="Parquet"/>
|
|
<data name="Category" value="File Formats"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
As data pipelines start to include more aspects such as NoSQL or loosely specified schemas, you might encounter
|
|
situations where you have data files (particularly in Parquet format) where you do not know the precise table definition.
|
|
This tutorial shows how you can build an Impala table around data that comes from non-Impala or even non-SQL sources,
|
|
where you do not have control of the table layout and might not be familiar with the characteristics of the data.
|
|
</p>
|
|
|
|
<p>
|
|
The data used in this tutorial represents airline on-time arrival statistics, from October 1987 through April 2008.
|
|
See the details on the <xref href="http://stat-computing.org/dataexpo/2009/" scope="external" format="html">2009 ASA Data Expo web site</xref>.
|
|
You can also see the <xref href="http://stat-computing.org/dataexpo/2009/the-data.html" scope="external" format="html">explanations of the columns</xref>;
|
|
for purposes of this exercise, wait until after following the tutorial before examining the schema, to better simulate
|
|
a real-life situation where you cannot rely on assumptions and assertions about the ranges and representations of
|
|
data values.
|
|
</p>
|
|
|
|
<p>
|
|
We will download Parquet files containing this data from the Ibis blog.
|
|
First, we download and unpack the data files.
|
|
There are 8 files totalling 1.4 GB.
|
|
Each file is less than 256 MB.
|
|
</p>
|
|
|
|
<codeblock>$ wget -O airlines_parquet.tar.gz https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0
|
|
...
|
|
Length: 1245204740 (1.2G) [application/octet-stream]
|
|
Saving to: “airlines_parquet.tar.gz”
|
|
|
|
2015-08-12 17:14:24 (23.6 MB/s) - “airlines_parquet.tar.gz” saved [1245204740/1245204740]
|
|
|
|
$ tar xvzf airlines_parquet.tar.gz
|
|
airlines_parquet/
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
|
|
airlines_parquet/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
|
|
$ cd airlines_parquet/
|
|
$ du -kch *.parq
|
|
253M 93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
|
|
65M 93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
|
|
156M 93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
|
|
240M 93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
|
|
253M 93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
|
|
16M 93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
|
|
177M 93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
|
|
213M 93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
|
|
1.4G total
|
|
</codeblock>
|
|
|
|
<p>
|
|
Next, we put the Parquet data files in HDFS, all together in a single directory,
|
|
with permissions on the directory and the files so that the <codeph>impala</codeph>
|
|
user will be able to read them.
|
|
</p>
|
|
|
|
<note>
|
|
After unpacking, we saw the largest Parquet file was 253 MB.
|
|
When copying Parquet files into HDFS for Impala to use,
|
|
for maximum query performance, make sure that each file resides in a single HDFS data block.
|
|
Therefore, we pick a size larger than any single file and specify that as the block size, using the argument
|
|
<codeph>-Ddfs.block.size=256m</codeph> on the <codeph>hdfs dfs -put</codeph> command.
|
|
</note>
|
|
|
|
<codeblock>$ hdfs dfs -mkdir -p hdfs://demo_host.example.com:8020/user/impala/staging/airlines
|
|
$ hdfs dfs -Ddfs.block.size=256m -put *.parq /user/impala/staging/airlines
|
|
$ hdfs dfs -ls /user/impala/staging
|
|
Found 1 items
|
|
drwxrwxrwx - hdfs supergroup 0 2015-08-12 13:52 /user/impala/staging/airlines
|
|
$ hdfs dfs -ls hdfs://demo_host.example.com:8020/user/impala/staging/airlines
|
|
Found 8 items
|
|
-rw-r--r-- 3 jrussell supergroup 265107489 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
|
|
-rw-r--r-- 3 jrussell supergroup 67544715 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
|
|
-rw-r--r-- 3 jrussell supergroup 162556490 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
|
|
-rw-r--r-- 3 jrussell supergroup 251603518 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
|
|
-rw-r--r-- 3 jrussell supergroup 265186603 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
|
|
-rw-r--r-- 3 jrussell supergroup 16663754 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
|
|
-rw-r--r-- 3 jrussell supergroup 185511677 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
|
|
-rw-r--r-- 3 jrussell supergroup 222794621 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
|
|
</codeblock>
|
|
|
|
<p>
|
|
With the files in an accessible location in HDFS, we create a database table that uses the data in those files.
|
|
The <codeph>CREATE EXTERNAL</codeph> syntax and the <codeph>LOCATION</codeph> attribute point Impala at the appropriate HDFS directory.
|
|
The <codeph>LIKE PARQUET '<varname>path_to_any_parquet_file</varname>'</codeph> clause means we skip the list of column names and types;
|
|
Impala automatically gets the column names and data types straight from the data files.
|
|
(Currently, this technique only works for Parquet files.)
|
|
We ignore the warning about lack of <codeph>READ_WRITE</codeph> access to the files in HDFS;
|
|
the <codeph>impala</codeph> user can read the files, which will be sufficient for us to experiment with
|
|
queries and perform some copy and transform operations into other tables.
|
|
</p>
|
|
|
|
<codeblock>$ impala-shell -i localhost
|
|
Starting Impala Shell without Kerberos authentication
|
|
Connected to localhost:21000
|
|
Server version: impalad version 2.2.0-cdh5 RELEASE (build 2ffd73a4255cefd521362ffe1cfb37463f67f75c)
|
|
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
|
|
|
|
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
|
|
|
|
(Shell build version: Impala Shell v2.1.2-cdh5 (92438b7) built on Tue Feb 24 12:36:33 PST 2015)
|
|
[localhost:21000] > create database airline_data;
|
|
[localhost:21000] > use airline_data;
|
|
[localhost:21000] > create external table airlines_external
|
|
> like parquet 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq'
|
|
> stored as parquet location 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines';
|
|
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://demo_host.example.com:8020/user/impala/staging'
|
|
</codeblock>
|
|
|
|
<p>
|
|
With the table created, we examine its physical and logical characteristics to confirm that the data is really
|
|
there and in a format and shape that we can work with.
|
|
The <codeph>SHOW TABLE STATS</codeph> statement gives a very high-level summary of the table,
|
|
showing how many files and how much total data it contains.
|
|
Also, it confirms that the table is expecting all the associated data files to be in Parquet format.
|
|
(The ability to work with all kinds of HDFS data files in different formats means that it is
|
|
possible to have a mismatch between the format of the data files, and the format
|
|
that the table expects the data files to be in.)
|
|
The <codeph>SHOW FILES</codeph> statement confirms that the data in the table has the expected number,
|
|
names, and sizes of the original Parquet files.
|
|
The <codeph>DESCRIBE</codeph> statement (or its abbreviation <codeph>DESC</codeph>) confirms the names and types
|
|
of the columns that Impala automatically created after reading that metadata from the Parquet file.
|
|
The <codeph>DESCRIBE FORMATTED</codeph> statement prints out some extra detail along with the column definitions;
|
|
the pieces we care about for this exercise are the containing database for the table,
|
|
the location of the associated data files in HDFS, the fact that it's an external table so Impala will not
|
|
delete the HDFS files when we finish the experiments and drop the table, and the fact that the
|
|
table is set up to work exclusively with files in the Parquet format.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > show table stats airlines_external;
|
|
+-------+--------+--------+--------------+-------------------+---------+-------------------+
|
|
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
|
|
+-------+--------+--------+--------------+-------------------+---------+-------------------+
|
|
| -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false |
|
|
+-------+--------+--------+--------------+-------------------+---------+-------------------+
|
|
[localhost:21000] > show files in airlines_external;
|
|
+----------------------------------------------------------------------------------------+----------+-----------+
|
|
| path | size | partition |
|
|
+----------------------------------------------------------------------------------------+----------+-----------+
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq | 252.83MB | |
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq | 64.42MB | |
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq | 155.03MB | |
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq | 239.95MB | |
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq | 252.90MB | |
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq | 15.89MB | |
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq | 176.92MB | |
|
|
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq | 212.47MB | |
|
|
+----------------------------------------------------------------------------------------+----------+-----------+
|
|
[localhost:21000] > describe airlines_external;
|
|
+---------------------+--------+---------------------------------------------------+
|
|
| name | type | comment |
|
|
+---------------------+--------+---------------------------------------------------+
|
|
| year | int | inferred from: optional int32 year |
|
|
| month | int | inferred from: optional int32 month |
|
|
| day | int | inferred from: optional int32 day |
|
|
| dayofweek | int | inferred from: optional int32 dayofweek |
|
|
| dep_time | int | inferred from: optional int32 dep_time |
|
|
| crs_dep_time | int | inferred from: optional int32 crs_dep_time |
|
|
| arr_time | int | inferred from: optional int32 arr_time |
|
|
| crs_arr_time | int | inferred from: optional int32 crs_arr_time |
|
|
| carrier | string | inferred from: optional binary carrier |
|
|
| flight_num | int | inferred from: optional int32 flight_num |
|
|
| tail_num | int | inferred from: optional int32 tail_num |
|
|
| actual_elapsed_time | int | inferred from: optional int32 actual_elapsed_time |
|
|
| crs_elapsed_time | int | inferred from: optional int32 crs_elapsed_time |
|
|
| airtime | int | inferred from: optional int32 airtime |
|
|
| arrdelay | int | inferred from: optional int32 arrdelay |
|
|
| depdelay | int | inferred from: optional int32 depdelay |
|
|
| origin | string | inferred from: optional binary origin |
|
|
| dest | string | inferred from: optional binary dest |
|
|
| distance | int | inferred from: optional int32 distance |
|
|
| taxi_in | int | inferred from: optional int32 taxi_in |
|
|
| taxi_out | int | inferred from: optional int32 taxi_out |
|
|
| cancelled | int | inferred from: optional int32 cancelled |
|
|
| cancellation_code | string | inferred from: optional binary cancellation_code |
|
|
| diverted | int | inferred from: optional int32 diverted |
|
|
| carrier_delay | int | inferred from: optional int32 carrier_delay |
|
|
| weather_delay | int | inferred from: optional int32 weather_delay |
|
|
| nas_delay | int | inferred from: optional int32 nas_delay |
|
|
| security_delay | int | inferred from: optional int32 security_delay |
|
|
| late_aircraft_delay | int | inferred from: optional int32 late_aircraft_delay |
|
|
+---------------------+--------+---------------------------------------------------+
|
|
[localhost:21000] > desc formatted airlines_external;
|
|
+------------------------------+-------------------------------
|
|
| name | type
|
|
+------------------------------+-------------------------------
|
|
...
|
|
| # Detailed Table Information | NULL
|
|
| Database: | airline_data
|
|
| Owner: | jrussell
|
|
...
|
|
| Location: | /user/impala/staging/airlines
|
|
| Table Type: | EXTERNAL_TABLE
|
|
...
|
|
| # Storage Information | NULL
|
|
| SerDe Library: | parquet.hive.serde.ParquetHiveSerDe
|
|
| InputFormat: | parquet.hive.DeprecatedParquetInputFormat
|
|
| OutputFormat: | parquet.hive.DeprecatedParquetOutputFormat
|
|
...
|
|
</codeblock>
|
|
|
|
<p>
|
|
Now that we are confident that the connections are solid between the Impala table and the
|
|
underlying Parquet files, we run some initial queries to understand the characteristics
|
|
of the data: the overall number of rows, and the ranges and how many
|
|
different values are in certain columns.
|
|
For convenience in understanding the magnitude of the <codeph>COUNT(*)</codeph>
|
|
result, we run another query dividing the number of rows by 1 million, demonstrating that there are 123 million rows in the table.
|
|
</p>
|
|
|
|
<!-- I think this is the very longest line that could possibly fit without wrapping in the PDF output; 87 characters wide.
|
|
[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external;
|
|
-->
|
|
|
|
<codeblock>[localhost:21000] > select count(*) from airlines_external;
|
|
+-----------+
|
|
| count(*) |
|
|
+-----------+
|
|
| 123534969 |
|
|
+-----------+
|
|
Fetched 1 row(s) in 1.32s
|
|
[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external;
|
|
+------------------+
|
|
| millions of rows |
|
|
+------------------+
|
|
| 123.534969 |
|
|
+------------------+
|
|
Fetched 1 row(s) in 1.24s
|
|
</codeblock>
|
|
|
|
<p> The <codeph>NDV()</codeph> function stands for <q>number of distinct
|
|
values</q>, which for performance reasons is an estimate when there
|
|
are lots of different values in the column, but is precise when the
|
|
cardinality is less than 16 K. Use <codeph>NDV()</codeph> calls for this
|
|
kind of exploration rather than <codeph>COUNT(DISTINCT
|
|
<varname>colname</varname>)</codeph>, because Impala can evaluate
|
|
multiple <codeph>NDV()</codeph> functions in a single query, but only a
|
|
single instance of <codeph>COUNT DISTINCT</codeph>. Here we see that
|
|
there are modest numbers of different airlines, flight numbers, and
|
|
origin and destination airports. Two things jump out from this query:
|
|
the number of <codeph>tail_num</codeph> values is much smaller than we
|
|
might have expected, and there are more destination airports than origin
|
|
airports. Let's dig further. What we find is that most
|
|
<codeph>tail_num</codeph> values are <codeph>NULL</codeph>. It looks
|
|
like this was an experimental column that wasn't filled in accurately.
|
|
We make a mental note that if we use this data as a starting point,
|
|
we'll ignore this column. We also find that certain airports are
|
|
represented in the <codeph>ORIGIN</codeph> column but not the
|
|
<codeph>DEST</codeph> column; now we know that we cannot rely on the
|
|
assumption that those sets of airport codes are identical. </p>
|
|
|
|
<note>
|
|
A slight digression for some performance tuning. Notice how the first
|
|
<codeph>SELECT DISTINCT DEST</codeph> query takes almost 40 seconds.
|
|
We expect all queries on such a small data set, less than 2 GB, to
|
|
take a few seconds at most. The reason is because the expression
|
|
<codeph>NOT IN (SELECT origin FROM airlines_external)</codeph>
|
|
produces an intermediate result set of 123 million rows, then
|
|
runs 123 million comparisons on each data node against the tiny set of destination airports.
|
|
The way the <codeph>NOT IN</codeph> operator works internally means that
|
|
this intermediate result set with 123 million rows might be transmitted
|
|
across the network to each data node in the cluster.
|
|
Applying another <codeph>DISTINCT</codeph> inside the <codeph>NOT IN</codeph>
|
|
subquery means that the intermediate result set is only 340 items,
|
|
resulting in much less network traffic and fewer comparison operations.
|
|
The more efficient query with the added <codeph>DISTINCT</codeph> is approximately 7 times as fast.
|
|
</note>
|
|
|
|
<codeblock>[localhost:21000] > select ndv(carrier), ndv(flight_num), ndv(tail_num),
|
|
> ndv(origin), ndv(dest) from airlines_external;
|
|
+--------------+-----------------+---------------+-------------+-----------+
|
|
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
|
|
+--------------+-----------------+---------------+-------------+-----------+
|
|
| 29 | 9086 | 3 | 340 | 347 |
|
|
+--------------+-----------------+---------------+-------------+-----------+
|
|
[localhost:21000] > select tail_num, count(*) as howmany from airlines_external
|
|
> group by tail_num;
|
|
+----------+-----------+
|
|
| tail_num | howmany |
|
|
+----------+-----------+
|
|
| 715 | 1 |
|
|
| 0 | 406405 |
|
|
| 112 | 6562 |
|
|
| NULL | 123122001 |
|
|
+----------+-----------+
|
|
Fetched 1 row(s) in 5.18s
|
|
[localhost:21000] > select distinct dest from airlines_external
|
|
> where dest not in (select origin from airlines_external);
|
|
+------+
|
|
| dest |
|
|
+------+
|
|
| LBF |
|
|
| CBM |
|
|
| RCA |
|
|
| SKA |
|
|
| LAR |
|
|
+------+
|
|
Fetched 5 row(s) in 39.64s
|
|
[localhost:21000] > select distinct dest from airlines_external
|
|
> where dest not in (select distinct origin from airlines_external);
|
|
+------+
|
|
| dest |
|
|
+------+
|
|
| LBF |
|
|
| RCA |
|
|
| CBM |
|
|
| SKA |
|
|
| LAR |
|
|
+------+
|
|
Fetched 5 row(s) in 5.59s
|
|
[localhost:21000] > select distinct origin from airlines_external
|
|
> where origin not in (select distinct dest from airlines_external);
|
|
Fetched 0 row(s) in 5.37s
|
|
</codeblock>
|
|
|
|
<p> Next, we try doing a simple calculation, with results broken down by year.
|
|
This reveals that some years have no data in the
|
|
<codeph>AIRTIME</codeph> column. That means we might be able to use
|
|
that column in queries involving certain date ranges, but we cannot
|
|
count on it to always be reliable. The question of whether a column
|
|
contains any <codeph>NULL</codeph> values, and if so what is their
|
|
number, proportion, and distribution, comes up again and again when
|
|
doing initial exploration of a data set. </p>
|
|
|
|
<codeblock>[localhost:21000] > select year, sum(airtime) from airlines_external
|
|
> group by year order by year desc;
|
|
+------+--------------+
|
|
| year | sum(airtime) |
|
|
+------+--------------+
|
|
| 2008 | 713050445 |
|
|
| 2007 | 748015545 |
|
|
| 2006 | 720372850 |
|
|
| 2005 | 708204026 |
|
|
| 2004 | 714276973 |
|
|
| 2003 | 665706940 |
|
|
| 2002 | 549761849 |
|
|
| 2001 | 590867745 |
|
|
| 2000 | 583537683 |
|
|
| 1999 | 561219227 |
|
|
| 1998 | 538050663 |
|
|
| 1997 | 536991229 |
|
|
| 1996 | 519440044 |
|
|
| 1995 | 513364265 |
|
|
| 1994 | NULL |
|
|
| 1993 | NULL |
|
|
| 1992 | NULL |
|
|
| 1991 | NULL |
|
|
| 1990 | NULL |
|
|
| 1989 | NULL |
|
|
| 1988 | NULL |
|
|
| 1987 | NULL |
|
|
+------+--------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
With the notion of <codeph>NULL</codeph> values in mind, let's come back to the <codeph>TAILNUM</codeph>
|
|
column that we discovered had a lot of <codeph>NULL</codeph>s.
|
|
Let's quantify the <codeph>NULL</codeph> and non-<codeph>NULL</codeph> values in that column for better understanding.
|
|
First, we just count the overall number of rows versus the non-<codeph>NULL</codeph> values in that column.
|
|
That initial result gives the appearance of relatively few non-<codeph>NULL</codeph> values, but we can break
|
|
it down more clearly in a single query.
|
|
Once we have the <codeph>COUNT(*)</codeph> and the <codeph>COUNT(<varname>colname</varname>)</codeph> numbers,
|
|
we can encode that initial query in a <codeph>WITH</codeph> clause, then run a followon query that performs
|
|
multiple arithmetic operations on those values.
|
|
Seeing that only one-third of one percent of all rows have non-<codeph>NULL</codeph> values for the
|
|
<codeph>TAILNUM</codeph> column clearly illustrates that that column is not of much use.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select count(*) as 'rows', count(tail_num) as 'non-null tail numbers'
|
|
> from airlines_external;
|
|
+-----------+-----------------------+
|
|
| rows | non-null tail numbers |
|
|
+-----------+-----------------------+
|
|
| 123534969 | 412968 |
|
|
+-----------+-----------------------+
|
|
Fetched 1 row(s) in 1.51s
|
|
[localhost:21000] > with t1 as
|
|
> (select count(*) as 'rows', count(tail_num) as 'nonnull'
|
|
> from airlines_external)
|
|
> select `rows`, `nonnull`, `rows` - `nonnull` as 'nulls',
|
|
> (`nonnull` / `rows`) * 100 as 'percentage non-null'
|
|
> from t1;
|
|
+-----------+---------+-----------+---------------------+
|
|
| rows | nonnull | nulls | percentage non-null |
|
|
+-----------+---------+-----------+---------------------+
|
|
| 123534969 | 412968 | 123122001 | 0.3342923897119365 |
|
|
+-----------+---------+-----------+---------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
By examining other columns using these techniques, we can form a mental picture of the way data is distributed
|
|
throughout the table, and which columns are most significant for query purposes. For this tutorial, we focus mostly on
|
|
the fields likely to hold discrete values, rather than columns such as <codeph>ACTUAL_ELAPSED_TIME</codeph>
|
|
whose names suggest they hold measurements. We would dig deeper into those columns once we had a clear picture
|
|
of which questions were worthwhile to ask, and what kinds of trends we might look for.
|
|
For the final piece of initial exploration, let's look at the <codeph>YEAR</codeph> column.
|
|
A simple <codeph>GROUP BY</codeph> query shows that it has a well-defined range, a manageable number of
|
|
distinct values, and relatively even distribution of rows across the different years.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select min(year), max(year), ndv(year) from airlines_external;
|
|
+-----------+-----------+-----------+
|
|
| min(year) | max(year) | ndv(year) |
|
|
+-----------+-----------+-----------+
|
|
| 1987 | 2008 | 22 |
|
|
+-----------+-----------+-----------+
|
|
Fetched 1 row(s) in 2.03s
|
|
[localhost:21000] > select year, count(*) howmany from airlines_external
|
|
> group by year order by year desc;
|
|
+------+---------+
|
|
| year | howmany |
|
|
+------+---------+
|
|
| 2008 | 7009728 |
|
|
| 2007 | 7453215 |
|
|
| 2006 | 7141922 |
|
|
| 2005 | 7140596 |
|
|
| 2004 | 7129270 |
|
|
| 2003 | 6488540 |
|
|
| 2002 | 5271359 |
|
|
| 2001 | 5967780 |
|
|
| 2000 | 5683047 |
|
|
| 1999 | 5527884 |
|
|
| 1998 | 5384721 |
|
|
| 1997 | 5411843 |
|
|
| 1996 | 5351983 |
|
|
| 1995 | 5327435 |
|
|
| 1994 | 5180048 |
|
|
| 1993 | 5070501 |
|
|
| 1992 | 5092157 |
|
|
| 1991 | 5076925 |
|
|
| 1990 | 5270893 |
|
|
| 1989 | 5041200 |
|
|
| 1988 | 5202096 |
|
|
| 1987 | 1311826 |
|
|
+------+---------+
|
|
Fetched 22 row(s) in 2.13s
|
|
</codeblock>
|
|
|
|
<p>
|
|
We could go quite far with the data in this initial raw format, just as we downloaded it from the web.
|
|
If the data set proved to be useful and worth persisting in Impala for extensive queries,
|
|
we might want to copy it to an internal table, letting Impala manage the data files and perhaps
|
|
reorganizing a little for higher efficiency.
|
|
In this next stage of the tutorial, we copy the original data into a partitioned table, still in Parquet format.
|
|
Partitioning based on the <codeph>YEAR</codeph> column lets us run queries with clauses such as <codeph>WHERE year = 2001</codeph>
|
|
or <codeph>WHERE year BETWEEN 1989 AND 1999</codeph>, which can dramatically cut down on I/O by
|
|
ignoring all the data from years outside the desired range.
|
|
Rather than reading all the data and then deciding which rows are in the matching years, Impala can
|
|
zero in on only the data files from specific <codeph>YEAR</codeph> partitions.
|
|
To do this, Impala physically reorganizes the data files, putting the rows from each year into
|
|
data files in a separate HDFS directory for each <codeph>YEAR</codeph> value.
|
|
Along the way, we'll also get rid of the <codeph>TAIL_NUM</codeph> column that proved to be almost entirely <codeph>NULL</codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
The first step is to create a new table with a layout very similar to the original <codeph>AIRLINES_EXTERNAL</codeph> table.
|
|
We'll do that by reverse-engineering a <codeph>CREATE TABLE</codeph> statement for the first table,
|
|
then tweaking it slightly to include a <codeph>PARTITION BY</codeph> clause for <codeph>YEAR</codeph>,
|
|
and excluding the <codeph>TAIL_NUM</codeph> column.
|
|
The <codeph>SHOW CREATE TABLE</codeph> statement gives us the starting point.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > show create table airlines_external;
|
|
+-------------------------------------------------------------------------------------
|
|
| result
|
|
+-------------------------------------------------------------------------------------
|
|
| CREATE EXTERNAL TABLE airline_data.airlines_external (
|
|
| year INT COMMENT 'inferred from: optional int32 year',
|
|
| month INT COMMENT 'inferred from: optional int32 month',
|
|
| day INT COMMENT 'inferred from: optional int32 day',
|
|
| dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
|
|
| dep_time INT COMMENT 'inferred from: optional int32 dep_time',
|
|
| crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
|
|
| arr_time INT COMMENT 'inferred from: optional int32 arr_time',
|
|
| crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
|
|
| carrier STRING COMMENT 'inferred from: optional binary carrier',
|
|
| flight_num INT COMMENT 'inferred from: optional int32 flight_num',
|
|
| tail_num INT COMMENT 'inferred from: optional int32 tail_num',
|
|
| actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
|
|
| crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
|
|
| airtime INT COMMENT 'inferred from: optional int32 airtime',
|
|
| arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
|
|
| depdelay INT COMMENT 'inferred from: optional int32 depdelay',
|
|
| origin STRING COMMENT 'inferred from: optional binary origin',
|
|
| dest STRING COMMENT 'inferred from: optional binary dest',
|
|
| distance INT COMMENT 'inferred from: optional int32 distance',
|
|
| taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
|
|
| taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
|
|
| cancelled INT COMMENT 'inferred from: optional int32 cancelled',
|
|
| cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
|
|
| diverted INT COMMENT 'inferred from: optional int32 diverted',
|
|
| carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
|
|
| weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
|
|
| nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
|
|
| security_delay INT COMMENT 'inferred from: optional int32 security_delay',
|
|
| late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
|
|
| )
|
|
| STORED AS PARQUET
|
|
| LOCATION 'hdfs://a1730.halxg.cloudera.com:8020/user/impala/staging/airlines'
|
|
| TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
|
|
| 'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
|
|
| 'rawDataSize'='-1')
|
|
+-------------------------------------------------------------------------------------
|
|
Fetched 1 row(s) in 0.03s
|
|
[localhost:21000] > quit;
|
|
</codeblock>
|
|
|
|
<p>
|
|
Although we could edit that output into a new SQL statement, all the ASCII box characters
|
|
make such editing inconvenient. To get a more stripped-down <codeph>CREATE TABLE</codeph>
|
|
to start with, we restart the <cmdname>impala-shell</cmdname> command with the
|
|
<codeph>-B</codeph> option, which turns off the box-drawing behavior.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > quit;
|
|
Goodbye jrussell
|
|
$ impala-shell -i localhost -B -d airline_data;
|
|
Starting Impala Shell without Kerberos authentication
|
|
Connected to localhost:21000
|
|
Server version: impalad version 2.2.0-cdh5 RELEASE (build 2ffd73a4255cefd521362ffe1cfb37463f67f75c)
|
|
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
|
|
|
|
Copyright (c) 2012 Cloudera, Inc. All rights reserved.
|
|
|
|
(Shell build version: Impala Shell v2.1.2-cdh5 (92438b7) built on Tue Feb 24 12:36:33 PST 2015)
|
|
[localhost:21000] > show create table airlines_external;
|
|
"CREATE EXTERNAL TABLE airline_data.airlines_external (
|
|
year INT COMMENT 'inferred from: optional int32 year',
|
|
month INT COMMENT 'inferred from: optional int32 month',
|
|
day INT COMMENT 'inferred from: optional int32 day',
|
|
dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
|
|
dep_time INT COMMENT 'inferred from: optional int32 dep_time',
|
|
crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
|
|
arr_time INT COMMENT 'inferred from: optional int32 arr_time',
|
|
crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
|
|
carrier STRING COMMENT 'inferred from: optional binary carrier',
|
|
flight_num INT COMMENT 'inferred from: optional int32 flight_num',
|
|
tail_num INT COMMENT 'inferred from: optional int32 tail_num',
|
|
actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
|
|
crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
|
|
airtime INT COMMENT 'inferred from: optional int32 airtime',
|
|
arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
|
|
depdelay INT COMMENT 'inferred from: optional int32 depdelay',
|
|
origin STRING COMMENT 'inferred from: optional binary origin',
|
|
dest STRING COMMENT 'inferred from: optional binary dest',
|
|
distance INT COMMENT 'inferred from: optional int32 distance',
|
|
taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
|
|
taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
|
|
cancelled INT COMMENT 'inferred from: optional int32 cancelled',
|
|
cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
|
|
diverted INT COMMENT 'inferred from: optional int32 diverted',
|
|
carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
|
|
weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
|
|
nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
|
|
security_delay INT COMMENT 'inferred from: optional int32 security_delay',
|
|
late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
|
|
)
|
|
STORED AS PARQUET
|
|
LOCATION 'hdfs://a1730.halxg.cloudera.com:8020/user/impala/staging/airlines'
|
|
TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
|
|
'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
|
|
'rawDataSize'='-1')"
|
|
Fetched 1 row(s) in 0.01s
|
|
</codeblock>
|
|
|
|
<p>
|
|
After copying and pasting the <codeph>CREATE TABLE</codeph> statement into a text editor for fine-tuning, we quit and restart <cmdname>impala-shell</cmdname>
|
|
without the <codeph>-B</codeph> option, to switch back to regular
|
|
output.
|
|
</p>
|
|
<p>
|
|
Next we run the <codeph>CREATE TABLE</codeph> statement that we adapted from the <codeph>SHOW CREATE TABLE</codeph> output.
|
|
We kept the <codeph>STORED AS PARQUET</codeph> clause because
|
|
we want to rearrange the data somewhat but still keep it in the high-performance
|
|
Parquet format. The <codeph>LOCATION</codeph> and <codeph>TBLPROPERTIES</codeph>
|
|
clauses are not relevant for this new table, so we edit those out.
|
|
Because we are going to partition the new table based on the <codeph>YEAR</codeph>
|
|
column, we move that column name (and its type) into a new <codeph>PARTITIONED BY</codeph> clause.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > CREATE TABLE airline_data.airlines
|
|
> (
|
|
> month INT,
|
|
> day INT,
|
|
> dayofweek INT,
|
|
> dep_time INT,
|
|
> crs_dep_time INT,
|
|
> arr_time INT,
|
|
> crs_arr_time INT,
|
|
> carrier STRING,
|
|
> flight_num INT,
|
|
> actual_elapsed_time INT,
|
|
> crs_elapsed_time INT,
|
|
> airtime INT,
|
|
> arrdelay INT,
|
|
> depdelay INT,
|
|
> origin STRING,
|
|
> dest STRING,
|
|
> distance INT,
|
|
> taxi_in INT,
|
|
> taxi_out INT,
|
|
> cancelled INT,
|
|
> cancellation_code STRING,
|
|
> diverted INT,
|
|
> carrier_delay INT,
|
|
> weather_delay INT,
|
|
> nas_delay INT,
|
|
> security_delay INT,
|
|
> late_aircraft_delay INT
|
|
> )
|
|
> STORED AS PARQUET
|
|
> PARTITIONED BY (year INT);
|
|
Fetched 0 row(s) in 0.10s
|
|
</codeblock>
|
|
|
|
<p>
|
|
Next, we copy all the rows from the original table into this new one with
|
|
an <codeph>INSERT</codeph> statement. (We edited the <codeph>CREATE TABLE</codeph>
|
|
statement to make an <codeph>INSERT</codeph> statement with the column names
|
|
in the same order.) The only change is to add a <codeph>PARTITION(year)</codeph>
|
|
clause, and move the <codeph>YEAR</codeph> column to the very end of the
|
|
<codeph>SELECT</codeph> list of the <codeph>INSERT</codeph> statement.
|
|
Specifying <codeph>PARTITION(year)</codeph>, rather than a fixed value
|
|
such as <codeph>PARTITION(year=2000)</codeph>, means that Impala figures
|
|
out the partition value for each row based on the value of the very last column
|
|
in the <codeph>SELECT</codeph> list.
|
|
This is the first SQL statement that legitimately takes any substantial time,
|
|
because the rows from different years are shuffled around the cluster;
|
|
the rows that go into each partition are collected on one node, before being
|
|
written to one or more new data files.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > INSERT INTO airline_data.airlines
|
|
> PARTITION (year)
|
|
> SELECT
|
|
> month,
|
|
> day,
|
|
> dayofweek,
|
|
> dep_time,
|
|
> crs_dep_time,
|
|
> arr_time,
|
|
> crs_arr_time,
|
|
> carrier,
|
|
> flight_num,
|
|
> actual_elapsed_time,
|
|
> crs_elapsed_time,
|
|
> airtime,
|
|
> arrdelay,
|
|
> depdelay,
|
|
> origin,
|
|
> dest,
|
|
> distance,
|
|
> taxi_in,
|
|
> taxi_out,
|
|
> cancelled,
|
|
> cancellation_code,
|
|
> diverted,
|
|
> carrier_delay,
|
|
> weather_delay,
|
|
> nas_delay,
|
|
> security_delay,
|
|
> late_aircraft_delay,
|
|
> year
|
|
> FROM airline_data.airlines_external;
|
|
Inserted 123534969 row(s) in 202.70s
|
|
</codeblock>
|
|
|
|
<p>
|
|
Once partitioning or join queries come into play, it's important to have statistics
|
|
that Impala can use to optimize queries on the corresponding tables.
|
|
The <codeph>COMPUTE INCREMENTAL STATS</codeph> statement is the way to collect
|
|
statistics for partitioned tables.
|
|
Then the <codeph>SHOW TABLE STATS</codeph> statement confirms that the statistics
|
|
are in place for each partition, and also illustrates how many files and how much raw data
|
|
is in each partition.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > compute incremental stats airlines;
|
|
+-------------------------------------------+
|
|
| summary |
|
|
+-------------------------------------------+
|
|
| Updated 22 partition(s) and 27 column(s). |
|
|
+-------------------------------------------+
|
|
[localhost:21000] > show table stats airlines;
|
|
+-------+-----------+--------+----------+--------------+------------+---------+-------------------+
|
|
| year | #Rows | #Files | Size | Bytes Cached | Cache Repl | Format | Incremental stats |
|
|
+-------+-----------+--------+----------+--------------+------------+---------+-----
|
|
| 1987 | 1311826 | 1 | 9.32MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1988 | 5202096 | 1 | 37.04MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1989 | 5041200 | 1 | 36.25MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1990 | 5270893 | 1 | 38.39MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1991 | 5076925 | 1 | 37.23MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1992 | 5092157 | 1 | 36.85MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1993 | 5070501 | 1 | 37.16MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1994 | 5180048 | 1 | 38.31MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1995 | 5327435 | 1 | 53.14MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1996 | 5351983 | 1 | 53.64MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1997 | 5411843 | 1 | 54.41MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1998 | 5384721 | 1 | 54.01MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 1999 | 5527884 | 1 | 56.32MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2000 | 5683047 | 1 | 58.15MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2001 | 5967780 | 1 | 60.65MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2002 | 5271359 | 1 | 57.99MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2003 | 6488540 | 1 | 81.33MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2004 | 7129270 | 1 | 103.19MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2005 | 7140596 | 1 | 102.61MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2006 | 7141922 | 1 | 106.03MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2007 | 7453215 | 1 | 112.15MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| 2008 | 7009728 | 1 | 105.76MB | NOT CACHED | NOT CACHED | PARQUET | true
|
|
| Total | 123534969 | 22 | 1.30GB | 0B | | |
|
|
+-------+-----------+--------+----------+--------------+------------+---------+-----
|
|
</codeblock>
|
|
|
|
<p>
|
|
At this point, we go through a quick thought process to sanity check the partitioning we did.
|
|
All the partitions have exactly one file, which is on the low side. A query that includes a
|
|
clause <codeph>WHERE year=2004</codeph> will only read a single data block; that data block
|
|
will be read and processed by a single data node; therefore, for a query targeting a single year,
|
|
all the other nodes in the cluster will sit idle while all the work happens on a single machine.
|
|
It's even possible that by chance (depending on HDFS replication factor and the way data blocks are distributed
|
|
across the cluster), that multiple year partitions selected by a filter such as
|
|
<codeph>WHERE year BETWEEN 1999 AND 2001</codeph> could all be read and processed by the same data node.
|
|
The more data files each partition has, the more parallelism you can get and the less probability of <q>hotspots</q> occurring
|
|
on particular nodes, therefore a bigger performance boost by having a big CDH cluster.
|
|
</p>
|
|
|
|
<p>
|
|
However, the more data files, the less data goes in each one. The overhead of dividing the work in a
|
|
parallel query might not be worth it if each node is only reading a few megabytes. 50 or 100 megabytes
|
|
is a decent size for a Parquet data block; 9 or 37 megabytes is on the small side.
|
|
Which is to say, the data distribution we ended up with based on this partitioning scheme is
|
|
on the borderline between sensible (reasonably large files) and suboptimal (few files in each partition).
|
|
The way to see how well it works in practice is to run the same queries against the original
|
|
flat table and the new partitioned table, and compare times.
|
|
</p>
|
|
|
|
<p>
|
|
Spoiler: in this case, with my particular 4-node cluster with its specific distribution of data blocks and my
|
|
particular exploratory queries, queries against the partitioned table do consistently run faster than the same queries
|
|
against the unpartitioned table. But I could not be sure that would be the case without some real measurements.
|
|
Here are some queries I ran to draw that conclusion, first against <codeph>AIRLINES_EXTERNAL</codeph> (no partitioning),
|
|
then against <codeph>AIRLINES</codeph> (partitioned by year). The <codeph>AIRLINES</codeph> queries are consistently faster.
|
|
Changing the volume of data, changing the size of the cluster, running queries that did or didn't refer to the partition key
|
|
columns, or other factors could change the results to favor one table layout or the other.
|
|
</p>
|
|
|
|
<note>
|
|
If you find the volume of each partition is only in the low tens of megabytes, consider lowering the granularity
|
|
of partitioning. For example, instead of partitioning by year, month, and day, partition by year and month
|
|
or even just by year. The ideal layout to distribute work efficiently in a parallel query is many tens or
|
|
even hundreds of megabytes per Parquet file, and the number of Parquet files in each partition somewhat
|
|
higher than the number of data nodes.
|
|
</note>
|
|
|
|
<codeblock>[localhost:21000] > select sum(airtime) from airlines_external;
|
|
+--------------+
|
|
| sum(airtime) |
|
|
+--------------+
|
|
| 8662859484 |
|
|
+--------------+
|
|
Fetched 1 row(s) in 2.02s
|
|
[localhost:21000] > select sum(airtime) from airlines;
|
|
+--------------+
|
|
| sum(airtime) |
|
|
+--------------+
|
|
| 8662859484 |
|
|
+--------------+
|
|
Fetched 1 row(s) in 1.21s
|
|
|
|
[localhost:21000] > select sum(airtime) from airlines_external where year = 2005;
|
|
+--------------+
|
|
| sum(airtime) |
|
|
+--------------+
|
|
| 708204026 |
|
|
+--------------+
|
|
Fetched 1 row(s) in 2.61s
|
|
[localhost:21000] > select sum(airtime) from airlines where year = 2005;
|
|
+--------------+
|
|
| sum(airtime) |
|
|
+--------------+
|
|
| 708204026 |
|
|
+--------------+
|
|
Fetched 1 row(s) in 1.19s
|
|
|
|
[localhost:21000] > select sum(airtime) from airlines_external where dayofweek = 1;
|
|
+--------------+
|
|
| sum(airtime) |
|
|
+--------------+
|
|
| 1264945051 |
|
|
+--------------+
|
|
Fetched 1 row(s) in 2.82s
|
|
[localhost:21000] > select sum(airtime) from airlines where dayofweek = 1;
|
|
+--------------+
|
|
| sum(airtime) |
|
|
+--------------+
|
|
| 1264945051 |
|
|
+--------------+
|
|
Fetched 1 row(s) in 1.61s
|
|
</codeblock>
|
|
|
|
<p>
|
|
Now we can finally do some serious analysis with this data set that, remember,
|
|
a few minutes ago all we had were some raw data files and we didn't even know
|
|
what columns they contained.
|
|
Let's see whether the <q>air time</q> of a flight tends to be different depending on the
|
|
day of the week. We can see that the average is a little higher on day number 6;
|
|
perhaps Saturday is a busy flying day and planes have to circle for longer at the
|
|
destination airport before landing.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select dayofweek, avg(airtime) from airlines
|
|
> group by dayofweek order by dayofweek;
|
|
+-----------+-------------------+
|
|
| dayofweek | avg(airtime) |
|
|
+-----------+-------------------+
|
|
| 1 | 102.1560425016671 |
|
|
| 2 | 102.1582931538807 |
|
|
| 3 | 102.2170009256653 |
|
|
| 4 | 102.37477661846 |
|
|
| 5 | 102.2697358763511 |
|
|
| 6 | 105.3627448363705 |
|
|
| 7 | 103.4144351202054 |
|
|
+-----------+-------------------+
|
|
Fetched 7 row(s) in 2.25s
|
|
</codeblock>
|
|
|
|
<p>
|
|
To see if the apparent trend holds up over time, let's do the same breakdown by day of week, but also
|
|
split up by year. Now we can see that day number 6 consistently has a higher average air time in each
|
|
year. We can also see that the average air time increased over time across the board. And the presence
|
|
of <codeph>NULL</codeph> for this column in years 1987 to 1994 shows that queries involving this column
|
|
need to be restricted to a date range of 1995 and higher.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select year, dayofweek, avg(airtime) from airlines
|
|
> group by year, dayofweek order by year desc, dayofweek;
|
|
+------+-----------+-------------------+
|
|
| year | dayofweek | avg(airtime) |
|
|
+------+-----------+-------------------+
|
|
| 2008 | 1 | 103.1821651651355 |
|
|
| 2008 | 2 | 103.2149301386094 |
|
|
| 2008 | 3 | 103.0585076622796 |
|
|
| 2008 | 4 | 103.4671383539038 |
|
|
| 2008 | 5 | 103.5575385182659 |
|
|
| 2008 | 6 | 107.4006306562128 |
|
|
| 2008 | 7 | 104.8648851041755 |
|
|
| 2007 | 1 | 102.2196114337825 |
|
|
| 2007 | 2 | 101.9317791906348 |
|
|
| 2007 | 3 | 102.0964767689043 |
|
|
| 2007 | 4 | 102.6215927201686 |
|
|
| 2007 | 5 | 102.4289399000661 |
|
|
| 2007 | 6 | 105.1477448215756 |
|
|
| 2007 | 7 | 103.6305945644095 |
|
|
...
|
|
| 1996 | 1 | 99.33860750862108 |
|
|
| 1996 | 2 | 99.54225446396656 |
|
|
| 1996 | 3 | 99.41129336113134 |
|
|
| 1996 | 4 | 99.5110373340348 |
|
|
| 1996 | 5 | 99.22120745027595 |
|
|
| 1996 | 6 | 101.1717447111921 |
|
|
| 1996 | 7 | 99.95410136133704 |
|
|
| 1995 | 1 | 96.93779698300494 |
|
|
| 1995 | 2 | 96.93458674589712 |
|
|
| 1995 | 3 | 97.00972311337051 |
|
|
| 1995 | 4 | 96.90843832024412 |
|
|
| 1995 | 5 | 96.78382115425562 |
|
|
| 1995 | 6 | 98.70872826057003 |
|
|
| 1995 | 7 | 97.85570478374616 |
|
|
| 1994 | 1 | NULL |
|
|
| 1994 | 2 | NULL |
|
|
| 1994 | 3 | NULL |
|
|
...
|
|
| 1987 | 5 | NULL |
|
|
| 1987 | 6 | NULL |
|
|
| 1987 | 7 | NULL |
|
|
+------+-----------+-------------------+
|
|
</codeblock>
|
|
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept audience="Cloudera" id="tut_synthesizing">
|
|
|
|
<title>Synthesizing Data</title>
|
|
|
|
<conbody>
|
|
|
|
<!-- Scenarios:
|
|
INSERT VALUES 1 row or a few rows, then duplicate the values with slight alterations.
|
|
- Use row_number() and rank() to assign IDs to values from narrow tables.
|
|
- Use sum() in analytic context, with sliding window, to derive other arbitrary numbers.
|
|
Grab database, table, column, etc. names from metastore and play with those.
|
|
Maybe take results of a 'find' and play with those.
|
|
- Invent fields like 'username' based on /home or whatever subdirectories.
|
|
- Perhaps also capture permissions, timestamps, etc. via 'ls -l'.
|
|
- Illustrate narrow table with 'extension' field, wide table with 'jpg', 'Office', 'picture', 'XML', etc. fields
|
|
based on case() of extension.
|
|
Decompose some string values using regexp_extract.
|
|
Set up views to get the logical layout right before making new physical copies of data.
|
|
-->
|
|
|
|
<p></p>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|