This section includes tutorial scenarios that demonstrate how to begin using Impala once the software is installed. It focuses on techniques for loading data, because once you have some data in tables and can query that data, you can quickly progress to more advanced Impala features.
Where practical, the tutorials take you from ground zero
to having the desired Impala tables and
data. In some cases, you might need to download additional files from outside sources, set up additional
software components, modify commands or scripts to fit your own configuration, or substitute your own
sample data.
Before trying these tutorial lessons, install Impala using one of these procedures:
These tutorials demonstrate the basics of using Impala. They are intended for first-time users, and for trying out Impala on any new cluster to make sure the major components are working correctly.
This tutorial demonstrates techniques for finding your way around the tables and databases of an unfamiliar (possibly empty) Impala instance.
When you connect to an Impala instance for the first time, you use the
A completely empty Impala instance contains no tables, but still has two databases:
The following example shows how to see the available databases, and the tables in each. If the list of databases or tables is long, you can use wildcard notation to locate specific databases or tables based on their names.
Once you know what tables and databases are available, you descend into a database with the
The following example explores a database named
When you graduate from read-only exploration, you use statements such as
The following example demonstrates creating a new database holding a new table. Although the last example
ended inside the
The following example creates a new table,
For your initial experiments with tables, you can use ones with just a few columns and a few rows, and text-format data files.
The following example sets up a couple of simple tables with a few rows, and performs queries involving sorting, aggregate functions and joins.
After completing this tutorial, you should now know:
This tutorial scenario illustrates the HDFS directory structures that correspond to various Impala databases, tables, and partitions. It also shows how data directories are shared between Impala and Hive, because of the shared metastore database.
In this tutorial scenario, you create a simple text-format data file in HDFS and then define an Impala table that refers to the data in its original location.
This scenario illustrates how to create some very small tables, suitable for first-time users to
experiment with Impala SQL features.
Populate HDFS with the data you want to query. To begin this process, create one or more new
subdirectories underneath your user directory in HDFS. The data for each table resides in a separate
subdirectory. Substitute your own username for
Here is some sample data, for two tables named
Copy the following content to
Put each
The name of each data file is not significant. In fact, when Impala examines the contents of the data directory for the first time, it considers all files in the directory to make up the data of the table, regardless of how many files there are or what the files are named.
To understand what paths are available within your own HDFS filesystem and what the permissions are for
the various directories and files, issue
Use the
The following example shows creating three tables. For each table, the example shows creating columns
with various attributes such as Boolean or integer types. The example also includes commands that provide
information about how the data is formatted, such as rows terminating with commas, which makes sense in
the case of importing data from a
A convenient way to set up data for Impala to access is to use an external table, where the data already
exists in a set of HDFS files and you just point the Impala table at the directory containing those
files. For example, you might run in
The following examples set up 2 tables, referencing the paths and sample data from the sample TPC-DS kit for Impala.
For historical reasons, the data physically resides in an HDFS directory tree under
Here is a SQL script to set up Impala tables pointing to some of these data files in HDFS.
(The script in the VM sets up tables like this through Hive; ignore those tables
for purposes of this demonstration.)
Save the following as
We would run this script with a command such as:
Now that you have updated the database metadata that Impala caches, you can confirm that the expected
tables are accessible by Impala and examine the attributes of one of the tables. We created these tables
in the database named
You can query data contained in the tables. Impala coordinates the query execution across a single node or multiple nodes depending on your configuration, without the overhead of running MapReduce jobs to perform the intermediate processing.
There are a variety of ways to execute queries on Impala:
This section describes how to create some sample tables and load data into them. These tables can then be queried using the Impala shell.
Loading data involves:
To run these sample queries, create a SQL query file
The examples and results below assume you have loaded the sample data into the tables as described above.
Let's start by verifying that the tables do contain the data we expect. Because Impala often deals
with tables containing millions or billions of rows, when examining tables of unknown size, include
the
Results:
Results:
Results:
Query
Results:
These tutorials walk you through advanced scenarios or specialized features.
This tutorial shows how you might set up a directory tree in HDFS, put data files into the lowest-level subdirectories, and then use an Impala external table to query the data files from their original locations.
The tutorial uses a table with web log data, with separate subdirectories for the year, month, day, and host. For simplicity, we use a tiny amount of CSV data, loading the same data into each partition.
First, we make an Impala partitioned table for CSV data, and look at the underlying HDFS directory
structure to understand the directory structure to re-create elsewhere in HDFS. The columns
Back in the Linux shell, we examine the HDFS directory structure. (Your Impala data directory might be in
a different location; for historical reasons, it is sometimes under the HDFS path
Still in the Linux shell, we use
We make a tiny CSV file, with values different than in the
Back in the
Because partition subdirectories and data files come and go during the data lifecycle, you must identify
each of the partitions through an
We issue a
Sometimes, you might find it convenient to switch to the Hive shell to perform some data loading or transformation operation, particularly on file formats such as RCFile, SequenceFile, and Avro that Impala currently can query but not write to.
Whenever you create, drop, or alter a table or other kind of object through Hive, the next time you
switch back to the
Whenever you load, insert, or change data in an existing table through Hive (or even through manual HDFS
operations such as the
For examples showing how this process works for the
For examples showing how this process works for the
Originally, Impala did not support UDFs, but this feature is available in Impala starting in Impala
1.2. Some
Prior to Impala 1.2, the
Originally, Impala restricted join queries so that they had to include at least one equality comparison between the columns of the tables on each side of the join operator. With the huge tables typically processed by Impala, any miscoded query that produced a full Cartesian product as a result set could consume a huge amount of cluster resources.
In Impala 1.2.2 and higher, this restriction is lifted when you use the
The following example sets up data for use in a series of comic books where characters battle each other. At first, we use an equijoin query, which only allows characters from the same time period and the same planet to meet.
Readers demanded more action, so we added elements of time travel and space travel so that any hero could face any villain. Prior to Impala 1.2.2, this type of query was impossible because all joins had to reference matching values between the two tables:
With Impala 1.2.2, we rewrite the query slightly to use
The full combination of rows from both tables is known as the Cartesian product. This type of result set
is often used for creating grid data structures. You can also filter the result set by including
As data pipelines start to include more aspects such as NoSQL or loosely specified schemas, you might encounter situations where you have data files (particularly in Parquet format) where you do not know the precise table definition. This tutorial shows how you can build an Impala table around data that comes from non-Impala or even non-SQL sources, where you do not have control of the table layout and might not be familiar with the characteristics of the data.
The data used in this tutorial represents airline on-time arrival statistics, from October 1987 through April 2008.
See the details on the
First, we download and unpack the data files. There are 8 files totalling 1.4 GB.
Next, we put the Parquet data files in HDFS, all together in a single
directory, with permissions on the directory and the files so that the
After unpacking, we saw the largest Parquet file was 253 MB. When
copying Parquet files into HDFS for Impala to use, for maximum query
performance, make sure that each file resides in a single HDFS data
block. Therefore, we pick a size larger than any single file and
specify that as the block size, using the argument
With the files in an accessible location in HDFS, you create a database table that uses the data in those files:
With the table created, we examine its physical and logical characteristics to confirm that the data is really there and in a format and shape that we can work with.
Now that we are confident that the connections are solid between the Impala table and the underlying Parquet files, we run some initial queries to understand the characteristics of the data: the overall number of rows, and the ranges and how many different values are in certain columns.
The
With the above queries, we see that there are modest numbers of
different airlines, flight numbers, and origin and destination
airports. Two things jump out from this query: the number of
Next, we try doing a simple calculation, with results broken down by year.
This reveals that some years have no data in the
With the notion of
By examining other columns using these techniques, we can form a mental
picture of the way data is distributed throughout the table, and which
columns are most significant for query purposes. For this tutorial, we
focus mostly on the fields likely to hold discrete values, rather than
columns such as
We could go quite far with the data in this initial raw format, just as we
downloaded it from the web. If the data set proved to be useful and
worth persisting in Impala for extensive queries, we might want to
copy it to an internal table, letting Impala manage the data files and
perhaps reorganizing a little for higher efficiency. In this next
stage of the tutorial, we copy the original data into a partitioned
table, still in Parquet format. Partitioning based on the
The first step is to create a new table with a layout very similar to the
original
Although we could edit that output into a new SQL statement, all the ASCII box characters
make such editing inconvenient. To get a more stripped-down
After copying and pasting the
Next we run the
Next, we copy all the rows from the original table into this new one with an
Once partitioning or join queries come into play, it's important to have statistics
that Impala can use to optimize queries on the corresponding tables.
The
At this point, we sanity check the partitioning we did. All the partitions
have exactly one file, which is on the low side. A query that includes
a clause hotspots
occurring on particular nodes,
therefore a bigger performance boost by having a big cluster.
However, the more data files, the less data goes in each one. The overhead of dividing the work in a parallel query might not be worth it if each node is only reading a few megabytes. 50 or 100 megabytes is a decent size for a Parquet data block; 9 or 37 megabytes is on the small side. Which is to say, the data distribution we ended up with based on this partitioning scheme is on the borderline between sensible (reasonably large files) and suboptimal (few files in each partition). The way to see how well it works in practice is to run the same queries against the original flat table and the new partitioned table, and compare times.
Spoiler: in this case, with my particular 4-node cluster with its specific
distribution of data blocks and my particular exploratory queries,
queries against the partitioned table do consistently run faster than
the same queries against the unpartitioned table. But I could not be
sure that would be the case without some real measurements. Here are
some queries I ran to draw that conclusion, first against
Now we can finally analyze this data set that from the raw data files and we
didn't know what columns they contained. Let's see whether the
To see if the apparent trend holds up over time, let's do the same breakdown by day of week, but also
split up by year. Now we can see that day number 6 consistently has a higher average air time in each
year. We can also see that the average air time increased over time across the board. And the presence
of