As data pipelines start to include more aspects such as NoSQL or loosely specified schemas, you might encounter
situations where you have data files (particularly in Parquet format) where you do not know the precise table definition.
This tutorial shows how you can build an Impala table around data that comes from non-Impala or even non-SQL sources,
where you do not have control of the table layout and might not be familiar with the characteristics of the data.
The data used in this tutorial represents airline on-time arrival statistics, from October 1987 through April 2008.
See the details on the 2009 ASA Data Expo web site.
You can also see the explanations of the columns;
for purposes of this exercise, wait until after following the tutorial before examining the schema, to better simulate
a real-life situation where you cannot rely on assumptions and assertions about the ranges and representations of
data values.
We will download Parquet files containing this data from the Ibis blog.
First, we download and unpack the data files.
There are 8 files totalling 1.4 GB.
Each file is less than 256 MB.
$ wget -O airlines_parquet.tar.gz https://www.dropbox.com/s/ol9x51tqp6cv4yc/airlines_parquet.tar.gz?dl=0
...
Length: 1245204740 (1.2G) [application/octet-stream]
Saving to: “airlines_parquet.tar.gz”
2015-08-12 17:14:24 (23.6 MB/s) - “airlines_parquet.tar.gz” saved [1245204740/1245204740]
$ tar xvzf airlines_parquet.tar.gz
airlines_parquet/
airlines_parquet/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
airlines_parquet/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
airlines_parquet/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
airlines_parquet/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
airlines_parquet/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
airlines_parquet/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
$ cd airlines_parquet/
$ du -kch *.parq
253M 93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
65M 93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
156M 93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
240M 93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
253M 93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
16M 93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
177M 93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
213M 93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
1.4G total
Next, we put the Parquet data files in HDFS, all together in a single directory,
with permissions on the directory and the files so that the impala
user will be able to read them.
After unpacking, we saw the largest Parquet file was 253 MB.
When copying Parquet files into HDFS for Impala to use,
for maximum query performance, make sure that each file resides in a single HDFS data block.
Therefore, we pick a size larger than any single file and specify that as the block size, using the argument
-Ddfs.block.size=256m on the hdfs dfs -put command.
$ hdfs dfs -mkdir -p hdfs://demo_host.example.com:8020/user/impala/staging/airlines
$ hdfs dfs -Ddfs.block.size=256m -put *.parq /user/impala/staging/airlines
$ hdfs dfs -ls /user/impala/staging
Found 1 items
drwxrwxrwx - hdfs supergroup 0 2015-08-12 13:52 /user/impala/staging/airlines
$ hdfs dfs -ls hdfs://demo_host.example.com:8020/user/impala/staging/airlines
Found 8 items
-rw-r--r-- 3 jrussell supergroup 265107489 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq
-rw-r--r-- 3 jrussell supergroup 67544715 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq
-rw-r--r-- 3 jrussell supergroup 162556490 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq
-rw-r--r-- 3 jrussell supergroup 251603518 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq
-rw-r--r-- 3 jrussell supergroup 265186603 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq
-rw-r--r-- 3 jrussell supergroup 16663754 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq
-rw-r--r-- 3 jrussell supergroup 185511677 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq
-rw-r--r-- 3 jrussell supergroup 222794621 2015-08-12 17:18 /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq
With the files in an accessible location in HDFS, we create a database table that uses the data in those files.
The CREATE EXTERNAL syntax and the LOCATION attribute point Impala at the appropriate HDFS directory.
The LIKE PARQUET 'path_to_any_parquet_file' clause means we skip the list of column names and types;
Impala automatically gets the column names and data types straight from the data files.
(Currently, this technique only works for Parquet files.)
We ignore the warning about lack of READ_WRITE access to the files in HDFS;
the impala user can read the files, which will be sufficient for us to experiment with
queries and perform some copy and transform operations into other tables.
$ impala-shell -i localhost
Starting Impala Shell without Kerberos authentication
Connected to localhost:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
...
[localhost:21000] > create database airline_data;
[localhost:21000] > use airline_data;
[localhost:21000] > create external table airlines_external
> like parquet 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq'
> stored as parquet location 'hdfs://demo_host.example.com:8020/user/impala/staging/airlines';
WARNINGS: Impala does not have READ_WRITE access to path 'hdfs://demo_host.example.com:8020/user/impala/staging'
With the table created, we examine its physical and logical characteristics to confirm that the data is really
there and in a format and shape that we can work with.
The SHOW TABLE STATS statement gives a very high-level summary of the table,
showing how many files and how much total data it contains.
Also, it confirms that the table is expecting all the associated data files to be in Parquet format.
(The ability to work with all kinds of HDFS data files in different formats means that it is
possible to have a mismatch between the format of the data files, and the format
that the table expects the data files to be in.)
The SHOW FILES statement confirms that the data in the table has the expected number,
names, and sizes of the original Parquet files.
The DESCRIBE statement (or its abbreviation DESC) confirms the names and types
of the columns that Impala automatically created after reading that metadata from the Parquet file.
The DESCRIBE FORMATTED statement prints out some extra detail along with the column definitions;
the pieces we care about for this exercise are the containing database for the table,
the location of the associated data files in HDFS, the fact that it's an external table so Impala will not
delete the HDFS files when we finish the experiments and drop the table, and the fact that the
table is set up to work exclusively with files in the Parquet format.
[localhost:21000] > show table stats airlines_external;
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| #Rows | #Files | Size | Bytes Cached | Cache Replication | Format | Incremental stats |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
| -1 | 8 | 1.34GB | NOT CACHED | NOT CACHED | PARQUET | false |
+-------+--------+--------+--------------+-------------------+---------+-------------------+
[localhost:21000] > show files in airlines_external;
+----------------------------------------------------------------------------------------+----------+-----------+
| path | size | partition |
+----------------------------------------------------------------------------------------+----------+-----------+
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.0.parq | 252.83MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa96_2118228804_data.1.parq | 64.42MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa97_574780876_data.0.parq | 155.03MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa98_1194408366_data.0.parq | 239.95MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.0.parq | 252.90MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa99_1555718317_data.1.parq | 15.89MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9a_2073981944_data.0.parq | 176.92MB | |
| /user/impala/staging/airlines/93459d994898a9ba-77674173b331fa9b_1413430552_data.0.parq | 212.47MB | |
+----------------------------------------------------------------------------------------+----------+-----------+
[localhost:21000] > describe airlines_external;
+---------------------+--------+---------------------------------------------------+
| name | type | comment |
+---------------------+--------+---------------------------------------------------+
| year | int | inferred from: optional int32 year |
| month | int | inferred from: optional int32 month |
| day | int | inferred from: optional int32 day |
| dayofweek | int | inferred from: optional int32 dayofweek |
| dep_time | int | inferred from: optional int32 dep_time |
| crs_dep_time | int | inferred from: optional int32 crs_dep_time |
| arr_time | int | inferred from: optional int32 arr_time |
| crs_arr_time | int | inferred from: optional int32 crs_arr_time |
| carrier | string | inferred from: optional binary carrier |
| flight_num | int | inferred from: optional int32 flight_num |
| tail_num | int | inferred from: optional int32 tail_num |
| actual_elapsed_time | int | inferred from: optional int32 actual_elapsed_time |
| crs_elapsed_time | int | inferred from: optional int32 crs_elapsed_time |
| airtime | int | inferred from: optional int32 airtime |
| arrdelay | int | inferred from: optional int32 arrdelay |
| depdelay | int | inferred from: optional int32 depdelay |
| origin | string | inferred from: optional binary origin |
| dest | string | inferred from: optional binary dest |
| distance | int | inferred from: optional int32 distance |
| taxi_in | int | inferred from: optional int32 taxi_in |
| taxi_out | int | inferred from: optional int32 taxi_out |
| cancelled | int | inferred from: optional int32 cancelled |
| cancellation_code | string | inferred from: optional binary cancellation_code |
| diverted | int | inferred from: optional int32 diverted |
| carrier_delay | int | inferred from: optional int32 carrier_delay |
| weather_delay | int | inferred from: optional int32 weather_delay |
| nas_delay | int | inferred from: optional int32 nas_delay |
| security_delay | int | inferred from: optional int32 security_delay |
| late_aircraft_delay | int | inferred from: optional int32 late_aircraft_delay |
+---------------------+--------+---------------------------------------------------+
[localhost:21000] > desc formatted airlines_external;
+------------------------------+-------------------------------
| name | type
+------------------------------+-------------------------------
...
| # Detailed Table Information | NULL
| Database: | airline_data
| Owner: | jrussell
...
| Location: | /user/impala/staging/airlines
| Table Type: | EXTERNAL_TABLE
...
| # Storage Information | NULL
| SerDe Library: | parquet.hive.serde.ParquetHiveSerDe
| InputFormat: | parquet.hive.DeprecatedParquetInputFormat
| OutputFormat: | parquet.hive.DeprecatedParquetOutputFormat
...
Now that we are confident that the connections are solid between the Impala table and the
underlying Parquet files, we run some initial queries to understand the characteristics
of the data: the overall number of rows, and the ranges and how many
different values are in certain columns.
For convenience in understanding the magnitude of the COUNT(*)
result, we run another query dividing the number of rows by 1 million, demonstrating that there are 123 million rows in the table.
[localhost:21000] > select count(*) from airlines_external;
+-----------+
| count(*) |
+-----------+
| 123534969 |
+-----------+
Fetched 1 row(s) in 1.32s
[localhost:21000] > select count(*) / 1e6 as 'millions of rows' from airlines_external;
+------------------+
| millions of rows |
+------------------+
| 123.534969 |
+------------------+
Fetched 1 row(s) in 1.24s
The NDV() function stands for number of distinct
values
, which for performance reasons is an estimate when there
are lots of different values in the column, but is precise when the
cardinality is less than 16 K. Use NDV() calls for this
kind of exploration rather than COUNT(DISTINCT
colname), because Impala can evaluate
multiple NDV() functions in a single query, but only a
single instance of COUNT DISTINCT. Here we see that
there are modest numbers of different airlines, flight numbers, and
origin and destination airports. Two things jump out from this query:
the number of tail_num values is much smaller than we
might have expected, and there are more destination airports than origin
airports. Let's dig further. What we find is that most
tail_num values are NULL. It looks
like this was an experimental column that wasn't filled in accurately.
We make a mental note that if we use this data as a starting point,
we'll ignore this column. We also find that certain airports are
represented in the ORIGIN column but not the
DEST column; now we know that we cannot rely on the
assumption that those sets of airport codes are identical.
A slight digression for some performance tuning. Notice how the first
SELECT DISTINCT DEST query takes almost 40 seconds.
We expect all queries on such a small data set, less than 2 GB, to
take a few seconds at most. The reason is because the expression
NOT IN (SELECT origin FROM airlines_external)
produces an intermediate result set of 123 million rows, then
runs 123 million comparisons on each data node against the tiny set of destination airports.
The way the NOT IN operator works internally means that
this intermediate result set with 123 million rows might be transmitted
across the network to each data node in the cluster.
Applying another DISTINCT inside the NOT IN
subquery means that the intermediate result set is only 340 items,
resulting in much less network traffic and fewer comparison operations.
The more efficient query with the added DISTINCT is approximately 7 times as fast.
[localhost:21000] > select ndv(carrier), ndv(flight_num), ndv(tail_num),
> ndv(origin), ndv(dest) from airlines_external;
+--------------+-----------------+---------------+-------------+-----------+
| ndv(carrier) | ndv(flight_num) | ndv(tail_num) | ndv(origin) | ndv(dest) |
+--------------+-----------------+---------------+-------------+-----------+
| 29 | 9086 | 3 | 340 | 347 |
+--------------+-----------------+---------------+-------------+-----------+
[localhost:21000] > select tail_num, count(*) as howmany from airlines_external
> group by tail_num;
+----------+-----------+
| tail_num | howmany |
+----------+-----------+
| 715 | 1 |
| 0 | 406405 |
| 112 | 6562 |
| NULL | 123122001 |
+----------+-----------+
Fetched 1 row(s) in 5.18s
[localhost:21000] > select distinct dest from airlines_external
> where dest not in (select origin from airlines_external);
+------+
| dest |
+------+
| LBF |
| CBM |
| RCA |
| SKA |
| LAR |
+------+
Fetched 5 row(s) in 39.64s
[localhost:21000] > select distinct dest from airlines_external
> where dest not in (select distinct origin from airlines_external);
+------+
| dest |
+------+
| LBF |
| RCA |
| CBM |
| SKA |
| LAR |
+------+
Fetched 5 row(s) in 5.59s
[localhost:21000] > select distinct origin from airlines_external
> where origin not in (select distinct dest from airlines_external);
Fetched 0 row(s) in 5.37s
Next, we try doing a simple calculation, with results broken down by year.
This reveals that some years have no data in the
AIRTIME column. That means we might be able to use
that column in queries involving certain date ranges, but we cannot
count on it to always be reliable. The question of whether a column
contains any NULL values, and if so what is their
number, proportion, and distribution, comes up again and again when
doing initial exploration of a data set.
[localhost:21000] > select year, sum(airtime) from airlines_external
> group by year order by year desc;
+------+--------------+
| year | sum(airtime) |
+------+--------------+
| 2008 | 713050445 |
| 2007 | 748015545 |
| 2006 | 720372850 |
| 2005 | 708204026 |
| 2004 | 714276973 |
| 2003 | 665706940 |
| 2002 | 549761849 |
| 2001 | 590867745 |
| 2000 | 583537683 |
| 1999 | 561219227 |
| 1998 | 538050663 |
| 1997 | 536991229 |
| 1996 | 519440044 |
| 1995 | 513364265 |
| 1994 | NULL |
| 1993 | NULL |
| 1992 | NULL |
| 1991 | NULL |
| 1990 | NULL |
| 1989 | NULL |
| 1988 | NULL |
| 1987 | NULL |
+------+--------------+
With the notion of NULL values in mind, let's come back to the TAILNUM
column that we discovered had a lot of NULLs.
Let's quantify the NULL and non-NULL values in that column for better understanding.
First, we just count the overall number of rows versus the non-NULL values in that column.
That initial result gives the appearance of relatively few non-NULL values, but we can break
it down more clearly in a single query.
Once we have the COUNT(*) and the COUNT(colname) numbers,
we can encode that initial query in a WITH clause, then run a followon query that performs
multiple arithmetic operations on those values.
Seeing that only one-third of one percent of all rows have non-NULL values for the
TAILNUM column clearly illustrates that that column is not of much use.
[localhost:21000] > select count(*) as 'rows', count(tail_num) as 'non-null tail numbers'
> from airlines_external;
+-----------+-----------------------+
| rows | non-null tail numbers |
+-----------+-----------------------+
| 123534969 | 412968 |
+-----------+-----------------------+
Fetched 1 row(s) in 1.51s
[localhost:21000] > with t1 as
> (select count(*) as 'rows', count(tail_num) as 'nonnull'
> from airlines_external)
> select `rows`, `nonnull`, `rows` - `nonnull` as 'nulls',
> (`nonnull` / `rows`) * 100 as 'percentage non-null'
> from t1;
+-----------+---------+-----------+---------------------+
| rows | nonnull | nulls | percentage non-null |
+-----------+---------+-----------+---------------------+
| 123534969 | 412968 | 123122001 | 0.3342923897119365 |
+-----------+---------+-----------+---------------------+
By examining other columns using these techniques, we can form a mental picture of the way data is distributed
throughout the table, and which columns are most significant for query purposes. For this tutorial, we focus mostly on
the fields likely to hold discrete values, rather than columns such as ACTUAL_ELAPSED_TIME
whose names suggest they hold measurements. We would dig deeper into those columns once we had a clear picture
of which questions were worthwhile to ask, and what kinds of trends we might look for.
For the final piece of initial exploration, let's look at the YEAR column.
A simple GROUP BY query shows that it has a well-defined range, a manageable number of
distinct values, and relatively even distribution of rows across the different years.
[localhost:21000] > select min(year), max(year), ndv(year) from airlines_external;
+-----------+-----------+-----------+
| min(year) | max(year) | ndv(year) |
+-----------+-----------+-----------+
| 1987 | 2008 | 22 |
+-----------+-----------+-----------+
Fetched 1 row(s) in 2.03s
[localhost:21000] > select year, count(*) howmany from airlines_external
> group by year order by year desc;
+------+---------+
| year | howmany |
+------+---------+
| 2008 | 7009728 |
| 2007 | 7453215 |
| 2006 | 7141922 |
| 2005 | 7140596 |
| 2004 | 7129270 |
| 2003 | 6488540 |
| 2002 | 5271359 |
| 2001 | 5967780 |
| 2000 | 5683047 |
| 1999 | 5527884 |
| 1998 | 5384721 |
| 1997 | 5411843 |
| 1996 | 5351983 |
| 1995 | 5327435 |
| 1994 | 5180048 |
| 1993 | 5070501 |
| 1992 | 5092157 |
| 1991 | 5076925 |
| 1990 | 5270893 |
| 1989 | 5041200 |
| 1988 | 5202096 |
| 1987 | 1311826 |
+------+---------+
Fetched 22 row(s) in 2.13s
We could go quite far with the data in this initial raw format, just as we downloaded it from the web.
If the data set proved to be useful and worth persisting in Impala for extensive queries,
we might want to copy it to an internal table, letting Impala manage the data files and perhaps
reorganizing a little for higher efficiency.
In this next stage of the tutorial, we copy the original data into a partitioned table, still in Parquet format.
Partitioning based on the YEAR column lets us run queries with clauses such as WHERE year = 2001
or WHERE year BETWEEN 1989 AND 1999, which can dramatically cut down on I/O by
ignoring all the data from years outside the desired range.
Rather than reading all the data and then deciding which rows are in the matching years, Impala can
zero in on only the data files from specific YEAR partitions.
To do this, Impala physically reorganizes the data files, putting the rows from each year into
data files in a separate HDFS directory for each YEAR value.
Along the way, we'll also get rid of the TAIL_NUM column that proved to be almost entirely NULL.
The first step is to create a new table with a layout very similar to the original AIRLINES_EXTERNAL table.
We'll do that by reverse-engineering a CREATE TABLE statement for the first table,
then tweaking it slightly to include a PARTITION BY clause for YEAR,
and excluding the TAIL_NUM column.
The SHOW CREATE TABLE statement gives us the starting point.
[localhost:21000] > show create table airlines_external;
+-------------------------------------------------------------------------------------
| result
+-------------------------------------------------------------------------------------
| CREATE EXTERNAL TABLE airline_data.airlines_external (
| year INT COMMENT 'inferred from: optional int32 year',
| month INT COMMENT 'inferred from: optional int32 month',
| day INT COMMENT 'inferred from: optional int32 day',
| dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
| dep_time INT COMMENT 'inferred from: optional int32 dep_time',
| crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
| arr_time INT COMMENT 'inferred from: optional int32 arr_time',
| crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
| carrier STRING COMMENT 'inferred from: optional binary carrier',
| flight_num INT COMMENT 'inferred from: optional int32 flight_num',
| tail_num INT COMMENT 'inferred from: optional int32 tail_num',
| actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
| crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
| airtime INT COMMENT 'inferred from: optional int32 airtime',
| arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
| depdelay INT COMMENT 'inferred from: optional int32 depdelay',
| origin STRING COMMENT 'inferred from: optional binary origin',
| dest STRING COMMENT 'inferred from: optional binary dest',
| distance INT COMMENT 'inferred from: optional int32 distance',
| taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
| taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
| cancelled INT COMMENT 'inferred from: optional int32 cancelled',
| cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
| diverted INT COMMENT 'inferred from: optional int32 diverted',
| carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
| weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
| nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
| security_delay INT COMMENT 'inferred from: optional int32 security_delay',
| late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
| )
| STORED AS PARQUET
| LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
| TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
| 'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
| 'rawDataSize'='-1')
+-------------------------------------------------------------------------------------
Fetched 1 row(s) in 0.03s
[localhost:21000] > quit;
Although we could edit that output into a new SQL statement, all the ASCII box characters
make such editing inconvenient. To get a more stripped-down CREATE TABLE
to start with, we restart the impala-shell command with the
-B option, which turns off the box-drawing behavior.
[localhost:21000] > quit;
Goodbye jrussell
$ impala-shell -i localhost -B -d airline_data;
Starting Impala Shell without Kerberos authentication
Connected to localhost:21000
Welcome to the Impala shell. Press TAB twice to see a list of available commands.
...
[localhost:21000] > show create table airlines_external;
"CREATE EXTERNAL TABLE airline_data.airlines_external (
year INT COMMENT 'inferred from: optional int32 year',
month INT COMMENT 'inferred from: optional int32 month',
day INT COMMENT 'inferred from: optional int32 day',
dayofweek INT COMMENT 'inferred from: optional int32 dayofweek',
dep_time INT COMMENT 'inferred from: optional int32 dep_time',
crs_dep_time INT COMMENT 'inferred from: optional int32 crs_dep_time',
arr_time INT COMMENT 'inferred from: optional int32 arr_time',
crs_arr_time INT COMMENT 'inferred from: optional int32 crs_arr_time',
carrier STRING COMMENT 'inferred from: optional binary carrier',
flight_num INT COMMENT 'inferred from: optional int32 flight_num',
tail_num INT COMMENT 'inferred from: optional int32 tail_num',
actual_elapsed_time INT COMMENT 'inferred from: optional int32 actual_elapsed_time',
crs_elapsed_time INT COMMENT 'inferred from: optional int32 crs_elapsed_time',
airtime INT COMMENT 'inferred from: optional int32 airtime',
arrdelay INT COMMENT 'inferred from: optional int32 arrdelay',
depdelay INT COMMENT 'inferred from: optional int32 depdelay',
origin STRING COMMENT 'inferred from: optional binary origin',
dest STRING COMMENT 'inferred from: optional binary dest',
distance INT COMMENT 'inferred from: optional int32 distance',
taxi_in INT COMMENT 'inferred from: optional int32 taxi_in',
taxi_out INT COMMENT 'inferred from: optional int32 taxi_out',
cancelled INT COMMENT 'inferred from: optional int32 cancelled',
cancellation_code STRING COMMENT 'inferred from: optional binary cancellation_code',
diverted INT COMMENT 'inferred from: optional int32 diverted',
carrier_delay INT COMMENT 'inferred from: optional int32 carrier_delay',
weather_delay INT COMMENT 'inferred from: optional int32 weather_delay',
nas_delay INT COMMENT 'inferred from: optional int32 nas_delay',
security_delay INT COMMENT 'inferred from: optional int32 security_delay',
late_aircraft_delay INT COMMENT 'inferred from: optional int32 late_aircraft_delay'
)
STORED AS PARQUET
LOCATION 'hdfs://a1730.example.com:8020/user/impala/staging/airlines'
TBLPROPERTIES ('numFiles'='0', 'COLUMN_STATS_ACCURATE'='false',
'transient_lastDdlTime'='1439425228', 'numRows'='-1', 'totalSize'='0',
'rawDataSize'='-1')"
Fetched 1 row(s) in 0.01s
After copying and pasting the CREATE TABLE statement into a text editor for fine-tuning, we quit and restart impala-shell
without the -B option, to switch back to regular
output.
Next we run the CREATE TABLE statement that we adapted from the SHOW CREATE TABLE output.
We kept the STORED AS PARQUET clause because
we want to rearrange the data somewhat but still keep it in the high-performance
Parquet format. The LOCATION and TBLPROPERTIES
clauses are not relevant for this new table, so we edit those out.
Because we are going to partition the new table based on the YEAR
column, we move that column name (and its type) into a new PARTITIONED BY clause.
[localhost:21000] > CREATE TABLE airline_data.airlines
> (
> month INT,
> day INT,
> dayofweek INT,
> dep_time INT,
> crs_dep_time INT,
> arr_time INT,
> crs_arr_time INT,
> carrier STRING,
> flight_num INT,
> actual_elapsed_time INT,
> crs_elapsed_time INT,
> airtime INT,
> arrdelay INT,
> depdelay INT,
> origin STRING,
> dest STRING,
> distance INT,
> taxi_in INT,
> taxi_out INT,
> cancelled INT,
> cancellation_code STRING,
> diverted INT,
> carrier_delay INT,
> weather_delay INT,
> nas_delay INT,
> security_delay INT,
> late_aircraft_delay INT
> )
> STORED AS PARQUET
> PARTITIONED BY (year INT);
Fetched 0 row(s) in 0.10s
Next, we copy all the rows from the original table into this new one with
an INSERT statement. (We edited the CREATE TABLE
statement to make an INSERT statement with the column names
in the same order.) The only change is to add a PARTITION(year)
clause, and move the YEAR column to the very end of the
SELECT list of the INSERT statement.
Specifying PARTITION(year), rather than a fixed value
such as PARTITION(year=2000), means that Impala figures
out the partition value for each row based on the value of the very last column
in the SELECT list.
This is the first SQL statement that legitimately takes any substantial time,
because the rows from different years are shuffled around the cluster;
the rows that go into each partition are collected on one node, before being
written to one or more new data files.
[localhost:21000] > INSERT INTO airline_data.airlines
> PARTITION (year)
> SELECT
> month,
> day,
> dayofweek,
> dep_time,
> crs_dep_time,
> arr_time,
> crs_arr_time,
> carrier,
> flight_num,
> actual_elapsed_time,
> crs_elapsed_time,
> airtime,
> arrdelay,
> depdelay,
> origin,
> dest,
> distance,
> taxi_in,
> taxi_out,
> cancelled,
> cancellation_code,
> diverted,
> carrier_delay,
> weather_delay,
> nas_delay,
> security_delay,
> late_aircraft_delay,
> year
> FROM airline_data.airlines_external;
Inserted 123534969 row(s) in 202.70s
Once partitioning or join queries come into play, it's important to have statistics
that Impala can use to optimize queries on the corresponding tables.
The COMPUTE INCREMENTAL STATS statement is the way to collect
statistics for partitioned tables.
Then the SHOW TABLE STATS statement confirms that the statistics
are in place for each partition, and also illustrates how many files and how much raw data
is in each partition.
[localhost:21000] > compute incremental stats airlines;
+-------------------------------------------+
| summary |
+-------------------------------------------+
| Updated 22 partition(s) and 27 column(s). |
+-------------------------------------------+
[localhost:21000] > show table stats airlines;
+-------+-----------+--------+----------+--------------+------------+---------+-------------------+
| year | #Rows | #Files | Size | Bytes Cached | Cache Repl | Format | Incremental stats |
+-------+-----------+--------+----------+--------------+------------+---------+-----
| 1987 | 1311826 | 1 | 9.32MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1988 | 5202096 | 1 | 37.04MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1989 | 5041200 | 1 | 36.25MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1990 | 5270893 | 1 | 38.39MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1991 | 5076925 | 1 | 37.23MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1992 | 5092157 | 1 | 36.85MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1993 | 5070501 | 1 | 37.16MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1994 | 5180048 | 1 | 38.31MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1995 | 5327435 | 1 | 53.14MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1996 | 5351983 | 1 | 53.64MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1997 | 5411843 | 1 | 54.41MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1998 | 5384721 | 1 | 54.01MB | NOT CACHED | NOT CACHED | PARQUET | true
| 1999 | 5527884 | 1 | 56.32MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2000 | 5683047 | 1 | 58.15MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2001 | 5967780 | 1 | 60.65MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2002 | 5271359 | 1 | 57.99MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2003 | 6488540 | 1 | 81.33MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2004 | 7129270 | 1 | 103.19MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2005 | 7140596 | 1 | 102.61MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2006 | 7141922 | 1 | 106.03MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2007 | 7453215 | 1 | 112.15MB | NOT CACHED | NOT CACHED | PARQUET | true
| 2008 | 7009728 | 1 | 105.76MB | NOT CACHED | NOT CACHED | PARQUET | true
| Total | 123534969 | 22 | 1.30GB | 0B | | |
+-------+-----------+--------+----------+--------------+------------+---------+-----
At this point, we go through a quick thought process to sanity check the partitioning we did.
All the partitions have exactly one file, which is on the low side. A query that includes a
clause WHERE year=2004 will only read a single data block; that data block
will be read and processed by a single data node; therefore, for a query targeting a single year,
all the other nodes in the cluster will sit idle while all the work happens on a single machine.
It's even possible that by chance (depending on HDFS replication factor and the way data blocks are distributed
across the cluster), that multiple year partitions selected by a filter such as
WHERE year BETWEEN 1999 AND 2001 could all be read and processed by the same data node.
The more data files each partition has, the more parallelism you can get and the less probability of hotspots
occurring
on particular nodes, therefore a bigger performance boost by having a big cluster.
However, the more data files, the less data goes in each one. The overhead of dividing the work in a
parallel query might not be worth it if each node is only reading a few megabytes. 50 or 100 megabytes
is a decent size for a Parquet data block; 9 or 37 megabytes is on the small side.
Which is to say, the data distribution we ended up with based on this partitioning scheme is
on the borderline between sensible (reasonably large files) and suboptimal (few files in each partition).
The way to see how well it works in practice is to run the same queries against the original
flat table and the new partitioned table, and compare times.
Spoiler: in this case, with my particular 4-node cluster with its specific distribution of data blocks and my
particular exploratory queries, queries against the partitioned table do consistently run faster than the same queries
against the unpartitioned table. But I could not be sure that would be the case without some real measurements.
Here are some queries I ran to draw that conclusion, first against AIRLINES_EXTERNAL (no partitioning),
then against AIRLINES (partitioned by year). The AIRLINES queries are consistently faster.
Changing the volume of data, changing the size of the cluster, running queries that did or didn't refer to the partition key
columns, or other factors could change the results to favor one table layout or the other.
If you find the volume of each partition is only in the low tens of megabytes, consider lowering the granularity
of partitioning. For example, instead of partitioning by year, month, and day, partition by year and month
or even just by year. The ideal layout to distribute work efficiently in a parallel query is many tens or
even hundreds of megabytes per Parquet file, and the number of Parquet files in each partition somewhat
higher than the number of data nodes.
[localhost:21000] > select sum(airtime) from airlines_external;
+--------------+
| sum(airtime) |
+--------------+
| 8662859484 |
+--------------+
Fetched 1 row(s) in 2.02s
[localhost:21000] > select sum(airtime) from airlines;
+--------------+
| sum(airtime) |
+--------------+
| 8662859484 |
+--------------+
Fetched 1 row(s) in 1.21s
[localhost:21000] > select sum(airtime) from airlines_external where year = 2005;
+--------------+
| sum(airtime) |
+--------------+
| 708204026 |
+--------------+
Fetched 1 row(s) in 2.61s
[localhost:21000] > select sum(airtime) from airlines where year = 2005;
+--------------+
| sum(airtime) |
+--------------+
| 708204026 |
+--------------+
Fetched 1 row(s) in 1.19s
[localhost:21000] > select sum(airtime) from airlines_external where dayofweek = 1;
+--------------+
| sum(airtime) |
+--------------+
| 1264945051 |
+--------------+
Fetched 1 row(s) in 2.82s
[localhost:21000] > select sum(airtime) from airlines where dayofweek = 1;
+--------------+
| sum(airtime) |
+--------------+
| 1264945051 |
+--------------+
Fetched 1 row(s) in 1.61s
Now we can finally do some serious analysis with this data set that, remember,
a few minutes ago all we had were some raw data files and we didn't even know
what columns they contained.
Let's see whether the air time
of a flight tends to be different depending on the
day of the week. We can see that the average is a little higher on day number 6;
perhaps Saturday is a busy flying day and planes have to circle for longer at the
destination airport before landing.
[localhost:21000] > select dayofweek, avg(airtime) from airlines
> group by dayofweek order by dayofweek;
+-----------+-------------------+
| dayofweek | avg(airtime) |
+-----------+-------------------+
| 1 | 102.1560425016671 |
| 2 | 102.1582931538807 |
| 3 | 102.2170009256653 |
| 4 | 102.37477661846 |
| 5 | 102.2697358763511 |
| 6 | 105.3627448363705 |
| 7 | 103.4144351202054 |
+-----------+-------------------+
Fetched 7 row(s) in 2.25s
To see if the apparent trend holds up over time, let's do the same breakdown by day of week, but also
split up by year. Now we can see that day number 6 consistently has a higher average air time in each
year. We can also see that the average air time increased over time across the board. And the presence
of NULL for this column in years 1987 to 1994 shows that queries involving this column
need to be restricted to a date range of 1995 and higher.
[localhost:21000] > select year, dayofweek, avg(airtime) from airlines
> group by year, dayofweek order by year desc, dayofweek;
+------+-----------+-------------------+
| year | dayofweek | avg(airtime) |
+------+-----------+-------------------+
| 2008 | 1 | 103.1821651651355 |
| 2008 | 2 | 103.2149301386094 |
| 2008 | 3 | 103.0585076622796 |
| 2008 | 4 | 103.4671383539038 |
| 2008 | 5 | 103.5575385182659 |
| 2008 | 6 | 107.4006306562128 |
| 2008 | 7 | 104.8648851041755 |
| 2007 | 1 | 102.2196114337825 |
| 2007 | 2 | 101.9317791906348 |
| 2007 | 3 | 102.0964767689043 |
| 2007 | 4 | 102.6215927201686 |
| 2007 | 5 | 102.4289399000661 |
| 2007 | 6 | 105.1477448215756 |
| 2007 | 7 | 103.6305945644095 |
...
| 1996 | 1 | 99.33860750862108 |
| 1996 | 2 | 99.54225446396656 |
| 1996 | 3 | 99.41129336113134 |
| 1996 | 4 | 99.5110373340348 |
| 1996 | 5 | 99.22120745027595 |
| 1996 | 6 | 101.1717447111921 |
| 1996 | 7 | 99.95410136133704 |
| 1995 | 1 | 96.93779698300494 |
| 1995 | 2 | 96.93458674589712 |
| 1995 | 3 | 97.00972311337051 |
| 1995 | 4 | 96.90843832024412 |
| 1995 | 5 | 96.78382115425562 |
| 1995 | 6 | 98.70872826057003 |
| 1995 | 7 | 97.85570478374616 |
| 1994 | 1 | NULL |
| 1994 | 2 | NULL |
| 1994 | 3 | NULL |
...
| 1987 | 5 | NULL |
| 1987 | 6 | NULL |
| 1987 | 7 | NULL |
+------+-----------+-------------------+