All the elements in this file with IDs are intended to be conref'ed elsewhere. Practically
all of the conref'ed elements for the Impala docs are in this file, to avoid questions of
when it's safe to remove or move something in any of the 'main' files, and avoid having to
change and conref references as a result.
This file defines some dummy subheadings as section elements, just for self-documentation.
Using sections instead of nested concepts lets all the conref links point to a very simple
name pattern, '#common/id_within_the_file', rather than a 3-part reference with an
intervening, variable concept ID.
SQL Language Reference Snippets
These reusable chunks were taken from conrefs originally in
ciiu_langref_sql.xml. Or they are primarily used in new SQL syntax
topics underneath that parent topic.
The TABLESAMPLE clause of the SELECT statement does
not apply to a table reference derived from a view, a subquery, or anything other than a
real base table. This clause only works for tables backed by HDFS or HDFS-like data
files, therefore it does not apply to Kudu or HBase tables.
In and higher, you can use the operators IS
[NOT] TRUE and IS [NOT] FALSE as equivalents for the built-in
functions ISTRUE(), ISNOTTRUE(),
ISFALSE(), and ISNOTFALSE().
The set of characters that can be generated as output from
BASE64ENCODE(), or specified in the argument string to
BASE64DECODE(), are the ASCII uppercase and lowercase letters (A-Z,
a-z), digits (0-9), and the punctuation characters +,
/, and =.
If the argument string to BASE64DECODE() does not represent a valid
base64-encoded value, subject to the constraints of the Impala implementation such as
the allowed character set, the function returns NULL.
The functions BASE64ENCODE() and BASE64DECODE() are
typically used in combination, to store in an Impala table string data that is
problematic to store or transmit. For example, you could use these functions to store
string data that uses an encoding other than UTF-8, or to transform the values in
contexts that require ASCII values, such as for partition key columns. Keep in mind that
base64-encoded values produce different results for string functions such as
LENGTH(), MAX(), and MIN() than when
those functions are called with the unencoded string values.
All return values produced by BASE64ENCODE() are a multiple of 4 bytes
in length. All argument values supplied to BASE64DECODE() must also be
a multiple of 4 bytes in length. If a base64-encoded value would otherwise have a
different length, it can be padded with trailing = characters to reach
a length that is a multiple of 4 bytes.
The following examples show how to use BASE64ENCODE() and
BASE64DECODE() together to store and retrieve string values:
-- An arbitrary string can be encoded in base 64.
-- The length of the output is a multiple of 4 bytes,
-- padded with trailing = characters if necessary.
select base64encode('hello world') as encoded,
length(base64encode('hello world')) as length;
+------------------+--------+
| encoded | length |
+------------------+--------+
| aGVsbG8gd29ybGQ= | 16 |
+------------------+--------+
-- Passing an encoded value to base64decode() produces
-- the original value.
select base64decode('aGVsbG8gd29ybGQ=') as decoded;
+-------------+
| decoded |
+-------------+
| hello world |
+-------------+
These examples demonstrate incorrect encoded values that produce NULL
return values when decoded:
-- The input value to base64decode() must be a multiple of 4 bytes.
-- In this case, leaving off the trailing = padding character
-- produces a NULL return value.
select base64decode('aGVsbG8gd29ybGQ') as decoded;
+---------+
| decoded |
+---------+
| NULL |
+---------+
WARNINGS: UDF WARNING: Invalid base64 string; input length is 15,
which is not a multiple of 4.
-- The input to base64decode() can only contain certain characters.
-- The $ character in this case causes a NULL return value.
select base64decode('abc$');
+----------------------+
| base64decode('abc$') |
+----------------------+
| NULL |
+----------------------+
WARNINGS: UDF WARNING: Could not base64 decode input in space 4; actual output length 0
These examples demonstrate round-tripping
of an original string to an encoded
string, and back again. This technique is applicable if the original source is in an
unknown encoding, or if some intermediate processing stage might cause national
characters to be misrepresented:
select 'circumflex accents: â, ê, î, ô, û' as original,
base64encode('circumflex accents: â, ê, î, ô, û') as encoded;
+-----------------------------------+------------------------------------------------------+
| original | encoded |
+-----------------------------------+------------------------------------------------------+
| circumflex accents: â, ê, î, ô, û | Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= |
+-----------------------------------+------------------------------------------------------+
select base64encode('circumflex accents: â, ê, î, ô, û') as encoded,
base64decode(base64encode('circumflex accents: â, ê, î, ô, û')) as decoded;
+------------------------------------------------------+-----------------------------------+
| encoded | decoded |
+------------------------------------------------------+-----------------------------------+
| Y2lyY3VtZmxleCBhY2NlbnRzOiDDoiwgw6osIMOuLCDDtCwgw7s= | circumflex accents: â, ê, î, ô, û |
+------------------------------------------------------+-----------------------------------+
In , only the value 1 enables the option, and the value
true is not recognized. This limitation is tracked by the issue
IMPALA-3334, which shows the releases where the
problem is fixed.
The Avro specification allows string values up to 2**64 bytes in length. Impala queries
for Avro tables use 32-bit integers to hold string lengths. In
and higher, Impala truncates CHAR and
VARCHAR values in Avro tables to (2**31)-1 bytes. If a query encounters
a STRING value longer than (2**31)-1 bytes in an Avro table, the query
fails. In earlier releases, encountering such long values in an Avro table could cause a
crash.
You specify a case-insensitive symbolic name for the kind of statistics:
numDVs, numNulls, avgSize,
maxSize. The key names and values are both quoted. This operation
applies to an entire table, not a specific partition. For example:
create table t1 (x int, s string);
insert into t1 values (1, 'one'), (2, 'two'), (2, 'deux');
show column stats t1;
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| x | INT | -1 | -1 | 4 | 4 |
| s | STRING | -1 | -1 | -1 | -1 |
+--------+--------+------------------+--------+----------+----------+
alter table t1 set column stats x ('numDVs'='2','numNulls'='0');
alter table t1 set column stats s ('numdvs'='3','maxsize'='4');
show column stats t1;
+--------+--------+------------------+--------+----------+----------+
| Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size |
+--------+--------+------------------+--------+----------+----------+
| x | INT | 2 | 0 | 4 | 4 |
| s | STRING | 3 | -1 | 4 | -1 |
+--------+--------+------------------+--------+----------+----------+
create table analysis_data stored as parquet as select * from raw_data;
Inserted 1000000000 rows in 181.98s
compute stats analysis_data;
insert into analysis_data select * from smaller_table_we_forgot_before;
Inserted 1000000 rows in 15.32s
-- Now there are 1001000000 rows. We can update this single data point in the stats.
alter table analysis_data set tblproperties('numRows'='1001000000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
-- If the table originally contained 1 million rows, and we add another partition with 30 thousand rows,
-- change the numRows property for the partition and the overall table.
alter table partitioned_data partition(year=2009, month=4) set tblproperties ('numRows'='30000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
alter table partitioned_data set tblproperties ('numRows'='1030000', 'STATS_GENERATED_VIA_STATS_TASK'='true');
Impala does not return column overflows as NULL, so that customers can
distinguish between NULL data and overflow conditions similar to how
they do so with traditional database systems. Impala returns the largest or smallest
value in the range for the type. For example, valid values for a
tinyint range from -128 to 127. In Impala, a tinyint
with a value of -200 returns -128 rather than NULL. A
tinyint with a value of 200 returns 127.
If you frequently run aggregate functions such as MIN(),
MAX(), and COUNT(DISTINCT) on partition key columns,
consider enabling the OPTIMIZE_PARTITION_KEY_SCANS query option, which
optimizes such queries. This feature is available in
and higher. See for the
kinds of queries that this option applies to, and slight differences in how partitions
are evaluated when this query option is enabled.
The output from this query option is printed to standard
error. The output is displayed only in interactive mode, not when the -q
or -f options are used.
To see how the LIVE_PROGRESS and LIVE_SUMMARY query
options work in real time, see
this
animated demo.
Because the runtime filtering feature is
enabled by default only for local processing, the other filtering-related query options have
the greatest effect when used in combination with the setting
RUNTIME_FILTER_MODE=GLOBAL.
The square bracket style of hint is now deprecated and might be removed in a future
release. For that reason, any newly added hints are not available with the square
bracket syntax.
Because the runtime filtering feature applies mainly to resource-intensive and
long-running queries, only adjust this query option when tuning long-running queries
involving some combination of large partitioned tables and joins involving large tables.
The LIVE_PROGRESS and LIVE_SUMMARY query options
currently do not produce any output during COMPUTE STATS operations.
The LIVE_PROGRESS and LIVE_SUMMARY query options only
apply inside the impala-shell interpreter. You cannot use them with
the SET statement from a JDBC or ODBC application.
Because the LIVE_PROGRESS and LIVE_SUMMARY query
options are available only within the impala-shell interpreter:
-
You cannot change these query options through the SQL SET
statement using the JDBC or ODBC interfaces. The SET command in
impala-shell recognizes these names as shell-only options.
-
Be careful when using impala-shell on a
pre- system to connect to a system running
or higher. The older impala-shell
does not recognize these query option names. Upgrade
impala-shell on the systems where you intend to use these query
options.
-
Likewise, the impala-shell command relies on some information
only available in and higher to prepare live
progress reports and query summaries. The LIVE_PROGRESS and
LIVE_SUMMARY query options have no effect when
impala-shell connects to a cluster running an older version of
Impala.
create database first_db;
use first_db;
create table t1 (x int);
create database second_db;
use second_db;
-- Each database has its own namespace for tables.
-- You can reuse the same table names in each database.
create table t1 (s string);
create database temp;
-- You can either USE a database after creating it,
-- or qualify all references to the table name with the name of the database.
-- Here, tables T2 and T3 are both created in the TEMP database.
create table temp.t2 (x int, y int);
use database temp;
create table t3 (s string);
-- You cannot drop a database while it is selected by the USE statement.
drop database temp;
ERROR: AnalysisException: Cannot drop current default database: temp
-- The always-available database 'default' is a convenient one to USE
-- before dropping a database you created.
use default;
-- Before dropping a database, first drop all the tables inside it,
-- or in and higher use the CASCADE clause.
drop database temp;
ERROR: ImpalaRuntimeException: Error making 'dropDatabase' RPC to Hive Metastore:
CAUSED BY: InvalidOperationException: Database temp is not empty
show tables in temp;
+------+
| name |
+------+
| t3 |
+------+
-- and higher:
drop database temp cascade;
-- Earlier releases:
drop table temp.t3;
drop database temp;
This example shows how to use the castto*() functions as an equivalent
to CAST(value AS type)
expressions.
Usage notes: A convenience function to skip the SQL CAST
value AS type syntax, for example when
programmatically generating SQL statements where a regular function call might be easier
to construct.
To determine the time zone of the server you are connected to, in
and higher you can call the
timeofday() function, which includes the time zone specifier in its
return value. Remember that with cloud computing, the server you interact with might be
in a different time zone than you are, or different sessions might connect to servers in
different time zones, or a cluster might include servers in more than one time zone.
The way this function deals with time zones when converting to or from
TIMESTAMP values is affected by the
‑‑use_local_tz_for_unix_timestamp_conversions startup flag
for the impalad daemon. See
for details about how
Impala handles time zone considerations for the TIMESTAMP data type.
For best
compatibility with the S3 write support in and higher:
- Use native Hadoop techniques to create data files in S3 for
querying through Impala.
- Use the PURGE clause of DROP
TABLE when dropping internal (managed) tables.
By default, when you drop an internal (managed) table, the data
files are moved to the HDFS trashcan. This operation is expensive for
tables that reside on the Amazon S3 object store. Therefore, for S3
tables, prefer to use DROP TABLE table_name
PURGE rather than the default DROP TABLE
statement. The PURGE clause makes Impala delete the
data files immediately, skipping the HDFS trashcan. For the
PURGE clause to work effectively, you must originally
create the data files on S3 using one of the tools from the Hadoop
ecosystem, such as hadoop fs -cp, or
INSERT in Impala or Hive.
This query option affects only Bloom filters, not the min/max filters that are applied
to Kudu tables. Therefore, it does not affect the performance of queries against Kudu
tables.
Because of differences
between S3 and traditional filesystems, DML operations for S3 tables can
take longer than for tables on HDFS. For example, both the LOAD
DATA statement and the final stage of the
INSERT and CREATE TABLE AS SELECT
statements involve moving files from one directory to another. (In the
case of INSERT and CREATE TABLE AS
SELECT, the files are moved from a temporary staging
directory to the final destination directory.) Because S3 does not
support a rename
operation for existing objects, in these cases
Impala actually copies the data files from one location to another and
then removes the original files. In ,
the S3_SKIP_INSERT_STAGING query option provides a way
to speed up INSERT statements for S3 tables and
partitions, with the tradeoff that a problem during statement execution
could leave data in an inconsistent state. It does not apply to
INSERT OVERWRITE or LOAD DATA
statements. See S3_SKIP_INSERT_STAGING Query Option for details.
Because ADLS does not expose the block sizes of data files the way HDFS does, any Impala
INSERT or CREATE TABLE AS SELECT statements use the
PARQUET_FILE_SIZE query option setting to define the size of Parquet
data files. (Using a large block size is more important for Parquet tables than for
tables that use other file formats.)
In and higher, Impala queries are optimized for files
stored in Amazon S3. For Impala tables that use the file formats Parquet, ORC, RCFile,
SequenceFile, Avro, and uncompressed text, the setting
fs.s3a.block.size in the core-site.xml
configuration file determines how Impala divides the I/O work of reading the data files.
This configuration setting is specified in bytes. By default, this value is 33554432 (32
MB), meaning that Impala parallelizes S3 read operations on the files as if they were
made up of 32 MB blocks. For example, if your S3 queries primarily access Parquet files
written by MapReduce or Hive, increase fs.s3a.block.size to 134217728
(128 MB) to match the row group size of those files. If most S3 queries involve Parquet
files written by Impala, increase fs.s3a.block.size to 268435456 (256
MB) to match the row group size produced by Impala.
In and higher, Impala supports both queries
(SELECT) and DML (INSERT, LOAD
DATA, CREATE TABLE AS SELECT) for data residing on Amazon
S3. With the inclusion of write support,
the Impala support for S3 is now considered ready for production use.
Impala query support for Amazon S3 is included in ,
but is not supported or recommended for production use in this version.
In and higher, Impala DDL statements such as
CREATE DATABASE, CREATE TABLE, DROP DATABASE
CASCADE, DROP TABLE, and ALTER TABLE [ADD|DROP]
PARTITION can create or remove folders as needed in the Amazon S3 system. Prior
to , you had to create folders yourself and point
Impala database, tables, or partitions at them, and manually remove folders when no
longer needed. See for details about reading
and writing S3 data with Impala.
In and higher, the Impala DML statements
(INSERT, LOAD DATA, and CREATE TABLE AS
SELECT) can write data into a table or partition that resides in the Azure Data
Lake Store (ADLS). ADLS Gen2 is supported in and higher.
In theCREATE TABLE or ALTER TABLE statements, specify
the ADLS location for tables and partitions with the adl:// prefix for
ADLS Gen1 and abfs:// or abfss:// for ADLS Gen2 in the
LOCATION attribute.
If you bring data into ADLS using the normal ADLS transfer mechanisms instead of Impala
DML statements, issue a REFRESH statement for the table before using
Impala to query the ADLS data.
In and higher, the Impala DML statements (INSERT,
LOAD DATA, and CREATE TABLE AS
SELECT) can write data into a table or partition that resides
in S3. The syntax of the DML statements is the same as for any other
tables, because the S3 location for tables and partitions is specified
by an s3a:// prefix in the LOCATION
attribute of CREATE TABLE or ALTER
TABLE statements. If you bring data into S3 using the normal
S3 transfer mechanisms instead of Impala DML statements, issue a
REFRESH statement for the table before using Impala
to query the S3 data.
Impala caches metadata for tables where the data resides in the Amazon Simple Storage
Service (S3), and the REFRESH and INVALIDATE METADATA
statements are supported for the S3 tables. In particular, issue a
REFRESH for a table after adding or removing files in the associated S3
data directory. See for details
about working with S3 tables.
In Impala 2.2.0 and higher, built-in functions that accept or return integers
representing TIMESTAMP values use the BIGINT type for
parameters and return values, rather than INT. This change lets the
date and time functions avoid an overflow error that would otherwise occur on January
19th, 2038 (known as the
Year
2038 problem
or Y2K38 problem
). This change affects the
FROM_UNIXTIME() and UNIX_TIMESTAMP() functions. You
might need to change application code that interacts with these functions, change the
types of columns that store the return values, or add CAST() calls to
SQL statements that call these functions.
Impala automatically converts STRING literals of the correct format
into TIMESTAMP values. Timestamp values are accepted in the format
'yyyy‑MM‑dd HH:mm:ss.SSSSSS', and can consist of just the date, or
just the time, with or without the fractional second portion. For example, you can
specify TIMESTAMP values such as '1966‑07‑30',
'08:30:00', or '1985‑09‑25 17:45:30.005'.
Leading zeroes are not required in the numbers representing the date component, such as
month and date, or the time component, such as hour, minute, and second. For example,
Impala accepts both '2018‑1‑1 01:02:03' and
'2018‑01‑01 1:2:3' as valid.
In STRING to TIMESTAMP conversions, leading and
trailing white spaces, such as a space, a tab, a newline, or a carriage return, are
ignored. For example, Impala treats the following as equivalent:
'1999‑12‑01 01:02:03 ', ' 1999‑12‑01 01:02:03',
'1999‑12‑01 01:02:03\r\n\t'.
When you convert or cast a STRING literal to
TIMESTAMP, you can use the following separators between the date part
and the time part:
-
One or more space characters
Example: CAST('2001-01-09 01:05:01' AS TIMESTAMP)
-
The character “T”
Example: CAST('2001-01-09T01:05:01' AS TIMESTAMP)
Casting an integer or floating-point value
N to TIMESTAMP produces a value that is
N seconds past the start of the epoch date (January 1, 1970). By
default, the result value represents a date and time in the UTC time zone. If the
setting ‑‑use_local_tz_for_unix_timestamp_conversions=true
is in effect, the resulting TIMESTAMP represents a date and time in the
local time zone.
If these statements in your environment contain sensitive literal values such as credit
card numbers or tax identifiers, Impala can redact this sensitive information when
displaying the statements in log files and other administrative contexts. See
for details.
For a particular table, use either COMPUTE STATS or COMPUTE
INCREMENTAL STATS, but never combine the two or alternate between them. If you
switch from COMPUTE STATS to COMPUTE INCREMENTAL STATS
during the lifetime of a table, or vice versa, drop all statistics by running
DROP STATS before making the switch.
When you run COMPUTE INCREMENTAL STATS on a table for the first time,
the statistics are computed again from scratch regardless of whether the table already
has statistics. Therefore, expect a one-time resource-intensive operation for scanning
the entire table when running COMPUTE INCREMENTAL STATS for the first
time on a given table.
In Impala 3.0 and lower, approximately 400 bytes of metadata per column per partition
are needed for caching. Tables with a big number of partitions and many columns can add
up to a significant memory overhead as the metadata must be cached on the
catalogd host and on every impalad host that is
eligible to be a coordinator. If this metadata for all tables exceeds 2 GB, you might
experience service downtime. In Impala 3.1 and higher, the issue was alleviated with an
improved handling of incremental stats.
The PARTITION clause is only allowed in combination with the
INCREMENTAL clause. It is optional for COMPUTE INCREMENTAL
STATS, and required for DROP INCREMENTAL STATS. Whenever you
specify partitions through the PARTITION
(partition_spec) clause in a COMPUTE INCREMENTAL
STATS or DROP INCREMENTAL STATS statement, you must include
all the partitioning columns in the specification, and specify constant values for all
the partition key columns.
-- Initially the table has no incremental stats, as indicated
-- 'false' under Incremental stats.
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | -1 | 1 | 223.74KB | NOT CACHED | PARQUET | false
| Children | -1 | 1 | 230.05KB | NOT CACHED | PARQUET | false
| Electronics | -1 | 1 | 232.67KB | NOT CACHED | PARQUET | false
| Home | -1 | 1 | 232.56KB | NOT CACHED | PARQUET | false
| Jewelry | -1 | 1 | 223.72KB | NOT CACHED | PARQUET | false
| Men | -1 | 1 | 231.25KB | NOT CACHED | PARQUET | false
| Music | -1 | 1 | 237.90KB | NOT CACHED | PARQUET | false
| Shoes | -1 | 1 | 234.90KB | NOT CACHED | PARQUET | false
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
| Women | -1 | 1 | 226.27KB | NOT CACHED | PARQUET | false
| Total | -1 | 10 | 2.25MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
-- After the first COMPUTE INCREMENTAL STATS,
-- all partitions have stats. The first
-- COMPUTE INCREMENTAL STATS scans the whole
-- table, discarding any previous stats from
-- a traditional COMPUTE STATS statement.
compute incremental stats item_partitioned;
+-------------------------------------------+
| summary |
+-------------------------------------------+
| Updated 10 partition(s) and 21 column(s). |
+-------------------------------------------+
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 10 | 2.25MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
-- Add a new partition...
alter table item_partitioned add partition (i_category='Camping');
-- Add or replace files in HDFS outside of Impala,
-- rendering the stats for a partition obsolete.
!import_data_into_sports_partition.sh
refresh item_partitioned;
drop incremental stats item_partitioned partition (i_category='Sports');
-- Now some partitions have incremental stats
-- and some do not.
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Camping | -1 | 1 | 408.02KB | NOT CACHED | PARQUET | false
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | -1 | 1 | 227.97KB | NOT CACHED | PARQUET | false
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 11 | 2.65MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
-- After another COMPUTE INCREMENTAL STATS,
-- all partitions have incremental stats, and only the 2
-- partitions without incremental stats were scanned.
compute incremental stats item_partitioned;
+------------------------------------------+
| summary |
+------------------------------------------+
| Updated 2 partition(s) and 21 column(s). |
+------------------------------------------+
show table stats item_partitioned;
+-------------+-------+--------+----------+--------------+---------+------------------
| i_category | #Rows | #Files | Size | Bytes Cached | Format | Incremental stats
+-------------+-------+--------+----------+--------------+---------+------------------
| Books | 1733 | 1 | 223.74KB | NOT CACHED | PARQUET | true
| Camping | 5328 | 1 | 408.02KB | NOT CACHED | PARQUET | true
| Children | 1786 | 1 | 230.05KB | NOT CACHED | PARQUET | true
| Electronics | 1812 | 1 | 232.67KB | NOT CACHED | PARQUET | true
| Home | 1807 | 1 | 232.56KB | NOT CACHED | PARQUET | true
| Jewelry | 1740 | 1 | 223.72KB | NOT CACHED | PARQUET | true
| Men | 1811 | 1 | 231.25KB | NOT CACHED | PARQUET | true
| Music | 1860 | 1 | 237.90KB | NOT CACHED | PARQUET | true
| Shoes | 1835 | 1 | 234.90KB | NOT CACHED | PARQUET | true
| Sports | 1783 | 1 | 227.97KB | NOT CACHED | PARQUET | true
| Women | 1790 | 1 | 226.27KB | NOT CACHED | PARQUET | true
| Total | 17957 | 11 | 2.65MB | 0B | |
+-------------+-------+--------+----------+--------------+---------+------------------
In and higher, Impala UDFs and UDAs written in C++ are
persisted in the metastore database. Java UDFs are also persisted, if they were created
with the new CREATE FUNCTION syntax for Java UDFs, where the Java
function argument and return types are omitted. Java-based UDFs created with the old
CREATE FUNCTION syntax do not persist across restarts because they are
held in the memory of the catalogd daemon. Until you re-create such
Java UDFs using the new CREATE FUNCTION syntax, you must reload those
Java-based UDFs by running the original CREATE FUNCTION statements
again each time you restart the catalogd daemon. Prior to
the requirement to reload functions after a restart
applied to both C++ and Java functions.
In and higher, you can refresh the user-defined
functions (UDFs) that Impala recognizes, at the database level, by running the
REFRESH FUNCTIONS statement with the database name as an argument.
Java-based UDFs can be added to the metastore database through Hive CREATE
FUNCTION statements, and made visible to Impala by subsequently running
REFRESH FUNCTIONS. For example:
CREATE DATABASE shared_udfs;
USE shared_udfs;
...use CREATE FUNCTION statements in Hive to create some Java-based UDFs
that Impala is not initially aware of...
REFRESH FUNCTIONS shared_udfs;
SELECT udf_created_by_hive(c1) FROM ...
The Hive current_user() function cannot be called from a Java UDF
through Impala.
If you are creating a partition for the first time and specifying its location, for
maximum efficiency, use a single ALTER TABLE statement including both
the ADD PARTITION and LOCATION clauses, rather than
separate statements with ADD PARTITION and SET
LOCATION clauses.
The INSERT statement has always left behind a hidden work directory
inside the data directory of the table. Formerly, this hidden work directory was named
.impala_insert_staging . In Impala 2.0.1 and later, this directory
name is changed to _impala_insert_staging . (While HDFS tools are
expected to treat names beginning either with underscore and dot as hidden, in practice
names beginning with an underscore are more widely supported.) If you have any scripts,
cleanup jobs, and so on that rely on the name of this work directory, adjust them to use
the new name.
To see whether a table is internal or external, and its associated HDFS location, issue
the statement DESCRIBE FORMATTED table_name. The
Table Type field displays MANAGED_TABLE for internal
tables and EXTERNAL_TABLE for external tables. The
Location field displays the path of the table directory as an HDFS URI.
You can switch a table from
internal to external, or from external to internal, by using the
ALTER TABLE statement:
-- Switch a table from internal to external.
ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
-- Switch a table from external to internal.
ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='FALSE');
If
the Kudu service is integrated with the Hive Metastore, the above
operations are not supported.
-- Find all customers whose first name starts with 'J', followed by 0 or more of any character.
select c_first_name, c_last_name from customer where c_first_name regexp '^J.*';
select c_first_name, c_last_name from customer where c_first_name rlike '^J.*';
-- Find 'Macdonald', where the first 'a' is optional and the 'D' can be upper- or lowercase.
-- The ^...$ are required, to match the start and end of the value.
select c_first_name, c_last_name from customer where c_last_name regexp '^Ma?c[Dd]onald$';
select c_first_name, c_last_name from customer where c_last_name rlike '^Ma?c[Dd]onald$';
-- Match multiple character sequences, either 'Mac' or 'Mc'.
select c_first_name, c_last_name from customer where c_last_name regexp '^(Mac|Mc)donald$';
select c_first_name, c_last_name from customer where c_last_name rlike '^(Mac|Mc)donald$';
-- Find names starting with 'S', then one or more vowels, then 'r', then any other characters.
-- Matches 'Searcy', 'Sorenson', 'Sauer'.
select c_first_name, c_last_name from customer where c_last_name regexp '^S[aeiou]+r.*$';
select c_first_name, c_last_name from customer where c_last_name rlike '^S[aeiou]+r.*$';
-- Find names that end with 2 or more vowels: letters from the set a,e,i,o,u.
select c_first_name, c_last_name from customer where c_last_name regexp '.*[aeiou]{2,}$';
select c_first_name, c_last_name from customer where c_last_name rlike '.*[aeiou]{2,}$';
-- You can use letter ranges in the [] blocks, for example to find names starting with A, B, or C.
select c_first_name, c_last_name from customer where c_last_name regexp '^[A-C].*';
select c_first_name, c_last_name from customer where c_last_name rlike '^[A-C].*';
-- If you are not sure about case, leading/trailing spaces, and so on, you can process the
-- column using string functions first.
select c_first_name, c_last_name from customer where lower(trim(c_last_name)) regexp '^de.*';
select c_first_name, c_last_name from customer where lower(trim(c_last_name)) rlike '^de.*';
In and higher, you can simplify queries that use many
UPPER() and LOWER() calls to do case-insensitive
comparisons, by using the ILIKE or IREGEXP operators
instead. See and
for details.
When authorization is enabled, the output of the SHOW statement only
shows those objects for which you have the privilege to view. If you believe an object
exists but you cannot see it in the SHOW output, check with the system
administrator if you need to be granted a new privilege for that object. See
for how to set up
authorization and add privileges for specific objects.
Infinity and NaN can be specified in text data files as inf and
nan respectively, and Impala interprets them as these special values.
They can also be produced by certain arithmetic expressions; for example,
1/0 returns Infinity and pow(-1, 0.5)
returns NaN. Or you can cast the literal values, such as
CAST('nan' AS DOUBLE) or CAST('inf' AS DOUBLE).
In Impala 2.0 and later, user() returns the full Kerberos principal
string, such as user@example.com, in a Kerberized environment.
On a kerberized cluster with high memory utilization, kinit commands
executed after every 'kerberos_reinit_interval' may cause out-of-memory
errors, because executing the command involves a fork of the Impala process. The error
looks similar to the following:
principal_details
Failed to execute shell cmd: 'kinit -k -t keytab_details',
error was: Error(12): Cannot allocate memory
]]>
The following command changes the vm.overcommit_memory setting
immediately on a running host. However, this setting is reset when the host is
restarted.
/proc/sys/vm/overcommit_memory
]]>
To change the setting in a persistent way, add the following line to the
/etc/sysctl.conf file:
Then run sysctl -p. No reboot is needed.
-
Currently, each Impala GRANT or REVOKE statement can
only grant or revoke a single privilege to or from a single role.
All data in CHAR and VARCHAR columns must be in a
character encoding that is compatible with UTF-8. If you have binary data from another
database system (that is, a BLOB type), use a STRING column to hold it.
The following example creates a series of views and then drops them. These examples
illustrate how views are associated with a particular database, and both the view
definitions and the view names for CREATE VIEW and DROP
VIEW can refer to a view in the current database or a fully qualified view
name.
-- Create and drop a view in the current database.
CREATE VIEW few_rows_from_t1 AS SELECT * FROM t1 LIMIT 10;
DROP VIEW few_rows_from_t1;
-- Create and drop a view referencing a table in a different database.
CREATE VIEW table_from_other_db AS SELECT x FROM db1.foo WHERE x IS NOT NULL;
DROP VIEW table_from_other_db;
USE db1;
-- Create a view in a different database.
CREATE VIEW db2.v1 AS SELECT * FROM db2.foo;
-- Switch into the other database and drop the view.
USE db2;
DROP VIEW v1;
USE db1;
-- Create a view in a different database.
CREATE VIEW db2.v1 AS SELECT * FROM db2.foo;
-- Drop a view in the other database.
DROP VIEW db2.v1;
For INSERT operations into CHAR or
VARCHAR columns, you must cast all STRING literals or
expressions returning STRING to to a CHAR or
VARCHAR type with the appropriate length.
The following example demonstrates how length() and
char_length() sometimes produce the same result, and sometimes produce
different results depending on the type of the argument and the presence of trailing
spaces for CHAR values. The S and C
values are displayed with enclosing quotation marks to show any trailing spaces.
create table length_demo (s string, c char(5));
insert into length_demo values
('a',cast('a' as char(5))),
('abc',cast('abc' as char(5))),
('hello',cast('hello' as char(5)));
select concat('"',s,'"') as s, concat('"',c,'"') as c,
length(s), length(c),
char_length(s), char_length(c)
from length_demo;
+---------+---------+-----------+-----------+----------------+----------------+
| s | c | length(s) | length(c) | char_length(s) | char_length(c) |
+---------+---------+-----------+-----------+----------------+----------------+
| "a" | "a " | 1 | 1 | 1 | 5 |
| "abc" | "abc " | 3 | 3 | 3 | 5 |
| "hello" | "hello" | 5 | 5 | 5 | 5 |
+---------+---------+-----------+-----------+----------------+----------------+
Correlated subqueries used in EXISTS and IN operators
cannot include a LIMIT clause.
Currently, Avro tables cannot contain TIMESTAMP columns. If you need to
store date and time values in Avro tables, as a workaround you can use a
STRING representation of the values, convert the values to
BIGINT with the UNIX_TIMESTAMP() function, or create
separate numeric columns for individual date and time fields using the
EXTRACT() function.
Zero-length strings: For purposes of clauses such as DISTINCT
and GROUP BY, Impala considers zero-length strings
(""), NULL, and space to all be different values.
When the spill-to-disk feature is activated for a join node within a query, Impala does
not produce any runtime filters for that join operation on that host. Other join nodes
within the query are not affected.
CREATE TABLE yy (s STRING) PARTITIONED BY (year INT);
INSERT INTO yy PARTITION (year) VALUES ('1999', 1999), ('2000', 2000),
('2001', 2001), ('2010', 2010), ('2018', 2018);
COMPUTE STATS yy;
CREATE TABLE yy2 (s STRING, year INT);
INSERT INTO yy2 VALUES ('1999', 1999), ('2000', 2000), ('2001', 2001);
COMPUTE STATS yy2;
-- The following query reads an unknown number of partitions, whose key values
-- are only known at run time. The runtime filters line shows the
-- information used in query fragment 02 to decide which partitions to skip.
EXPLAIN SELECT s FROM yy WHERE year IN (SELECT year FROM yy2);
+--------------------------------------------------------------------------+
| PLAN-ROOT SINK |
| | |
| 04:EXCHANGE [UNPARTITIONED] |
| | |
| 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST] |
| | hash predicates: year = year |
| | runtime filters: RF000 <- year |
| | |
| |--03:EXCHANGE [BROADCAST] |
| | | |
| | 01:SCAN HDFS [default.yy2] |
| | partitions=1/1 files=1 size=620B |
| | |
| 00:SCAN HDFS [default.yy] |
| partitions=5/5 files=5 size=1.71KB |
| runtime filters: RF000 -> year |
+--------------------------------------------------------------------------+
SELECT s FROM yy WHERE year IN (SELECT year FROM yy2); -- Returns 3 rows from yy
PROFILE;
By default, intermediate files used during
large sort, join, aggregation, or analytic function operations are
stored in the directory /tmp/impala-scratch, and
these intermediate files are removed when the operation finishes. You
can specify a different location by starting the
impalad daemon with the
‑‑scratch_dirs="path_to_directory"
configuration option.
An ORDER BY clause without an additional LIMIT clause
is ignored in any view definition. If you need to sort the entire result set from a
view, use an ORDER BY clause in the SELECT statement
that queries the view. You can still make a simple top 10
report by combining the
ORDER BY and LIMIT clauses in the same view
definition:
[localhost:21000] > create table unsorted (x bigint);
[localhost:21000] > insert into unsorted values (1), (9), (3), (7), (5), (8), (4), (6), (2);
[localhost:21000] > create view sorted_view as select x from unsorted order by x;
[localhost:21000] > select x from sorted_view; -- ORDER BY clause in view has no effect.
+---+
| x |
+---+
| 1 |
| 9 |
| 3 |
| 7 |
| 5 |
| 8 |
| 4 |
| 6 |
| 2 |
+---+
[localhost:21000] > select x from sorted_view order by x; -- View query requires ORDER BY at outermost level.
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
| 8 |
| 9 |
+---+
[localhost:21000] > create view top_3_view as select x from unsorted order by x limit 3;
[localhost:21000] > select x from top_3_view; -- ORDER BY and LIMIT together in view definition are preserved.
+---+
| x |
+---+
| 1 |
| 2 |
| 3 |
+---+
The following examples demonstrate how to check the precision and scale of numeric
literals or other numeric expressions. Impala represents numeric literals in the
smallest appropriate type. 5 is a TINYINT value, which ranges from -128
to 127, therefore 3 decimal digits are needed to represent the entire range, and because
it is an integer value there are no fractional digits. 1.333 is interpreted as a
DECIMAL value, with 4 digits total and 3 digits after the decimal
point.
[localhost:21000] > select precision(5), scale(5);
+--------------+----------+
| precision(5) | scale(5) |
+--------------+----------+
| 3 | 0 |
+--------------+----------+
[localhost:21000] > select precision(1.333), scale(1.333);
+------------------+--------------+
| precision(1.333) | scale(1.333) |
+------------------+--------------+
| 4 | 3 |
+------------------+--------------+
[localhost:21000] > with t1 as
( select cast(12.34 as decimal(20,2)) x union select cast(1 as decimal(8,6)) x )
select precision(x), scale(x) from t1 limit 1;
+--------------+----------+
| precision(x) | scale(x) |
+--------------+----------+
| 24 | 6 |
+--------------+----------+
Type: Boolean; recognized values are 1 and 0, or true and
false; any other value interpreted as false
Type: string
Type: integer
Default:
Default: false
Default: 0
Default: false (shown as 0 in output of SET
statement)
Default: true (shown as 1 in output of SET
statement)
Units: A numeric argument represents a size in bytes; you can also use a suffix
of m or mb for megabytes, or g or
gb for gigabytes. If you specify a value with unrecognized formats,
subsequent queries fail with an error.
Currently, the return value is always a STRING. The return type is
subject to change in future releases. Always use CAST() to convert the
result to whichever data type is appropriate for your computations.
Return type: DOUBLE in Impala 2.0 and higher;
STRING in earlier releases
Usage notes: Primarily for compatibility with code containing industry extensions
to SQL.
Return type: BOOLEAN
Return type: DOUBLE
Return type: Same as the input value
Return type: Same as the input value, except for CHAR and
VARCHAR arguments which produce a STRING result
Impala includes another predefined database, _impala_builtins, that
serves as the location for the
built-in functions. To see
the built-in functions, use a statement like the following:
show functions in _impala_builtins;
show functions in _impala_builtins like '*substring*';
Due to the way arithmetic on FLOAT and DOUBLE columns
uses high-performance hardware instructions, and distributed queries can perform these
operations in different order for each query, results can vary slightly for aggregate
function calls such as SUM() and AVG() for
FLOAT and DOUBLE columns, particularly on large data
sets where millions or billions of values are summed or averaged. For perfect
consistency and repeatability, use the DECIMAL data type for such
operations instead of FLOAT or DOUBLE.
The inability to exactly represent certain floating-point values means that
DECIMAL is sometimes a better choice than DOUBLE or
FLOAT when precision is critical, particularly when transferring data
from other database systems that use different representations or file formats.
If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR
COLUMNS, Impala can only use the resulting column statistics if the table is
unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned
table.
UNIX_TIMESTAMP() and FROM_UNIXTIME() are often used in
combination to convert a TIMESTAMP value into a particular string
format. For example:
SELECT FROM_UNIXTIME(UNIX_TIMESTAMP(NOW() + interval 3 days),
'yyyy/MM/dd HH:mm') AS yyyy_mm_dd_hh_mm;
+------------------+
| yyyy_mm_dd_hh_mm |
+------------------+
| 2016/06/03 11:38 |
+------------------+
Sorting considerations: Although you can specify an ORDER BY
clause in an INSERT ... SELECT statement, any ORDER BY
clause is ignored and the results are not necessarily sorted. An INSERT ...
SELECT operation potentially creates many different data files, prepared by
different executor Impala daemons, and therefore the notion of the data being stored in
sorted order is impractical.
Prior to Impala 1.4.0, it was not possible to use the CREATE TABLE LIKE
view_name syntax. In Impala 1.4.0 and higher, you can create
a table with the same column definitions as a view using the CREATE TABLE
LIKE technique. Although CREATE TABLE LIKE normally inherits
the file format of the original table, a view has no underlying file format, so
CREATE TABLE LIKE view_name produces a text table by
default. To specify a different file format, include a STORED AS
file_format clause at the end of the CREATE TABLE
LIKE statement.
Prior to Impala 1.4.0, COMPUTE STATS counted the number of
NULL values in each column and recorded that figure in the metastore
database. Because Impala does not currently use the NULL count during
query planning, Impala 1.4.0 and higher speeds up the COMPUTE STATS
statement by skipping this NULL counting.
The regular expression must match the entire value, not just occur somewhere inside it.
Use .* at the beginning, the end, or both if you only need to match
characters anywhere in the middle. Thus, the ^ and $
atoms are often redundant, although you might already have them in your expression
strings that you reuse from elsewhere.
In Impala 1.3.1 and higher, the REGEXP and RLIKE
operators now match a regular expression string that occurs anywhere inside the target
string, the same as if the regular expression was enclosed on each side by
.*. See for
examples. Previously, these operators only succeeded when the regular expression matched
the entire target string. This change improves compatibility with the regular expression
support for popular database systems. There is no change to the behavior of the
regexp_extract() and regexp_replace() built-in
functions.
By default, if an INSERT statement creates any new subdirectories
underneath a partitioned table, those subdirectories are assigned default HDFS
permissions for the impala user. To make each subdirectory have the
same permissions as its parent directory in HDFS, specify the
‑‑insert_inherit_permissions startup option for the
impalad daemon.
Prefer UNION ALL over
UNION when you know the data sets are disjoint or duplicate values are
not a problem; UNION ALL is more efficient because it avoids
materializing and sorting the entire result set to eliminate duplicate values.
The CREATE TABLE clauses FIELDS TERMINATED BY,
ESCAPED BY, and LINES TERMINATED BY have special rules
for the string literal used for their argument, because they all require a single
character. You can use a regular character surrounded by single or double quotation
marks, an octal sequence such as '\054' (representing a comma), or an
integer in the range '-127'..'128' (with quotation marks but no backslash), which is
interpreted as a single-byte ASCII character. Negative values are subtracted from 256;
for example, FIELDS TERMINATED BY '-2' sets the field delimiter to
ASCII code 254, the Icelandic Thorn
character used as a delimiter by some data
formats.
Sqoop considerations:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any
resulting values from DATE, DATETIME, or
TIMESTAMP columns. The underlying values are represented as the Parquet
INT64 type, which is represented as BIGINT in the
Impala table. The Parquet values represent the time in milliseconds, while Impala
interprets BIGINT as the time in seconds. Therefore, if you have a
BIGINT column in a Parquet table that was imported this way from Sqoop,
divide the values by 1000 when interpreting as the TIMESTAMP type.
Command-line equivalent:
Complex type considerations:
Because complex types are often used in combination, for example an
ARRAY of STRUCT elements, if you are unfamiliar with
the Impala complex types, start with
for background
information and usage examples.
In and higher, Impala supports the complex types
ARRAY, STRUCT, and MAP. In
and higher, Impala also supports these
complex types in ORC. See
for details.
These Complex types are currently supported only for the Parquet or ORC file formats.
Because Impala has better performance on Parquet than ORC, if you plan to use complex
types, become familiar with the performance and storage aspects of Parquet first.
-
Columns with this data type can only be used in tables or partitions with the
Parquet or ORC file format.
-
Columns with this data type cannot be used as partition key columns in a partitioned
table.
-
The COMPUTE STATS statement does not produce any statistics for
columns of this data type.
-
The maximum length of the column definition for any complex type, including
declarations for any nested types, is 4000 characters.
-
See for a
full list of limitations and associated guidelines about complex type columns.
Partitioned tables can contain complex type columns. All the partition key columns must
be scalar types.
You can pass a multi-part qualified name to DESCRIBE to specify an
ARRAY, STRUCT, or MAP column and
visualize its structure as if it were a table. For example, if table T1
contains an ARRAY column A1, you could issue the
statement DESCRIBE t1.a1. If table T1 contained a
STRUCT column S1, and a field F1
within the STRUCT was a MAP, you could issue the
statement DESCRIBE t1.s1.f1. An ARRAY is shown as a
two-column table, with ITEM and POS columns. A
STRUCT is shown as a table with each field representing a column in the
table. A MAP is shown as a two-column table, with KEY
and VALUE columns.
Many of the complex type examples refer to tables such as CUSTOMER and
REGION adapted from the tables used in the TPC-H benchmark. See
for the table
definitions.
Complex type considerations: Although you can create tables in this file format
using the complex types (ARRAY, STRUCT, and
MAP) available in and higher,
currently, Impala can query these types only in Parquet tables.
The one exception to the preceding rule is COUNT(*) queries on RCFile
tables that include complex types. Such queries are allowed in
and higher.
You cannot refer to a column with a complex data type (ARRAY,
STRUCT, or MAP directly in an operator. You can apply
operators only to scalar values that make up a complex type (the fields of a
STRUCT, the items of an ARRAY, or the key or value
portion of a MAP) as part of a join query that refers to the scalar
value using the appropriate dot notation or ITEM, KEY,
or VALUE pseudocolumn names.
Currently, Impala UDFs cannot accept arguments or return values of the Impala complex
types (STRUCT, ARRAY, or MAP).
Impala currently cannot write new data files containing complex type columns. Therefore,
although the SELECT statement works for queries involving complex type
columns, you cannot use a statement form that writes data to complex type columns, such
as CREATE TABLE AS SELECT or INSERT ... SELECT. To
create data files containing complex type data, use the Hive INSERT
statement, or another ETL mechanism such as MapReduce jobs, Spark jobs, Pig, and so on.
For tables containing complex type columns (ARRAY,
STRUCT, or MAP), you typically use join queries to
refer to the complex values. You can use views to hide the join notation, making such
tables seem like traditional denormalized tables, and making those tables queryable by
business intelligence tools that do not have built-in support for those complex types.
See for details.
Because you cannot directly issue SELECT col_name
against a column of complex type, you cannot use a view or a WITH
clause to rename
a column by selecting it with a column alias.
To access a column with a complex type (ARRAY, STRUCT,
or MAP) in an aggregation function, you unpack the individual elements
using join notation in the query, and then apply the function to the final scalar item,
field, key, or value at the bottom of any nested type hierarchy in the column. See
for details about using
complex types in Impala.
The following example demonstrates calls to several aggregation functions using values
from a column containing nested complex types (an ARRAY of
STRUCT items). The array is unpacked inside the query using join
notation. The array elements are referenced using the ITEM
pseudocolumn, and the structure fields inside the array elements are referenced using
dot notation. Numeric values such as SUM() and AVG()
are computed using the numeric R_NATIONKEY field, and the
general-purpose MAX() and MIN() values are computed
from the string N_NAME field.
describe region;
+-------------+-------------------------+---------+
| name | type | comment |
+-------------+-------------------------+---------+
| r_regionkey | smallint | |
| r_name | string | |
| r_comment | string | |
| r_nations | array<struct< | |
| | n_nationkey:smallint, | |
| | n_name:string, | |
| | n_comment:string | |
| | >> | |
+-------------+-------------------------+---------+
select r_name, r_nations.item.n_nationkey
from region, region.r_nations as r_nations
order by r_name, r_nations.item.n_nationkey;
+-------------+------------------+
| r_name | item.n_nationkey |
+-------------+------------------+
| AFRICA | 0 |
| AFRICA | 5 |
| AFRICA | 14 |
| AFRICA | 15 |
| AFRICA | 16 |
| AMERICA | 1 |
| AMERICA | 2 |
| AMERICA | 3 |
| AMERICA | 17 |
| AMERICA | 24 |
| ASIA | 8 |
| ASIA | 9 |
| ASIA | 12 |
| ASIA | 18 |
| ASIA | 21 |
| EUROPE | 6 |
| EUROPE | 7 |
| EUROPE | 19 |
| EUROPE | 22 |
| EUROPE | 23 |
| MIDDLE EAST | 4 |
| MIDDLE EAST | 10 |
| MIDDLE EAST | 11 |
| MIDDLE EAST | 13 |
| MIDDLE EAST | 20 |
+-------------+------------------+
select
r_name,
count(r_nations.item.n_nationkey) as count,
sum(r_nations.item.n_nationkey) as sum,
avg(r_nations.item.n_nationkey) as avg,
min(r_nations.item.n_name) as minimum,
max(r_nations.item.n_name) as maximum,
ndv(r_nations.item.n_nationkey) as distinct_vals
from
region, region.r_nations as r_nations
group by r_name
order by r_name;
+-------------+-------+-----+------+-----------+----------------+---------------+
| r_name | count | sum | avg | minimum | maximum | distinct_vals |
+-------------+-------+-----+------+-----------+----------------+---------------+
| AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 |
| AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 |
| ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 |
| EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 |
| MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 |
+-------------+-------+-----+------+-----------+----------------+---------------+
Hive considerations:
HDFS permissions:
HDFS permissions: This statement does not touch any HDFS files or directories,
therefore no HDFS permissions are required.
Security considerations:
Performance considerations:
Casting and conversions:
Related information:
Related tasks:
Related startup options:
Restrictions:
Restrictions: In Impala 2.0 and higher, this function can be used as an analytic
function, but with restrictions on any window clause. For MAX() and
MIN(), the window clause is only allowed if the start bound is
UNBOUNDED PRECEDING.
Restrictions: This function cannot be used as an analytic function; it does not
currently support the OVER() clause.
Compatibility:
NULL considerations:
UDF considerations:
UDF considerations: This type cannot be used for the argument or return type of a
user-defined function (UDF) or user-defined aggregate function (UDA).
Considerations for views:
NULL considerations: Casting any non-numeric value to this type produces a
NULL value.
NULL considerations: Casting any unrecognized STRING value to
this type produces a NULL value.
NULL considerations: An expression of this type produces a NULL
value if any argument of the expression is NULL.
Required privileges:
Parquet considerations:
To examine the internal structure and data of Parquet files, you can use the
parquet-tools command. Make sure this command is in your
$PATH. (Typically, it is symlinked from /usr/bin;
sometimes, depending on your installation setup, you might need to locate it under an
alternative bin directory.) The arguments to this command let you
perform operations such as:
-
cat: Print a file's contents to standard out. In
and higher, you can use the -j
option to output JSON.
-
head: Print the first few records of a file to standard output.
-
schema: Print the Parquet schema for the file.
-
meta: Print the file footer metadata, including key-value
properties (like Avro schema), compression ratios, encodings, compression used, and
row group information.
-
dump: Print all data and metadata.
Use parquet-tools -h to see usage information for all the arguments.
Here are some examples showing parquet-tools usage:
Parquet considerations: This type is fully compatible with Parquet tables.
This function cannot be used in an analytic context. That is, the
OVER() clause is not allowed at all with this function.
In queries involving both analytic functions and partitioned tables, partition pruning
only occurs for columns named in the PARTITION BY clause of the
analytic function call. For example, if an analytic function query has a clause such as
WHERE year=2016, the way to make the query prune all other
YEAR partitions is to include PARTITION BY year in the
analytic function call; for example, OVER (PARTITION BY
year,other_columns
other_analytic_clauses).
Impala can query Parquet files that use the PLAIN,
PLAIN_DICTIONARY, BIT_PACKED, and RLE
encodings. Currently, Impala does not support RLE_DICTIONARY encoding.
When creating files outside of Impala for use by Impala, make sure to use one of the
supported encodings. In particular, for MapReduce jobs,
parquet.writer.version must not be defined (especially as
PARQUET_2_0) for writing the configurations of Parquet MR jobs. Use the
default version (or format). The default format, 1.0, includes some enhancements that
are compatible with older versions. Data using the 2.0 format might not be consumable by
Impala, due to use of the RLE_DICTIONARY encoding.
Currently, Impala always decodes the column data in Parquet files based on the ordinal
position of the columns, not by looking up the position of each column based on its
name. Parquet files produced outside of Impala must write column data in the same
order as the columns are declared in the Impala table. Any optional columns that are
omitted from the data files must be the rightmost columns in the Impala table
definition.
If you created compressed Parquet files through some tool other than Impala, make sure
that any compression codecs are supported in Parquet by Impala. For example, Impala
does not currently support LZO compression in Parquet files. Also doublecheck that you
used any recommended compatibility settings in the other tool, such as
spark.sql.parquet.binaryAsString when writing Parquet files through
Spark.
Text table considerations:
Text table considerations: Values of this type are potentially larger in text
tables than in tables using Parquet or other binary formats.
Schema evolution considerations:
Column statistics considerations:
Column statistics considerations: Because this type has a fixed size, the maximum
and average size fields are always filled in for column statistics, even before you run
the COMPUTE STATS statement.
Column statistics considerations: Because the values of this type have variable
size, none of the column statistics fields are filled in until you run the
COMPUTE STATS statement.
Usage notes:
Impala does not evaluate NaN (not a number) as equal to any other numeric values,
including other NaN values. For example, the following statement, which evaluates
equality between two NaN values, returns false:
Examples:
Result set:
JDBC and ODBC considerations:
Cancellation: Cannot be cancelled.
Cancellation: Can be cancelled. To cancel this statement, use Ctrl-C from the
impala-shell interpreter, the Cancel button
from the Watch page in Hue, or Cancel from
the list of in-flight queries (for a particular node) on the
Queries tab in the Impala web UI (port 25000).
Cancellation: Certain multi-stage statements (CREATE TABLE AS
SELECT and COMPUTE STATS) can be cancelled during some stages,
when running INSERT or SELECT operations internally.
To cancel this statement, use Ctrl-C from the impala-shell
interpreter, the Cancel button from the
Watch page in Hue, or Cancel from the list
of in-flight queries (for a particular node) on the Queries tab
in the Impala web UI (port 25000).
Partitioning:
Partitioning: Prefer to use this type for a partition key column. Impala can
process the numeric type more efficiently than a STRING representation
of the value.
Partitioning: This type can be used for partition key columns. Because of the
efficiency advantage of numeric values over character-based values, if the partition key
is a string representation of a number, prefer to use an integer type with sufficient
range (INT, BIGINT, and so on) where practical.
Partitioning: Because this type has so few distinct values, it is typically not a
sensible choice for a partition key column.
Partitioning: Because fractional values of this type are not always represented
precisely, when this type is used for a partition key column, the underlying HDFS
directories might not be named exactly as you expect. Prefer to partition on a
DECIMAL column instead.
Partitioning: Because this type potentially has so many distinct values, it is
often not a sensible choice for a partition key column. For example, events 1
millisecond apart would be stored in different partitions. Consider using the
TRUNC() function to condense the number of distinct values, and
partition on a new column with the truncated values.
HDFS considerations:
File format considerations:
Amazon S3 considerations:
ADLS considerations:
Isilon considerations:
Because the EMC Isilon storage devices use a global value for the block size rather than
a configurable value for each file, the PARQUET_FILE_SIZE query option
has no effect when Impala inserts data into a table or partition residing on Isilon
storage. Use the isi command to set the default block size globally on
the Isilon device. For example, to set the Isilon default block size to 256 MB, the
recommended size for Parquet data files for Impala, issue the following command:
isi hdfs settings modify --default-block-size=256MB
HBase considerations:
The LOAD DATA statement cannot be used with HBase tables.
HBase considerations: This data type is fully compatible with HBase tables.
HBase considerations: This data type cannot be used with HBase tables.
Internal details:
Internal details: Represented in memory as a 1-byte value.
Internal details: Represented in memory as a 2-byte value.
Internal details: Represented in memory as a 4-byte value.
Internal details: Represented in memory as an 8-byte value.
Internal details: Represented in memory as a 16-byte value.
Internal details: Represented in memory as a byte array with the same size as the
length specification. Values that are shorter than the specified length are padded on
the right with trailing spaces.
Internal details: Represented in memory as a byte array with the minimum size
needed to represent each value.
Added in:
Added in:
Added in:
Added in:
Added in:
Added in:
Added in:
Added in:
Added in:
Added in:
Added in:
Added in: Available in earlier Impala releases, but new capabilities were added
in
Added in: Available in all versions of Impala.
Added in: Impala 1.4.0
Added in: Impala 1.3.0
Added in: Impala 1.1
Added in: Impala 1.1.1
Added in:
Added in:
Syntax:
For other tips about managing and reclaiming Impala disk space, see
.
Impala supports a wide variety of JOIN clauses. Left, right, semi,
full, and outer joins are supported in all Impala versions. The CROSS
JOIN operator is available in Impala 1.2.2 and higher. During performance
tuning, you can override the reordering of join clauses that Impala does internally by
including the keyword STRAIGHT_JOIN immediately after the
SELECT and any DISTINCT or ALL
keywords.
The STRAIGHT_JOIN hint affects the join order of table references in
the query block containing the hint. It does not affect the join order of nested
queries, such as views, inline views, or WHERE-clause subqueries. To
use this hint for performance tuning of complex queries, apply the hint to all query
blocks that need a fixed join order.
In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE
METADATA after the table is created in Hive, allowing you to make individual
tables visible to Impala without doing a full reload of the catalog metadata. Impala
1.2.4 also includes other changes to make the metadata broadcast mechanism faster and
more responsive, especially during Impala startup. See
for details.
Read the EXPLAIN plan from bottom to top:
-
The last part of the plan shows the low-level details such as the expected amount of
data that will be read, where you can judge the effectiveness of your partitioning
strategy and estimate how long it will take to scan a table based on total data size
and the size of the cluster.
-
As you work your way up, next you see the operations that will be parallelized and
performed on each Impala node.
-
At the higher levels, you see how data flows when intermediate result sets are
combined and transmitted from one node to another.
-
See for details
about the EXPLAIN_LEVEL query option, which lets you customize how
much detail to show in the EXPLAIN plan depending on whether you
are doing high-level or low-level tuning, dealing with logical or physical aspects
of the query.
Aggregate functions are a special category with different rules. These functions
calculate a return value across all the items in a result set, so they require a
FROM clause in the query:
select count(product_id) from product_catalog;
select max(height), avg(height) from census_data where age > 20;
Aggregate functions also ignore NULL values rather than returning a
NULL result. For example, if some rows have NULL for a
particular column, those rows are ignored when computing the AVG() for
that column. Likewise, specifying COUNT(col_name) in
a query counts only those rows where col_name contains a
non-NULL value.
Aliases follow the same rules as identifiers when it
comes to case insensitivity. Aliases can be longer than identifiers (up to the maximum
length of a Java string) and can include additional characters such as spaces and dashes
when they are quoted using backtick characters.
Another way to define different names for the same tables or columns is to create views.
See for details.
When inserting into partitioned tables, especially using the Parquet file format, you
can include a hint in the INSERT statement to fine-tune the overall
performance of the operation and its resource usage:
-
You would only use hints if an INSERT into a partitioned Parquet
table was failing due to capacity limits, or if such an INSERT was
succeeding but with less-than-optimal performance.
-
To use a hint to influence the join order, put the hint keyword /* +SHUFFLE
*/ or /* +NOSHUFFLE */ (including the square brackets)
after the PARTITION clause, immediately before the
SELECT keyword.
-
/* +SHUFFLE */ selects an execution plan that reduces the number of
files being written simultaneously to HDFS, and the number of memory buffers holding
data for individual partitions. Thus it reduces overall resource usage for the
INSERT operation, allowing some INSERT operations
to succeed that otherwise would fail. It does involve some data transfer between the
nodes so that the data files for a particular partition are all constructed on the
same node.
-
/* +NOSHUFFLE */ selects an execution plan that might be faster
overall, but might also produce a larger number of small data files or exceed
capacity limits, causing the INSERT operation to fail. Use
/* +SHUFFLE */ in cases where an INSERT statement
fails or runs inefficiently due to all nodes attempting to construct data for all
partitions.
-
Impala automatically uses the /* +SHUFFLE */ method if any
partition key column in the source table, mentioned in the INSERT ...
SELECT query, does not have column statistics. In this case, only the
/* +NOSHUFFLE */ hint would have any effect.
-
If column statistics are available for all partition key columns in the source table
mentioned in the INSERT ... SELECT query, Impala chooses whether to
use the /* +SHUFFLE */ or /* +NOSHUFFLE */
technique based on the estimated number of distinct values in those columns and the
number of nodes involved in the INSERT operation. In this case, you
might need the /* +SHUFFLE */ or the /* +NOSHUFFLE
*/ hint to override the execution plan selected by Impala.
-
In or higher, you can make the
INSERT operation organize (
cluster
) the data for each
partition to avoid buffering data for multiple partitions and reduce the risk of an
out-of-memory condition. Specify the hint as /* +CLUSTERED */. This
technique is primarily useful for inserts into Parquet tables, where the large block
size requires substantial memory to buffer data for multiple output files at once.
Any INSERT statement for a Parquet table requires enough free space in
the HDFS filesystem to write one block. Because Parquet data files use a block size of 1
GB by default, an INSERT might fail (even for a very small amount of
data) if your HDFS is running low on space.
After adding or replacing data in a table used in performance-critical queries, issue a
COMPUTE STATS statement to make sure all statistics are up-to-date.
Consider updating statistics for a table after any INSERT, LOAD
DATA, or CREATE TABLE AS SELECT statement in Impala, or after
loading data through Hive and doing a REFRESH
table_name in Impala. This technique is especially important
for tables that are very large, used in join queries, or both.
Usage notes: concat() and concat_ws() are
appropriate for concatenating the values of multiple columns within the same row, while
group_concat() joins together values from different rows.
In Impala 1.2.1 and higher, all NULL values come at the end of the
result set for ORDER BY ... ASC queries, and at the beginning of the
result set for ORDER BY ... DESC queries. In effect,
NULL is considered greater than all other values for sorting purposes.
The original Impala behavior always put NULL values at the end, even
for ORDER BY ... DESC queries. The new behavior in Impala 1.2.1 makes
Impala more compatible with other popular database systems. In Impala 1.2.1 and higher,
you can override or specify the sorting behavior for NULL by adding the
clause NULLS FIRST or NULLS LAST at the end of the
ORDER BY clause.
Return type: same as the initial argument value, except that integer values are
promoted to BIGINT and floating-point values are promoted to
DOUBLE; use CAST() when inserting into a smaller
numeric column
Statement type: DDL
Statement type: DML (but still affected by
SYNC_DDL query option)
Statement type: DML
If you connect to different Impala nodes within an impala-shell
session for load-balancing purposes, you can enable the SYNC_DDL query
option to make each DDL statement wait before returning, until the new or changed
metadata has been received by all the Impala nodes. See
for details.
The Impala regular expression syntax conforms to the POSIX Extended Regular Expression
syntax used by the Boost library. For details, see
the
Boost documentation. It has most idioms familiar from regular expressions in
Perl, Python, and so on. It does not support .*? for non-greedy
matches.
In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX
Extended Regular Expression syntax used by the Google RE2 library. For details, see
the RE2
documentation. It has most idioms familiar from regular expressions in Perl,
Python, and so on, including .*? for non-greedy matches.
In Impala 2.0 and later, a change in the underlying regular expression library could
cause changes in the way regular expressions are interpreted by this function. Test any
queries that use regular expressions and adjust the expression patterns if necessary.
See
for details.
Because the impala-shell interpreter uses the \
character for escaping, use \\ to represent the regular expression
escape character in any regular expressions that you submit through
impala-shell . You might prefer to use the equivalent character class
names, such as [[:digit:]] instead of \d which you
would have to escape as \\d.
The SET statement has no effect until the
impala-shell interpreter is connected to an Impala server. Once you
are connected, any query options you set remain in effect as you issue a subsequent
CONNECT command to connect to a different Impala host.
Prior to Impala 1.4.0, Impala required any query including an
ORDER BY
clause to also use a
LIMIT clause. In
Impala 1.4.0 and higher, the LIMIT clause is optional for ORDER
BY queries. In cases where sorting a huge result set requires enough memory to
exceed the Impala memory limit for a particular executor Impala daemon, Impala
automatically uses a temporary disk work area to perform the sort operation.
In Impala 1.2.1 and higher, you can combine a LIMIT clause with an
OFFSET clause to produce a small result set that is different from a
top-N query, for example, to return items 11 through 20. This technique can be used to
simulate paged
results. Because Impala queries typically involve substantial
amounts of I/O, use this technique only for compatibility in cases where you cannot
rewrite the application logic. For best performance and scalability, wherever practical,
query as many items as you expect to need, cache them on the application side, and
display small groups of results to users using application logic.
In and higher, the optional WITH
REPLICATION clause for CREATE TABLE and ALTER
TABLE lets you specify a replication factor, the number of hosts
on which to cache the same data blocks. When Impala processes a cached data block, where
the cache replication factor is greater than 1, Impala randomly selects a host that has
a cached copy of that data block. This optimization avoids excessive CPU usage on a
single host when the same cached data block is processed multiple times. Where
practical, specify a value greater than or equal to the HDFS block replication factor.
If a view applies to a partitioned table, any partition pruning considers the clauses on
both the original query and any additional WHERE predicates in the
query that refers to the view. Prior to Impala 1.4, only the WHERE
clauses on the original query from the CREATE VIEW statement were used
for partition pruning.
To see the definition of a view, issue a DESCRIBE FORMATTED statement,
which shows the query from the original CREATE VIEW statement:
[localhost:21000] > create view v1 as select * from t1;
[localhost:21000] > describe formatted v1;
Query finished, fetching results ...
+------------------------------+------------------------------+------------+
| name | type | comment |
+------------------------------+------------------------------+------------+
| # col_name | data_type | comment |
| | NULL | NULL |
| x | int | None |
| y | int | None |
| s | string | None |
| | NULL | NULL |
| # Detailed Table Information | NULL | NULL |
| Database: | views | NULL |
| Owner: | doc_demo | NULL |
| CreateTime: | Mon Jul 08 15:56:27 EDT 2013 | NULL |
| LastAccessTime: | UNKNOWN | NULL |
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Table Type: | VIRTUAL_VIEW | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1373313387 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
| SerDe Library: | null | NULL |
| InputFormat: | null | NULL |
| OutputFormat: | null | NULL |
| Compressed: | No | NULL |
| Num Buckets: | 0 | NULL |
| Bucket Columns: | [] | NULL |
| Sort Columns: | [] | NULL |
| | NULL | NULL |
| # View Information | NULL | NULL |
| View Original Text: | SELECT * FROM t1 | NULL |
| View Expanded Text: | SELECT * FROM t1 | NULL |
+------------------------------+------------------------------+------------+
The INSERT ... VALUES technique is not suitable for loading large
quantities of data into HDFS-based tables, because the insert operations cannot be
parallelized, and each one produces a separate data file. Use it for setting up small
dimension tables or tiny amounts of data for experimenting with SQL syntax, or with
HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations.
Do not run scripts with thousands of INSERT ... VALUES statements that
insert a single row each time. If you do run INSERT ... VALUES
operations to load data into a staging table as one stage in an ETL pipeline, include
multiple row values if possible within each VALUES clause, and use a
separate database to make cleanup easier if the operation does produce many tiny files.