Reusable Text, Paragraphs, List Items, and Other Elements for Impala

All the elements in this file with IDs are intended to be conref'ed elsewhere. Practically all of the conref'ed elements for the Impala docs are in this file, to avoid questions of when it's safe to remove or move something in any of the 'main' files, and avoid having to change and conref references as a result.

This file defines some dummy subheadings as section elements, just for self-documentation. Using sections instead of nested concepts lets all the conref links point to a very simple name pattern, '#common/id_within_the_file', rather than a 3-part reference with an intervening, variable concept ID.

Conceptual Content

Overview and conceptual information for Impala as a whole.

The following are some of the key advantages of Impala:

  • Impala integrates with the existing ecosystem, meaning data can be stored, shared, and accessed using the various solutions included with . This also avoids data silos and minimizes expensive data movement.
  • Impala provides access to data stored in without requiring the Java skills required for MapReduce jobs. Impala can access data directly from the HDFS file system. Impala also provides a SQL front-end to access data in the HBase database system, or in the Amazon Simple Storage System (S3).
  • Impala returns results typically within seconds or a few minutes, rather than the many minutes or hours that are often required for Hive queries to complete.
  • Impala is pioneering the use of the Parquet file format, a columnar storage layout that is optimized for large-scale queries typical in data warehouse scenarios.

Impala provides:

  • Familiar SQL interface that data scientists and analysts already know.
  • Ability to query high volumes of data (big data) in Apache Hadoop.
  • Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity hardware.
  • Ability to share data files between different components with no copy or export/import step; for example, to write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling simple data interchange using Impala for analytics on Hive-produced data.
  • Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for analytics.

Sentry-Related Content

Material related to Sentry security, intended to be reused between Hive and Impala. Complicated by the fact that most of it will probably be multi-paragraph or involve subheads, might need to be represented as nested topics at the end of this file.

Valid privilege types and objects they apply to Privilege Object INSERT DB, TABLE SELECT DB, TABLE, COLUMN ALL SERVER, TABLE, DB, URI
Privilege table for Hive & Impala operations Operation Scope Privileges Required URI CREATE DATABASE SERVER ALL DROP DATABASE DATABASE ALL CREATE TABLE DATABASE ALL DROP TABLE TABLE ALL CREATE VIEW

-This operation is allowed if you have column-level SELECT access to the columns being used.

DATABASE; SELECT on TABLE; ALL
ALTER VIEW

-This operation is allowed if you have column-level SELECT access to the columns being used.

VIEW/TABLE ALL
DROP VIEW VIEW/TABLE ALL ALTER TABLE .. ADD COLUMNS TABLE ALL on DATABASE ALTER TABLE .. REPLACE COLUMNS TABLE ALL on DATABASE ALTER TABLE .. CHANGE column TABLE ALL on DATABASE ALTER TABLE .. RENAME TABLE ALL on DATABASE ALTER TABLE .. SET TBLPROPERTIES TABLE ALL on DATABASE ALTER TABLE .. SET FILEFORMAT TABLE ALL on DATABASE ALTER TABLE .. SET LOCATION TABLE ALL on DATABASE URI ALTER TABLE .. ADD PARTITION TABLE ALL on DATABASE ALTER TABLE .. ADD PARTITION location TABLE ALL on DATABASE URI ALTER TABLE .. DROP PARTITION TABLE ALL on DATABASE ALTER TABLE .. PARTITION SET FILEFORMAT TABLE ALL on DATABASE SHOW CREATE TABLE TABLE SELECT/INSERT SHOW PARTITIONS TABLE SELECT/INSERT SHOW TABLES

-Output includes all the tables for which the user has table-level privileges and all the tables for which the user has some column-level privileges.

TABLE SELECT/INSERT
SHOW GRANT ROLE

-Output includes an additional field for any column-level privileges.

TABLE SELECT/INSERT
DESCRIBE TABLE

-Output shows all columns if the user has table level-privileges or SELECT privilege on at least one table column

TABLE SELECT/INSERT
LOAD DATA TABLE INSERT URI SELECT

-You can grant the SELECT privilege on a view to give users access to specific columns of a table they do not otherwise have access to.

-See for details on allowed column-level operations.

VIEW/TABLE; COLUMN SELECT
INSERT OVERWRITE TABLE TABLE INSERT CREATE TABLE .. AS SELECT

-This operation is allowed if you have column-level SELECT access to the columns being used.

DATABASE; SELECT on TABLE ALL
USE <dbName> Any CREATE FUNCTION SERVER ALL ALTER TABLE .. SET SERDEPROPERTIES TABLE ALL on DATABASE ALTER TABLE .. PARTITION SET SERDEPROPERTIES TABLE ALL on DATABASE Hive-Only Operations INSERT OVERWRITE DIRECTORY TABLE INSERT URI Analyze TABLE TABLE SELECT + INSERT IMPORT TABLE DATABASE ALL URI EXPORT TABLE TABLE SELECT URI ALTER TABLE TOUCH TABLE ALL on DATABASE ALTER TABLE TOUCH PARTITION TABLE ALL on DATABASE ALTER TABLE .. CLUSTERED BY SORTED BY TABLE ALL on DATABASE ALTER TABLE .. ENABLE/DISABLE TABLE ALL on DATABASE ALTER TABLE .. PARTITION ENABLE/DISABLE TABLE ALL on DATABASE ALTER TABLE .. PARTITION.. RENAME TO PARTITION TABLE ALL on DATABASE MSCK REPAIR TABLE TABLE ALL ALTER DATABASE DATABASE ALL DESCRIBE DATABASE DATABASE SELECT/INSERT SHOW COLUMNS

-Output for this operation filters columns to which the user does not have explicit SELECT access

TABLE SELECT/INSERT
CREATE INDEX TABLE ALL DROP INDEX TABLE ALL SHOW INDEXES TABLE SELECT/INSERT GRANT PRIVILEGE Allowed only for Sentry admin users REVOKE PRIVILEGE Allowed only for Sentry admin users SHOW GRANTS Allowed only for Sentry admin users SHOW TBLPROPERTIES TABLE SELECT/INSERT DESCRIBE TABLE .. PARTITION TABLE SELECT/INSERT ADD JAR Not Allowed ADD FILE Not Allowed DFS Not Allowed Impala-Only Operations EXPLAIN TABLE; COLUMN SELECT INVALIDATE METADATA SERVER ALL INVALIDATE METADATA <table name> TABLE SELECT/INSERT REFRESH <table name> or REFRESH <table name> PARTITION (<partition_spec>) TABLE SELECT/INSERT DROP FUNCTION SERVER ALL COMPUTE STATS TABLE ALL

In and higher, Impala recognizes the auth_to_local setting, specified through the HDFS configuration setting hadoop.security.auth_to_local or the Cloudera Manager setting Additional Rules to Map Kerberos Principals to Short Names. This feature is disabled by default, to avoid an unexpected change in security-related behavior. To enable it:

  • For clusters not managed by Cloudera Manager, specify --load_auth_to_local_rules=true in the impalad and catalogdconfiguration settings.

  • For clusters managed by Cloudera Manager, select the Use HDFS Rules to Map Kerberos Principals to Short Names checkbox to enable the service-wide load_auth_to_local_rules configuration setting. Then restart the Impala service.

See Using Auth-to-Local Rules to Isolate Cluster Users for general information about this feature.

Regardless of the authentication mechanism used, Impala always creates HDFS directories and data files owned by the same user (typically impala). To implement user-level access to different databases, tables, columns, partitions, and so on, use the Sentry authorization feature, as explained in .

Debugging Failed Sentry Authorization Requests

Sentry logs all facts that lead up to authorization decisions at the debug level. If you do not understand why Sentry is denying access, the best way to debug is to temporarily turn on debug logging:

  • In Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the logging settings for your service through the corresponding Logging Safety Valve field for the Impala, Hive Server 2, or Solr Server services.
  • On systems not managed by Cloudera Manager, add log4j.logger.org.apache.sentry=DEBUG to the log4j.properties file on each host in the cluster, in the appropriate configuration directory for each service.
Specifically, look for exceptions and messages such as: FilePermission server..., RequestPermission server...., result [true|false] which indicate each evaluation Sentry makes. The FilePermission is from the policy file, while RequestPermission is the privilege required for the query. A RequestPermission will iterate over all appropriate FilePermission settings until a match is found. If no matching privilege is found, Sentry returns false indicating Access Denied .

Cloudera Manager Terminology

Especially during the transition from CM 4 to CM 5, we'll use some stock phraseology to talk about fields and such. Also there are some task steps etc. to conref under the Impala Service page that are easier to keep track of here instead of in cm_common_elements.xml. (Although as part of Apache work, anything CM might naturally move out of this file.)

In Cloudera Manager 4, these fields are labelled Safety Valve; in Cloudera Manager 5, they are called Advanced Configuration Snippet.

  • Go to the Impala service.
  • Restart the Impala service.
Items from the Citibank Escalation Spreadsheet

Paragraphs with IDs are intended to be reused both in the FAQ and the User's Guide. They refer to feature requests or misunderstandings encountered by Citibank, captured in the escalation spreadsheet here: .

With Impala, you use the built-in CONCAT() function to concatenate two, three, or more strings: select concat('some prefix: ', col1) from t1; select concat('abc','mno','xyz'); Impala does not currently support operators for string concatenation, such as || as seen in some other database systems.

You can specify column aliases with or without the AS keyword, and with no quotation marks, single quotation marks, or double quotation marks. Some kind of quotation marks are required if the column alias contains any spaces or other problematic characters. The alias text is displayed in the impala-shell output as all-lowercase. For example: [localhost:21000] > select c1 First_Column from t; [localhost:21000] > select c1 as First_Column from t; +--------------+ | first_column | +--------------+ ... [localhost:21000] > select c1 'First Column' from t; [localhost:21000] > select c1 as 'First Column' from t; +--------------+ | first column | +--------------+ ... [localhost:21000] > select c1 "First Column" from t; [localhost:21000] > select c1 as "First Column" from t; +--------------+ | first column | +--------------+ ...

Currently, Impala does not support temporary tables. Some other database systems have a class of lightweight tables that are held only in memory and/or that are only accessible by one connection and disappear when the session ends. In Impala, creating new databases is a relatively lightweight operation, so as an alternative, you could create a database with a unique name and use CREATE TABLE LIKE, CREATE TABLE AS SELECT, and INSERT statements to create a table in that database to hold the result set of a query, to use in subsequent queries. When finished, issue a DROP TABLE statement followed by DROP DATABASE.

Blurbs About Standards Compliance

The following blurbs simplify the process of flagging which SQL standard various features were first introduced in. The wording and the tagging can be modified by editing one central instance of each blurb. Not extensively used yet, just here and there in the SQL Language Reference section.

Standards compliance: Introduced in SQL-1986.

Standards compliance: Introduced in SQL-1989.

Standards compliance: Introduced in SQL-1992.

Standards compliance: Introduced in SQL:1999.

Standards compliance: Introduced in SQL:2003.

Standards compliance: Introduced in SQL:2008.

Standards compliance: Introduced in SQL:2011.

Standards compliance: Extension first introduced in HiveQL.

Standards compliance: Extension first introduced in Impala.

Background Info for REFRESH, INVALIDATE METADATA, and General Metadata Discussion

Because REFRESH table_name only works for tables that the current Impala node is already aware of, when you create a new table in the Hive shell, enter INVALIDATE METADATA new_table before you can see the new table in impala-shell. Once the table is known by Impala, you can issue REFRESH table_name after you add data files for that table.

INVALIDATE METADATA and REFRESH are counterparts: INVALIDATE METADATA waits to reload the metadata when needed for a subsequent query, but reloads all the metadata for the table, which can be an expensive operation, especially for large tables with many partitions. REFRESH reloads the metadata immediately, but only loads the block location data for newly added data files, making it a less expensive operation overall. If data was altered in some more extensive way, such as being reorganized by the HDFS balancer, use INVALIDATE METADATA to avoid a performance penalty from reduced local reads. If you used Impala version 1.0, the INVALIDATE METADATA statement works just like the Impala 1.0 REFRESH statement did, while the Impala 1.1 REFRESH is optimized for the common use case of adding new data files to an existing table, thus the table name argument is now required.

SQL Language Reference Snippets

These reusable chunks were taken from conrefs originally in ciiu_langref_sql.xml. Or they are primarily used in new SQL syntax topics underneath that parent topic.

In CDH 5.7.0 / Impala 2.5.0, only the value 1 enables the option, and the value true is not recognized. This limitation is tracked by the issue IMPALA-3334, which shows the releases where the problem is fixed.

The Avro specification allows string values up to 2**64 bytes in length. Impala queries for Avro tables use 32-bit integers to hold string lengths. In and higher, Impala truncates CHAR and VARCHAR values in Avro tables to (2**31)-1 bytes. If a query encounters a STRING value longer than (2**31)-1 bytes in an Avro table, the query fails. In earlier releases, encountering such long values in an Avro table could cause a crash.

You specify a case-insensitive symbolic name for the kind of statistics: numDVs, numNulls, avgSize, maxSize. The key names and values are both quoted. This operation applies to an entire table, not a specific partition. For example: create table t1 (x int, s string); insert into t1 values (1, 'one'), (2, 'two'), (2, 'deux'); show column stats t1; +--------+--------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+--------+------------------+--------+----------+----------+ | x | INT | -1 | -1 | 4 | 4 | | s | STRING | -1 | -1 | -1 | -1 | +--------+--------+------------------+--------+----------+----------+ alter table t1 set column stats x ('numDVs'='2','numNulls'='0'); alter table t1 set column stats s ('numdvs'='3','maxsize'='4'); show column stats t1; +--------+--------+------------------+--------+----------+----------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | +--------+--------+------------------+--------+----------+----------+ | x | INT | 2 | 0 | 4 | 4 | | s | STRING | 3 | -1 | 4 | -1 | +--------+--------+------------------+--------+----------+----------+

create table analysis_data stored as parquet as select * from raw_data; Inserted 1000000000 rows in 181.98s compute stats analysis_data; insert into analysis_data select * from smaller_table_we_forgot_before; Inserted 1000000 rows in 15.32s -- Now there are 1001000000 rows. We can update this single data point in the stats. alter table analysis_data set tblproperties('numRows'='1001000000', 'STATS_GENERATED_VIA_STATS_TASK'='true'); -- If the table originally contained 1 million rows, and we add another partition with 30 thousand rows, -- change the numRows property for the partition and the overall table. alter table partitioned_data partition(year=2009, month=4) set tblproperties ('numRows'='30000', 'STATS_GENERATED_VIA_STATS_TASK'='true'); alter table partitioned_data set tblproperties ('numRows'='1030000', 'STATS_GENERATED_VIA_STATS_TASK'='true');

Impala does not return column overflows as NULL, so that customers can distinguish between NULL data and overflow conditions similar to how they do so with traditional database systems. Impala returns the largest or smallest value in the range for the type. For example, valid values for a tinyint range from -128 to 127. In Impala, a tinyint with a value of -200 returns -128 rather than NULL. A tinyint with a value of 200 returns 127.

If you frequently run aggregate functions such as MIN(), MAX(), and COUNT(DISTINCT) on partition key columns, consider enabling the OPTIMIZE_PARTITION_KEY_SCANS query option, which optimizes such queries. This feature is available in and higher. See for the kinds of queries that this option applies to, and slight differences in how partitions are evaluated when this query option is enabled.

The output from this query option is printed to standard error. The output is only displayed in interactive mode, that is, not when the -q or -f options are used.

To see how the LIVE_PROGRESS and LIVE_SUMMARY query options work in real time, see this animated demo.

Because the runtime filtering feature is enabled by default only for local processing, the other filtering-related query options have the greatest effect when used in combination with the setting RUNTIME_FILTER_MODE=GLOBAL.

Because the runtime filtering feature applies mainly to resource-intensive and long-running queries, only adjust this query option when tuning long-running queries involving some combination of large partitioned tables and joins involving large tables.

The LIVE_PROGRESS and LIVE_SUMMARY query options currently do not produce any output during COMPUTE STATS operations.

The LIVE_PROGRESS and LIVE_SUMMARY query options only apply inside the impala-shell interpreter. You cannot use them with the SET statement from a JDBC or ODBC application.

Because the LIVE_PROGRESS and LIVE_SUMMARY query options are available only within the impala-shell interpreter:

  • You cannot change these query options through the SQL SET statement using the JDBC or ODBC interfaces. The SET command in impala-shell recognizes these names as shell-only options.

  • Be careful when using impala-shell on a pre-CDH 5.5 system to connect to Impala running on a CDH 5.5 or higher system. The older impala-shell does not recognize these query option names. Upgrade impala-shell on the systems where you intend to use these query options.

  • Likewise, the impala-shell command relies on some information only available in and higher to prepare live progress reports and query summaries. The LIVE_PROGRESS and LIVE_SUMMARY query options have no effect when impala-shell connects to a cluster running an older version of Impala.

create database first_db; use first_db; create table t1 (x int); create database second_db; use second_db; -- Each database has its own namespace for tables. -- You can reuse the same table names in each database. create table t1 (s string); create database temp; -- You can either USE a database after creating it, -- or qualify all references to the table name with the name of the database. -- Here, tables T2 and T3 are both created in the TEMP database. create table temp.t2 (x int, y int); use database temp; create table t3 (s string); -- You cannot drop a database while it is selected by the USE statement. drop database temp; ERROR: AnalysisException: Cannot drop current default database: temp -- The always-available database 'default' is a convenient one to USE -- before dropping a database you created. use default; -- Before dropping a database, first drop all the tables inside it, -- or in and higher use the CASCADE clause. drop database temp; ERROR: ImpalaRuntimeException: Error making 'dropDatabase' RPC to Hive Metastore: CAUSED BY: InvalidOperationException: Database temp is not empty show tables in temp; +------+ | name | +------+ | t3 | +------+ -- and higher: drop database temp cascade; -- CDH 5.4 and lower: drop table temp.t3; drop database temp;

This example shows how to use the castto*() functions as an equivalent to CAST(value AS type) expressions.

Usage notes: A convenience function to skip the SQL CAST value AS type syntax, for example when programmatically generating SQL statements where a regular function call might be easier to construct.

To determine the time zone of the server you are connected to, in CDH 5.5 / Impala 2.3 and higher you can call the timeofday() function, which includes the time zone specifier in its return value. Remember that with cloud computing, the server you interact with might be in a different time zone than you are, or different sessions might connect to servers in different time zones, or a cluster might include servers in more than one time zone.

The way this function deals with time zones when converting to or from TIMESTAMP values is affected by the -use_local_tz_for_unix_timestamp_conversions startup flag for the impalad daemon. See for details about how Impala handles time zone considerations for the TIMESTAMP data type.

For best compatibility with the S3 write support in CDH 5.8 / Impala 2.6 and higher:

  • Use native Hadoop techniques to create data files in S3 for querying through Impala.
  • Use the PURGE clause of DROP TABLE when dropping internal (managed) tables.
By default, when you drop an internal (managed) table, the data files are moved to the HDFS trashcan. This operation is expensive for tables that reside on the Amazon S3 filesystem. Therefore, for S3 tables, prefer to use DROP TABLE table_name PURGE rather than the default DROP TABLE statement. The PURGE clause makes Impala delete the data files immediately, skipping the HDFS trashcan. For the PURGE clause to work effectively, you must originally create the data files on S3 using one of the tools from the Hadoop ecosystem, such as hadoop fs -cp, or INSERT in Impala or Hive.

Because of differences between S3 and traditional filesystems, DML operations for S3 tables can take longer than for tables on HDFS. For example, both the LOAD DATA statement and the final stage of the INSERT and CREATE TABLE AS SELECT statements involve moving files from one directory to another. (In the case of INSERT and CREATE TABLE AS SELECT, the files are moved from a temporary staging directory to the final destination directory.) Because S3 does not support a rename operation for existing objects, in these cases Impala actually copies the data files from one location to another and then removes the original files. In CDH 5.8 / Impala 2.6, the S3_SKIP_INSERT_STAGING query option provides a way to speed up INSERT statements for S3 tables and partitions, with the tradeoff that a problem during statement execution could leave data in an inconsistent state. It does not apply to INSERT OVERWRITE or LOAD DATA statements. See for details.

In and higher, Impala queries are optimized for files stored in Amazon S3. For Impala tables that use the file formats Parquet, RCFile, SequenceFile, Avro, and uncompressed text, the setting fs.s3a.block.size in the core-site.xml configuration file determines how Impala divides the I/O work of reading the data files. This configuration setting is specified in bytes. By default, this value is 33554432 (32 MB), meaning that Impala parallelizes S3 read operations on the files as if they were made up of 32 MB blocks. For example, if your S3 queries primarily access Parquet files written by MapReduce or Hive, increase fs.s3a.block.size to 134217728 (128 MB) to match the row group size of those files. If most S3 queries involve Parquet files written by Impala, increase fs.s3a.block.size to 268435456 (256 MB) to match the row group size produced by Impala.

In and higher, Impala supports both queries (SELECT) and DML (INSERT, LOAD DATA, CREATE TABLE AS SELECT) for data residing on Amazon S3. With the inclusion of write support, the Impala support for S3 is now considered ready for production use.

Impala query support for Amazon S3 is included in CDH 5.4.0, but is not currently supported or recommended for production use. To try this feature, use it in a test environment until Cloudera resolves currently existing issues and limitations to make it ready for production use.

In and higher, Impala DDL statements such as CREATE DATABASE, CREATE TABLE, DROP DATABASE CASCADE, DROP TABLE, and ALTER TABLE [ADD|DROP] PARTITION can create or remove folders as needed in the Amazon S3 system. Prior to CDH 5.8 / Impala 2.6, you had to create folders yourself and point Impala database, tables, or partitions at them, and manually remove folders when no longer needed. See for details about reading and writing S3 data with Impala.

In and higher, the Impala DML statements (INSERT, LOAD DATA, and CREATE TABLE AS SELECT) can write data into a table or partition that resides in the Amazon Simple Storage Service (S3). The syntax of the DML statements is the same as for any other tables, because the S3 location for tables and partitions is specified by an s3a:// prefix in the LOCATION attribute of CREATE TABLE or ALTER TABLE statements. If you bring data into S3 using the normal S3 transfer mechanisms instead of Impala DML statements, issue a REFRESH statement for the table before using Impala to query the S3 data.

The REFRESH and INVALIDATE METADATA statements also cache metadata for tables where the data resides in the Amazon Simple Storage Service (S3). In particular, issue a REFRESH for a table after adding or removing files in the associated S3 data directory. See for details about working with S3 tables.

In Impala 2.2.0 and higher, built-in functions that accept or return integers representing TIMESTAMP values use the BIGINT type for parameters and return values, rather than INT. This change lets the date and time functions avoid an overflow error that would otherwise occur on January 19th, 2038 (known as the Year 2038 problem or Y2K38 problem). This change affects the from_unixtime() and unix_timestamp() functions. You might need to change application code that interacts with these functions, change the types of columns that store the return values, or add CAST() calls to SQL statements that call these functions.

Impala automatically converts STRING literals of the correct format into TIMESTAMP values. Timestamp values are accepted in the format "yyyy-MM-dd HH:mm:ss.SSSSSS", and can consist of just the date, or just the time, with or without the fractional second portion. For example, you can specify TIMESTAMP values such as '1966-07-30', '08:30:00', or '1985-09-25 17:45:30.005'.

Casting an integer or floating-point value N to TIMESTAMP produces a value that is N seconds past the start of the epoch date (January 1, 1970). By default, the result value represents a date and time in the UTC time zone. If the setting -use_local_tz_for_unix_timestamp_conversions=true is in effect, the resulting TIMESTAMP represents a date and time in the local time zone.

If these statements in your environment contain sensitive literal values such as credit card numbers or tax identifiers, Impala can redact this sensitive information when displaying the statements in log files and other administrative contexts. See for details.

The PARTITION clause is only allowed in combination with the INCREMENTAL clause. It is optional for COMPUTE INCREMENTAL STATS, and required for DROP INCREMENTAL STATS. Whenever you specify partitions through the PARTITION (partition_spec) clause in a COMPUTE INCREMENTAL STATS or DROP INCREMENTAL STATS statement, you must include all the partitioning columns in the specification, and specify constant values for all the partition key columns.

In and higher, Impala UDFs and UDAs written in C++ are persisted in the metastore database. Java UDFs are also persisted, if they were created with the new CREATE FUNCTION syntax for Java UDFs, where the Java function argument and return types are omitted. Java-based UDFs created with the old CREATE FUNCTION syntax do not persist across restarts because they are held in the memory of the catalogd daemon. Until you re-create such Java UDFs using the new CREATE FUNCTION syntax, you must reload those Java-based UDFs by running the original CREATE FUNCTION statements again each time you restart the catalogd daemon. Prior to the requirement to reload functions after a restart applied to both C++ and Java functions.

The Hive current_user() function cannot be called from a Java UDF through Impala.

If you are creating a partition for the first time and specifying its location, for maximum efficiency, use a single ALTER TABLE statement including both the ADD PARTITION and LOCATION clauses, rather than separate statements with ADD PARTITION and SET LOCATION clauses.

The INSERT statement has always left behind a hidden work directory inside the data directory of the table. Formerly, this hidden work directory was named .impala_insert_staging . In Impala 2.0.1 and later, this directory name is changed to _impala_insert_staging . (While HDFS tools are expected to treat names beginning either with underscore and dot as hidden, in practice names beginning with an underscore are more widely supported.) If you have any scripts, cleanup jobs, and so on that rely on the name of this work directory, adjust them to use the new name.

To see whether a table is internal or external, and its associated HDFS location, issue the statement DESCRIBE FORMATTED table_name. The Table Type field displays MANAGED_TABLE for internal tables and EXTERNAL_TABLE for external tables. The Location field displays the path of the table directory as an HDFS URI.

You can switch a table from internal to external, or from external to internal, by using the ALTER TABLE statement: -- Switch a table from internal to external. ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE'); -- Switch a table from external to internal. ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='FALSE');

-- Find all customers whose first name starts with 'J', followed by 0 or more of any character. select c_first_name, c_last_name from customer where c_first_name regexp '^J.*'; select c_first_name, c_last_name from customer where c_first_name rlike '^J.*'; -- Find 'Macdonald', where the first 'a' is optional and the 'D' can be upper- or lowercase. -- The ^...$ are required, to match the start and end of the value. select c_first_name, c_last_name from customer where c_last_name regexp '^Ma?c[Dd]onald$'; select c_first_name, c_last_name from customer where c_last_name rlike '^Ma?c[Dd]onald$'; -- Match multiple character sequences, either 'Mac' or 'Mc'. select c_first_name, c_last_name from customer where c_last_name regexp '^(Mac|Mc)donald$'; select c_first_name, c_last_name from customer where c_last_name rlike '^(Mac|Mc)donald$'; -- Find names starting with 'S', then one or more vowels, then 'r', then any other characters. -- Matches 'Searcy', 'Sorenson', 'Sauer'. select c_first_name, c_last_name from customer where c_last_name regexp '^S[aeiou]+r.*$'; select c_first_name, c_last_name from customer where c_last_name rlike '^S[aeiou]+r.*$'; -- Find names that end with 2 or more vowels: letters from the set a,e,i,o,u. select c_first_name, c_last_name from customer where c_last_name regexp '.*[aeiou]{2,}$'; select c_first_name, c_last_name from customer where c_last_name rlike '.*[aeiou]{2,}$'; -- You can use letter ranges in the [] blocks, for example to find names starting with A, B, or C. select c_first_name, c_last_name from customer where c_last_name regexp '^[A-C].*'; select c_first_name, c_last_name from customer where c_last_name rlike '^[A-C].*'; -- If you are not sure about case, leading/trailing spaces, and so on, you can process the -- column using string functions first. select c_first_name, c_last_name from customer where lower(trim(c_last_name)) regexp '^de.*'; select c_first_name, c_last_name from customer where lower(trim(c_last_name)) rlike '^de.*';

In and higher, you can simplify queries that use many UPPER() and LOWER() calls to do case-insensitive comparisons, by using the ILIKE or IREGEXP operators instead. See and for details.

When authorization is enabled, the output of the SHOW statement is limited to those objects for which you have some privilege. There might be other database, tables, and so on, but their names are concealed. If you believe an object exists but you cannot see it in the SHOW output, check with the system administrator if you need to be granted a new privilege for that object. See for how to set up authorization and add privileges for specific kinds of objects.

Infinity and NaN can be specified in text data files as inf and nan respectively, and Impala interprets them as these special values. They can also be produced by certain arithmetic expressions; for example, pow(-1, 0.5) returns Infinity and 1/0 returns NaN. Or you can cast the literal values, such as CAST('nan' AS DOUBLE) or CAST('inf' AS DOUBLE).

In Impala 2.0 and later, user() returns the full Kerberos principal string, such as user@example.com, in a Kerberized environment.

  • Currently, each Impala GRANT or REVOKE statement can only grant or revoke a single privilege to or from a single role.

All data in CHAR and VARCHAR columns must be in a character encoding that is compatible with UTF-8. If you have binary data from another database system (that is, a BLOB type), use a STRING column to hold it.

The following example creates a series of views and then drops them. These examples illustrate how views are associated with a particular database, and both the view definitions and the view names for CREATE VIEW and DROP VIEW can refer to a view in the current database or a fully qualified view name. -- Create and drop a view in the current database. CREATE VIEW few_rows_from_t1 AS SELECT * FROM t1 LIMIT 10; DROP VIEW few_rows_from_t1; -- Create and drop a view referencing a table in a different database. CREATE VIEW table_from_other_db AS SELECT x FROM db1.foo WHERE x IS NOT NULL; DROP VIEW table_from_other_db; USE db1; -- Create a view in a different database. CREATE VIEW db2.v1 AS SELECT * FROM db2.foo; -- Switch into the other database and drop the view. USE db2; DROP VIEW v1; USE db1; -- Create a view in a different database. CREATE VIEW db2.v1 AS SELECT * FROM db2.foo; -- Drop a view in the other database. DROP VIEW db2.v1;

For INSERT operations into CHAR or VARCHAR columns, you must cast all STRING literals or expressions returning STRING to to a CHAR or VARCHAR type with the appropriate length.

Correlated subqueries used in EXISTS and IN operators cannot include a LIMIT clause.

Currently, Avro tables cannot contain TIMESTAMP columns. If you need to store date and time values in Avro tables, as a workaround you can use a STRING representation of the values, convert the values to BIGINT with the UNIX_TIMESTAMP() function, or create separate numeric columns for individual date and time fields using the EXTRACT() function.

Zero-length strings: For purposes of clauses such as DISTINCT and GROUP BY, Impala considers zero-length strings (""), NULL, and space to all be different values.

When the spill-to-disk feature is activated for a join node within a query, Impala does not produce any runtime filters for that join operation on that host. Other join nodes within the query are not affected.

create table yy (s string) partitioned by (year int) stored as parquet; insert into yy partition (year) values ('1999', 1999), ('2000', 2000), ('2001', 2001), ('2010',2010); compute stats yy; create table yy2 (s string) partitioned by (year int) stored as parquet; insert into yy2 partition (year) values ('1999', 1999), ('2000', 2000), ('2001', 2001); compute stats yy2; -- The query reads an unknown number of partitions, whose key values are only -- known at run time. The 'runtime filters' lines show how the information about -- the partitions is calculated in query fragment 02, and then used in query -- fragment 00 to decide which partitions to skip. explain select s from yy2 where year in (select year from yy where year between 2000 and 2005); +----------------------------------------------------------+ | Explain String | +----------------------------------------------------------+ | Estimated Per-Host Requirements: Memory=16.00MB VCores=2 | | | | 04:EXCHANGE [UNPARTITIONED] | | | | | 02:HASH JOIN [LEFT SEMI JOIN, BROADCAST] | | | hash predicates: year = year | | | runtime filters: RF000 <- year | | | | | |--03:EXCHANGE [BROADCAST] | | | | | | | 01:SCAN HDFS [dpp.yy] | | | partitions=2/4 files=2 size=468B | | | | | 00:SCAN HDFS [dpp.yy2] | | partitions=2/3 files=2 size=468B | | runtime filters: RF000 -> year | +----------------------------------------------------------+

By default, intermediate files used during large sort, join, aggregation, or analytic function operations are stored in the directory /tmp/impala-scratch . These files are removed when the operation finishes. (Multiple concurrent queries can perform operations that use the spill to disk technique, without any name conflicts for these temporary files.) You can specify a different location by starting the impalad daemon with the --scratch_dirs="path_to_directory" configuration option or the equivalent configuration option in the Cloudera Manager user interface. You can specify a single directory, or a comma-separated list of directories. The scratch directories must be on the local filesystem, not in HDFS. You might specify different directory paths for different hosts, depending on the capacity and speed of the available storage devices. In CDH 5.5 / Impala 2.3 or higher, Impala successfully starts (with a warning written to the log) if it cannot create or read and write files in one of the scratch directories. If there is less than 1 GB free on the filesystem where that directory resides, Impala still runs, but writes a warning message to its log. If Impala encounters an error reading or writing files in a scratch directory during a query, Impala logs the error and the query fails.

An ORDER BY clause without an additional LIMIT clause is ignored in any view definition. If you need to sort the entire result set from a view, use an ORDER BY clause in the SELECT statement that queries the view. You can still make a simple top 10 report by combining the ORDER BY and LIMIT clauses in the same view definition: [localhost:21000] > create table unsorted (x bigint); [localhost:21000] > insert into unsorted values (1), (9), (3), (7), (5), (8), (4), (6), (2); [localhost:21000] > create view sorted_view as select x from unsorted order by x; [localhost:21000] > select x from sorted_view; -- ORDER BY clause in view has no effect. +---+ | x | +---+ | 1 | | 9 | | 3 | | 7 | | 5 | | 8 | | 4 | | 6 | | 2 | +---+ [localhost:21000] > select x from sorted_view order by x; -- View query requires ORDER BY at outermost level. +---+ | x | +---+ | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | +---+ [localhost:21000] > create view top_3_view as select x from unsorted order by x limit 3; [localhost:21000] > select x from top_3_view; -- ORDER BY and LIMIT together in view definition are preserved. +---+ | x | +---+ | 1 | | 2 | | 3 | +---+

The following examples demonstrate how to check the precision and scale of numeric literals or other numeric expressions. Impala represents numeric literals in the smallest appropriate type. 5 is a TINYINT value, which ranges from -128 to 127, therefore 3 decimal digits are needed to represent the entire range, and because it is an integer value there are no fractional digits. 1.333 is interpreted as a DECIMAL value, with 4 digits total and 3 digits after the decimal point. [localhost:21000] > select precision(5), scale(5); +--------------+----------+ | precision(5) | scale(5) | +--------------+----------+ | 3 | 0 | +--------------+----------+ [localhost:21000] > select precision(1.333), scale(1.333); +------------------+--------------+ | precision(1.333) | scale(1.333) | +------------------+--------------+ | 4 | 3 | +------------------+--------------+ [localhost:21000] > with t1 as ( select cast(12.34 as decimal(20,2)) x union select cast(1 as decimal(8,6)) x ) select precision(x), scale(x) from t1 limit 1; +--------------+----------+ | precision(x) | scale(x) | +--------------+----------+ | 24 | 6 | +--------------+----------+

Type: Boolean; recognized values are 1 and 0, or true and false; any other value interpreted as false

Type: string

Type: integer

Default: false

Default: false (shown as 0 in output of SET statement)

Default: true (shown as 1 in output of SET statement)

Currently, the return value is always a STRING. The return type is subject to change in future releases. Always use CAST() to convert the result to whichever data type is appropriate for your computations.

Return type: DOUBLE in Impala 2.0 and higher; STRING in earlier releases

Usage notes: Primarily for compatibility with code containing industry extensions to SQL.

Return type: BOOLEAN

Return type: DOUBLE

Return type: Same as the input value

Return type: Same as the input value, except for CHAR and VARCHAR arguments which produce a STRING result

Impala includes another predefined database, _impala_builtins, that serves as the location for the built-in functions. To see the built-in functions, use a statement like the following: show functions in _impala_builtins; show functions in _impala_builtins like '*substring*';

Due to the way arithmetic on FLOAT and DOUBLE columns uses high-performance hardware instructions, and distributed queries can perform these operations in different order for each query, results can vary slightly for aggregate function calls such as SUM() and AVG() for FLOAT and DOUBLE columns, particularly on large data sets where millions or billions of values are summed or averaged. For perfect consistency and repeatability, use the DECIMAL data type for such operations instead of FLOAT or DOUBLE.

The inability to exactly represent certain floating-point values means that DECIMAL is sometimes a better choice than DOUBLE or FLOAT when precision is critical, particularly when transferring data from other database systems that use different representations or file formats.

Currently, the COMPUTE STATS statement under CDH 4 does not store any statistics for DECIMAL columns. When Impala runs under CDH 5, which has better support for DECIMAL in the metastore database, COMPUTE STATS does collect statistics for DECIMAL columns and Impala uses the statistics to optimize query performance.

If you run the Hive statement ANALYZE TABLE COMPUTE STATISTICS FOR COLUMNS, Impala can only use the resulting column statistics if the table is unpartitioned. Impala cannot use Hive-generated column statistics for a partitioned table.

unix_timestamp() and from_unixtime() are often used in combination to convert a TIMESTAMP value into a particular string format. For example: select from_unixtime(unix_timestamp(now() + interval 3 days), 'yyyy/MM/dd HH:mm') as yyyy_mm_dd_hh_mm; +------------------+ | yyyy_mm_dd_hh_mm | +------------------+ | 2016/06/03 11:38 | +------------------+

Sorting considerations: Although you can specify an ORDER BY clause in an INSERT ... SELECT statement, any ORDER BY clause is ignored and the results are not necessarily sorted. An INSERT ... SELECT operation potentially creates many different data files, prepared on different data nodes, and therefore the notion of the data being stored in sorted order is impractical.

Prior to Impala 1.4.0, it was not possible to use the CREATE TABLE LIKE view_name syntax. In Impala 1.4.0 and higher, you can create a table with the same column definitions as a view using the CREATE TABLE LIKE technique. Although CREATE TABLE LIKE normally inherits the file format of the original table, a view has no underlying file format, so CREATE TABLE LIKE view_name produces a text table by default. To specify a different file format, include a STORED AS file_format clause at the end of the CREATE TABLE LIKE statement.

Prior to Impala 1.4.0, COMPUTE STATS counted the number of NULL values in each column and recorded that figure in the metastore database. Because Impala does not currently use the NULL count during query planning, Impala 1.4.0 and higher speeds up the COMPUTE STATS statement by skipping this NULL counting.

The regular expression must match the entire value, not just occur somewhere inside it. Use .* at the beginning, the end, or both if you only need to match characters anywhere in the middle. Thus, the ^ and $ atoms are often redundant, although you might already have them in your expression strings that you reuse from elsewhere.

In Impala 1.3.1 and higher, the REGEXP and RLIKE operators now match a regular expression string that occurs anywhere inside the target string, the same as if the regular expression was enclosed on each side by .*. See for examples. Previously, these operators only succeeded when the regular expression matched the entire target string. This change improves compatibility with the regular expression support for popular database systems. There is no change to the behavior of the regexp_extract() and regexp_replace() built-in functions.

By default, if an INSERT statement creates any new subdirectories underneath a partitioned table, those subdirectories are assigned default HDFS permissions for the impala user. To make each subdirectory have the same permissions as its parent directory in HDFS, specify the --insert_inherit_permissions startup option for the impalad daemon.

By default, Impala only allows a single COUNT(DISTINCT columns) expression in each query.

If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by specifying NDV(column); a query can contain multiple instances of NDV(column). To make Impala automatically rewrite COUNT(DISTINCT) expressions to NDV(), enable the APPX_COUNT_DISTINCT query option.

To produce the same result as multiple COUNT(DISTINCT) expressions, you can use the following technique for queries involving a single table:

select v1.c1 result1, v2.c1 result2 from (select count(distinct col1) as c1 from t1) v1 cross join (select count(distinct col2) as c1 from t1) v2;

Because CROSS JOIN is an expensive operation, prefer to use the NDV() technique wherever practical.

Prefer UNION ALL over UNION when you know the data sets are disjoint or duplicate values are not a problem; UNION ALL is more efficient because it avoids materializing and sorting the entire result set to eliminate duplicate values.

The CREATE TABLE clauses FIELDS TERMINATED BY, ESCAPED BY, and LINES TERMINATED BY have special rules for the string literal used for their argument, because they all require a single character. You can use a regular character surrounded by single or double quotation marks, an octal sequence such as '\054' (representing a comma), or an integer in the range '-127'..'128' (with quotation marks but no backslash), which is interpreted as a single-byte ASCII character. Negative values are subtracted from 256; for example, FIELDS TERMINATED BY '-2' sets the field delimiter to ASCII code 254, the Icelandic Thorn character used as a delimiter by some data formats.

Sqoop considerations:

If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.

Command-line equivalent:

Complex type considerations:

Because complex types are often used in combination, for example an ARRAY of STRUCT elements, if you are unfamiliar with the Impala complex types, start with for background information and usage examples.

  • Columns with this data type can only be used in tables or partitions with the Parquet file format.

  • Columns with this data type cannot be used as partition key columns in a partitioned table.

  • The COMPUTE STATS statement does not produce any statistics for columns of this data type.

  • The maximum length of the column definition for any complex type, including declarations for any nested types, is 4000 characters.

  • See for a full list of limitations and associated guidelines about complex type columns.

Partitioned tables can contain complex type columns. All the partition key columns must be scalar types.

You can pass a multi-part qualified name to DESCRIBE to specify an ARRAY, STRUCT, or MAP column and visualize its structure as if it were a table. For example, if table T1 contains an ARRAY column A1, you could issue the statement DESCRIBE t1.a1. If table T1 contained a STRUCT column S1, and a field F1 within the STRUCT was a MAP, you could issue the statement DESCRIBE t1.s1.f1. An ARRAY is shown as a two-column table, with ITEM and POS columns. A STRUCT is shown as a table with each field representing a column in the table. A MAP is shown as a two-column table, with KEY and VALUE columns.

Many of the complex type examples refer to tables such as CUSTOMER and REGION adapted from the tables used in the TPC-H benchmark. See for the table definitions.

Complex type considerations: Although you can create tables in this file format using the complex types (ARRAY, STRUCT, and MAP) available in and higher, currently, Impala can query these types only in Parquet tables. The one exception to the preceding rule is COUNT(*) queries on RCFile tables that include complex types. Such queries are allowed in and higher.

You cannot refer to a column with a complex data type (ARRAY, STRUCT, or MAP directly in an operator. You can apply operators only to scalar values that make up a complex type (the fields of a STRUCT, the items of an ARRAY, or the key or value portion of a MAP) as part of a join query that refers to the scalar value using the appropriate dot notation or ITEM, KEY, or VALUE pseudocolumn names.

Currently, Impala UDFs cannot accept arguments or return values of the Impala complex types (STRUCT, ARRAY, or MAP).

Impala currently cannot write new data files containing complex type columns. Therefore, although the SELECT statement works for queries involving complex type columns, you cannot use a statement form that writes data to complex type columns, such as CREATE TABLE AS SELECT or INSERT ... SELECT. To create data files containing complex type data, use the Hive INSERT statement, or another ETL mechanism such as MapReduce jobs, Spark jobs, Pig, and so on.

For tables containing complex type columns (ARRAY, STRUCT, or MAP), you typically use join queries to refer to the complex values. You can use views to hide the join notation, making such tables seem like traditional denormalized tables, and making those tables queryable by business intelligence tools that do not have built-in support for those complex types. See for details.

Because you cannot directly issue SELECT col_name against a column of complex type, you cannot use a view or a WITH clause to rename a column by selecting it with a column alias.

The Impala complex types (STRUCT, ARRAY, or MAP) are available in and higher. To use these types with JDBC requires version 2.5.28 or higher of the Cloudera JDBC Connector for Impala. To use these types with ODBC requires version 2.5.30 or higher of the Cloudera ODBC Connector for Impala. Consider upgrading all JDBC and ODBC drivers at the same time you upgrade from CDH 5.5 or higher.

Although the result sets from queries involving complex types consist of all scalar values, the queries involve join notation and column references that might not be understood by a particular JDBC or ODBC connector. Consider defining a view that represents the flattened version of a table containing complex type columns, and pointing the JDBC or ODBC application at the view. See for details.

To access a column with a complex type (ARRAY, STRUCT, or MAP) in an aggregation function, you unpack the individual elements using join notation in the query, and then apply the function to the final scalar item, field, key, or value at the bottom of any nested type hierarchy in the column. See for details about using complex types in Impala.

The following example demonstrates calls to several aggregation functions using values from a column containing nested complex types (an ARRAY of STRUCT items). The array is unpacked inside the query using join notation. The array elements are referenced using the ITEM pseudocolumn, and the structure fields inside the array elements are referenced using dot notation. Numeric values such as SUM() and AVG() are computed using the numeric R_NATIONKEY field, and the general-purpose MAX() and MIN() values are computed from the string N_NAME field. describe region; +-------------+-------------------------+---------+ | name | type | comment | +-------------+-------------------------+---------+ | r_regionkey | smallint | | | r_name | string | | | r_comment | string | | | r_nations | array<struct< | | | | n_nationkey:smallint, | | | | n_name:string, | | | | n_comment:string | | | | >> | | +-------------+-------------------------+---------+ select r_name, r_nations.item.n_nationkey from region, region.r_nations as r_nations order by r_name, r_nations.item.n_nationkey; +-------------+------------------+ | r_name | item.n_nationkey | +-------------+------------------+ | AFRICA | 0 | | AFRICA | 5 | | AFRICA | 14 | | AFRICA | 15 | | AFRICA | 16 | | AMERICA | 1 | | AMERICA | 2 | | AMERICA | 3 | | AMERICA | 17 | | AMERICA | 24 | | ASIA | 8 | | ASIA | 9 | | ASIA | 12 | | ASIA | 18 | | ASIA | 21 | | EUROPE | 6 | | EUROPE | 7 | | EUROPE | 19 | | EUROPE | 22 | | EUROPE | 23 | | MIDDLE EAST | 4 | | MIDDLE EAST | 10 | | MIDDLE EAST | 11 | | MIDDLE EAST | 13 | | MIDDLE EAST | 20 | +-------------+------------------+ select r_name, count(r_nations.item.n_nationkey) as count, sum(r_nations.item.n_nationkey) as sum, avg(r_nations.item.n_nationkey) as avg, min(r_nations.item.n_name) as minimum, max(r_nations.item.n_name) as maximum, ndv(r_nations.item.n_nationkey) as distinct_vals from region, region.r_nations as r_nations group by r_name order by r_name; +-------------+-------+-----+------+-----------+----------------+---------------+ | r_name | count | sum | avg | minimum | maximum | distinct_vals | +-------------+-------+-----+------+-----------+----------------+---------------+ | AFRICA | 5 | 50 | 10 | ALGERIA | MOZAMBIQUE | 5 | | AMERICA | 5 | 47 | 9.4 | ARGENTINA | UNITED STATES | 5 | | ASIA | 5 | 68 | 13.6 | CHINA | VIETNAM | 5 | | EUROPE | 5 | 77 | 15.4 | FRANCE | UNITED KINGDOM | 5 | | MIDDLE EAST | 5 | 58 | 11.6 | EGYPT | SAUDI ARABIA | 5 | +-------------+-------+-----+------+-----------+----------------+---------------+

Hive considerations:

HDFS permissions:

HDFS permissions: This statement does not touch any HDFS files or directories, therefore no HDFS permissions are required.

Security considerations:

Performance considerations:

Casting and conversions:

Restrictions:

Restrictions: In Impala 2.0 and higher, this function can be used as an analytic function, but with restrictions on any window clause. For MAX() and MIN(), the window clause is only allowed if the start bound is UNBOUNDED PRECEDING.

Restrictions: This function cannot be used as an analytic function; it does not currently support the OVER() clause.

Compatibility:

NULL considerations:

UDF considerations:

UDF considerations: This type cannot be used for the argument or return type of a user-defined function (UDF) or user-defined aggregate function (UDA).

Considerations for views:

NULL considerations: Casting any non-numeric value to this type produces a NULL value.

NULL considerations: Casting any unrecognized STRING value to this type produces a NULL value.

NULL considerations: An expression of this type produces a NULL value if any argument of the expression is NULL.

Required privileges:

Parquet considerations:

To examine the internal structure and data of Parquet files, you can use the parquet-tools command that comes with CDH. Make sure this command is in your $PATH. (Typically, it is symlinked from /usr/bin; sometimes, depending on your installation setup, you might need to locate it under a CDH-specific bin directory.) The arguments to this command let you perform operations such as:

  • cat: Print a file's contents to standard out. In CDH 5.5 and higher, you can use the -j option to output JSON.
  • head: Print the first few records of a file to standard output.
  • schema: Print the Parquet schema for the file.
  • meta: Print the file footer metadata, including key-value properties (like Avro schema), compression ratios, encodings, compression used, and row group information.
  • dump: Print all data and metadata.
Use parquet-tools -h to see usage information for all the arguments. Here are some examples showing parquet-tools usage:

Parquet considerations: This type is fully compatible with Parquet tables.

This function cannot be used in an analytic context. That is, the OVER() clause is not allowed at all with this function.

In queries involving both analytic functions and partitioned tables, partition pruning only occurs for columns named in the PARTITION BY clause of the analytic function call. For example, if an analytic function query has a clause such as WHERE year=2016, the way to make the query prune all other YEAR partitions is to include PARTITION BY yearin the analytic function call; for example, OVER (PARTITION BY year,other_columns other_analytic_clauses).

Impala can query Parquet files that use the PLAIN, PLAIN_DICTIONARY, BIT_PACKED, and RLE encodings. Currently, Impala does not support RLE_DICTIONARY encoding. When creating files outside of Impala for use by Impala, make sure to use one of the supported encodings. In particular, for MapReduce jobs, parquet.writer.version must not be defined (especially as PARQUET_2_0) for writing the configurations of Parquet MR jobs. Use the default version (or format). The default format, 1.0, includes some enhancements that are compatible with older versions. Data using the 2.0 format might not be consumable by Impala, due to use of the RLE_DICTIONARY encoding.

Currently, Impala always decodes the column data in Parquet files based on the ordinal position of the columns, not by looking up the position of each column based on its name. Parquet files produced outside of Impala must write column data in the same order as the columns are declared in the Impala table. Any optional columns that are omitted from the data files must be the rightmost columns in the Impala table definition.

If you created compressed Parquet files through some tool other than Impala, make sure that any compression codecs are supported in Parquet by Impala. For example, Impala does not currently support LZO compression in Parquet files. Also doublecheck that you used any recommended compatibility settings in the other tool, such as spark.sql.parquet.binaryAsString when writing Parquet files through Spark.

Text table considerations:

Text table considerations: Values of this type are potentially larger in text tables than in tables using Parquet or other binary formats.

Schema evolution considerations:

Column statistics considerations:

Column statistics considerations: Because this type has a fixed size, the maximum and average size fields are always filled in for column statistics, even before you run the COMPUTE STATS statement.

Column statistics considerations: Because the values of this type have variable size, none of the column statistics fields are filled in until you run the COMPUTE STATS statement.

Usage notes:

Examples:

Result set:

JDBC and ODBC considerations:

Cancellation: Cannot be cancelled.

Cancellation: Can be cancelled. To cancel this statement, use Ctrl-C from the impala-shell interpreter, the Cancel button from the Watch page in Hue, or Cancel from the list of in-flight queries (for a particular node) on the Queries tab in the Impala web UI (port 25000).

Cancellation: Certain multi-stage statements (CREATE TABLE AS SELECT and COMPUTE STATS) can be cancelled during some stages, when running INSERT or SELECT operations internally. To cancel this statement, use Ctrl-C from the impala-shell interpreter, the Cancel button from the Watch page in Hue, or Cancel from the list of in-flight queries (for a particular node) on the Queries tab in the Impala web UI (port 25000).

Partitioning:

Partitioning: Prefer to use this type for a partition key column. Impala can process the numeric type more efficiently than a STRING representation of the value.

Partitioning: This type can be used for partition key columns. Because of the efficiency advantage of numeric values over character-based values, if the partition key is a string representation of a number, prefer to use an integer type with sufficient range (INT, BIGINT, and so on) where practical.

Partitioning: Because this type has so few distinct values, it is typically not a sensible choice for a partition key column.

Partitioning: Because fractional values of this type are not always represented precisely, when this type is used for a partition key column, the underlying HDFS directories might not be named exactly as you expect. Prefer to partition on a DECIMAL column instead.

Partitioning: Because this type potentially has so many distinct values, it is often not a sensible choice for a partition key column. For example, events 1 millisecond apart would be stored in different partitions. Consider using the TRUNC() function to condense the number of distinct values, and partition on a new column with the truncated values.

HDFS considerations:

File format considerations:

Amazon S3 considerations:

Isilon considerations:

Because the EMC Isilon storage devices use a global value for the block size rather than a configurable value for each file, the PARQUET_FILE_SIZE query option has no effect when Impala inserts data into a table or partition residing on Isilon storage. Use the isi command to set the default block size globally on the Isilon device. For example, to set the Isilon default block size to 256 MB, the recommended size for Parquet data files for Impala, issue the following command: isi hdfs settings modify --default-block-size=256MB

HBase considerations:

The LOAD DATA statement cannot be used with HBase tables.

HBase considerations: This data type is fully compatible with HBase tables.

HBase considerations: This data type cannot be used with HBase tables.

Internal details:

Internal details: Represented in memory as a 1-byte value.

Internal details: Represented in memory as a 2-byte value.

Internal details: Represented in memory as a 4-byte value.

Internal details: Represented in memory as an 8-byte value.

Internal details: Represented in memory as a 16-byte value.

Internal details: Represented in memory as a byte array with the same size as the length specification. Values that are shorter than the specified length are padded on the right with trailing spaces.

Internal details: Represented in memory as a byte array with the minimum size needed to represent each value.

Added in: CDH 5.9.0 (Impala 2.7.0)

Added in: CDH 5.8.0 (Impala 2.6.0)

Added in: CDH 5.7.0 (Impala 2.5.0)

Added in: CDH 5.5.0 (Impala 2.3.0)

Added in: CDH 5.2.0 (Impala 2.0.0)

Added in: Available in earlier Impala releases, but new capabilities were added in CDH 5.2.0 / Impala 2.0.0

Added in: Available in all versions of Impala.

Added in: Impala 1.4.0

Added in: Impala 1.3.0

Added in: Impala 1.1

Added in: Impala 1.1.1

Added in: CDH 5.3.0 (Impala 2.1.0)

Added in: CDH 5.4.0 (Impala 2.2.0)

Syntax:

For other tips about managing and reclaiming Impala disk space, see .

Impala supports a wide variety of JOIN clauses. Left, right, semi, full, and outer joins are supported in all Impala versions. The CROSS JOIN operator is available in Impala 1.2.2 and higher. During performance tuning, you can override the reordering of join clauses that Impala does internally by including the keyword STRAIGHT_JOIN immediately after the SELECT keyword

In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table is created in Hive, allowing you to make individual tables visible to Impala without doing a full reload of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast mechanism faster and more responsive, especially during Impala startup. See for details.

Read the EXPLAIN plan from bottom to top:

  • The last part of the plan shows the low-level details such as the expected amount of data that will be read, where you can judge the effectiveness of your partitioning strategy and estimate how long it will take to scan a table based on total data size and the size of the cluster.
  • As you work your way up, next you see the operations that will be parallelized and performed on each Impala node.
  • At the higher levels, you see how data flows when intermediate result sets are combined and transmitted from one node to another.
  • See for details about the EXPLAIN_LEVEL query option, which lets you customize how much detail to show in the EXPLAIN plan depending on whether you are doing high-level or low-level tuning, dealing with logical or physical aspects of the query.

Aggregate functions are a special category with different rules. These functions calculate a return value across all the items in a result set, so they require a FROM clause in the query:

select count(product_id) from product_catalog; select max(height), avg(height) from census_data where age > 20;

Aggregate functions also ignore NULL values rather than returning a NULL result. For example, if some rows have NULL for a particular column, those rows are ignored when computing the AVG() for that column. Likewise, specifying COUNT(col_name) in a query counts only those rows where col_name contains a non-NULL value.

Aliases follow the same rules as identifiers when it comes to case insensitivity. Aliases can be longer than identifiers (up to the maximum length of a Java string) and can include additional characters such as spaces and dashes when they are quoted using backtick characters.

Another way to define different names for the same tables or columns is to create views. See for details.

When inserting into partitioned tables, especially using the Parquet file format, you can include a hint in the INSERT statement to fine-tune the overall performance of the operation and its resource usage:

  • These hints are available in Impala 1.2.2 and higher.
  • You would only use these hints if an INSERT into a partitioned Parquet table was failing due to capacity limits, or if such an INSERT was succeeding but with less-than-optimal performance.
  • To use these hints, put the hint keyword [SHUFFLE] or [NOSHUFFLE] (including the square brackets) after the PARTITION clause, immediately before the SELECT keyword.
  • [SHUFFLE] selects an execution plan that minimizes the number of files being written simultaneously to HDFS, and the number of memory buffers holding data for individual partitions. Thus it reduces overall resource usage for the INSERT operation, allowing some INSERT operations to succeed that otherwise would fail. It does involve some data transfer between the nodes so that the data files for a particular partition are all constructed on the same node.
  • [NOSHUFFLE] selects an execution plan that might be faster overall, but might also produce a larger number of small data files or exceed capacity limits, causing the INSERT operation to fail. Use [SHUFFLE] in cases where an INSERT statement fails or runs inefficiently due to all nodes attempting to construct data for all partitions.
  • Impala automatically uses the [SHUFFLE] method if any partition key column in the source table, mentioned in the INSERT ... SELECT query, does not have column statistics. In this case, only the [NOSHUFFLE] hint would have any effect.
  • If column statistics are available for all partition key columns in the source table mentioned in the INSERT ... SELECT query, Impala chooses whether to use the [SHUFFLE] or [NOSHUFFLE] technique based on the estimated number of distinct values in those columns and the number of nodes involved in the INSERT operation. In this case, you might need the [SHUFFLE] or the [NOSHUFFLE] hint to override the execution plan selected by Impala.

Any INSERT statement for a Parquet table requires enough free space in the HDFS filesystem to write one block. Because Parquet data files use a block size of 1 GB by default, an INSERT might fail (even for a very small amount of data) if your HDFS is running low on space.

After adding or replacing data in a table used in performance-critical queries, issue a COMPUTE STATS statement to make sure all statistics are up-to-date. Consider updating statistics for a table after any INSERT, LOAD DATA, or CREATE TABLE AS SELECT statement in Impala, or after loading data through Hive and doing a REFRESH table_name in Impala. This technique is especially important for tables that are very large, used in join queries, or both.

Usage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.

In Impala 1.2.1 and higher, all NULL values come at the end of the result set for ORDER BY ... ASC queries, and at the beginning of the result set for ORDER BY ... DESC queries. In effect, NULL is considered greater than all other values for sorting purposes. The original Impala behavior always put NULL values at the end, even for ORDER BY ... DESC queries. The new behavior in Impala 1.2.1 makes Impala more compatible with other popular database systems. In Impala 1.2.1 and higher, you can override or specify the sorting behavior for NULL by adding the clause NULLS FIRST or NULLS LAST at the end of the ORDER BY clause.

Return type: same as the initial argument value, except that integer values are promoted to BIGINT and floating-point values are promoted to DOUBLE; use CAST() when inserting into a smaller numeric column

Statement type: DDL

Statement type: DML (but still affected by SYNC_DDL query option)

If you connect to different Impala nodes within an impala-shell session for load-balancing purposes, you can enable the SYNC_DDL query option to make each DDL statement wait before returning, until the new or changed metadata has been received by all the Impala nodes. See for details.

The Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Boost library. For details, see the Boost documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on. It does not support .*? for non-greedy matches.

In Impala 2.0 and later, the Impala regular expression syntax conforms to the POSIX Extended Regular Expression syntax used by the Google RE2 library. For details, see the RE2 documentation. It has most idioms familiar from regular expressions in Perl, Python, and so on, including .*? for non-greedy matches.

In Impala 2.0 and later, a change in the underlying regular expression library could cause changes in the way regular expressions are interpreted by this function. Test any queries that use regular expressions and adjust the expression patterns if necessary. See for details.

Because the impala-shell interpreter uses the \ character for escaping, use \\ to represent the regular expression escape character in any regular expressions that you submit through impala-shell . You might prefer to use the equivalent character class names, such as [[:digit:]] instead of \d which you would have to escape as \\d.

The SET statement has no effect until the impala-shell interpreter is connected to an Impala server. Once you are connected, any query options you set remain in effect as you issue a subsequent CONNECT command to connect to a different Impala host.

Prior to Impala 1.4.0, Impala required any query including an ORDER BY clause to also use a LIMIT clause. In Impala 1.4.0 and higher, the LIMIT clause is optional for ORDER BY queries. In cases where sorting a huge result set requires enough memory to exceed the Impala memory limit for a particular node, Impala automatically uses a temporary disk work area to perform the sort operation.

In Impala 1.2.1 and higher, you can combine a LIMIT clause with an OFFSET clause to produce a small result set that is different from a top-N query, for example, to return items 11 through 20. This technique can be used to simulate paged results. Because Impala queries typically involve substantial amounts of I/O, use this technique only for compatibility in cases where you cannot rewrite the application logic. For best performance and scalability, wherever practical, query as many items as you expect to need, cache them on the application side, and display small groups of results to users using application logic.

In and higher, the optional WITH REPLICATION clause for CREATE TABLE and ALTER TABLE lets you specify a replication factor, the number of hosts on which to cache the same data blocks. When Impala processes a cached data block, where the cache replication factor is greater than 1, Impala randomly selects a host that has a cached copy of that data block. This optimization avoids excessive CPU usage on a single host when the same cached data block is processed multiple times. Cloudera recommends specifying a value greater than or equal to the HDFS block replication factor.

If a view applies to a partitioned table, any partition pruning considers the clauses on both the original query and any additional WHERE predicates in the query that refers to the view. Prior to Impala 1.4, only the WHERE clauses on the original query from the CREATE VIEW statement were used for partition pruning.

To see the definition of a view, issue a DESCRIBE FORMATTED statement, which shows the query from the original CREATE VIEW statement: [localhost:21000] > create view v1 as select * from t1; [localhost:21000] > describe formatted v1; Query finished, fetching results ... +------------------------------+------------------------------+------------+ | name | type | comment | +------------------------------+------------------------------+------------+ | # col_name | data_type | comment | | | NULL | NULL | | x | int | None | | y | int | None | | s | string | None | | | NULL | NULL | | # Detailed Table Information | NULL | NULL | | Database: | views | NULL | | Owner: | cloudera | NULL | | CreateTime: | Mon Jul 08 15:56:27 EDT 2013 | NULL | | LastAccessTime: | UNKNOWN | NULL | | Protect Mode: | None | NULL | | Retention: | 0 | NULL | | Table Type: | VIRTUAL_VIEW | NULL | | Table Parameters: | NULL | NULL | | | transient_lastDdlTime | 1373313387 | | | NULL | NULL | | # Storage Information | NULL | NULL | | SerDe Library: | null | NULL | | InputFormat: | null | NULL | | OutputFormat: | null | NULL | | Compressed: | No | NULL | | Num Buckets: | 0 | NULL | | Bucket Columns: | [] | NULL | | Sort Columns: | [] | NULL | | | NULL | NULL | | # View Information | NULL | NULL | | View Original Text: | SELECT * FROM t1 | NULL | | View Expanded Text: | SELECT * FROM t1 | NULL | +------------------------------+------------------------------+------------+

The INSERT ... VALUES technique is not suitable for loading large quantities of data into HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not run scripts with thousands of INSERT ... VALUES statements that insert a single row each time. If you do run INSERT ... VALUES operations to load data into a staging table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES clause, and use a separate database to make cleanup easier if the operation does produce many tiny files.
HBase

HBase-related reusable snippets.

After you create a table in Hive, such as the HBase mapping table in this example, issue an INVALIDATE METADATA table_name statement the next time you connect to Impala, make Impala aware of the new table. (Prior to Impala 1.2.4, you could not specify the table name if Impala was not aware of the table yet; in Impala 1.2.4 and higher, specifying the table name avoids reloading the metadata for other tables that are not changed.)
Introduction, Concepts, and Architecture

Snippets from conceptual, architecture, benefits, and feature introduction sections. Some of these, particularly around the front matter, were conref'ed in ways that were hard to follow. So now we pull individual paragraphs and lists from here, for clarity.

The Apache Impala (incubating) project provides high-performance, low-latency SQL queries on data stored in popular Apache Hadoop file formats. The fast response for queries enables interactive exploration and fine-tuning of analytic queries, rather than long batch jobs traditionally associated with SQL-on-Hadoop technologies. (You will often see the term interactive applied to these kinds of fast queries with human-scale response times.)

Impala integrates with the Apache Hive metastore database, to share databases and tables between both components. The high level of integration with Hive, and compatibility with the HiveQL syntax, lets you use either Impala or Hive to create tables, issue queries, load data, and so on.

The following graphic illustrates how Impala is positioned in the broader Cloudera environment: Architecture diagram showing how Impala relates to other Hadoop components such as HDFS, the Hive metastore database, and client programs such as JDBC and ODBC applications and the Hue web UI.

The Impala solution is composed of the following components:

  • Clients - Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all interact with Impala. These interfaces are typically used to issue queries or complete administrative tasks such as connecting to Impala.
  • Hive Metastore - Stores information about the data available to Impala. For example, the metastore lets Impala know what databases are available and what the structure of those databases is. As you create, drop, and alter schema objects, load data into tables, and so on through Impala SQL statements, the relevant metadata changes are automatically broadcast to all Impala nodes by the dedicated catalog service introduced in Impala 1.2.
  • Impala - This process, which runs on DataNodes, coordinates and executes queries. Each instance of Impala can receive, plan, and coordinate queries from Impala clients. Queries are distributed among Impala nodes, and these nodes then act as workers, executing parallel query fragments.
  • HBase and HDFS - Storage for data to be queried.

Queries executed using Impala are handled as follows:

  1. User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying interfaces. The user application may connect to any impalad in the cluster. This impalad becomes the coordinator for the query.
  2. Impala parses the query and analyzes it to determine what tasks need to be performed by impalad instances across the cluster. Execution is planned for optimal efficiency.
  3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.
  4. Each impalad returns data to the coordinating impalad, which sends these results to the client.

In and higher, Impala can optionally skip an arbitrary number of header lines from text input files on HDFS based on the skip.header.line.count value in the TBLPROPERTIES field of the table metadata. For example: create table header_line(first_name string, age int) row format delimited fields terminated by ','; -- Back in the shell, load data into the table with commands such as: -- cat >data.csv -- Name,Age -- Alice,25 -- Bob,19 -- hdfs dfs -put data.csv /user/hive/warehouse/header_line refresh header_line; -- Initially, the Name,Age header line is treated as a row of the table. select * from header_line limit 10; +------------+------+ | first_name | age | +------------+------+ | Name | NULL | | Alice | 25 | | Bob | 19 | +------------+------+ alter table header_line set tblproperties('skip.header.line.count'='1'); -- Once the table property is set, queries skip the specified number of lines -- at the beginning of each text data file. Therefore, all the files in the table -- should follow the same convention for header lines. select * from header_line limit 10; +------------+-----+ | first_name | age | +------------+-----+ | Alice | 25 | | Bob | 19 | +------------+-----+

Impala provides support for:

  • Most common SQL-92 features of Hive Query Language (HiveQL) including SELECT, joins, and aggregate functions.
  • HDFS, HBase, and Amazon Simple Storage System (S3) storage, including:
    • HDFS file formats: delimited text files, Parquet, Avro, SequenceFile, and RCFile.
    • Compression codecs: Snappy, GZIP, Deflate, BZIP.
  • Common data access interfaces including:
    • JDBC driver.
    • ODBC driver.
    • Hue Beeswax and the Impala Query UI.
  • impala-shell command-line interface.
  • Kerberos authentication.

By default, the metadata loading and caching on startup happens asynchronously, so Impala can begin accepting requests promptly. To enable the original behavior, where Impala waited until all metadata was loaded before accepting any requests, set the catalogd configuration option --load_catalog_in_background=false.

  • See , and , for usage information for the catalogd daemon.

  • The REFRESH and INVALIDATE METADATA statements are no longer needed when the CREATE TABLE, INSERT, or other table-changing or data-changing operation is performed through Impala. These statements are still needed if such operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the statements only need to be issued on one Impala node rather than on all nodes. See and for the latest usage information for those statements.

  • See for background information on the catalogd service.

Installation

Snippets related to installation, upgrading, prerequisites.

  • The location of core dump files may vary according to your operating system configuration.

  • Other security settings may prevent Impala from writing core dumps even when this option is enabled.

  • On systems managed by Cloudera Manager, the default location for core dumps is on a temporary filesystem, which can lead to out-of-space issues if the core dumps are large, frequent, or not removed promptly. To specify an alternative location for the core dumps, filter the Impala configuration settings to find the core_dump_dir option, which is available in Cloudera Manager 5.4.3 and higher. This option lets you specify a different directory for core dumps for each of the Impala-related daemons.

The prerequisite for CPU architecture has been relaxed in Impala 2.2.0 and higher. From this release onward, Impala works on CPUs that have the SSSE3 instruction set. The SSE4 instruction set is no longer required. This relaxed requirement simplifies the upgrade planning from Impala 1.x releases, which also worked on SSSE3-enabled processors.

Due to a change to the implementation of logging in Impala 1.1.1 and higher, currently you should change the default setting for the logbuflevel property for the Impala service after installing through Cloudera Manager. In Cloudera Manager, on the log settings page for the Impala service, change the setting Impala Daemon Log Buffer Level (logbuflevel) from -1 to 0. You might change this setting to a value higher than 0, if you prefer to reduce the I/O overhead for logging, at the expense of possibly losing some lower-priority log messages in the event of a crash.

On version 5 of Red Hat Enterprise Linux and comparable distributions, some additional setup is needed for the impala-shell interpreter to connect to a Kerberos-enabled Impala cluster: sudo yum install python-devel openssl-devel python-pip sudo pip-python install ssl

Prior to CDH 5.5.2 / Impala 2.3.2, you could enable Kerberos authentication between Impala internal components, or SSL encryption between Impala internal components, but not both at the same time. This restriction has now been lifted. See IMPALA-2598 to see the maintenance releases for different levels of CDH where the fix has been published.

Prior to CDH 5.7 / Impala 2.5, the Hive JDBC driver did not support connections that use both Kerberos authentication and SSL encryption. If your cluster is running an older release that has this restriction, to use both of these security features with Impala through a JDBC application, use the Cloudera JDBC Connector as the JDBC driver.

Because Impala 1.2.2 works with CDH 4, while the Impala that comes with the CDH 5 beta is version 1.2.0, upgrading from CDH 4 to the CDH 5 beta actually reverts to an earlier Impala version. The beta release of Impala that comes with the CDH 5 beta includes the resource management feature that relies on the CDH 5 infrastructure, as well as the much-requested user-defined function feature and the catalog service. However, it does not include new features in Impala 1.2.3 such as join order optimizations, COMPUTE STATS statement, CROSS JOIN operator, SHOW CREATE TABLE statement, SHOW TABLE STATS and SHOW COLUMN STATS statements, OFFSET and NULLS FIRST/LAST clauses for queries, and the SYNC_DDL query option.

In a Cloudera Manager environment, the catalog service is not recognized or managed by Cloudera Manager versions prior to 4.8. Cloudera Manager 4.8 and higher require the catalog service to be present for Impala. Therefore, if you upgrade to Cloudera Manager 4.8 or higher, you must also upgrade Impala to 1.2.1 or higher. Likewise, if you upgrade Impala to 1.2.1 or higher, you must also upgrade Cloudera Manager to 4.8 or higher.

For the resource management feature to work (in combination with CDH 5 and the YARN and Llama components), the impala user must be a member of the hdfs group. This setup is performed automatically during a new install, but not when upgrading from earlier Impala releases to Impala 1.2. If you are upgrading a node that already had Impala 1.1 or 1.0 installed, manually add the impala user to the hdfs group.

Prior to CDH 5.5 / Impala 2.3, the impala user was required to be a member of the hdfs group for the resource management feature to work (in combination with CDH 5 and the YARN and Llama components). This requirement has been lifted in and higher. The impala user remains in the hdfs group on upgraded systems if it was already there, but is no longer put into that group during new installs.

  • The Impala 1.3.1 release is available for both CDH 4 and CDH 5. This is the first release in the 1.3.x series for CDH 4.
Performance

Snippets from performance configuration, tuning, and so on.

A good source of tips related to scalability and performance tuning is the Impala Cookbook presentation. These slides are updated periodically as new features come out and new benchmarks are performed.

  • Copy the client core-site.xml and hdfs-site.xml configuration files from the Hadoop configuration directory to the Impala configuration directory. The default Impala configuration location is /etc/impala/conf.
  • After applying these changes, restart all DataNodes.
Currently, a known issue (IMPALA-488) could cause excessive memory usage during a COMPUTE STATS operation on a Parquet table. As a workaround, issue the command SET NUM_SCANNER_THREADS=2 in impala-shell before issuing the COMPUTE STATS statement. Then issue UNSET NUM_SCANNER_THREADS before continuing with queries.
Administration

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and catalogd daemons do not have special requirements for high availability, because problems with those daemons do not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and restart the Impala service.

Due to a limitation of HDFS, zero-copy reads are not supported with encryption. Cloudera recommends not using HDFS caching for Impala data files in encryption zones. The queries fall back to the normal read path during query execution, which might cause some performance overhead.

This query option is no longer supported, because it affects interaction between Impala and Llama. The use of the Llama component for integrated resource management within YARN is no longer supported with CDH 5.5 / Impala 2.3 and higher.

The use of the Llama component for integrated resource management within YARN is no longer supported with CDH 5.5 / Impala 2.3 and higher.

For clusters running Impala alongside other data management components, you define static service pools to define the resources available to Impala and other components. Then within the area allocated for Impala, you can create dynamic service pools, each with its own settings for the Impala admission control feature.

If you specify Max Memory for an Impala dynamic resource pool, you must also specify the Default Query Memory Limit. Max Memory relies on the Default Query Memory Limit to produce a reliable estimate of overall memory consumption for a query.

For example, consider the following scenario:

  • The cluster is running impalad daemons on five DataNodes.
  • A dynamic resource pool has Max Memory set to 100 GB.
  • The Default Query Memory Limit for the pool is 10 GB. Therefore, any query running in this pool could use up to 50 GB of memory (default query memory limit * number of Impala nodes).
  • The maximum number of queries that Impala executes concurrently within this dynamic resource pool is two, which is the most that could be accomodated within the 100 GB Max Memory cluster-wide limit.
  • There is no memory penalty if queries use less memory than the Default Query Memory Limit per-host setting or the Max Memory cluster-wide limit. These values are only used to estimate how many queries can be run concurrently within the resource constraints for the pool.

When using YARN with Impala, Cloudera recommends using the static partitioning technique (through a static service pool) rather than the combination of YARN and Llama. YARN is a central, synchronous scheduler and thus introduces higher latency and variance which is better suited for batch processing than for interactive workloads like Impala (especially with higher concurrency). Currently, YARN allocates memory throughout the query, making it hard to reason about out-of-memory and timeout conditions.

Impala queries ignore files with extensions commonly used for temporary work files by Hadoop tools. Any files with extensions .tmp or .copying are not considered part of the Impala table. The suffix matching is case-insensitive, so for example Impala ignores both .copying and .COPYING suffixes.

If your JDBC or ODBC application connects to Impala through a load balancer such as haproxy, be cautious about reusing the connections. If the load balancer has set up connection timeout values, either check the connection frequently so that it never sits idle longer than the load balancer timeout value, or check the connection validity before using it and create a new one if the connection has been closed.

For a detailed information about configuring a cluster to share resources between Impala queries and MapReduce jobs, see and .

In CDH 5.0.0, the Llama component is in beta. It is intended for evaluation of resource management in test environments, in combination with Impala and YARN. It is currently not recommended for production deployment.
CDH5 Integration

Snippets related to CDH 5 integration, for example phrase tags that are conditionalized in or out of 'integrated' and 'standalone' conditions to provide extra context for links that don't work in certain PDF contexts.

The version of Impala that is included with CDH 5.5.1 is identical to the Impala for CDH 5.5.0. There are no new bug fixes, new features, or incompatible changes.

Impala 2.6.x is available as part of CDH 5.8.x. Impala 2.5.x is available as part of CDH 5.7.x and is not available for CDH 4. Cloudera does not intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to a recent CDH 5 release. Impala 2.4.x is available as part of CDH 5.6.x and is not available for CDH 4. Cloudera does not intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to a recent CDH 5 release. Impala 2.4.0 is available as part of CDH 5.6.0 and is not available for CDH 4. Cloudera does not intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to a recent CDH 5 release. Impala 2.3.x is available as part of CDH 5.5.x and is not available for CDH 4. Cloudera does not intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to a recent CDH 5 release. Impala 2.2.9 is available as part of CDH 5.4.9, not under CDH 4. Impala 2.2.8 is available as part of CDH 5.4.8, not under CDH 4. Impala 2.2.7 is available as part of CDH 5.4.7, not under CDH 4. Impala 2.2.6 is available as part of CDH 5.4.6, not under CDH 4. Impala 2.2.5 is available as part of CDH 5.4.5, not under CDH 4. Impala 2.2.4 is available as part of CDH 5.4.4, not under CDH 4. Impala 2.2.3 is available as part of CDH 5.4.3, not under CDH 4. Impala 2.2.2 is available as part of CDH 5.4.2, not under CDH 4. Impala 2.2.1 is available as part of CDH 5.4.1, not under CDH 4. The Impala 2.2.x maintenance releases now use the CDH 5.4.x numbering system rather than increasing the Impala version numbers. Impala 2.2 and higher are not available under CDH 4. Impala 2.2.0 is available as part of CDH 5.4.0 and is not available for CDH 4. Cloudera does not intend to release future versions of Impala for CDH 4 outside patch and maintenance releases if required. Given the end-of-maintenance status for CDH 4, Cloudera recommends all customers to migrate to a recent CDH 5 release. This Impala maintenance release is only available as part of CDH 5, not under CDH 4. Impala 2.1.3 is available as part of CDH 5.3.3, not under CDH 4. Impala 2.1.2 is available as part of CDH 5.3.2, not under CDH 4. Impala 2.0.5 is available as part of CDH 5.2.6, not under CDH 4. Impala 2.0.4 is available as part of CDH 5.2.5, not under CDH 4. Impala 2.0.3 is available as part of CDH 5.2.4, not under CDH 4. Impala 2.0.2 is available as part of CDH 5.2.3, not under CDH 4. Impala 1.4.4 is available as part of CDH 5.1.5, not under CDH 4. Impala 1.4.3 is available as part of CDH 5.1.4, and under CDH 4. Impala 1.4.2 is only available as part of CDH 5.1.3, not under CDH 4. Impala 1.3.3 is only available as part of CDH 5.0.5, not under CDH 4. Impala 1.3.2 is only available as part of CDH 5.0.4, not under CDH 4. Impala 1.4.1 is only available as part of CDH 5.1.2, not under CDH 4. Starting in April 2016, future release note updates are being consolidated in a single location to avoid duplication of stale or incomplete information. You can view online the Impala New Features, Incompatible Changes, Known Issues, and Fixed Issues. You can view or print all of these by downloading the latest Impala PDF.

Because CDH 5.3.5 does not include any code changes for Impala, Impala 2.1.4 is included with both CDH 5.3.4 and 5.3.5. See such-and-such a topic in the CDH 5 Installation Guide. See such-and-such a topic in the CDH 5 Security Guide. See such-and-such a topic in the CDH 5 Release Notes. See such-and-such a topic in the Impala User Guide. See such-and-such a topic in the Impala Release Notes. See such-and-such a topic in the Impala Frequently Asked Questions.

Impala relies on the statistics produced by the COMPUTE STATS statement to estimate memory usage for each query. See for guidelines about how and when to use this statement.
impala-shell

These reusable snippets are for the impala-shell command and related material such as query options.

You might set the NUM_NODES option to 1 briefly, during INSERT or CREATE TABLE AS SELECT statements. Normally, those statements produce one or more data files per data node. If the write operation involves small amounts of data, a Parquet table, and/or a partitioned table, the default behavior could produce many small files when intuitively you might expect only a single output file. SET NUM_NODES=1 turns off the distributed aspect of the write operation, making it more likely to produce only one or a few data files.

The timeout clock for queries and sessions only starts ticking when the query or session is idle. For queries, this means the query has results ready but is waiting for a client to fetch the data. A query can run for an arbitrary time without triggering a timeout, because the query is computing results rather than sitting idle waiting for the results to be fetched. The timeout period is intended to prevent unclosed queries from consuming resources and taking up slots in the admission count of running queries, potentially preventing other queries from starting.

For sessions, this means that no query has been submitted for some period of time.

Now that the ORDER BY clause no longer requires an accompanying LIMIT clause in Impala 1.4.0 and higher, this query option is deprecated and has no effect.

Release Notes

These are notes associated with a particular JIRA issue. They typically will be conref'ed both in the release notes and someplace in the main body as a limitation or warning or similar.

The initial release of CDH 5.7 / Impala 2.5 sometimes has a higher peak memory usage than in previous releases while reading Parquet files. The following query options might help to reduce memory consumption in the Parquet scanner:

  • Reduce the number of scanner threads, for example: set num_scanner_threads=30
  • Reduce the batch size, for example: set batch_size=512
  • Increase the memory limit, for example: set mem_limit=64g
You can track the status of the fix for this issue at IMPALA-3662.

For schemas with large numbers of tables, partitions, and data files, the catalogd daemon might encounter an out-of-memory error. To increase the memory limit for the catalogd daemon:

  1. Check current memory usage for the catalogd daemon by running the following commands on the host where that daemon runs on your cluster:

    jcmd catalogd_pid VM.flags jmap -heap catalogd_pid
  2. Decide on a large enough value for the catalogd heap. You express it as an environment variable value as follows:

    JAVA_TOOL_OPTIONS="-Xmx8g"
  3. On systems managed by Cloudera Manager, include this value in the configuration field Java Heap Size of Catalog Server in Bytes (Cloudera Manager 5.7 and higher), or Impala Catalog Server Environment Advanced Configuration Snippet (Safety Valve) (prior to Cloudera Manager 5.7). Then restart the Impala service.

  4. On systems not managed by Cloudera Manager, put this environment variable setting into the startup script for the catalogd daemon, then restart the catalogd daemon.

  5. Use the same jcmd and jmap commands as earlier to verify that the new settings are in effect.

Kudu

Kudu-related content. This category gets its own special area because there could be considerations around sharing content between the Impala documentation and the Kudu documentation.

Kudu considerations:

The LOAD DATA statement cannot be used with Kudu tables.