Porting SQL from Other Database Systems to Impala

Porting SQL from Other Database Systems to Impala Porting SQL

porting Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications to Impala, due to variations in data types, built-in functions, vendor language extensions, and Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for best performance.

Porting DDL and DML Statements

When adapting SQL code from a traditional database system to Impala, expect to find a number of differences in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to account for the Impala partitioning scheme and Hadoop file formats.

Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor extensions and features not yet supported in Impala, you might be able to run identical or almost-identical query text on both systems.

Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse and ongoing tuning efforts on the code for SQL queries.

Porting Data Types from Other Database Systems

Change any VARCHAR, VARCHAR2, and CHAR columns to STRING. Remove any length constraints from the column declarations; for example, change VARCHAR(32) or CHAR(1) to STRING. Impala is very flexible about the length of string values; it does not impose any length constraints or do any special processing (such as blank-padding) for STRING columns. (In Impala 2.0 and higher, there are data types VARCHAR and CHAR, with length constraints for both types and blank-padding for CHAR. However, for performance reasons, it is still preferable to use STRING columns where practical.)
For national language character types such as NCHAR, NVARCHAR, or NCLOB, be aware that while Impala can store and query UTF-8 character data, currently some string manipulation operations only work correctly with ASCII data. See for details.
Change any DATE, DATETIME, or TIME columns to TIMESTAMP. Remove any precision constraints. Remove any timezone clauses, and make sure your application logic or ETL process accounts for the fact that Impala expects all TIMESTAMP values to be in Coordinated Universal Time (UTC). See for information about the TIMESTAMP data type, and for conversion functions for different date and time formats.

You might also need to adapt date- and time-related literal values and format strings to use the supported Impala date and time formats. If you have date and time literals with different separators or different numbers of YY, MM, and so on placeholders than Impala expects, consider using calls to regexp_replace() to transform those values to the Impala-compatible format. See for information about the allowed formats for date and time literals, and for string conversion functions such as regexp_replace().

Instead of SYSDATE, call the function NOW().

Instead of adding or subtracting directly from a date value to produce a value N days in the past or future, use an INTERVAL expression, for example NOW() + INTERVAL 30 DAYS.
Although Impala supports INTERVAL expressions for datetime arithmetic, as shown in , INTERVAL is not available as a column data type in Impala. For any INTERVAL values stored in tables, convert them to numeric values that you can add or subtract using the functions in . For example, if you had a table DEADLINES with an INT column TIME_PERIOD, you could construct dates N days in the future like so:
SELECT NOW() + INTERVAL time_period DAYS from deadlines;
For YEAR columns, change to the smallest Impala integer type that has sufficient range. See for details about ranges, casting, and so on for the various numeric data types.
Change any DECIMAL and NUMBER types. If fixed-point precision is not required, you can use FLOAT or DOUBLE on the Impala side depending on the range of values. For applications that require precise decimal values, such as financial data, you might need to make more extensive changes to table structure and application logic, such as using separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs to manipulate them. See for details about ranges, casting, and so on for the various numeric data types.
FLOAT, DOUBLE, and REAL types are supported in Impala. Remove any precision and scale specifications. (In Impala, REAL is just an alias for DOUBLE; columns declared as REAL are turned into DOUBLE behind the scenes.) See for details about ranges, casting, and so on for the various numeric data types.
Most integer types from other systems have equivalents in Impala, perhaps under different names such as BIGINT instead of INT8. For any that are unavailable, for example MEDIUMINT, switch to the smallest Impala integer type that has sufficient range. Remove any precision specifications. See for details about ranges, casting, and so on for the various numeric data types.
Remove any UNSIGNED constraints. All Impala numeric types are signed. See for details about ranges, casting, and so on for the various numeric data types.
For any types holding bitwise values, use an integer type with enough range to hold all the relevant bits within a positive integer. See for details about ranges, casting, and so on for the various numeric data types.

For example, TINYINT has a maximum positive value of 127, not 256, so to manipulate 8-bit bitfields as positive numbers switch to the next largest type SMALLINT.
[localhost:21000] > select cast(127*2 as tinyint); +--------------------------+ | cast(127 * 2 as tinyint) | +--------------------------+ | -2 | +--------------------------+ [localhost:21000] > select cast(128 as tinyint); +----------------------+ | cast(128 as tinyint) | +----------------------+ | -128 | +----------------------+ [localhost:21000] > select cast(127*2 as smallint); +---------------------------+ | cast(127 * 2 as smallint) | +---------------------------+ | 254 | +---------------------------+
Impala does not support notation such as b'0101' for bit literals.
For BLOB values, use STRING to represent CLOB or TEXT types (character based large objects) up to 32 KB in size. Binary large objects such as BLOB, RAW BINARY, and VARBINARY do not currently have an equivalent in Impala.
For Boolean-like types such as BOOL, use the Impala BOOLEAN type.
Because Impala currently does not support composite or nested types, any spatial data types in other database systems do not have direct equivalents in Impala. You could represent spatial values in string format and write UDFs to process them. See for details. Where practical, separate spatial types into separate tables so that Impala can still work with the non-spatial data.
Take out any DEFAULT clauses. Impala can use data files produced from many different sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of LOAD DATA and external tables mean that Impala is flexible about the format of data files, and Impala does not necessarily validate or cleanse data before querying it. When copying data through Impala INSERT statements, you can use conditional functions such as CASE or NVL to substitute some other value for NULL fields; see for details.
Take out any constraints from your CREATE TABLE and ALTER TABLE statements, for example PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL, UNSIGNED, or CHECK constraints. Impala can use data files produced from many different sources, such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries to test for NULL values. When copying data through Impala INSERT statements, you can use conditional functions such as CASE or NVL to substitute some other value for NULL fields; see for details.

Do as much verification as practical before loading data into Impala. After data is loaded into Impala, you can do further verification using SQL queries to check if values have expected ranges, if values are NULL or not, and so on. If there is a problem with the data, you will need to re-run earlier stages of the ETL process, or do an INSERT ... SELECT statement in Impala to copy the faulty data to a new table and transform or filter out the bad values.
Take out any CREATE INDEX, DROP INDEX, and ALTER INDEX statements, and equivalent ALTER TABLE statements. Remove any INDEX, KEY, or PRIMARY KEY clauses from CREATE TABLE and ALTER TABLE statements. Impala is optimized for bulk read operations for data warehouse-style queries, and therefore does not support indexes for its tables.
Calls to built-in functions with out-of-range or otherwise incorrect arguments, return NULL in Impala as opposed to raising exceptions. (This rule applies even when the ABORT_ON_ERROR=true query option is in effect.) Run small-scale queries using representative data to doublecheck that calls to built-in functions are returning expected values rather than NULL. For example, unsupported CAST operations do not raise an error in Impala:
select cast('foo' as int); +--------------------+ | cast('foo' as int) | +--------------------+ | NULL | +--------------------+
For any other type not supported in Impala, you could represent their values in string format and write UDFs to process them. See for details.
To detect the presence of unsupported or unconvertable data types in data files, do initial testing with the ABORT_ON_ERROR=true query option in effect. This option causes queries to fail immediately if they encounter disallowed type conversions. See for details. For example:
set abort_on_error=true; select count(*) from (select * from t1); -- The above query will fail if the data files for T1 contain any -- values that can't be converted to the expected Impala data types. -- For example, if T1.C1 is defined as INT but the column contains -- floating-point values like 1.1, the query will return an error.

SQL Statements to Remove or Adapt

Some SQL statements or clauses that you might be familiar with are not currently supported in Impala:

Impala has no DELETE statement. Impala is intended for data warehouse-style operations where you do bulk moves and transforms of large quantities of data. Instead of using DELETE, use INSERT OVERWRITE to entirely replace the contents of a table or partition, or use INSERT ... SELECT to copy a subset of data (everything but the rows you intended to delete) from one table to another. See for an overview of Impala DML statements.
Impala has no UPDATE statement. Impala is intended for data warehouse-style operations where you do bulk moves and transforms of large quantities of data. Instead of using UPDATE, do all necessary transformations early in the ETL process, such as in the job that generates the original data, or when copying from one table to another to convert to a particular file format or partitioning scheme. See for an overview of Impala DML statements.
Impala has no transactional statements, such as COMMIT or ROLLBACK. Impala effectively works like the AUTOCOMMIT mode in some database systems, where changes take effect as soon as they are made.
If your database, table, column, or other names conflict with Impala reserved words, use different names or quote the names with backticks. See for the current list of Impala reserved words.

Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or column alias. For example, in SELECT * FROM t1 NATURAL JOIN t2, Impala does not recognize the NATURAL keyword and interprets it as an alias for the table t1. If you experience any unexpected behavior with queries, check the list of reserved words to make sure all keywords in join and WHERE clauses are recognized.
Impala supports subqueries only in the FROM clause of a query, not within the WHERE clauses. Therefore, you cannot use clauses such as WHERE column IN (subquery). Also, Impala does not allow EXISTS or NOT EXISTS clauses (although EXISTS is a reserved keyword).
Impala supports UNION and UNION ALL set operators, but not INTERSECT.
Within queries, Impala requires query aliases for any subqueries:
-- Without the alias 'contents_of_t1' at the end, query gives syntax error. select count(*) from (select * from t1) contents_of_t1;
When an alias is declared for an expression in a query, that alias cannot be referenced again within the same query block:
-- Can't reference AVERAGE twice in the SELECT list where it's defined. select avg(x) as average, average+1 from t1 group by x; ERROR: AnalysisException: couldn't resolve column reference: 'average' -- Although it can be referenced again later in the same query. select avg(x) as average from t1 group by x having average > 3;
For Impala, either repeat the expression again, or abstract the expression into a WITH clause, creating named columns that can be referenced multiple times anywhere in the base query:
-- The following 2 query forms are equivalent. select avg(x) as average, avg(x)+1 from t1 group by x; with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;
Impala does not support certain rarely used join types that are less appropriate for high-volume tables used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support natural joins or anti-joins, and requires the CROSS JOIN operator for Cartesian products. See for details on the syntax for Impala join clauses.
Impala has a limited choice of partitioning types. Partitions are defined based on each distinct combination of values for one or more partition key columns. Impala does not redistribute or check data to create evenly distributed partitions; you must choose partition key columns based on your knowledge of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning to use the Impala partition syntax for CREATE TABLE and ALTER TABLE statements. Impala partitioning is similar to range partitioning where every range has exactly one value, or key partitioning where the hash function produces a separate bucket for every combination of key values. See for usage details, and and for syntax.
Because the number of separate partitions is potentially higher than in other database systems, keep a close eye on the number of partitions and the volume of data in each one; scale back the number of partition key columns if you end up with too many partitions with a small volume of data in each one. Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node. HDFS blocks are typically multiple megabytes, especially for Parquet files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much parallelism in the query because such a small amount of data is typically processed by a single node.
For top-N queries, Impala uses the LIMIT clause rather than comparing against a pseudocolumn named ROWNUM or ROW_NUM. See for details.

SQL Constructs to Doublecheck

Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert and change these things where appropriate:

A CREATE TABLE statement with no STORED AS clause creates data files in plain text format, which is convenient for data interchange but not a good choice for high-volume data with high-performance queries. See for why and how to use specific file formats for compact data and high-performance queries. Especially see , for details about the file format most heavily optimized for large-scale data warehouse queries.
A CREATE TABLE statement with no PARTITIONED BY clause stores all the data files in the same physical location, which can lead to scalability problems when the data volume becomes large.

On the other hand, adapting tables that were already partitioned in a different database system could produce an Impala table with a high number of partitions and not enough data in each one, leading to underutilization of Impala's parallel query features.

See for details about setting up partitioning and tuning the performance of queries on partitioned tables.
The INSERT ... VALUES syntax is suitable for setting up toy tables with a few rows for functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising your data load process to produce raw data files outside of Impala, then setting up Impala external tables or using the LOAD DATA statement to use those data files instantly in Impala tables, with no conversion or indexing stage. See and for details about the Impala techniques for working with data files produced outside of Impala; see for examples of ETL workflow for Impala.
If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data files, or a single giant data file that cannot take advantage of distributed parallel queries or partitioning. In this case, use an INSERT ... SELECT statement to copy the data into a new table and reorganize into a more efficient layout in the same operation. See for details about the INSERT statement.

You can do INSERT ... SELECT into a table with a more efficient file format (see ) or from an unpartitioned table into a partitioned one (see ).
The number of expressions allowed in an Impala query might be smaller than for some other database systems, causing failures for very complicated queries (typically produced by automated SQL generators). Where practical, keep the number of expressions in the WHERE clauses to approximately 2000 or fewer. As a workaround, set the query option DISABLE_CODEGEN=true if queries fail for this reason. See for details.
If practical, rewrite UNION queries to use the UNION ALL operator instead.

Next Porting Steps after Verifying Syntax and Semantics

Throughout this section, some of the decisions you make during the porting process also have a substantial impact on performance. After your SQL code is ported and working correctly, doublecheck the performance-related aspects of your schema design, physical layout, and queries to make sure that the ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and integration with Hadoop components.

Have you run the COMPUTE STATS statement on each table involved in join queries? Have you also run COMPUTE STATS for each table used as the source table in an INSERT ... SELECT or CREATE TABLE AS SELECT statement?
Are you using the most efficient file format for your data volumes, table structure, and query characteristics?
Are you using partitioning effectively? That is, have you partitioned on columns that are often used for filtering in WHERE clauses? Have you partitioned at the right granularity so that there is enough data in each partition to parallelize the work for each query?
Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a huge number of small files (bad)?

See for details about the whole performance tuning process.