When adapting SQL code from a traditional database system to Impala, expect to find a number of differences in the DDL statements that you use to set up the schema. Clauses related to physical layout of files, tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to account for the Impala partitioning scheme and Hadoop file formats.
Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor extensions and features not yet supported in Impala, you might be able to run identical or almost-identical query text on both systems.
Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse and ongoing tuning efforts on the code for SQL queries.
Change any
For national language character types such as
Change any
You might also need to adapt date- and time-related literal values and format strings to use the
supported Impala date and time formats. If you have date and time literals with different separators or
different numbers of
Instead of
Instead of adding or subtracting directly from a date value to produce a value
Although Impala supports
For
Change any
Most integer types from other systems have equivalents in Impala, perhaps under different names such as
Remove any
For any types holding bitwise values, use an integer type with enough range to hold all the relevant
bits within a positive integer. See
For example,
Impala does not support notation such as
For BLOB values, use
For Boolean-like types such as
Because Impala currently does not support composite or nested types, any spatial data types in other
database systems do not have direct equivalents in Impala. You could represent spatial values in string
format and write UDFs to process them. See
Take out any
Take out any constraints from your
Do as much verification as practical before loading data into Impala. After data is loaded into Impala,
you can do further verification using SQL queries to check if values have expected ranges, if values
are
Take out any
Calls to built-in functions with out-of-range or otherwise incorrect arguments, return
For any other type not supported in Impala, you could represent their values in string format and write
UDFs to process them. See
To detect the presence of unsupported or unconvertable data types in data files, do initial testing
with the
Some SQL statements or clauses that you might be familiar with are not currently supported in Impala:
Impala has no
Impala has no
Impala has no transactional statements, such as
If your database, table, column, or other names conflict with Impala reserved words, use different
names or quote the names with backticks. See
Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or
column alias. For example, in
Impala supports subqueries only in the
Impala supports
Within queries, Impala requires query aliases for any subqueries:
When an alias is declared for an expression in a query, that alias cannot be referenced again within the same query block:
For Impala, either repeat the expression again, or abstract the expression into a
Impala does not support certain rarely used join types that are less appropriate for high-volume tables
used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to
ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support
natural joins or anti-joins, and requires the
Impala has a limited choice of partitioning types. Partitions are defined based on each distinct
combination of values for one or more partition key columns. Impala does not redistribute or check data
to create evenly distributed partitions; you must choose partition key columns based on your knowledge
of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning
to use the Impala partition syntax for
For top-N
queries, Impala uses the
Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert and change these things where appropriate:
A
A
On the other hand, adapting tables that were already partitioned in a different database system could produce an Impala table with a high number of partitions and not enough data in each one, leading to underutilization of Impala's parallel query features.
See
The
If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data
files, or a single giant data file that cannot take advantage of distributed parallel queries or
partitioning. In this case, use an
You can do
The number of expressions allowed in an Impala query might be smaller than for some other database
systems, causing failures for very complicated queries (typically produced by automated SQL
generators). Where practical, keep the number of expressions in the
If practical, rewrite
Throughout this section, some of the decisions you make during the porting process also have a substantial impact on performance. After your SQL code is ported and working correctly, doublecheck the performance-related aspects of your schema design, physical layout, and queries to make sure that the ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and integration with Hadoop components.
See