mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
642 lines
32 KiB
XML
642 lines
32 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="porting">
|
|
|
|
<title>Porting SQL from Other Database Systems to Impala</title>
|
|
<titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Databases"/>
|
|
<data name="Category" value="Hive"/>
|
|
<data name="Category" value="Oracle"/>
|
|
<data name="Category" value="MySQL"/>
|
|
<data name="Category" value="PostgreSQL"/>
|
|
<data name="Category" value="Troubleshooting"/>
|
|
<data name="Category" value="Porting"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">porting</indexterm>
|
|
Although Impala uses standard SQL for queries, you might need to modify SQL source when bringing applications
|
|
to Impala, due to variations in data types, built-in functions, vendor language extensions, and
|
|
Hadoop-specific syntax. Even when SQL is working correctly, you might make further minor modifications for
|
|
best performance.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="porting_ddl_dml">
|
|
|
|
<title>Porting DDL and DML Statements</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
When adapting SQL code from a traditional database system to Impala, expect to find a number of differences
|
|
in the DDL statements that you use to set up the schema. Clauses related to physical layout of files,
|
|
tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to
|
|
account for the Impala partitioning scheme and Hadoop file formats.
|
|
</p>
|
|
|
|
<p>
|
|
Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor
|
|
extensions and features not yet supported in Impala, you might be able to run identical or almost-identical
|
|
query text on both systems.
|
|
</p>
|
|
|
|
<p>
|
|
Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse
|
|
and ongoing tuning efforts on the code for SQL queries.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_data_types">
|
|
|
|
<title>Porting Data Types from Other Database Systems</title>
|
|
|
|
<conbody>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to
|
|
<codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example,
|
|
change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is
|
|
very flexible about the length of string values; it does not impose any length constraints
|
|
or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns.
|
|
(In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>,
|
|
with length constraints for both types and blank-padding for <codeph>CHAR</codeph>.
|
|
However, for performance reasons, it is still preferable to use <codeph>STRING</codeph>
|
|
columns where practical.)
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or
|
|
<codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently
|
|
some string manipulation operations only work correctly with ASCII data. See
|
|
<xref href="impala_string.xml#string"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to
|
|
<codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make
|
|
sure your application logic or ETL process accounts for the fact that Impala expects all
|
|
<codeph>TIMESTAMP</codeph> values to be in
|
|
<xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated
|
|
Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about
|
|
the <codeph>TIMESTAMP</codeph> data type, and
|
|
<xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different
|
|
date and time formats.
|
|
</p>
|
|
<p>
|
|
You might also need to adapt date- and time-related literal values and format strings to use the
|
|
supported Impala date and time formats. If you have date and time literals with different separators or
|
|
different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala
|
|
expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the
|
|
Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the
|
|
allowed formats for date and time literals, and
|
|
<xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as
|
|
<codeph>regexp_replace()</codeph>.
|
|
</p>
|
|
<p>
|
|
Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>.
|
|
</p>
|
|
<p>
|
|
Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname>
|
|
days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() +
|
|
INTERVAL 30 DAYS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in
|
|
<xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column
|
|
data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric
|
|
values that you can add or subtract using the functions in
|
|
<xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table
|
|
<codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could
|
|
construct dates N days in the future like so:
|
|
</p>
|
|
<codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient
|
|
range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on
|
|
for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not
|
|
required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on
|
|
the range of values. For applications that require precise decimal values, such as financial data, you
|
|
might need to make more extensive changes to table structure and application logic, such as using
|
|
separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs
|
|
to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges,
|
|
casting, and so on for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in
|
|
Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an
|
|
alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into
|
|
<codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for
|
|
details about ranges, casting, and so on for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Most integer types from other systems have equivalents in Impala, perhaps under different names such as
|
|
<codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example
|
|
<codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range.
|
|
Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details
|
|
about ranges, casting, and so on for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See
|
|
<xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the
|
|
various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For any types holding bitwise values, use an integer type with enough range to hold all the relevant
|
|
bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about
|
|
ranges, casting, and so on for the various numeric data types.
|
|
</p>
|
|
<p>
|
|
For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate
|
|
8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>.
|
|
</p>
|
|
<codeblock>[localhost:21000] > select cast(127*2 as tinyint);
|
|
+--------------------------+
|
|
| cast(127 * 2 as tinyint) |
|
|
+--------------------------+
|
|
| -2 |
|
|
+--------------------------+
|
|
[localhost:21000] > select cast(128 as tinyint);
|
|
+----------------------+
|
|
| cast(128 as tinyint) |
|
|
+----------------------+
|
|
| -128 |
|
|
+----------------------+
|
|
[localhost:21000] > select cast(127*2 as smallint);
|
|
+---------------------------+
|
|
| cast(127 * 2 as smallint) |
|
|
+---------------------------+
|
|
| 254 |
|
|
+---------------------------+</codeblock>
|
|
<p>
|
|
Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or
|
|
<codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects
|
|
such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and
|
|
<codeph>VARBINARY</codeph> do not currently have an equivalent in Impala.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Because Impala currently does not support composite or nested types, any spatial data types in other
|
|
database systems do not have direct equivalents in Impala. You could represent spatial values in string
|
|
format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where
|
|
practical, separate spatial types into separate tables so that Impala can still work with the
|
|
non-spatial data.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different
|
|
sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph>
|
|
and external tables mean that Impala is flexible about the format of data files, and Impala does not
|
|
necessarily validate or cleanse data before querying it. When copying data through Impala
|
|
<codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or
|
|
<codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see
|
|
<xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
|
|
statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>,
|
|
<codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or
|
|
<codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources,
|
|
such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen
|
|
earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries
|
|
to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph>
|
|
statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to
|
|
substitute some other value for <codeph>NULL</codeph> fields; see
|
|
<xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
|
|
</p>
|
|
<p>
|
|
Do as much verification as practical before loading data into Impala. After data is loaded into Impala,
|
|
you can do further verification using SQL queries to check if values have expected ranges, if values
|
|
are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to
|
|
re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in
|
|
Impala to copy the faulty data to a new table and transform or filter out the bad values.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER
|
|
INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any
|
|
<codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from
|
|
<codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk
|
|
read operations for data warehouse-style queries, and therefore does not support indexes for its
|
|
tables.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Calls to built-in functions with out-of-range or otherwise incorrect arguments, return
|
|
<codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the
|
|
<codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using
|
|
representative data to doublecheck that calls to built-in functions are returning expected values
|
|
rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not
|
|
raise an error in Impala:
|
|
</p>
|
|
<codeblock>select cast('foo' as int);
|
|
+--------------------+
|
|
| cast('foo' as int) |
|
|
+--------------------+
|
|
| NULL |
|
|
+--------------------+</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For any other type not supported in Impala, you could represent their values in string format and write
|
|
UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
To detect the presence of unsupported or unconvertable data types in data files, do initial testing
|
|
with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to
|
|
fail immediately if they encounter disallowed type conversions. See
|
|
<xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example:
|
|
</p>
|
|
<codeblock>set abort_on_error=true;
|
|
select count(*) from (select * from t1);
|
|
-- The above query will fail if the data files for T1 contain any
|
|
-- values that can't be converted to the expected Impala data types.
|
|
-- For example, if T1.C1 is defined as INT but the column contains
|
|
-- floating-point values like 1.1, the query will return an error.</codeblock>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_statements">
|
|
|
|
<title>SQL Statements to Remove or Adapt</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Some SQL statements or clauses that you might be familiar with are not currently supported in Impala:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Impala has no <codeph>DELETE</codeph> statement. Impala is intended for data warehouse-style operations
|
|
where you do bulk moves and transforms of large quantities of data. Instead of using
|
|
<codeph>DELETE</codeph>, use <codeph>INSERT OVERWRITE</codeph> to entirely replace the contents of a
|
|
table or partition, or use <codeph>INSERT ... SELECT</codeph> to copy a subset of data (everything but
|
|
the rows you intended to delete) from one table to another. See <xref href="impala_dml.xml#dml"/> for
|
|
an overview of Impala DML statements.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Impala has no <codeph>UPDATE</codeph> statement. Impala is intended for data warehouse-style operations
|
|
where you do bulk moves and transforms of large quantities of data. Instead of using
|
|
<codeph>UPDATE</codeph>, do all necessary transformations early in the ETL process, such as in the job
|
|
that generates the original data, or when copying from one table to another to convert to a particular
|
|
file format or partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an overview of Impala DML
|
|
statements.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Impala has no transactional statements, such as <codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>.
|
|
Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode in some database systems, where
|
|
changes take effect as soon as they are made.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If your database, table, column, or other names conflict with Impala reserved words, use different
|
|
names or quote the names with backticks. See <xref href="impala_reserved_words.xml#reserved_words"/>
|
|
for the current list of Impala reserved words.
|
|
</p>
|
|
<p>
|
|
Conversely, if you use a keyword that Impala does not recognize, it might be interpreted as a table or
|
|
column alias. For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>, Impala does not
|
|
recognize the <codeph>NATURAL</codeph> keyword and interprets it as an alias for the table
|
|
<codeph>t1</codeph>. If you experience any unexpected behavior with queries, check the list of reserved
|
|
words to make sure all keywords in join and <codeph>WHERE</codeph> clauses are recognized.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Impala supports subqueries only in the <codeph>FROM</codeph> clause of a query, not within the
|
|
<codeph>WHERE</codeph> clauses. Therefore, you cannot use clauses such as <codeph>WHERE
|
|
<varname>column</varname> IN (<varname>subquery</varname>)</codeph>. Also, Impala does not allow
|
|
<codeph>EXISTS</codeph> or <codeph>NOT EXISTS</codeph> clauses (although <codeph>EXISTS</codeph> is a
|
|
reserved keyword).
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Impala supports <codeph>UNION</codeph> and <codeph>UNION ALL</codeph> set operators, but not
|
|
<codeph>INTERSECT</codeph>. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/>
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Within queries, Impala requires query aliases for any subqueries:
|
|
</p>
|
|
<codeblock>-- Without the alias 'contents_of_t1' at the end, query gives syntax error.
|
|
select count(*) from (select * from t1) contents_of_t1;</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
When an alias is declared for an expression in a query, that alias cannot be referenced again within
|
|
the same query block:
|
|
</p>
|
|
<codeblock>-- Can't reference AVERAGE twice in the SELECT list where it's defined.
|
|
select avg(x) as average, average+1 from t1 group by x;
|
|
ERROR: AnalysisException: couldn't resolve column reference: 'average'
|
|
|
|
-- Although it can be referenced again later in the same query.
|
|
select avg(x) as average from t1 group by x having average > 3;</codeblock>
|
|
<p>
|
|
For Impala, either repeat the expression again, or abstract the expression into a <codeph>WITH</codeph>
|
|
clause, creating named columns that can be referenced multiple times anywhere in the base query:
|
|
</p>
|
|
<codeblock>-- The following 2 query forms are equivalent.
|
|
select avg(x) as average, avg(x)+1 from t1 group by x;
|
|
with avg_t as (select avg(x) average from t1 group by x) select average, average+1 from avg_t;</codeblock>
|
|
<!-- An alternative bunch of queries to use in the example above.
|
|
[localhost:21000] > select x*x as x_squared from t1;
|
|
|
|
[localhost:21000] > select x*x as x_squared from t1 where x_squared < 100;
|
|
ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
|
|
[localhost:21000] > select x*x as x_squared, x_squared * pi() as pi_x_squared from t1;
|
|
ERROR: AnalysisException: couldn't resolve column reference: 'x_squared'
|
|
[localhost:21000] > select x*x as x_squared from t1 group by x_squared;
|
|
|
|
[localhost:21000] > select x*x as x_squared from t1 group by x_squared having x_squared < 100;
|
|
-->
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Impala does not support certain rarely used join types that are less appropriate for high-volume tables
|
|
used for data warehousing. In some cases, Impala supports join types but requires explicit syntax to
|
|
ensure you do not do inefficient joins of huge tables by accident. For example, Impala does not support
|
|
natural joins or anti-joins, and requires the <codeph>CROSS JOIN</codeph> operator for Cartesian
|
|
products. See <xref href="impala_joins.xml#joins"/> for details on the syntax for Impala join clauses.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Impala has a limited choice of partitioning types. Partitions are defined based on each distinct
|
|
combination of values for one or more partition key columns. Impala does not redistribute or check data
|
|
to create evenly distributed partitions; you must choose partition key columns based on your knowledge
|
|
of the data volume and distribution. Adapt any tables that use range, list, hash, or key partitioning
|
|
to use the Impala partition syntax for <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
|
|
statements. Impala partitioning is similar to range partitioning where every range has exactly one
|
|
value, or key partitioning where the hash function produces a separate bucket for every combination of
|
|
key values. See <xref href="impala_partitioning.xml#partitioning"/> for usage details, and
|
|
<xref href="impala_create_table.xml#create_table"/> and
|
|
<xref href="impala_alter_table.xml#alter_table"/> for syntax.
|
|
</p>
|
|
<note>
|
|
Because the number of separate partitions is potentially higher than in other database systems, keep a
|
|
close eye on the number of partitions and the volume of data in each one; scale back the number of
|
|
partition key columns if you end up with too many partitions with a small volume of data in each one.
|
|
Remember, to distribute work for a query across a cluster, you need at least one HDFS block per node.
|
|
HDFS blocks are typically multiple megabytes, <ph rev="parquet_block_size">especially</ph> for Parquet
|
|
files. Therefore, if each partition holds only a few megabytes of data, you are unlikely to see much
|
|
parallelism in the query because such a small amount of data is typically processed by a single node.
|
|
</note>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For <q>top-N</q> queries, Impala uses the <codeph>LIMIT</codeph> clause rather than comparing against a
|
|
pseudocolumn named <codeph>ROWNUM</codeph> or <codeph>ROW_NUM</codeph>. See
|
|
<xref href="impala_limit.xml#limit"/> for details.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_antipatterns">
|
|
|
|
<title>SQL Constructs to Doublecheck</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Some SQL constructs that are supported have behavior or defaults more oriented towards convenience than
|
|
optimal performance. Also, sometimes machine-generated SQL, perhaps issued through JDBC or ODBC
|
|
applications, might have inefficiencies or exceed internal Impala limits. As you port SQL code, be alert
|
|
and change these things where appropriate:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files
|
|
in plain text format, which is convenient for data interchange but not a good choice for high-volume
|
|
data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and
|
|
how to use specific file formats for compact data and high-performance queries. Especially see
|
|
<xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for
|
|
large-scale data warehouse queries.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
A <codeph>CREATE TABLE</codeph> statement with no <codeph>PARTITIONED BY</codeph> clause stores all the
|
|
data files in the same physical location, which can lead to scalability problems when the data volume
|
|
becomes large.
|
|
</p>
|
|
<p>
|
|
On the other hand, adapting tables that were already partitioned in a different database system could
|
|
produce an Impala table with a high number of partitions and not enough data in each one, leading to
|
|
underutilization of Impala's parallel query features.
|
|
</p>
|
|
<p>
|
|
See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and
|
|
tuning the performance of queries on partitioned tables.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The <codeph>INSERT ... VALUES</codeph> syntax is suitable for setting up toy tables with a few rows for
|
|
functional testing, but because each such statement creates a separate tiny file in HDFS, it is not a
|
|
scalable technique for loading megabytes or gigabytes (let alone petabytes) of data. Consider revising
|
|
your data load process to produce raw data files outside of Impala, then setting up Impala external
|
|
tables or using the <codeph>LOAD DATA</codeph> statement to use those data files instantly in Impala
|
|
tables, with no conversion or indexing stage. See <xref href="impala_tables.xml#external_tables"/> and
|
|
<xref href="impala_load_data.xml#load_data"/> for details about the Impala techniques for working with
|
|
data files produced outside of Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for examples
|
|
of ETL workflow for Impala.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data
|
|
files, or a single giant data file that cannot take advantage of distributed parallel queries or
|
|
partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a
|
|
new table and reorganize into a more efficient layout in the same operation. See
|
|
<xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement.
|
|
</p>
|
|
<p>
|
|
You can do <codeph>INSERT ... SELECT</codeph> into a table with a more efficient file format (see
|
|
<xref href="impala_file_formats.xml#file_formats"/>) or from an unpartitioned table into a partitioned
|
|
one (see <xref href="impala_partitioning.xml#partitioning"/>).
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The number of expressions allowed in an Impala query might be smaller than for some other database
|
|
systems, causing failures for very complicated queries (typically produced by automated SQL
|
|
generators). Where practical, keep the number of expressions in the <codeph>WHERE</codeph> clauses to
|
|
approximately 2000 or fewer. As a workaround, set the query option
|
|
<codeph>DISABLE_CODEGEN=true</codeph> if queries fail for this reason. See
|
|
<xref href="impala_disable_codegen.xml#disable_codegen"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator
|
|
instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/>
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_next">
|
|
|
|
<title>Next Porting Steps after Verifying Syntax and Semantics</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Throughout this section, some of the decisions you make during the porting process also have a substantial
|
|
impact on performance. After your SQL code is ported and working correctly, doublecheck the
|
|
performance-related aspects of your schema design, physical layout, and queries to make sure that the
|
|
ported application is taking full advantage of Impala's parallelism, performance-related SQL features, and
|
|
integration with Hadoop components.
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Have you run the <codeph>COMPUTE STATS</codeph> statement on each table involved in join queries? Have
|
|
you also run <codeph>COMPUTE STATS</codeph> for each table used as the source table in an <codeph>INSERT
|
|
... SELECT</codeph> or <codeph>CREATE TABLE AS SELECT</codeph> statement?
|
|
</li>
|
|
|
|
<li>
|
|
Are you using the most efficient file format for your data volumes, table structure, and query
|
|
characteristics?
|
|
</li>
|
|
|
|
<li>
|
|
Are you using partitioning effectively? That is, have you partitioned on columns that are often used for
|
|
filtering in <codeph>WHERE</codeph> clauses? Have you partitioned at the right granularity so that there
|
|
is enough data in each partition to parallelize the work for each query?
|
|
</li>
|
|
|
|
<li>
|
|
Does your ETL process produce a relatively small number of multi-megabyte data files (good) rather than a
|
|
huge number of small files (bad)?
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
See <xref href="impala_performance.xml#performance"/> for details about the whole performance tuning
|
|
process.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|