mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Change-Id: I3a1edb1ec5076354df9301ce12e7618d7cf607a6 Reviewed-on: http://gerrit.cloudera.org:8080/11154 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
601 lines
30 KiB
XML
601 lines
30 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="porting">
|
|
|
|
<title>Porting SQL from Other Database Systems to Impala</title>
|
|
<titlealts audience="PDF"><navtitle>Porting SQL</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Databases"/>
|
|
<data name="Category" value="Hive"/>
|
|
<data name="Category" value="Oracle"/>
|
|
<data name="Category" value="MySQL"/>
|
|
<data name="Category" value="PostgreSQL"/>
|
|
<data name="Category" value="Troubleshooting"/>
|
|
<data name="Category" value="Porting"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p> Although Impala uses standard SQL for queries, you might need to modify
|
|
SQL source when bringing applications to Impala, due to variations in data
|
|
types, built-in functions, vendor language extensions, and Hadoop-specific
|
|
syntax. Even when SQL is working correctly, you might make further minor
|
|
modifications for best performance. </p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="porting_ddl_dml">
|
|
|
|
<title>Porting DDL and DML Statements</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
When adapting SQL code from a traditional database system to Impala, expect to find a number of differences
|
|
in the DDL statements that you use to set up the schema. Clauses related to physical layout of files,
|
|
tablespaces, and indexes have no equivalent in Impala. You might restructure your schema considerably to
|
|
account for the Impala partitioning scheme and Hadoop file formats.
|
|
</p>
|
|
|
|
<p>
|
|
Expect SQL queries to have a much higher degree of compatibility. With modest rewriting to address vendor
|
|
extensions and features not yet supported in Impala, you might be able to run identical or almost-identical
|
|
query text on both systems.
|
|
</p>
|
|
|
|
<p>
|
|
Therefore, consider separating out the DDL into a separate Impala-specific setup script. Focus your reuse
|
|
and ongoing tuning efforts on the code for SQL queries.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_data_types">
|
|
|
|
<title>Porting Data Types from Other Database Systems</title>
|
|
|
|
<conbody>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
Change any <codeph>VARCHAR</codeph>, <codeph>VARCHAR2</codeph>, and <codeph>CHAR</codeph> columns to
|
|
<codeph>STRING</codeph>. Remove any length constraints from the column declarations; for example,
|
|
change <codeph>VARCHAR(32)</codeph> or <codeph>CHAR(1)</codeph> to <codeph>STRING</codeph>. Impala is
|
|
very flexible about the length of string values; it does not impose any length constraints
|
|
or do any special processing (such as blank-padding) for <codeph>STRING</codeph> columns.
|
|
(In Impala 2.0 and higher, there are data types <codeph>VARCHAR</codeph> and <codeph>CHAR</codeph>,
|
|
with length constraints for both types and blank-padding for <codeph>CHAR</codeph>.
|
|
However, for performance reasons, it is still preferable to use <codeph>STRING</codeph>
|
|
columns where practical.)
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For national language character types such as <codeph>NCHAR</codeph>, <codeph>NVARCHAR</codeph>, or
|
|
<codeph>NCLOB</codeph>, be aware that while Impala can store and query UTF-8 character data, currently
|
|
some string manipulation operations only work correctly with ASCII data. See
|
|
<xref href="impala_string.xml#string"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Change any <codeph>DATE</codeph>, <codeph>DATETIME</codeph>, or <codeph>TIME</codeph> columns to
|
|
<codeph>TIMESTAMP</codeph>. Remove any precision constraints. Remove any timezone clauses, and make
|
|
sure your application logic or ETL process accounts for the fact that Impala expects all
|
|
<codeph>TIMESTAMP</codeph> values to be in
|
|
<xref href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" scope="external" format="html">Coordinated
|
|
Universal Time (UTC)</xref>. See <xref href="impala_timestamp.xml#timestamp"/> for information about
|
|
the <codeph>TIMESTAMP</codeph> data type, and
|
|
<xref href="impala_datetime_functions.xml#datetime_functions"/> for conversion functions for different
|
|
date and time formats.
|
|
</p>
|
|
<p>
|
|
You might also need to adapt date- and time-related literal values and format strings to use the
|
|
supported Impala date and time formats. If you have date and time literals with different separators or
|
|
different numbers of <codeph>YY</codeph>, <codeph>MM</codeph>, and so on placeholders than Impala
|
|
expects, consider using calls to <codeph>regexp_replace()</codeph> to transform those values to the
|
|
Impala-compatible format. See <xref href="impala_timestamp.xml#timestamp"/> for information about the
|
|
allowed formats for date and time literals, and
|
|
<xref href="impala_string_functions.xml#string_functions"/> for string conversion functions such as
|
|
<codeph>regexp_replace()</codeph>.
|
|
</p>
|
|
<p>
|
|
Instead of <codeph>SYSDATE</codeph>, call the function <codeph>NOW()</codeph>.
|
|
</p>
|
|
<p>
|
|
Instead of adding or subtracting directly from a date value to produce a value <varname>N</varname>
|
|
days in the past or future, use an <codeph>INTERVAL</codeph> expression, for example <codeph>NOW() +
|
|
INTERVAL 30 DAYS</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Although Impala supports <codeph>INTERVAL</codeph> expressions for datetime arithmetic, as shown in
|
|
<xref href="impala_timestamp.xml#timestamp"/>, <codeph>INTERVAL</codeph> is not available as a column
|
|
data type in Impala. For any <codeph>INTERVAL</codeph> values stored in tables, convert them to numeric
|
|
values that you can add or subtract using the functions in
|
|
<xref href="impala_datetime_functions.xml#datetime_functions"/>. For example, if you had a table
|
|
<codeph>DEADLINES</codeph> with an <codeph>INT</codeph> column <codeph>TIME_PERIOD</codeph>, you could
|
|
construct dates N days in the future like so:
|
|
</p>
|
|
<codeblock>SELECT NOW() + INTERVAL time_period DAYS from deadlines;</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For <codeph>YEAR</codeph> columns, change to the smallest Impala integer type that has sufficient
|
|
range. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on
|
|
for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Change any <codeph>DECIMAL</codeph> and <codeph>NUMBER</codeph> types. If fixed-point precision is not
|
|
required, you can use <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> on the Impala side depending on
|
|
the range of values. For applications that require precise decimal values, such as financial data, you
|
|
might need to make more extensive changes to table structure and application logic, such as using
|
|
separate integer columns for dollars and cents, or encoding numbers as string values and writing UDFs
|
|
to manipulate them. See <xref href="impala_datatypes.xml#datatypes"/> for details about ranges,
|
|
casting, and so on for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>FLOAT</codeph>, <codeph>DOUBLE</codeph>, and <codeph>REAL</codeph> types are supported in
|
|
Impala. Remove any precision and scale specifications. (In Impala, <codeph>REAL</codeph> is just an
|
|
alias for <codeph>DOUBLE</codeph>; columns declared as <codeph>REAL</codeph> are turned into
|
|
<codeph>DOUBLE</codeph> behind the scenes.) See <xref href="impala_datatypes.xml#datatypes"/> for
|
|
details about ranges, casting, and so on for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Most integer types from other systems have equivalents in Impala, perhaps under different names such as
|
|
<codeph>BIGINT</codeph> instead of <codeph>INT8</codeph>. For any that are unavailable, for example
|
|
<codeph>MEDIUMINT</codeph>, switch to the smallest Impala integer type that has sufficient range.
|
|
Remove any precision specifications. See <xref href="impala_datatypes.xml#datatypes"/> for details
|
|
about ranges, casting, and so on for the various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Remove any <codeph>UNSIGNED</codeph> constraints. All Impala numeric types are signed. See
|
|
<xref href="impala_datatypes.xml#datatypes"/> for details about ranges, casting, and so on for the
|
|
various numeric data types.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For any types holding bitwise values, use an integer type with enough range to hold all the relevant
|
|
bits within a positive integer. See <xref href="impala_datatypes.xml#datatypes"/> for details about
|
|
ranges, casting, and so on for the various numeric data types.
|
|
</p>
|
|
<p>
|
|
For example, <codeph>TINYINT</codeph> has a maximum positive value of 127, not 256, so to manipulate
|
|
8-bit bitfields as positive numbers switch to the next largest type <codeph>SMALLINT</codeph>.
|
|
</p>
|
|
<codeblock>[localhost:21000] > select cast(127*2 as tinyint);
|
|
+--------------------------+
|
|
| cast(127 * 2 as tinyint) |
|
|
+--------------------------+
|
|
| -2 |
|
|
+--------------------------+
|
|
[localhost:21000] > select cast(128 as tinyint);
|
|
+----------------------+
|
|
| cast(128 as tinyint) |
|
|
+----------------------+
|
|
| -128 |
|
|
+----------------------+
|
|
[localhost:21000] > select cast(127*2 as smallint);
|
|
+---------------------------+
|
|
| cast(127 * 2 as smallint) |
|
|
+---------------------------+
|
|
| 254 |
|
|
+---------------------------+</codeblock>
|
|
<p>
|
|
Impala does not support notation such as <codeph>b'0101'</codeph> for bit literals.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For BLOB values, use <codeph>STRING</codeph> to represent <codeph>CLOB</codeph> or
|
|
<codeph>TEXT</codeph> types (character based large objects) up to 32 KB in size. Binary large objects
|
|
such as <codeph>BLOB</codeph>, <codeph>RAW</codeph> <codeph>BINARY</codeph>, and
|
|
<codeph>VARBINARY</codeph> do not currently have an equivalent in Impala.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For Boolean-like types such as <codeph>BOOL</codeph>, use the Impala <codeph>BOOLEAN</codeph> type.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Because Impala currently does not support composite or nested types, any spatial data types in other
|
|
database systems do not have direct equivalents in Impala. You could represent spatial values in string
|
|
format and write UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details. Where
|
|
practical, separate spatial types into separate tables so that Impala can still work with the
|
|
non-spatial data.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Take out any <codeph>DEFAULT</codeph> clauses. Impala can use data files produced from many different
|
|
sources, such as Pig, Hive, or MapReduce jobs. The fast import mechanisms of <codeph>LOAD DATA</codeph>
|
|
and external tables mean that Impala is flexible about the format of data files, and Impala does not
|
|
necessarily validate or cleanse data before querying it. When copying data through Impala
|
|
<codeph>INSERT</codeph> statements, you can use conditional functions such as <codeph>CASE</codeph> or
|
|
<codeph>NVL</codeph> to substitute some other value for <codeph>NULL</codeph> fields; see
|
|
<xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Take out any constraints from your <codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
|
|
statements, for example <codeph>PRIMARY KEY</codeph>, <codeph>FOREIGN KEY</codeph>,
|
|
<codeph>UNIQUE</codeph>, <codeph>NOT NULL</codeph>, <codeph>UNSIGNED</codeph>, or
|
|
<codeph>CHECK</codeph> constraints. Impala can use data files produced from many different sources,
|
|
such as Pig, Hive, or MapReduce jobs. Therefore, Impala expects initial data validation to happen
|
|
earlier during the ETL or ELT cycle. After data is loaded into Impala tables, you can perform queries
|
|
to test for <codeph>NULL</codeph> values. When copying data through Impala <codeph>INSERT</codeph>
|
|
statements, you can use conditional functions such as <codeph>CASE</codeph> or <codeph>NVL</codeph> to
|
|
substitute some other value for <codeph>NULL</codeph> fields; see
|
|
<xref href="impala_conditional_functions.xml#conditional_functions"/> for details.
|
|
</p>
|
|
<p>
|
|
Do as much verification as practical before loading data into Impala. After data is loaded into Impala,
|
|
you can do further verification using SQL queries to check if values have expected ranges, if values
|
|
are <codeph>NULL</codeph> or not, and so on. If there is a problem with the data, you will need to
|
|
re-run earlier stages of the ETL process, or do an <codeph>INSERT ... SELECT</codeph> statement in
|
|
Impala to copy the faulty data to a new table and transform or filter out the bad values.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Take out any <codeph>CREATE INDEX</codeph>, <codeph>DROP INDEX</codeph>, and <codeph>ALTER
|
|
INDEX</codeph> statements, and equivalent <codeph>ALTER TABLE</codeph> statements. Remove any
|
|
<codeph>INDEX</codeph>, <codeph>KEY</codeph>, or <codeph>PRIMARY KEY</codeph> clauses from
|
|
<codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph> statements. Impala is optimized for bulk
|
|
read operations for data warehouse-style queries, and therefore does not support indexes for its
|
|
tables.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Calls to built-in functions with out-of-range or otherwise incorrect arguments, return
|
|
<codeph>NULL</codeph> in Impala as opposed to raising exceptions. (This rule applies even when the
|
|
<codeph>ABORT_ON_ERROR=true</codeph> query option is in effect.) Run small-scale queries using
|
|
representative data to doublecheck that calls to built-in functions are returning expected values
|
|
rather than <codeph>NULL</codeph>. For example, unsupported <codeph>CAST</codeph> operations do not
|
|
raise an error in Impala:
|
|
</p>
|
|
<codeblock>select cast('foo' as int);
|
|
+--------------------+
|
|
| cast('foo' as int) |
|
|
+--------------------+
|
|
| NULL |
|
|
+--------------------+</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
For any other type not supported in Impala, you could represent their values in string format and write
|
|
UDFs to process them. See <xref href="impala_udf.xml#udfs"/> for details.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
To detect the presence of unsupported or unconvertable data types in data files, do initial testing
|
|
with the <codeph>ABORT_ON_ERROR=true</codeph> query option in effect. This option causes queries to
|
|
fail immediately if they encounter disallowed type conversions. See
|
|
<xref href="impala_abort_on_error.xml#abort_on_error"/> for details. For example:
|
|
</p>
|
|
<codeblock>set abort_on_error=true;
|
|
select count(*) from (select * from t1);
|
|
-- The above query will fail if the data files for T1 contain any
|
|
-- values that can't be converted to the expected Impala data types.
|
|
-- For example, if T1.C1 is defined as INT but the column contains
|
|
-- floating-point values like 1.1, the query will return an error.</codeblock>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_statements">
|
|
|
|
<title>SQL Statements to Remove or Adapt</title>
|
|
|
|
<conbody>
|
|
|
|
<p> The following SQL statements or clauses are not currently supported or
|
|
supported with limitations in Impala: </p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p> Impala supports the <codeph>DELETE</codeph> statement only for
|
|
Kudu tables. </p>
|
|
<p>Impala is intended for data warehouse-style operations where you do
|
|
bulk moves and transforms of large quantities of data. When not
|
|
using Kudu tables, instead of <codeph>DELETE</codeph>, use
|
|
<codeph>INSERT OVERWRITE</codeph> to entirely replace the contents
|
|
of a table or partition, or use <codeph>INSERT ... SELECT</codeph>
|
|
to copy a subset of data (everything but the rows you intended to
|
|
delete) from one table to another. See <xref
|
|
href="impala_dml.xml#dml"/> for an overview of Impala DML
|
|
statements. </p>
|
|
</li>
|
|
<li>
|
|
<p> Impala supports the <codeph>UPDATE</codeph> statement only for
|
|
Kudu tables.</p>
|
|
<p>When not using Kudu tables, instead of <codeph>UPDATE</codeph>, do
|
|
all necessary transformations early in the ETL process, such as in
|
|
the job that generates the original data, or when copying from one
|
|
table to another to convert to a particular file format or
|
|
partitioning scheme. See <xref href="impala_dml.xml#dml"/> for an
|
|
overview of Impala DML statements. </p>
|
|
</li>
|
|
<li>
|
|
<p> Impala has no transactional statements, such as
|
|
<codeph>COMMIT</codeph> or <codeph>ROLLBACK</codeph>. </p>
|
|
<p>Impala effectively works like the <codeph>AUTOCOMMIT</codeph> mode
|
|
in some database systems, where changes take effect as soon as they
|
|
are made. </p>
|
|
</li>
|
|
<li>
|
|
<p> If your database, table, column, or other names conflict with
|
|
Impala reserved words, use different names or quote the names with
|
|
backticks. </p>
|
|
<p>See <xref href="impala_reserved_words.xml#reserved_words"/> for the
|
|
current list of Impala reserved words. </p>
|
|
<p> Conversely, if you use a keyword that Impala does not recognize,
|
|
it might be interpreted as a table or column alias. </p>
|
|
<p>For example, in <codeph>SELECT * FROM t1 NATURAL JOIN t2</codeph>,
|
|
Impala does not recognize the <codeph>NATURAL</codeph> keyword and
|
|
interprets it as an alias for the table <codeph>t1</codeph>. If you
|
|
experience any unexpected behavior with queries, check the list of
|
|
reserved words to make sure all keywords in join and
|
|
<codeph>WHERE</codeph> clauses are supported keywords in Impala.
|
|
</p>
|
|
</li>
|
|
<li>
|
|
<p>Impala has some restrictions on subquery support. See <keyword
|
|
keyref="subqueries"/> for the current details.</p>
|
|
</li>
|
|
<li>
|
|
<p> Impala supports <codeph>UNION</codeph> and <codeph>UNION
|
|
ALL</codeph> set operators, but not <codeph>INTERSECT</codeph>. </p>
|
|
<p><ph conref="../shared/impala_common.xml#common/union_all_vs_union"
|
|
/>
|
|
</p>
|
|
</li>
|
|
<li><p>Impala requires query aliases for the subqueries used as inline
|
|
views in the <codeph>FROM</codeph> clause. </p><p>For example,
|
|
without the alias <codeph>contents_of_t1</codeph> at the end, the
|
|
following query gives a syntax
|
|
error:<codeblock>SELECT COUNT(*) FROM (SELECT * FROM t1) contents_of_t1;</codeblock></p>Aliases
|
|
are not required for the subqueries used in other parts of queries.
|
|
For
|
|
example:<codeblock>SELECT * FROM functional.alltypes WHERE id = (SELECT MIN(id) FROM functional.alltypes);
|
|
</codeblock></li>
|
|
<li>
|
|
<p> When an alias is declared for an expression in a query, that alias
|
|
cannot be referenced again within the same <codeph>SELECT</codeph>
|
|
list.</p>
|
|
<p>For example, the <codeph>average</codeph> alias cannot be
|
|
referenced twice in the <codeph>SELECT</codeph> list as below. You
|
|
will receive an error:</p>
|
|
<codeblock>SELECT AVG(x) AS average, average+1 FROM t1 GROUP BY x;</codeblock>
|
|
<p>An alias can be referenced again in the same query if not in the
|
|
<codeph>SELECT</codeph> list. For example, the
|
|
<codeph>average</codeph> alias can be referenced twice as shown
|
|
below:<codeblock>SELECT AVG(x) AS average FROM t1 GROUP BY x HAVING average > 3;</codeblock></p>
|
|
</li>
|
|
<li>
|
|
<p> Impala does not support <codeph>NATURAL JOIN</codeph>, and it does
|
|
not support the <codeph>USING</codeph> clause in joins. See <xref
|
|
href="impala_joins.xml#joins"/> for details on the syntax for
|
|
Impala join clauses. </p>
|
|
</li>
|
|
<li>
|
|
<p> Impala supports a limited choice of partitioning types. </p>
|
|
<p>Partitions are defined based on each distinct combination of values
|
|
for one or more partition key columns. Impala does not redistribute
|
|
or check data to create evenly distributed partitions. You must
|
|
choose partition key columns based on your knowledge of the data
|
|
volume and distribution. Adapt any tables that use range, list,
|
|
hash, or key partitioning to use the Impala partition syntax for
|
|
<codeph>CREATE TABLE</codeph> and <codeph>ALTER TABLE</codeph>
|
|
statements. </p>
|
|
<p>Impala partitioning is similar to range partitioning where every
|
|
range has exactly one value, or key partitioning where the hash
|
|
function produces a separate bucket for every combination of key
|
|
values. See <xref href="impala_partitioning.xml#partitioning"/> for
|
|
usage details, and <xref href="impala_create_table.xml#create_table"
|
|
/> and <xref href="impala_alter_table.xml#alter_table"/> for syntax. </p>
|
|
<note> Because the number of separate partitions is potentially higher
|
|
than in other database systems, keep a close eye on the number of
|
|
partitions and the volume of data in each one; scale back the number
|
|
of partition key columns if you end up with too many partitions with
|
|
a small volume of data in each one. <p>To distribute work for a
|
|
query across a cluster, you need at least one HDFS block per node.
|
|
HDFS blocks are typically multiple megabytes, <ph
|
|
rev="parquet_block_size">especially</ph> for Parquet files.
|
|
Therefore, if each partition holds only a few megabytes of data,
|
|
you are unlikely to see much parallelism in the query because such
|
|
a small amount of data is typically processed by a single node.
|
|
</p></note>
|
|
</li>
|
|
<li>
|
|
<p> For the <q>top-N</q> queries, Impala uses the
|
|
<codeph>LIMIT</codeph> clause rather than comparing against a
|
|
pseudo column named <codeph>ROWNUM</codeph> or
|
|
<codeph>ROW_NUM</codeph>. </p>
|
|
<p>See <xref href="impala_limit.xml#limit"/> for details. </p>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_antipatterns">
|
|
|
|
<title>SQL Constructs to Double-check</title>
|
|
|
|
<conbody>
|
|
|
|
<p> Some SQL constructs that are supported have behavior or defaults more
|
|
oriented towards convenience than optimal performance. Also, sometimes
|
|
machine-generated SQL, perhaps issued through JDBC or ODBC applications,
|
|
might have inefficiencies or exceed internal Impala limits. As you port
|
|
SQL code, examine and possibly update the following where appropriate: </p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
A <codeph>CREATE TABLE</codeph> statement with no <codeph>STORED AS</codeph> clause creates data files
|
|
in plain text format, which is convenient for data interchange but not a good choice for high-volume
|
|
data with high-performance queries. See <xref href="impala_file_formats.xml#file_formats"/> for why and
|
|
how to use specific file formats for compact data and high-performance queries. Especially see
|
|
<xref href="impala_parquet.xml#parquet"/>, for details about the file format most heavily optimized for
|
|
large-scale data warehouse queries.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p> Adapting tables that were already partitioned in a different
|
|
database system could produce an Impala table with a high number of
|
|
partitions and not enough data in each one, leading to
|
|
underutilization of Impala's parallel query features. </p>
|
|
<p>
|
|
See <xref href="impala_partitioning.xml#partitioning"/> for details about setting up partitioning and
|
|
tuning the performance of queries on partitioned tables.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p> The <codeph>INSERT ... VALUES</codeph> syntax is suitable for
|
|
setting up toy tables with a few rows for functional testing when
|
|
used with HDFS. Each such statement creates a separate tiny file in
|
|
HDFS, and it is not a scalable technique for loading megabytes or
|
|
gigabytes (let alone petabytes) of data. </p>
|
|
<p>Consider revising your data load process to produce raw data files
|
|
outside of Impala, then setting up Impala external tables or using
|
|
the <codeph>LOAD DATA</codeph> statement to use those data files
|
|
instantly in Impala tables, with no conversion or indexing stage.
|
|
See <xref href="impala_tables.xml#external_tables"/> and <xref
|
|
href="impala_load_data.xml#load_data"/> for details about the
|
|
Impala techniques for working with data files produced outside of
|
|
Impala; see <xref href="impala_tutorial.xml#tutorial_etl"/> for
|
|
examples of ETL workflow for Impala. </p>
|
|
<p><codeph>INSERT</codeph> works fine for Kudu tables even though not
|
|
particularly fast.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If your ETL process is not optimized for Hadoop, you might end up with highly fragmented small data
|
|
files, or a single giant data file that cannot take advantage of distributed parallel queries or
|
|
partitioning. In this case, use an <codeph>INSERT ... SELECT</codeph> statement to copy the data into a
|
|
new table and reorganize into a more efficient layout in the same operation. See
|
|
<xref href="impala_insert.xml#insert"/> for details about the <codeph>INSERT</codeph> statement.
|
|
</p>
|
|
<p> You can do <codeph>INSERT ... SELECT</codeph> into a table with a
|
|
more efficient file format (see <xref
|
|
href="impala_file_formats.xml#file_formats"/>) or from an
|
|
unpartitioned table into a partitioned one. See <xref
|
|
href="impala_partitioning.xml#partitioning"/>. </p>
|
|
</li>
|
|
|
|
<li>
|
|
<p> Complex queries may have high codegen time. As a workaround, set
|
|
the query option <codeph>DISABLE_CODEGEN=true</codeph> if queries
|
|
fail for this reason. See <xref
|
|
href="impala_disable_codegen.xml#disable_codegen"/> for details. </p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
If practical, rewrite <codeph>UNION</codeph> queries to use the <codeph>UNION ALL</codeph> operator
|
|
instead. <ph conref="../shared/impala_common.xml#common/union_all_vs_union"/>
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="porting_next">
|
|
|
|
<title>Next Porting Steps after Verifying Syntax and Semantics</title>
|
|
|
|
<conbody>
|
|
|
|
<p> Some of the decisions you make during the porting process can have an
|
|
impact on performance. After your SQL code is ported and working
|
|
correctly, examine the performance-related aspects of your schema
|
|
design, physical layout, and queries to make sure that the ported
|
|
application is taking full advantage of Impala's parallelism,
|
|
performance-related SQL features, and integration with Hadoop
|
|
components. The following are a few of the areas you should examine:</p>
|
|
|
|
<ul>
|
|
<li> For the optimal performance, we recommend that you run
|
|
<codeph>COMPUTE STATS</codeph> on all tables.</li>
|
|
|
|
<li> Use the most efficient file format for your data volumes, table
|
|
structure, and query characteristics.</li>
|
|
|
|
<li> Partition on columns that are often used for filtering in
|
|
<codeph>WHERE</codeph> clauses.</li>
|
|
|
|
<li> Your ETL process should produce a relatively small number of
|
|
multi-megabyte data files rather than a huge number of small
|
|
files.</li>
|
|
</ul>
|
|
|
|
<p> See <xref href="impala_performance.xml#performance"/> for details
|
|
about the performance tuning process. </p>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|