EXPLAIN statement
Returns the execution plan for a statement, showing the low-level mechanisms that Impala will use to read the
data, divide the work among nodes in the cluster, and transmit intermediate and final results across the
network. Use explain followed by a complete SELECT query. For example:
EXPLAIN { select_query | ctas_stmt | insert_stmt }
The select_query is a SELECT statement, optionally prefixed by a
WITH clause. See for details.
The insert_stmt is an INSERT statement that inserts into or overwrites an
existing table. It can use either the INSERT ... SELECT or INSERT ...
VALUES syntax. See for details.
The ctas_stmt is a CREATE TABLE statement using the AS
SELECT clause, typically abbreviated as a CTAS
operation. See
for details.
You can interpret the output to judge whether the query is performing efficiently, and adjust the query
and/or the schema if not. For example, you might change the tests in the WHERE clause, add
hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add
or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other
performance tuning steps.
The EXPLAIN output reminds you if table or column statistics are missing from any table
involved in the query. These statistics are important for optimizing queries involving large tables or
multi-table joins. See for how to gather statistics,
and for how to use this information for query tuning.
If you come from a traditional database background and are not familiar with data warehousing, keep in mind
that Impala is optimized for full table scans across very large tables. The structure and distribution of
this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP
environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of
an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for
example by using a query that affects only certain partitions within a partitioned table, then you might be
able to optimize a query so that it executes in seconds rather than minutes.
For more information and examples to help you interpret EXPLAIN output, see
.
Extended EXPLAIN output:
For performance tuning of complex queries, and capacity planning (such as using the admission control and
resource management features), you can enable more detailed and informative output for the
EXPLAIN statement. In the impala-shell interpreter, issue the command
SET EXPLAIN_LEVEL=level, where level is an integer
from 0 to 3 or corresponding mnemonic values minimal, standard,
extended, or verbose.
When extended EXPLAIN output is enabled, EXPLAIN statements print
information about estimated memory requirements, minimum number of virtual cores, and so on.
See for details and examples.
This example shows how the standard EXPLAIN output moves from the lowest (physical) level to
the higher (logical) levels. The query begins by scanning a certain amount of data; each node performs an
aggregation operation (evaluating COUNT(*)) on some subset of data that is local to that
node; the intermediate results are transmitted back to the coordinator node (labelled here as the
EXCHANGE node); lastly, the intermediate results are summed to display the final result.
[impalad-host:21000] > explain select count(*) from customer_address;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=42.00MB VCores=1 |
| |
| 03:AGGREGATE [MERGE FINALIZE] |
| | output: sum(count(*)) |
| | |
| 02:EXCHANGE [PARTITION=UNPARTITIONED] |
| | |
| 01:AGGREGATE |
| | output: count(*) |
| | |
| 00:SCAN HDFS [default.customer_address] |
| partitions=1/1 size=5.25MB |
+----------------------------------------------------------+
These examples show how the extended EXPLAIN output becomes more accurate and informative as
statistics are gathered by the COMPUTE STATS statement. Initially, much of the information
about data size and distribution is marked unavailable
. Impala can determine the raw data size, but
not the number of rows or number of distinct values for each column without additional analysis. The
COMPUTE STATS statement performs this analysis, so a subsequent EXPLAIN
statement has additional information to use in deciding how to optimize the distributed query.
[localhost:21000] > set explain_level=extended;
EXPLAIN_LEVEL set to extended
[localhost:21000] > explain select x from t1;
[localhost:21000] > explain select x from t1;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=32.00MB VCores=1 |
| |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| | hosts=1 per-host-mem=unavailable |
| | tuple-ids=0 row-size=4B cardinality=unavailable |
| | |
| 00:SCAN HDFS [default.t2, PARTITION=RANDOM] |
| partitions=1/1 size=36B |
| table stats: unavailable |
| column stats: unavailable |
| hosts=1 per-host-mem=32.00MB |
| tuple-ids=0 row-size=4B cardinality=unavailable |
+----------------------------------------------------------+
[localhost:21000] > compute stats t1;
+-----------------------------------------+
| summary |
+-----------------------------------------+
| Updated 1 partition(s) and 1 column(s). |
+-----------------------------------------+
[localhost:21000] > explain select x from t1;
+----------------------------------------------------------+
| Explain String |
+----------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=64.00MB VCores=1 |
| |
| 01:EXCHANGE [PARTITION=UNPARTITIONED] |
| | hosts=1 per-host-mem=unavailable |
| | tuple-ids=0 row-size=4B cardinality=0 |
| | |
| 00:SCAN HDFS [default.t1, PARTITION=RANDOM] |
| partitions=1/1 size=36B |
| table stats: 0 rows total |
| column stats: all |
| hosts=1 per-host-mem=64.00MB |
| tuple-ids=0 row-size=4B cardinality=0 |
+----------------------------------------------------------+
The user ID that the impalad daemon runs under,
typically the impala user, must have read
and execute permissions for all applicable directories in all source tables
for the query that is being explained.
(A SELECT operation could read files from multiple different HDFS directories
if the source table is partitioned.)
The EXPLAIN statement displays equivalent plan
information for queries against Kudu tables as for queries
against HDFS-based tables.
To see which predicates Impala can push down
to Kudu for
efficient evaluation, without transmitting unnecessary rows back
to Impala, look for the kudu predicates item in
the scan phase of the query. The label kudu predicates
indicates a condition that can be evaluated efficiently on the Kudu
side. The label predicates in a SCAN KUDU
node indicates a condition that is evaluated by Impala.
For example, in a table with primary key column X
and non-primary key column Y, you can see that
some operators in the WHERE clause are evaluated
immediately by Kudu and others are evaluated later by Impala:
EXPLAIN SELECT x,y from kudu_table WHERE
x = 1 AND y NOT IN (2,3) AND z = 1
AND a IS NOT NULL AND b > 0 AND length(s) > 5;
+----------------
| Explain String
+----------------
...
| 00:SCAN KUDU [kudu_table]
| predicates: y NOT IN (2, 3), length(s) > 5
| kudu predicates: a IS NOT NULL, b > 0, x = 1, z = 1
Only binary predicates, IS NULL and IS NOT NULL
(in and higher), and IN predicates
containing literal values that exactly match the types in the Kudu table, and do not
require any casting, can be pushed to Kudu.
,
,
,