impala/docs/topics/impala_hints.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="hints">

  <title>Query Hints in Impala SELECT Statements</title>
  <titlealts audience="PDF"><navtitle>Hints</navtitle></titlealts>
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="SQL"/>
      <data name="Category" value="Querying"/>
      <data name="Category" value="Performance"/>
      <data name="Category" value="Troubleshooting"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      <indexterm audience="hidden">hints</indexterm>
      The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as
      a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient
      performance.
    </p>

    <p>
      Hints are most often used for the most resource-intensive kinds of Impala queries:
    </p>

    <ul>
      <li>
        Join queries involving large tables, where intermediate result sets are transmitted across the network to
        evaluate the join conditions.
      </li>

      <li>
        Inserting into partitioned Parquet tables, where many memory buffers could be allocated on each host to
        hold intermediate results for each partition.
      </li>
    </ul>

    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

    <p>
      You can also represent the hints as keywords surrounded by <codeph>[]</codeph>
      square brackets; include the brackets in the text of the SQL statement.
      <note conref="../shared/impala_common.xml#common/square_bracket_hint_caveat"/>
    </p>

<codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
<varname>join_left_hand_table</varname>
  JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }]
<varname>join_right_hand_table</varname>
<varname>remainder_of_query</varname>;

INSERT <varname>insert_clauses</varname>
  [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }]
  [<ph rev="IMPALA-2522 2.8.0">/* +CLUSTERED */</ph>]
  SELECT <varname>remainder_of_query</varname>;
</codeblock>

    <p rev="2.0.0">
      In <keyword keyref="impala20_full"/> and higher, you can also specify the hints inside comments that use
      either the <codeph>/* */</codeph> or <codeph>--</codeph> notation. Specify a <codeph>+</codeph> symbol
      immediately before the hint name. Recently added hints are only available using the <codeph>/* */</codeph>
      and <codeph>--</codeph> notation.
      For clarity, the <codeph>/* */</codeph> and <codeph>--</codeph> styles
      are used in the syntax and examples throughout this section.
      With the <codeph>/* */</codeph> or <codeph>--</codeph> notation for
      hints, specify a <codeph>+</codeph> symbol immediately before the first hint name.
      Multiple hints can be specified separated by commas, for example
      <codeph>/* +clustered,shuffle */</codeph>
    </p>

<codeblock rev="2.0.0">SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM
<varname>join_left_hand_table</varname>
  JOIN /* +BROADCAST|SHUFFLE */
<varname>join_right_hand_table</varname>
<varname>remainder_of_query</varname>;

SELECT <varname>select_list</varname> FROM
<varname>join_left_hand_table</varname>
  JOIN -- +BROADCAST|SHUFFLE
<varname>join_right_hand_table</varname>
<varname>remainder_of_query</varname>;

INSERT <varname>insert_clauses</varname>
  /* +SHUFFLE|NOSHUFFLE */
  SELECT <varname>remainder_of_query</varname>;

INSERT <varname>insert_clauses</varname>
  -- +SHUFFLE|NOSHUFFLE
  SELECT <varname>remainder_of_query</varname>;

<ph rev="IMPALA-2924">SELECT <varname>select_list</varname> FROM
<varname>table_ref</varname>
  /* +{SCHEDULE_CACHE_LOCAL | SCHEDULE_DISK_LOCAL | SCHEDULE_REMOTE}
    [,RANDOM_REPLICA] */
<varname>remainder_of_query</varname>;</ph>

<ph rev="IMPALA-2522 2.8.0">INSERT <varname>insert_clauses</varname>
  -- +CLUSTERED
  SELECT <varname>remainder_of_query</varname>;

INSERT <varname>insert_clauses</varname>
  /* +CLUSTERED */
  SELECT <varname>remainder_of_query</varname>;</ph>
</codeblock>

    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

    <p>
      With both forms of hint syntax, include the <codeph>STRAIGHT_JOIN</codeph>
      keyword immediately after the <codeph>SELECT</codeph> and any
      <codeph>DISTINCT</codeph> or <codeph>ALL</codeph> keywords to prevent Impala from
      reordering the tables in a way that makes the join-related hints ineffective.
    </p>

    <p conref="../shared/impala_common.xml#common/straight_join_nested_queries"/>

    <p>
      To reduce the need to use hints, run the <codeph>COMPUTE STATS</codeph> statement against all tables involved
      in joins, or used as the source tables for <codeph>INSERT ... SELECT</codeph> operations where the
      destination is a partitioned Parquet table. Do this operation after loading data or making substantial
      changes to the data within each table. Having up-to-date statistics helps Impala choose more efficient query
      plans without the need for hinting. See <xref href="impala_perf_stats.xml#perf_stats"/> for details and
      examples.
    </p>

    <p>
      To see which join strategy is used for a particular query, examine the <codeph>EXPLAIN</codeph> output for
      that query. See <xref href="impala_explain_plan.xml#perf_explain"/> for details and examples.
    </p>

    <p>
      <b>Hints for join queries:</b>
    </p>

    <p>
      The <codeph>/* +BROADCAST */</codeph> and <codeph>/* +SHUFFLE */</codeph> hints control the execution strategy for join
      queries. Specify one of the following constructs immediately after the <codeph>JOIN</codeph> keyword in a
      query:
    </p>

    <ul>
      <li>
        <codeph>/* +SHUFFLE */</codeph> - Makes that join operation use the <q>partitioned</q> technique, which divides
        up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other
        nodes for processing. (The keyword <codeph>SHUFFLE</codeph> is used to indicate a <q>partitioned join</q>,
        because that type of join is not related to <q>partitioned tables</q>.) Since the alternative
        <q>broadcast</q> join mechanism is the default when table and index statistics are unavailable, you might
        use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more
        efficient for joins between large tables of similar size.
      </li>

      <li>
        <codeph>/* +BROADCAST */</codeph> - Makes that join operation use the <q>broadcast</q> technique that sends the
        entire contents of the right-hand table to all nodes involved in processing the join. This is the default
        mode of operation when table and index statistics are unavailable, so you would typically only need it if
        stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins
        are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the
        right side of the <codeph>JOIN</codeph> operator.)
      </li>
    </ul>

    <p>
      <b>Hints for INSERT ... SELECT queries:</b>
    </p>

    <p conref="../shared/impala_common.xml#common/insert_hints"/>

    <p rev="IMPALA-2924">
      <b>Hints for scheduling of HDFS blocks:</b>
    </p>

    <p rev="IMPALA-2924">
      The hints <codeph>/* +SCHEDULE_CACHE_LOCAL */</codeph>,
      <codeph>/* +SCHEDULE_DISK_LOCAL */</codeph>, and
      <codeph>/* +SCHEDULE_REMOTE */</codeph> have the same effect
      as specifying the <codeph>REPLICA_PREFERENCE</codeph> query
      option with the respective option settings of <codeph>CACHE_LOCAL</codeph>,
      <codeph>DISK_LOCAL</codeph>, or <codeph>REMOTE</codeph>.
      The hint <codeph>/* +RANDOM_REPLICA */</codeph> is the same as
      enabling the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option.
    </p>

    <p rev="IMPALA-2924">
      You can use these hints in combination by separating them with commas,
      for example, <codeph>/* +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>.
      See <xref keyref="replica_preference"/> and
      <xref keyref="schedule_random_replica"/> for information about how
      these settings influence the way Impala processes HDFS data blocks.
    </p>

    <p rev="IMPALA-2924">
      Specifying the replica preference as a query hint always overrides the
      query option setting. Specifying either the <codeph>SCHEDULE_RANDOM_REPLICA</codeph>
      query option or the corresponding <codeph>RANDOM_REPLICA</codeph> query hint
      enables the random tie-breaking behavior when processing data blocks
      during the query.
    </p>

    <p>
      <b>Suggestions versus directives:</b>
    </p>

    <p>
      In early Impala releases, hints were always obeyed and so acted more like directives. Once Impala gained join
      order optimizations, sometimes join queries were automatically reordered in a way that made a hint
      irrelevant. Therefore, the hints act more like suggestions in Impala 1.2.2 and higher.
    </p>

    <p>
      To force Impala to follow the hinted execution mechanism for a join query, include the
      <codeph>STRAIGHT_JOIN</codeph> keyword in the <codeph>SELECT</codeph> statement. See
      <xref href="impala_perf_joins.xml#straight_join"/> for details. When you use this technique, Impala does not
      reorder the joined tables at all, so you must be careful to arrange the join order to put the largest table
      (or subquery result set) first, then the smallest, second smallest, third smallest, and so on. This ordering lets Impala do the
      most I/O-intensive parts of the query using local reads on the DataNodes, and then reduce the size of the
      intermediate result set as much as possible as each subsequent table or subquery result set is joined.
    </p>

    <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

    <p>
      Queries that include subqueries in the <codeph>WHERE</codeph> clause can be rewritten internally as join
      queries. Currently, you cannot apply hints to the joins produced by these types of queries.
    </p>

    <p>
      Because hints can prevent queries from taking advantage of new metadata or improvements in query planning,
      use them only when required to work around performance issues, and be prepared to remove them when they are
      no longer required, such as after a new Impala release or bug fix.
    </p>

    <p>
      In particular, the <codeph>/* +BROADCAST */</codeph> and <codeph>/* +SHUFFLE */</codeph> hints are expected to be
      needed much less frequently in Impala 1.2.2 and higher, because the join order optimization feature in
      combination with the <codeph>COMPUTE STATS</codeph> statement now automatically choose join order and join
      mechanism without the need to rewrite the query and add hints. See
      <xref href="impala_perf_joins.xml#perf_joins"/> for details.
    </p>

    <p conref="../shared/impala_common.xml#common/compatibility_blurb"/>

    <p rev="2.0.0">
      The hints embedded within <codeph>--</codeph> comments are compatible with Hive queries. The hints embedded
      within <codeph>/* */</codeph> comments or <codeph>[ ]</codeph> square brackets are not recognized by or not
      compatible with Hive. For example, Hive raises an error for Impala hints within <codeph>/* */</codeph>
      comments because it does not recognize the Impala hint names.
    </p>

    <p conref="../shared/impala_common.xml#common/view_blurb"/>

    <p rev="2.0.0">
      If you use a hint in the query that defines a view, the hint is preserved when you query the view. Impala
      internally rewrites all hints in views to use the <codeph>--</codeph> comment notation, so that Hive can
      query such views without errors due to unrecognized hint names.
    </p>

    <p conref="../shared/impala_common.xml#common/example_blurb"/>

    <p>
      For example, this query joins a large customer table with a small lookup table of less than 100 rows. The
      right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the
      <codeph>/* +broadcast */</codeph> hint to force a broadcast join strategy:
    </p>

<codeblock>select straight_join customer.address, state_lookup.state_name
  from customer join <b>/* +broadcast */</b> state_lookup
  on customer.state_id = state_lookup.state_id;</codeblock>

    <p>
      This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of
      hints and find that it is more efficient to transmit portions of each table to other nodes for processing.
      Thus, you would use the <codeph>/* +shuffle */</codeph> hint to force a partitioned join strategy:
    </p>

<codeblock>select straight_join weather.wind_velocity, geospatial.altitude
  from weather join <b>/* +shuffle */</b> geospatial
  on weather.lat = geospatial.lat and weather.long = geospatial.long;</codeblock>

    <p>
      For joins involving three or more tables, the hint applies to the tables on either side of that specific
      <codeph>JOIN</codeph> keyword. The <codeph>STRAIGHT_JOIN</codeph> keyword ensures that joins are processed
      in a predictable order from left to right. For example, this query joins
      <codeph>t1</codeph> and <codeph>t2</codeph> using a partitioned join, then joins that result set to
      <codeph>t3</codeph> using a broadcast join:
    </p>

<codeblock>select straight_join t1.name, t2.id, t3.price
  from t1 join <b>/* +shuffle */</b> t2 join <b>/* +broadcast */</b> t3
  on t1.id = t2.id and t2.id = t3.id;</codeblock>

    <!-- To do: This is a good place to add more sample output showing before and after EXPLAIN plans. -->

    <p conref="../shared/impala_common.xml#common/related_info"/>

    <p>
      For more background information about join queries, see <xref href="impala_joins.xml#joins"/>. For
      performance considerations, see <xref href="impala_perf_joins.xml#perf_joins"/>.
    </p>
  </conbody>
</concept>