impala/docs/topics/impala_query_results_spooling.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="data_sink">
  <title>Spooling Impala Query Results</title>
  <conbody>
    <p>In Impala, you can control how query results are materialized and
      returned to clients, e.g. impala-shell, Hue, JDBC apps.</p>
    <ul>
      <li>When query result spooling is disabled, Impala relies on clients to
        fetch results to trigger the generation of more result row batches until
        all the result rows have been produced. If a client issues a query
        without fetching all the results, the query fragments continue to
        consume the resources until the query is cancelled and unregistered,
        potentially tying up resources and causing other queries to wait for an
        extended period of time in admission control.<p>Impala would materialize
          rows on-demand where rows are created only when the client requests
          them.</p></li>
      <li>When query result spooling is enabled, result sets of queries are
        eagerly fetched and spooled in the spooling location, either in memory
        or on disk. <p>Once all result rows have been fetched and stored in the
          spooling location, the resources are freed up. Incoming client fetches
          can get the data from the spooled results.</p></li>
    </ul>
    <p>Result spooling is turned off by default, but can be enabled via the
        <codeph>SPOOL_QUERY_RESULTS</codeph> query option.</p>
    <section id="section_av4_hsy_2jb">
      <title>Admission Control and Result Spooling</title>
      <p>Query results spooling collects and stores query results in memory that
        is controlled by admission control. Use the following query options to
        calibrate how much memory to use and when to spill to disk.<dl>
          <dlentry>
            <dt>MAX_RESULT_SPOOLING_MEM</dt>
            <dd>
              <p>The maximum amount of memory used when spooling query results.
                If this value is exceeded when spooling results, all memory will
                most likely be spilled to disk. Set to 100 MB by default. </p>
            </dd>
          </dlentry>
          <dlentry>
            <dt>MAX_SPILLED_RESULT_SPOOLING_MEM</dt>
            <dd>
              <p>The maximum amount of memory that can be spilled to disk when
                spooling query results. Must be greater than or equal to
                  <codeph>MAX_RESULT_SPOOLING_MEM</codeph>. If this value is
                exceeded, the coordinator fragment will block until the client
                has consumed enough rows to free up more memory. Set to 1 GB by
                default.</p>
            </dd>
          </dlentry>
        </dl></p>
    </section>
    <section id="section_oh2_fsy_2jb">
      <title>Fetch Timeout</title>
      <p>Resources for a query are released when the query completes its
        execution. To prevent clients from indefinitely waiting for query
        results, use the <codeph>FETCH_ROWS_TIMEOUT_MS</codeph> query option to
        set the timeout when clients fetch rows. Timeout applies both when query
        result spooling is enabled and disabled:<ul>
          <li>When result spooling is disabled (<codeph>SPOOL_QUERY_RESULTS =
              FALSE</codeph>), the timeout controls how long a client waits for
            a single row batch to be produced by the coordinator. </li>
          <li>When result spooling is enabled ( (<codeph>SPOOL_QUERY_RESULTS =
              TRUE</codeph>), a client can fetch multiple row batches at a time,
            so this timeout controls the total time a client waits for row
            batches to be produced.</li>
        </ul></p>
    </section>
    <section id="section_ahm_bsy_2jb">
      <title>Explain Plans</title>
      <p>Below is the part of the <codeph>EXPLAIN</codeph> plan output for
        result spooling.<codeblock>F01:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
|  Per-Host Resources: mem-estimate=4.02MB mem-reservation=4.00MB thread-reservation=1
PLAN-ROOT SINK
|  mem-estimate=4.00MB mem-reservation=4.00MB spill-buffer=2.00MB thread-reservation=0</codeblock><ul>
          <li>The <codeph>mem-estimate</codeph> for the <codeph>PLAN-ROOT
              SINK</codeph> is an estimate of the amount of memory needed to
            spool all the rows returned by the query.</li>
          <li>The <codeph>mem-reservation</codeph> is the number and size of the
            buffers necessary to spool the query results. By default, the read
            and write buffers are 2 MB in size each, which is why the default is
            4 MB.</li>
        </ul></p>
    </section>
    <section id="section_ovl_ksy_2jb">
      <title>PlanRootSink</title>
      <p dir="ltr">In Impala, the <codeph>PlanRootSink</codeph> class controls
        the passing of batches of rows to the clients and acts as a queue of
        rows to be sent to clients.</p>
      <p>
        <ul>
          <li>
            <p>When result spooling is disabled, a single batch or rows is sent
              to the <codeph>PlanRootSink</codeph>, and then the client must
              consume that batch before another one can be sent.</p>
          </li>
          <li>
            <p>When result spooling is enabled, multiple batches of rows can be
              sent to the <codeph>PlanRootSink</codeph>, and multiple batches
              can be consumed by the client.</p>
          </li>
        </ul>
      </p>
    </section>
    <section>
      <p><b>Related information:</b>
        <xref href="impala_max_result_spooling_mem.xml#MAX_RESULT_SPOOLING_MEM"
        />, <xref
          href="impala_max_spilled_result_spooling_mem.xml#MAX_SPILLED_RESULT_SPOOLING_MEM"
        />, <xref href="impala_spool_query_results.xml#SPOOL_QUERY_RESULTS"
        /></p>
    </section>
  </conbody>
</concept>