IMPALA-7214: [DOCS] Update Impala docs to decouple Impala and DataNodes

- Take 1: Let's review these docs before we go clean up many more. Change-Id: I1c91f7975c09dae9908591eeeac0d55e5355b2d4 Reviewed-on: http://gerrit.cloudera.org:8080/12400 Reviewed-by: Alex Rodoni <arodoni@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-19 09:58:28 -05:00 · 2019-02-07 17:06:47 -08:00
parent 29df1d55f6
commit 697a15b341
9 changed files with 164 additions and 468 deletions
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -2079,7 +2079,7 @@ show functions in _impala_builtins like '*<varname>substring</varname>*';
        <b>Sorting considerations:</b> Although you can specify an <codeph>ORDER BY</codeph> clause in an
        <codeph>INSERT ... SELECT</codeph> statement, any <codeph>ORDER BY</codeph> clause is ignored and the
        results are not necessarily sorted. An <codeph>INSERT ... SELECT</codeph> operation potentially creates
-        many different data files, prepared on different data nodes, and therefore the notion of the data being
+        many different data files, prepared by different executor Impala daemons, and therefore the notion of the data being
        stored in sorted order is impractical.
      </p>

@@ -3185,8 +3185,9 @@ select max(height), avg(height) from census_data where age &gt; 20;
        <codeph><xref href="../topics/impala_order_by.xml#order_by">ORDER BY</xref></codeph> clause to also use a
        <codeph><xref href="../topics/impala_limit.xml#limit">LIMIT</xref></codeph> clause. In Impala 1.4.0 and
        higher, the <codeph>LIMIT</codeph> clause is optional for <codeph>ORDER BY</codeph> queries. In cases where
-        sorting a huge result set requires enough memory to exceed the Impala memory limit for a particular node,
-        Impala automatically uses a temporary disk work area to perform the sort operation.
+        sorting a huge result set requires enough memory to exceed the Impala
+        memory limit for a particular executor Impala daemon, Impala automatically uses a
+        temporary disk work area to perform the sort operation.
      </p>

      <p rev="1.2" id="limit_and_offset">
--- a/docs/topics/impala_components.xml
+++ b/docs/topics/impala_components.xml
@@ -34,10 +34,9 @@ under the License.

  <conbody>

-    <p>
-      The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of
-      different daemon processes that run on specific hosts within your <keyword keyref="distro"/> cluster.
-    </p>
+    <p> The Impala server is a distributed, massively parallel processing (MPP)
+      database engine. It consists of different daemon processes that run on
+      specific hosts within your cluster. </p>

    <p outputclass="toc inpage"/>
  </conbody>
@@ -48,35 +47,35 @@ under the License.

    <conbody>

-      <p>
-        The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented
-        by the <codeph>impalad</codeph> process. It reads and writes to data files; accepts queries transmitted
-        from the <codeph>impala-shell</codeph> command, Hue, JDBC, or ODBC; parallelizes the queries and
-        distributes work across the cluster; and transmits intermediate query results back to the
-        central coordinator node.
-      </p>
+      <p> The core Impala component is the Impala daemon, physically represented
+        by the <codeph>impalad</codeph> process. A few of the key functions that
+        an Impala daemon performs are:<ul>
+          <li>Reads and writes to data files.</li>
+          <li>Accepts queries transmitted from the <codeph>impala-shell</codeph>
+            command, Hue, JDBC, or ODBC.</li>
+          <li>Parallelizes the queries and distributes work across the
+            cluster.</li>
+          <li>Transmits intermediate query results back to the central
+            coordinator. </li>
+        </ul></p>
+      <p>Impala daemons can be deployed in one of the following ways:<ul>
+          <li>HDFS and Impala are co-located, and each Impala daemon runs on the
+            same host as a DataNode.</li>
+          <li>Impala is deployed separately in a compute cluster and reads
+            remotely from HDFS, S3, ADLS, etc.</li>
+        </ul></p>

-      <p>
-        You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as the
-        <term>coordinator node</term> for that query. The other nodes transmit partial results back to the
-        coordinator, which constructs the final result set for a query. When running experiments with functionality
-        through the <codeph>impala-shell</codeph> command, you might always connect to the same Impala daemon for
-        convenience. For clusters running production workloads, you might load-balance by
-        submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces.
-      </p>
+      <p> The Impala daemons are in constant communication with StateStore, to
+        confirm which daemons are healthy and can accept new work. </p>

-      <p>
-        The Impala daemons are in constant communication with the <term>statestore</term>, to confirm which nodes
-        are healthy and can accept new work.
-      </p>
-
-      <p rev="1.2">
-        They also receive broadcast messages from the <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2)
-        whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an
-        <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statement is processed through Impala. This
-        background communication minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
-        METADATA</codeph> statements that were needed to coordinate metadata across nodes prior to Impala 1.2.
-      </p>
+      <p rev="1.2"> They also receive broadcast messages from the
+          <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2) whenever
+        any Impala daemon in the cluster creates, alters, or drops any type of
+        object, or when an <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph>
+        statement is processed through Impala. This background communication
+        minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
+          METADATA</codeph> statements that were needed to coordinate metadata
+        across Impala daemons prior to Impala 1.2. </p>

      <p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
        In <keyword keyref="impala29_full"/> and higher, you can control which hosts act as query coordinators
@@ -98,32 +97,28 @@ under the License.

    <conbody>

-      <p>
-        The Impala component known as the <term>statestore</term> checks on the health of Impala daemons on all the
-        DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically
-        represented by a daemon process named <codeph>statestored</codeph>; you only need such a process on one
-        host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue,
-        or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making
-        requests to the unreachable node.
-      </p>
+      <p> The Impala component known as the StateStore checks on the health of
+        all Impala daemons in a cluster, and continuously relays its findings to
+        each of those daemons. It is physically represented by a daemon process
+        named <codeph>statestored</codeph>. You only need such a process on one
+        host in a cluster. If an Impala daemon goes offline due to hardware
+        failure, network error, software issue, or other reason, the StateStore
+        informs all the other Impala daemons so that future queries can avoid
+        making requests to the unreachable Impala daemon. </p>

-      <p>
-        Because the statestore's purpose is to help when things go wrong and
+      <p> Because the StateStore's purpose is to help when things go wrong and
        to broadcast metadata to coordinators, it is not always critical to the
-        normal operation of an Impala cluster. If the statestore is not running
+        normal operation of an Impala cluster. If the StateStore is not running
        or becomes unreachable, the Impala daemons continue running and
        distributing work among themselves as usual when working with the data
        known to Impala. The cluster just becomes less robust if other Impala
        daemons fail, and metadata becomes less consistent as it changes while
-        the statestore is offline. When the statestore comes back online, it
+        the StateStore is offline. When the StateStore comes back online, it
        re-establishes communication with the Impala daemons and resumes its
-        monitoring and broadcasting functions.
-      </p>
+        monitoring and broadcasting functions. </p>

-      <p>
-        If you issue a DDL statement while the statestore is down, the queries
-        that access the new object the DDL created will fail.
-      </p>
+      <p> If you issue a DDL statement while the StateStore is down, the queries
+        that access the new object the DDL created will fail. </p>

      <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

@@ -145,35 +140,25 @@ under the License.

    <conbody>

-      <p>
-        The Impala component known as the <term>catalog service</term> relays the metadata changes from Impala SQL
-        statements to all the Impala daemons in a cluster. It is physically represented by a daemon process named
-        <codeph>catalogd</codeph>; you only need such a process on one host in the cluster. Because the requests
-        are passed through the statestore daemon, it makes sense to run the <cmdname>statestored</cmdname> and
-        <cmdname>catalogd</cmdname> services on the same host.
-      </p>
+      <p> The Impala component known as the Catalog Service relays the metadata
+        changes from Impala SQL statements to all the Impala daemons in a
+        cluster. It is physically represented by a daemon process named
+          <codeph>catalogd</codeph>. You only need such a process on one host in
+        a cluster. Because the requests are passed through the StateStore
+        daemon, it makes sense to run the <cmdname>statestored</cmdname> and
+          <cmdname>catalogd</cmdname> services on the same host. </p>

-      <p>
-        The catalog service avoids the need to issue
-        <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements when the metadata changes are
-        performed by statements issued through Impala. When you create a table, load data, and so on through Hive,
-        you do need to issue <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an Impala node
-        before executing a query there.
-      </p>
+      <p> The catalog service avoids the need to issue <codeph>REFRESH</codeph>
+        and <codeph>INVALIDATE METADATA</codeph> statements when the metadata
+        changes are performed by statements issued through Impala. When you
+        create a table, load data, and so on through Hive, you do need to issue
+          <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an
+        Impala daemon before executing a query there. </p>

      <p>
        This feature touches a number of aspects of Impala:
      </p>

-<!-- This was formerly a conref, but since the list of links also included a link
-     to this same topic, materializing the list here and removing that
-     circular link. (The conref is still used in Incompatible Changes.)
-
-      <ul conref="../shared/impala_common.xml#common/catalogd_xrefs">
-        <li/>
-      </ul>
-->
-
      <ul id="catalogd_xrefs">
        <li>
          <p>
@@ -184,16 +169,17 @@ under the License.
        </li>

        <li>
-          <p>
-            The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements are not needed
-            when the <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other table-changing or
-            data-changing operation is performed through Impala. These statements are still needed if such
-            operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the
-            statements only need to be issued on one Impala node rather than on all nodes. See
-            <xref href="impala_refresh.xml#refresh"/> and
-            <xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> for the latest usage information for
-            those statements.
-          </p>
+          <p> The <codeph>REFRESH</codeph> and <codeph>INVALIDATE
+              METADATA</codeph> statements are not needed when the
+              <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other
+            table-changing or data-changing operation is performed through
+            Impala. These statements are still needed if such operations are
+            done through Hive or by manipulating data files directly in HDFS,
+            but in those cases the statements only need to be issued on one
+            Impala daemon rather than on all daemons. See <xref
+              href="impala_refresh.xml#refresh"/> and <xref
+              href="impala_invalidate_metadata.xml#invalidate_metadata"/> for
+            the latest usage information for those statements. </p>
        </li>
      </ul>

--- a/docs/topics/impala_concepts.xml
+++ b/docs/topics/impala_concepts.xml
@@ -43,270 +43,4 @@ under the License.

    <p outputclass="toc"/>
  </conbody>
-
-<!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. -->
-
-  <concept id="intro_data_lifecycle" audience="hidden">
-
-    <title>Overview of the Data Lifecycle for Impala</title>
-
-    <conbody/>
-  </concept>
-
-  <concept id="intro_etl" audience="hidden">
-
-    <title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title>
-  <prolog>
-    <metadata>
-      <data name="Category" value="ETL"/>
-      <data name="Category" value="Ingest"/>
-      <data name="Category" value="Concepts"/>
-    </metadata>
-  </prolog>
-
-    <conbody/>
-  </concept>
-
-  <concept id="intro_hadoop_data" audience="hidden">
-
-    <title>How Impala Works with Hadoop Data Files</title>
-
-    <conbody/>
-  </concept>
-
-  <concept id="intro_web_ui" audience="hidden">
-
-    <title>Overview of the Impala Web Interface</title>
-
-    <conbody/>
-  </concept>
-
-  <concept id="intro_bi" audience="hidden">
-
-    <title>Using Impala with Business Intelligence Tools</title>
-
-    <conbody/>
-  </concept>
-
-  <concept id="intro_ha" audience="hidden">
-
-    <title>Overview of Impala Availability and Fault Tolerance</title>
-
-    <conbody/>
-  </concept>
-
-<!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance",
-     and if it should be split out into a separate file, and then take out the audience= attribute
-     to make it visible.
-->
-
-  <concept id="intro_llvm" audience="hidden">
-
-    <title>Overview of Impala Runtime Code Generation</title>
-
-    <conbody>
-
-<!-- Adapted from the CIDR15 paper written by the Impala team. -->
-
-      <p>
-        Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time
-        (JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation
-        technique improves query execution times by generating native code optimized for the architecture of each
-        host in your particular cluster. Performance gains of 5 times or more are typical for representative
-        workloads.
-      </p>
-
-      <p>
-        Impala uses runtime code generation to produce query-specific versions of functions that are critical to
-        performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those
-        that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the
-        total time the query takes to execute. For example, when Impala scans a data file, it calls a function to
-        parse each record into Impala’s in-memory tuple format. For queries scanning large tables, billions of
-        records could result in billions of function calls. This function must therefore be extremely efficient for
-        good query performance, and removing even a few instructions from each function call can result in large
-        query speedups.
-      </p>
-
-      <p>
-        Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it
-        eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions.
-        Inlining is especially valuable for functions used internally to evaluate expressions, where the function
-        call itself is more expensive than the function body (for example, a function that adds two numbers).
-        Inlining functions also increases instruction-level parallelism, and allows the compiler to make further
-        optimizations such as subexpression elimination across expressions.
-      </p>
-
-      <p>
-        Impala generates runtime query code automatically, so you do not need to do anything special to get this
-        performance benefit. This technique is most effective for complex and long-running queries that process
-        large numbers of rows. If you need to issue a series of short, small queries, you might turn off this
-        feature to avoid the overhead of compilation time for each query. In this case, issue the statement
-        <codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the
-        current session.
-      </p>
-
-<!--
-      <p>
-        Without code generation,
-        functions tend to be suboptimal
-        to handle situations that cannot be predicted in advance.
-        For example,
-        a record-parsing function that
-        only handles integer types will be faster at parsing an integer-only file
-        than a function that handles other data types
-        such as strings and floating-point numbers.
-        However, the schemas of the files to
-        be scanned are unknown at compile time,
-        and so a general-purpose function must be used, even if at runtime
-        it is known that more limited functionality is sufficient.
-      </p>
-
-      <p>
-        A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance
-        penalty, particularly when the called function is very simple, as the calls cannot be inlined.
-        If the type of the object instance is known at runtime, we can use code generation to replace the virtual
-        function call with a call directly to the correct function, which can then be inlined. This is especially
-        valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a
-        tree of individual operators and functions.
-      </p>
-
-      <p>
-        Each type of expression that can appear in a query is implemented internally by overriding a virtual function.
-        Many of these expression functions are quite simple, for example, adding two numbers.
-        The virtual function call can be more expensive than the function body itself. By resolving the virtual
-        function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions
-        directly with no function call overhead. Inlining functions also increases
-        instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression
-        elimination across expressions.
-      </p>
-->
-    </conbody>
-  </concept>
-
-<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
-
-  <concept audience="hidden" id="intro_io">
-
-    <title>Overview of Impala I/O</title>
-
-    <conbody>
-
-      <p>
-        Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform
-        data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called
-        <term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala
-        can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all
-        available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec.
-        Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed,
-        and saves CPU cycles as there is no need to copy or checksum data blocks within memory.
-      </p>
-
-      <p>
-        The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a
-        fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per
-        SSD), providing an asynchronous interface to clients (<term>scanner threads</term>).
-      </p>
-    </conbody>
-  </concept>
-
-<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->
-
-<!-- Although good idea to get some answers from Henry first. -->
-
-  <concept audience="hidden" id="intro_state_distribution">
-
-    <title>State distribution</title>
-
-    <conbody>
-
-      <p>
-        As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize
-        its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept
-        and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of
-        which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and
-        remote procedure calls to retrieve metadata during query planning, Impala implements a simple
-        publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of
-        subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes).
-      </p>
-
-      <p>
-        The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>,
-        <varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where
-        <varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a
-        64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the
-        contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not
-        persisted across service restarts. Processes that receive updates to any topic are called
-        <term>subscribers</term>, and express their interest by registering with the statestore at startup and
-        providing a list of topics. The statestore responds to registration by sending the subscriber an initial
-        topic update for each registered topic, which consists of all the entries currently in that topic.
-      </p>
-
-<!-- Henry: OK, but in practice, what is in these topic messages for Impala? -->
-
-      <p>
-        After registration, the statestore periodically sends two kinds of messages to each subscriber. The first
-        kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries
-        and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a
-        per-topic most-recent-version identifier which allows the statestore to only send the delta between
-        updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its
-        subscribed topics. Those changes are guaranteed to have been applied by the time the next update is
-        received.
-      </p>
-
-      <p>
-        The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called
-        <term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each
-        subscriber, which would otherwise time out its subscription and attempt to re-register.
-      </p>
-
-      <p>
-        Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these
-        messages could be very large in instances with thousands of tables, partitions, data files, and so on,
-        Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted
-        and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala
-        nodes become unavailable.
-      </p>
-
-      <p>
-        If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it
-        stops sending updates to that node.
-<!-- Henry: what are examples of these transient topic entries? -->
-        Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are
-        removed.
-      </p>
-
-      <p>
-        Although the asynchronous nature of this mechanism means that metadata updates might take some time to
-        propagate across the entire cluster, that does not affect the consistency of query planning or results.
-        Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of
-        the existence of the relevant tables, data files, and so on, it can distribute the query work to other
-        nodes even if those other nodes have not received the latest metadata updates.
-<!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? -->
-<!--
-        For example, query planning is performed on a single node based on the
-        catalog metadata topic, and once a full plan has been computed, all information required to execute that
-        plan is distributed directly to the executing nodes.
-        There is no requirement that an executing node should
-        know about the same version of the catalog metadata topic.
-->
-      </p>
-
-      <p>
-        We have found that the statestore process with default settings scales well to medium sized clusters, and
-        can serve our largest deployments with some configuration changes.
-<!-- Henry: elaborate on the configuration changes. -->
-      </p>
-
-      <p>
-<!-- Henry: other examples like load information? How is load information used? -->
-        The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by
-        its subscribers (for example, load information). Therefore, should a statestore restart, its state can be
-        recovered during the initial subscriber registration phase. Or if the machine that the statestore is
-        running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it.
-        There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS
-        entry to force subscribers to automatically move to the new process instance.
-<!-- Henry: translate that last sentence into instructions / guidelines. -->
-      </p>
-    </conbody>
-  </concept>
 </concept>
--- a/docs/topics/impala_connecting.xml
+++ b/docs/topics/impala_connecting.xml
@@ -57,33 +57,30 @@ Lots of nuances to illustrate through sample code.
      See <xref href="impala_shell_options.xml"/> for the command-line and configuration file options you can use.
    </p>

-    <p>
-      You can connect to any DataNode where an instance of <cmdname>impalad</cmdname> is running,
-      and that host coordinates the execution of all queries sent to it.
-    </p>
+    <p> You can connect to any Impala daemon (<cmdname>impalad</cmdname>), and
+      that daemon coordinates the execution of all queries sent to it. </p>

    <p>
      For simplicity during development, you might always connect to the same host, perhaps running <cmdname>impala-shell</cmdname> on
      the same host as <cmdname>impalad</cmdname> and specifying the hostname as <codeph>localhost</codeph>.
    </p>

-    <p>
-      In a production environment, you might enable load balancing, in which you connect to specific host/port combination
-      but queries are forwarded to arbitrary hosts. This technique spreads the overhead of acting as the coordinator
-      node among all the DataNodes in the cluster. See <xref href="impala_proxy.xml"/> for details.
-    </p>
+    <p> In a production environment, you might enable load balancing, in which
+      you connect to specific host/port combination but queries are forwarded to
+      arbitrary hosts. This technique spreads the overhead of acting as the
+      coordinator node among all the Impala daemons in the cluster. See <xref
+        href="impala_proxy.xml"/> for details. </p>

    <p>
      <b>To connect the Impala shell during shell startup:</b>
    </p>

    <ol>
-      <li>
-        Locate the hostname of a DataNode within the cluster that is running an instance of the
-        <cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something
-        other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the
-        port number also.
-      </li>
+      <li> Locate the hostname that is running an instance of the
+          <cmdname>impalad</cmdname> daemon. If that <cmdname>impalad</cmdname>
+        uses a non-default port (something other than port 21000) for
+          <cmdname>impala-shell</cmdname> connections, find out the port number
+        also. </li>

      <li>
        Use the <codeph>-i</codeph> option to the
@@ -126,21 +123,17 @@ $ impala-shell -i <varname>some.other.hostname</varname>:<varname>port_number</v
 [Not connected] &gt; </codeblock>
      </li>

-      <li>
-        Locate the hostname of a DataNode within the cluster that is running an instance of the
-        <cmdname>impalad</cmdname> daemon. If that DataNode uses a non-default port (something
-        other than port 21000) for <cmdname>impala-shell</cmdname> connections, find out the
-        port number also.
-      </li>
+      <li> Locate the hostname that is running the <cmdname>impalad</cmdname>
+        daemon. If that <cmdname>impalad</cmdname> uses a non-default port
+        (something other than port 21000) for <cmdname>impala-shell</cmdname>
+        connections, find out the port number also. </li>

-      <li>
-        Use the <codeph>connect</codeph> command to connect to an Impala instance. Enter a command of the form:
-<codeblock>[Not connected] &gt; connect <varname>impalad-host</varname>
+      <li> Use the <codeph>connect</codeph> command to connect to an Impala
+        instance. Enter a command of the form: <codeblock>[Not connected] &gt; connect <varname>impalad-host</varname>
 [<varname>impalad-host</varname>:21000] &gt;</codeblock>
-        <note>
-          Replace <varname>impalad-host</varname> with the hostname you have configured for any DataNode running
-          Impala in your environment. The changed prompt indicates a successful connection.
-        </note>
+        <note> Replace <varname>impalad-host</varname> with the hostname you
+          have configured to run Impala in your environment. The changed prompt
+          indicates a successful connection. </note>
      </li>
    </ol>

@@ -148,11 +141,9 @@ $ impala-shell -i <varname>some.other.hostname</varname>:<varname>port_number</v
      <b>To start <cmdname>impala-shell</cmdname> in a specific database:</b>
    </p>

-    <p>
-      You can use all the same connection options as in previous examples.
-      For simplicity, these examples assume that you are logged into one of
-      the DataNodes that is running the <cmdname>impalad</cmdname> daemon.
-    </p>
+    <p> You can use all the same connection options as in previous examples. For
+      simplicity, these examples assume that you are logged into one of the
+      Impala daemons. </p>

    <ol>
      <li>
@@ -182,11 +173,9 @@ $ impala-shell -i localhost -d parquet_gzip_compression
      <b>To run one or several statements in non-interactive mode:</b>
    </p>

-    <p>
-      You can use all the same connection options as in previous examples.
-      For simplicity, these examples assume that you are logged into one of
-      the DataNodes that is running the <cmdname>impalad</cmdname> daemon.
-    </p>
+    <p> You can use all the same connection options as in previous examples. For
+      simplicity, these examples assume that you are logged into one of the
+      Impala daemons. </p>

    <ol>
      <li>
--- a/docs/topics/impala_impala_shell.xml
+++ b/docs/topics/impala_impala_shell.xml
@@ -98,10 +98,9 @@ under the License.
      For information on installing the Impala shell, see <xref href="impala_install.xml#install"/>.
    </p>

-    <p>
-      For information about establishing a connection to a DataNode running the <codeph>impalad</codeph> daemon
-      through the <codeph>impala-shell</codeph> command, see <xref href="impala_connecting.xml#connecting"/>.
-    </p>
+    <p> For information about establishing a connection to a coordinator Impala
+      daemon through the <codeph>impala-shell</codeph> command, see <xref
+        href="impala_connecting.xml#connecting"/>. </p>

    <p>
      For a list of the <codeph>impala-shell</codeph> command-line options, see
--- a/docs/topics/impala_kudu.xml
+++ b/docs/topics/impala_kudu.xml
@@ -143,15 +143,13 @@ under the License.
          </li>

          <li>
-            <p>
-              Data is physically divided based on units of storage called
+            <p> Data is physically divided based on units of storage called
                <term>tablets</term>. Tablets are stored by <term>tablet
                servers</term>. Each tablet server can store multiple tablets,
              and each tablet is replicated across multiple tablet servers,
              managed automatically by Kudu. Where practical, co-locate the
-              tablet servers on the same hosts as the DataNodes, although that
-              is not required.
-            </p>
+              tablet servers on the same hosts as the Impala daemons, although
+              that is not required. </p>
          </li>
        </ul>

--- a/docs/topics/impala_order_by.xml
+++ b/docs/topics/impala_order_by.xml
@@ -49,15 +49,6 @@ under the License.
        <codeph>LIMIT</codeph> clause, which reduces network overhead and the
      memory requirement on the coordinator node. </p>

-    <note>
-      <p rev="1.4.0 obwl">
-        In Impala 1.4.0 and higher, the <codeph>LIMIT</codeph> clause is now optional (rather than required) for
-        queries that use the <codeph>ORDER BY</codeph> clause. Impala automatically uses a temporary disk work area
-        to perform the sort if the sort operation would otherwise exceed the Impala memory limit for a particular
-        DataNode.
-      </p>
-    </note>
-
    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

    <p>
@@ -105,8 +96,6 @@ col_ref ::= <varname>column_name</varname> | <varname>integer_literal</varname>
    </p>

    <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_limit"/>
-
-    <!-- Good to show an example of cases where ORDER BY does and doesn't work with complex types. -->
    <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>

    <p rev="2.3.0">
@@ -131,13 +120,12 @@ ERROR: AnalysisException: ORDER BY expression 'score' with complex type 'ARRAY&l

    <p conref="../shared/impala_common.xml#common/example_blurb"/>

-    <p>
-      The following query retrieves the user ID and score, only for scores greater than one million,
-      with the highest scores for each user listed first.
-      Because the individual array elements are now represented as separate rows in the result set,
-      they can be used in the <codeph>ORDER BY</codeph> clause, referenced using the <codeph>ITEM</codeph>
-      pseudocolumn that represents each array element.
-    </p>
+    <p> The following query retrieves the user ID and score, only for scores
+      greater than one million, with the highest scores for each user listed
+      first. Because the individual array elements are now represented as
+      separate rows in the result set, they can be used in the <codeph>ORDER
+        BY</codeph> clause, referenced using the <codeph>ITEM</codeph>
+      pseudo-column that represents each array element. </p>

 <codeblock>SELECT id, item FROM games, games.score
  WHERE item &gt; 1000000
@@ -168,13 +156,14 @@ ORDER BY id, info.value desc;

    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

-    <p>
-      Although the <codeph>LIMIT</codeph> clause is now optional on <codeph>ORDER BY</codeph> queries, if your
-      query only needs some number of rows that you can predict in advance, use the <codeph>LIMIT</codeph> clause
-      to reduce unnecessary processing. For example, if the query has a clause <codeph>LIMIT 10</codeph>, each data
-      node sorts its portion of the relevant result set and only returns 10 rows to the coordinator node. The
-      coordinator node picks the 10 highest or lowest row values out of this small intermediate result set.
-    </p>
+    <p> Although the <codeph>LIMIT</codeph> clause is now optional on
+        <codeph>ORDER BY</codeph> queries, if your query only needs some number
+      of rows that you can predict in advance, use the <codeph>LIMIT</codeph>
+      clause to reduce unnecessary processing. For example, if the query has a
+      clause <codeph>LIMIT 10</codeph>, each executor Impala daemon sorts its
+      portion of the relevant result set and only returns 10 rows to the
+      coordinator node. The coordinator node picks the 10 highest or lowest row
+      values out of this small intermediate result set. </p>

    <p>
      If an <codeph>ORDER BY</codeph> clause is applied to an early phase of query processing, such as a subquery
@@ -212,45 +201,46 @@ SELECT page_title AS "Page 3 of search results", page_url FROM search_content

    <p conref="../shared/impala_common.xml#common/internals_blurb"/>

-    <p>
-      Impala sorts the intermediate results of an <codeph>ORDER BY</codeph> clause in memory whenever practical. In
-      a cluster of N DataNodes, each node sorts roughly 1/Nth of the result set, the exact proportion varying
-      depending on how the data matching the query is distributed in HDFS.
-    </p>
+    <p> Impala sorts the intermediate results of an <codeph>ORDER BY</codeph>
+      clause in memory whenever practical. In a cluster of N executor Impala
+      daemons, each daemon sorts roughly 1/Nth of the result set, the exact
+      proportion varying depending on how the data matching the query is
+      distributed in HDFS. </p>

-    <p>
-      If the size of the sorted intermediate result set on any DataNode would cause the query to exceed the Impala
-      memory limit, Impala sorts as much as practical in memory, then writes partially sorted data to disk. (This
-      technique is known in industry terminology as <q>external sorting</q> and <q>spilling to disk</q>.) As each
-      8 MB batch of data is written to disk, Impala frees the corresponding memory to sort a new 8 MB batch of
-      data. When all the data has been processed, a final merge sort operation is performed to correctly order the
-      in-memory and on-disk results as the result set is transmitted back to the coordinator node. When external
-      sorting becomes necessary, Impala requires approximately 60 MB of RAM at a minimum for the buffers needed to
-      read, write, and sort the intermediate results. If more RAM is available on the DataNode, Impala will use
-      the additional RAM to minimize the amount of disk I/O for sorting.
-    </p>
+    <p> If the size of the sorted intermediate result set on any executor Impala
+      daemon would cause the query to exceed the Impala memory limit, Impala
+      sorts as much as practical in memory, then writes partially sorted data to
+      disk. (This technique is known in industry terminology as <q>external
+        sorting</q> and <q>spilling to disk</q>.) As each 8 MB batch of data is
+      written to disk, Impala frees the corresponding memory to sort a new 8 MB
+      batch of data. When all the data has been processed, a final merge sort
+      operation is performed to correctly order the in-memory and on-disk
+      results as the result set is transmitted back to the coordinator node.
+      When external sorting becomes necessary, Impala requires approximately 60
+      MB of RAM at a minimum for the buffers needed to read, write, and sort the
+      intermediate results. If more RAM is available on the Impala daemon,
+      Impala will use the additional RAM to minimize the amount of disk I/O for
+      sorting. </p>

-    <p>
-      This external sort technique is used as appropriate on each DataNode (possibly including the coordinator
-      node) to sort the portion of the result set that is processed on that node. When the sorted intermediate
-      results are sent back to the coordinator node to produce the final result set, the coordinator node uses a
-      merge sort technique to produce a final sorted result set without using any extra resources on the
-      coordinator node.
-    </p>
+    <p> This external sort technique is used as appropriate on each Impala
+      daemon (possibly including the coordinator node) to sort the portion of
+      the result set that is processed on that node. When the sorted
+      intermediate results are sent back to the coordinator node to produce the
+      final result set, the coordinator node uses a merge sort technique to
+      produce a final sorted result set without using any extra resources on the
+      coordinator node. </p>

    <p rev="obwl">
      <b>Configuration for disk usage:</b>
    </p>

-    <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>
-
-<!-- Here is actually the more logical place to collect all those examples, move them from SELECT and cross-reference to here. -->
-
-<!--     <p rev="obwl" conref="../shared/impala_common.xml#common/restrictions_blurb"/> -->
+    <p rev="obwl"
+      conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>

    <p rev="obwl" conref="../shared/impala_common.xml#common/insert_sort_blurb"/>

-    <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_view_restriction"/>
+    <p rev="obwl"
+      conref="../shared/impala_common.xml#common/order_by_view_restriction"/>

    <p>
      With the lifting of the requirement to include a <codeph>LIMIT</codeph> clause in every <codeph>ORDER
@@ -259,16 +249,17 @@ SELECT page_title AS "Page 3 of search results", page_url FROM search_content

    <ul>
      <li>
-        <p>
-          Now the use of scratch disk space raises the possibility of an <q>out of disk space</q> error on a
-          particular DataNode, as opposed to the previous possibility of an <q>out of memory</q> error. Make sure
-          to keep at least 1 GB free on the filesystem used for temporary sorting work.
-        </p>
+        <p> Now the use of scratch disk space raises the possibility of an
+            <q>out of disk space</q> error on a particular Impala daemon, as
+          opposed to the previous possibility of an <q>out of memory</q> error.
+          Make sure to keep at least 1 GB free on the filesystem used for
+          temporary sorting work. </p>
      </li>

    </ul>

-    <p rev="obwl" conref="../shared/impala_common.xml#common/null_sorting_change"/>
+    <p rev="obwl"
+      conref="../shared/impala_common.xml#common/null_sorting_change"/>
 <codeblock>[localhost:21000] > create table numbers (x int);
 [localhost:21000] > insert into numbers values (1), (null), (2), (null), (3);
 [localhost:21000] > select x from numbers order by x nulls first;
--- a/docs/topics/impala_runtime_filtering.xml
+++ b/docs/topics/impala_runtime_filtering.xml
@@ -82,13 +82,13 @@ under the License.
      </p>
      <ul>
        <li>
-          <p>
-            What a <term>plan fragment</term> is.
-            Impala decomposes each query into smaller units of work that are distributed across the cluster.
-            Wherever possible, a data block is read, filtered, and aggregated by plan fragments executing
-            on the same host. For some operations, such as joins and combining intermediate results into
-            a final result set, data is transmitted across the network from one DataNode to another.
-          </p>
+          <p> What a <term>plan fragment</term> is. Impala decomposes each query
+            into smaller units of work that are distributed across the cluster.
+            Wherever possible, a data block is read, filtered, and aggregated by
+            plan fragments executing on the same host. For some operations, such
+            as joins and combining intermediate results into a final result set,
+            data is transmitted across the network from one Impala daemon to
+            another. </p>
        </li>
        <li>
          <p>
--- a/docs/topics/impala_scalability.xml
+++ b/docs/topics/impala_scalability.xml
@@ -601,9 +601,7 @@ Memory Usage: Additional Notes
              available to Impala and reduce the amount of memory required on each node.
            </li>

-            <li>
-              Increase the overall memory capacity of each DataNode at the hardware level.
-            </li>
+            <li> Add more memory to the hosts running Impala daemons. </li>

            <li>
              On a cluster with resources shared between Impala and other Hadoop components, use