impala/docs/topics/impala_analytic_functions.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept rev="2.0.0" id="analytic_functions">

  <title>Impala Analytic Functions</title>

  <titlealts audience="PDF">

    <navtitle>Analytic Functions</navtitle>

  </titlealts>

  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="SQL"/>
      <data name="Category" value="Impala Functions"/>
      <data name="Category" value="Aggregate Functions"/>
      <data name="Category" value="Analytic Functions"/>
      <data name="Category" value="Data Analysts"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Querying"/>
    </metadata>
  </prolog>

  <conbody>

    <p rev="2.0.0">
      <indexterm audience="hidden">analytic functions</indexterm>

      <indexterm audience="hidden">window functions</indexterm>
      Analytic functions (also known as window functions) are a special category of built-in functions. Like
      aggregate functions, they examine the contents of multiple input rows to compute each output value. However,
      rather than being limited to one result value per <codeph>GROUP BY</codeph> group, they operate on
      <term>windows</term> where the input rows are ordered and grouped using flexible conditions expressed through
      an <codeph>OVER()</codeph> clause.
    </p>

    <p conref="../shared/impala_common.xml#common/added_in_20"/>

<!--
    <p>
      Analytic functions produce one output value for each input row, like scalar functions such as
      <codeph>length()</codeph> or
      <codeph>substr()</codeph>.
    </p>
-->

    <p>
      Some functions, such as <codeph>LAG()</codeph> and <codeph>RANK()</codeph>, can only be used in this analytic
      context. Some aggregate functions do double duty: when you call the aggregation functions such as
      <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>AVG()</codeph>, and so on with an
      <codeph>OVER()</codeph> clause, they produce an output value for each row, based on computations across other
      rows in the window.
    </p>

    <p>
      Although analytic functions often compute the same value you would see from an aggregate function in a
      <codeph>GROUP BY</codeph> query, the analytic functions produce a value for each row in the result set rather
      than a single value for each group. This flexibility lets you include additional columns in the
      <codeph>SELECT</codeph> list, offering more opportunities for organizing and filtering the result set.
    </p>

    <p>
      Analytic function calls are only allowed in the <codeph>SELECT</codeph> list and in the outermost
      <codeph>ORDER BY</codeph> clause of the query. During query processing, analytic functions are evaluated
      after other query stages such as joins, <codeph>WHERE</codeph>, and <codeph>GROUP BY</codeph>,
    </p>

<!-- Oracle doesn't show examples until it gets to the actual functions, so let's follow that lead.
    <p>
      The following example shows a very simple call to <codeph>MAX()</codeph> in
      an analytic context, and a similar query using a <codeph>GROUP BY</codeph> clause.
    </p>
-->

<!--
This basic query could be represented either as an analytic
function call or an aggregation function call in a <codeph>GROUP BY</codeph> query.
For more elaborate kinds of computations, the flexibility of the analytic window
makes that the preferred option.
-->

<!-- TK: construct sample data and fill in query results. -->

<!-- Other DB docs don't necessarily include examples up at this level, only for the individual functions.
     So maybe take these placeholders out entirely.

<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year) FROM historical_temps;
SELECT year, month, max(degrees_c) FROM historical_temps GROUP BY year;
</codeblock>
-->

    <p>
      The rows that are part of each partition are analyzed by computations across an ordered or unordered set of
      rows. For example, <codeph>COUNT()</codeph> and <codeph>SUM()</codeph> might be applied to all the rows in
      the partition, in which case the order of analysis does not matter. The <codeph>ORDER BY</codeph> clause
      might be used inside the <codeph>OVER()</codeph> clause to defines the ordering that applies to functions
      such as <codeph>LAG()</codeph> and <codeph>FIRST_VALUE()</codeph>.
    </p>

<!-- TK: output needed here also. -->

<!--
<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year ORDER BY MONTH DESC) FROM historical_temps;
</codeblock>
-->

    <p>
      Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and
      bucketed analysis for large data sets. You might also see the term <q>window functions</q> in database
      literature, referring to the sequence of rows (the <q>window</q>) that the function call applies to,
      particularly when the <codeph>OVER</codeph> clause includes a <codeph>ROWS</codeph> or <codeph>RANGE</codeph>
      keyword.
    </p>

    <p>
      The following sections describe the analytic query clauses and the pure analytic functions provided by
      Impala. For usage information about aggregate functions in an analytic context, see
      <xref href="impala_aggregate_functions.xml#aggregate_functions"/>.
    </p>

    <p outputclass="toc inpage"/>

  </conbody>

  <concept id="over">

    <title>OVER Clause</title>

    <conbody>

      <p>
        The <codeph>OVER</codeph> clause is required for calls to pure analytic functions such as
        <codeph>LEAD()</codeph>, <codeph>RANK()</codeph>, and <codeph>FIRST_VALUE()</codeph>. When you include an
        <codeph>OVER</codeph> clause with calls to aggregate functions such as <codeph>MAX()</codeph>,
        <codeph>COUNT()</codeph>, or <codeph>SUM()</codeph>, they operate as analytic functions.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>function(<varname>args</varname>) OVER([<varname>partition_by_clause</varname>] [<varname>order_by_clause</varname> [<varname>window_clause</varname>]])

partition_by_clause ::= PARTITION BY <varname>expr</varname> [, <varname>expr</varname> ...]
order_by_clause ::= ORDER BY <varname>expr</varname>  [ASC | DESC] [NULLS FIRST | NULLS LAST] [, <varname>expr</varname> [ASC | DESC] [NULLS FIRST | NULLS LAST] ...]
window_clause: See <xref href="#window_clause">Window Clause</xref>
</codeblock>

      <p>
        <b>PARTITION BY clause:</b>
      </p>

      <p>
        The <codeph>PARTITION BY</codeph> clause acts much like the <codeph>GROUP BY</codeph> clause in the
        outermost block of a query. It divides the rows into groups containing identical values in one or more
        columns. These logical groups are known as <term>partitions</term>. Throughout the discussion of analytic
        functions, <q>partitions</q> refers to the groups produced by the <codeph>PARTITION BY</codeph> clause, not
        to partitioned tables. However, note the following limitation that applies specifically to analytic function
        calls involving partitioned tables.
      </p>

      <p conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/>

      <p>
        The sequence of results from an analytic function <q>resets</q> for each new partition in the result set.
        That is, the set of preceding or following rows considered by the analytic function always come from a
        single partition. Any <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>ROW_NUMBER()</codeph>, and so
        on apply to each partition independently. Omit the <codeph>PARTITION BY</codeph> clause to apply the
        analytic operation to all the rows in the table.
      </p>

      <p>
        <b>ORDER BY clause:</b>
      </p>

      <p>
        The <codeph>ORDER BY</codeph> clause works much like the <codeph>ORDER BY</codeph> clause in the outermost
        block of a query. It defines the order in which rows are evaluated for the entire input set, or for each
        group produced by a <codeph>PARTITION BY</codeph> clause. You can order by one or multiple expressions, and
        for each expression optionally choose ascending or descending order and whether nulls come first or last in
        the sort order. Because this <codeph>ORDER BY</codeph> clause only defines the order in which rows are
        evaluated, if you want the results to be output in a specific order, also include an <codeph>ORDER
        BY</codeph> clause in the outer block of the query.
      </p>

      <p>
        When the <codeph>ORDER BY</codeph> clause is omitted, the analytic function applies to all items in the
        group produced by the <codeph>PARTITION BY</codeph> clause. When the <codeph>ORDER BY</codeph> clause is
        included, the analysis can apply to all or a subset of the items in the group, depending on the optional
        window clause.
      </p>

      <p>
        The order in which the rows are analyzed is only defined for those columns specified in <codeph>ORDER
        BY</codeph> clauses.
      </p>

      <p>
        One difference between the analytic and outer uses of the <codeph>ORDER BY</codeph> clause: inside the
        <codeph>OVER</codeph> clause, <codeph>ORDER BY 1</codeph> or other integer value is interpreted as a
        constant sort value (effectively a no-op) rather than referring to column 1.
      </p>

      <p>
        <b>Window clause:</b>
      </p>

      <p>
        The window clause is only allowed in combination with an <codeph>ORDER BY</codeph> clause. If the
        <codeph>ORDER BY</codeph> clause is specified but the window clause is not, the default window is
        <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>. See
        <xref href="impala_analytic_functions.xml#window_clause"/> for full details.
      </p>

      <p conref="../shared/impala_common.xml#common/hbase_blurb"/>

      <p>
        Because HBase tables are optimized for single-row lookups rather than full scans, analytic functions using
        the <codeph>OVER()</codeph> clause are not recommended for HBase tables. Although such queries work, their
        performance is lower than on comparable tables using HDFS data files.
      </p>

      <p conref="../shared/impala_common.xml#common/parquet_blurb"/>

      <p>
        Analytic functions are very efficient for Parquet tables. The data that is examined during evaluation of
        the <codeph>OVER()</codeph> clause comes from a specified set of columns, and the values for each column
        are arranged sequentially within each data file.
      </p>

      <p conref="../shared/impala_common.xml#common/text_blurb"/>

      <p>
        Analytic functions are convenient to use with text tables for exploratory business intelligence. When the
        volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example shows how to synthesize a numeric sequence corresponding to all the rows in a table.
        The new table has the same columns as the old one, plus an additional column <codeph>ID</codeph> containing
        the integers 1, 2, 3, and so on, corresponding to the order of a <codeph>TIMESTAMP</codeph> column in the
        original table.
      </p>

<!-- TK: synthesize some data and fill in output here. -->

<codeblock>CREATE TABLE events_with_id AS
  SELECT
    row_number() OVER (ORDER BY date_and_time) AS id,
    c1, c2, c3, c4
  FROM events;
</codeblock>

      <p>
        The following example shows how to determine the number of rows containing each value for a column. Unlike
        a corresponding <codeph>GROUP BY</codeph> query, this one can analyze a single column and still return all
        values (not just the distinct ones) from the other columns.
      </p>

<!-- TK: verify the 'unbounded' shortcut syntax. -->

<codeblock>SELECT x, y, z,
  count() OVER (PARTITION BY x) AS how_many_x
FROM t1;
</codeblock>

      <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

      <p>
        You cannot directly combine the <codeph>DISTINCT</codeph> operator with analytic function calls. You can
        put the analytic function call in a <codeph>WITH</codeph> clause or an inline view, and apply the
        <codeph>DISTINCT</codeph> operator to its result set.
      </p>

<codeblock>WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1)
  SELECT DISTINCT x, total FROM t1;
</codeblock>

    </conbody>

  </concept>

  <concept id="window_clause">

    <title>Window Clause</title>

    <conbody>

      <p>
        Certain analytic functions accept an optional <term>window clause</term>, which makes the function analyze
        only certain rows <q>around</q> the current row rather than all rows in the partition. For example, you can
        get a moving average by specifying some number of preceding and following rows, or a running count or
        running total by specifying all rows up to the current position. This clause can result in different
        analytic results for rows within the same partition.
      </p>

      <p>
        The window clause is supported with the <codeph>AVG()</codeph>, <codeph>COUNT()</codeph>,
        <codeph>FIRST_VALUE()</codeph>, <codeph>LAST_VALUE()</codeph>, and <codeph>SUM()</codeph> functions.
<!-- To do: fill in this factoid under MAX and MIN also. -->
        For <codeph>MAX()</codeph> and <codeph>MIN()</codeph>, the window clause only allowed if the start bound is
        <codeph>UNBOUNDED PRECEDING</codeph>
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>ROWS BETWEEN [ { <varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ]
RANGE BETWEEN [ {<varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ]</codeblock>

      <p>
        <codeph>ROWS BETWEEN</codeph> defines the size of the window in terms of the indexes of the rows in the
        result set. The size of the window is predictable based on the clauses the position within the result set.
      </p>

      <p>
        <codeph>RANGE BETWEEN</codeph> does not currently support numeric arguments to define a variable-size
        sliding window.
<!--
Currently, it effectively works the same as the
equivalent <codeph>ROWS BETWEEN</codeph> clause.
-->
      </p>

<!--
<p>
<codeph>RANGE BETWEEN</codeph> defines the size of the window based on arithmetic comparisons
of the values in the result set.
The size of the window varies depending on the order and distribution of values.
</p>
-->

      <p>
        Currently, Impala supports only some combinations of arguments to the <codeph>RANGE</codeph> clause:
      </p>

      <ul>
        <li>
          <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> (the default when <codeph>ORDER
          BY</codeph> is specified and the window clause is omitted)
        </li>

        <li>
          <codeph>RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING</codeph>
        </li>

        <li>
          <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING</codeph>
        </li>
      </ul>

      <p>
        When <codeph>RANGE</codeph> is used, <codeph>CURRENT ROW</codeph> includes not just the current row but all
        rows that are tied with the current row based on the <codeph>ORDER BY</codeph> expressions.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following examples show financial data for a fictional stock symbol <codeph>JDR</codeph>. The closing
        price moves up and down each day.
      </p>

<codeblock>create table stock_ticker (stock_symbol string, closing_price decimal(8,2), closing_date timestamp);
...load some data...
select * from stock_ticker order by stock_symbol, closing_date
+--------------+---------------+---------------------+
| stock_symbol | closing_price | closing_date        |
+--------------+---------------+---------------------+
| JDR          | 12.86         | 2014-10-02 00:00:00 |
| JDR          | 12.89         | 2014-10-03 00:00:00 |
| JDR          | 12.94         | 2014-10-04 00:00:00 |
| JDR          | 12.55         | 2014-10-05 00:00:00 |
| JDR          | 14.03         | 2014-10-06 00:00:00 |
| JDR          | 14.75         | 2014-10-07 00:00:00 |
| JDR          | 13.98         | 2014-10-08 00:00:00 |
+--------------+---------------+---------------------+
</codeblock>

      <p>
        The queries use analytic functions with window clauses to compute moving averages of the closing price. For
        example, <codeph>ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING</codeph> produces an average of the value from a
        3-day span, producing a different value for each row. The first row, which has no preceding row, only gets
        averaged with the row following it. If the table contained more than one stock symbol, the
        <codeph>PARTITION BY</codeph> clause would limit the window for the moving average to only consider the
        prices for a single stock.
      </p>

<codeblock>select stock_symbol, closing_date, closing_price,
  avg(closing_price) over (partition by stock_symbol order by closing_date
    rows between 1 preceding and 1 following) as moving_average
  from stock_ticker;
+--------------+---------------------+---------------+----------------+
| stock_symbol | closing_date        | closing_price | moving_average |
+--------------+---------------------+---------------+----------------+
| JDR          | 2014-10-02 00:00:00 | 12.86         | 12.87          |
| JDR          | 2014-10-03 00:00:00 | 12.89         | 12.89          |
| JDR          | 2014-10-04 00:00:00 | 12.94         | 12.79          |
| JDR          | 2014-10-05 00:00:00 | 12.55         | 13.17          |
| JDR          | 2014-10-06 00:00:00 | 14.03         | 13.77          |
| JDR          | 2014-10-07 00:00:00 | 14.75         | 14.25          |
| JDR          | 2014-10-08 00:00:00 | 13.98         | 14.36          |
+--------------+---------------------+---------------+----------------+
</codeblock>

      <p>
        The clause <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> produces a cumulative moving
        average, from the earliest data up to the value for each day.
      </p>

<codeblock>select stock_symbol, closing_date, closing_price,
  avg(closing_price) over (partition by stock_symbol order by closing_date
    rows between unbounded preceding and current row) as moving_average
  from stock_ticker;
+--------------+---------------------+---------------+----------------+
| stock_symbol | closing_date        | closing_price | moving_average |
+--------------+---------------------+---------------+----------------+
| JDR          | 2014-10-02 00:00:00 | 12.86         | 12.86          |
| JDR          | 2014-10-03 00:00:00 | 12.89         | 12.87          |
| JDR          | 2014-10-04 00:00:00 | 12.94         | 12.89          |
| JDR          | 2014-10-05 00:00:00 | 12.55         | 12.81          |
| JDR          | 2014-10-06 00:00:00 | 14.03         | 13.05          |
| JDR          | 2014-10-07 00:00:00 | 14.75         | 13.33          |
| JDR          | 2014-10-08 00:00:00 | 13.98         | 13.42          |
+--------------+---------------------+---------------+----------------+
</codeblock>

<!-- Matt suggests not always true depending on data. Hiding until I can try myself.
<p>
The clause <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> would produce the same
output as above. Because <codeph>RANGE</codeph> currently does not support numeric offsets while
<codeph>ROWS</codeph> does, currently the <codeph>ROWS</codeph> syntax is more flexible.
</p>
-->

    </conbody>

  </concept>

  <concept id="avg_analytic">

    <title>AVG Function - Analytic Context</title>

    <conbody>

      <p>
        You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
        function. See <xref href="impala_avg.xml#avg"/> for details and examples.
      </p>

    </conbody>

  </concept>

  <concept id="count_analytic">

    <title>COUNT Function - Analytic Context</title>

    <conbody>

      <p>
        You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
        function. See <xref href="impala_count.xml#count"/> for details and examples.
      </p>

    </conbody>

  </concept>

  <concept rev="2.3.0" id="cume_dist">

    <title>CUME_DIST Function (<keyword keyref="impala23"/> or higher only)</title>

    <conbody>

      <p>
        Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0
        and less than or equal to 1.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>CUME_DIST (<varname>expr</varname>)
  OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)
</codeblock>

      <p>
        The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
        window clause is not allowed.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        Within each partition of the result set, the <codeph>CUME_DIST()</codeph> value represents an ascending
        sequence that ends at 1. Each value represents the proportion of rows in the partition whose values are
        less than or equal to the value in the current row.
      </p>

      <p>
        If the sequence of input values contains ties, the <codeph>CUME_DIST()</codeph> results are identical for the
        tied values.
      </p>

      <p>
        Impala only supports the <codeph>CUME_DIST()</codeph> function in an analytic context, not as a regular
        aggregate function.
      </p>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        This example uses a table with 9 rows. The <codeph>CUME_DIST()</codeph>
        function evaluates the entire table because there is no <codeph>PARTITION BY</codeph> clause,
        with the rows ordered by the weight of the animal.
        the sequence of values shows that 1/9 of the values are less than or equal to the lightest
        animal (mouse), 2/9 of the values are less than or equal to the second-lightest animal,
        and so on up to the heaviest animal (elephant), where 9/9 of the rows are less than or
        equal to its weight.
      </p>

<codeblock>create table animals (name string, kind string, kilos decimal(9,3));
insert into animals values
  ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020),
  ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5),
  ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5);

select name, cume_dist() over (order by kilos) from animals;
+------------+-----------------------+
| name       | cume_dist() OVER(...) |
+------------+-----------------------+
| Elephant   | 1                     |
| Giraffe    | 0.8888888888888888    |
| Polar bear | 0.7777777777777778    |
| Horse      | 0.6666666666666666    |
| Ostrich    | 0.5555555555555556    |
| Condor     | 0.4444444444444444    |
| Housecat   | 0.3333333333333333    |
| Owl        | 0.2222222222222222    |
| Mouse      | 0.1111111111111111    |
+------------+-----------------------+
</codeblock>

      <p>
        Using a <codeph>PARTITION BY</codeph> clause produces a separate sequence for each partition
        group, in this case one for mammals and one for birds. Because there are 3 birds and 6 mammals,
        the sequence illustrates how 1/3 of the <q>Bird</q> rows have a <codeph>kilos</codeph> value that is less than or equal to
        the lightest bird, 1/6 of the <q>Mammal</q> rows have a <codeph>kilos</codeph> value that is less than or equal to
        the lightest mammal, and so on until both the heaviest bird and heaviest mammal have a <codeph>CUME_DIST()</codeph>
        value of 1.
      </p>

<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) from animals
+------------+--------+-----------------------+
| name       | kind   | cume_dist() OVER(...) |
+------------+--------+-----------------------+
| Ostrich    | Bird   | 1                     |
| Condor     | Bird   | 0.6666666666666666    |
| Owl        | Bird   | 0.3333333333333333    |
| Elephant   | Mammal | 1                     |
| Giraffe    | Mammal | 0.8333333333333334    |
| Polar bear | Mammal | 0.6666666666666666    |
| Horse      | Mammal | 0.5                   |
| Housecat   | Mammal | 0.3333333333333333    |
| Mouse      | Mammal | 0.1666666666666667    |
+------------+--------+-----------------------+
</codeblock>

      <p>
        We can reverse the ordering within each partition group by using an <codeph>ORDER BY ... DESC</codeph>
        clause within the <codeph>OVER()</codeph> clause. Now the lightest (smallest value of <codeph>kilos</codeph>)
        animal of each kind has a <codeph>CUME_DIST()</codeph> value of 1.
      </p>

<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos desc) from animals
+------------+--------+-----------------------+
| name       | kind   | cume_dist() OVER(...) |
+------------+--------+-----------------------+
| Owl        | Bird   | 1                     |
| Condor     | Bird   | 0.6666666666666666    |
| Ostrich    | Bird   | 0.3333333333333333    |
| Mouse      | Mammal | 1                     |
| Housecat   | Mammal | 0.8333333333333334    |
| Horse      | Mammal | 0.6666666666666666    |
| Polar bear | Mammal | 0.5                   |
| Giraffe    | Mammal | 0.3333333333333333    |
| Elephant   | Mammal | 0.1666666666666667    |
+------------+--------+-----------------------+
</codeblock>

      <p>
        The following example manufactures some rows with identical values in the <codeph>kilos</codeph> column,
        to demonstrate how the results look in case of tie values. For simplicity, it only shows the <codeph>CUME_DIST()</codeph>
        sequence for the <q>Bird</q> rows. Now with 3 rows all with a value of 15, all of those rows have the same
        <codeph>CUME_DIST()</codeph> value. 4/5 of the rows have a value for <codeph>kilos</codeph> that is less than or
        equal to 15.
      </p>

<codeblock>insert into animals values ('California Condor', 'Bird', 15), ('Andean Condor', 'Bird', 15)

select name, kind, cume_dist() over (order by kilos) from animals where kind = 'Bird';
+-------------------+------+-----------------------+
| name              | kind | cume_dist() OVER(...) |
+-------------------+------+-----------------------+
| Ostrich           | Bird | 1                     |
| Condor            | Bird | 0.8                   |
| California Condor | Bird | 0.8                   |
| Andean Condor     | Bird | 0.8                   |
| Owl               | Bird | 0.2                   |
+-------------------+------+-----------------------+
</codeblock>

      <p>
        The following example shows how to use an <codeph>ORDER BY</codeph> clause in the outer block
        to order the result set in case of ties. Here, all the <q>Bird</q> rows are together, then in descending order
        by the result of the <codeph>CUME_DIST()</codeph> function, and all tied <codeph>CUME_DIST()</codeph>
        values are ordered by the animal name.
      </p>

<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) as ordering
  from animals
where
  kind = 'Bird'
order by kind, ordering desc, name;
+-------------------+------+----------+
| name              | kind | ordering |
+-------------------+------+----------+
| Ostrich           | Bird | 1        |
| Andean Condor     | Bird | 0.8      |
| California Condor | Bird | 0.8      |
| Condor            | Bird | 0.8      |
| Owl               | Bird | 0.2      |
+-------------------+------+----------+
</codeblock>

    </conbody>

  </concept>

  <concept rev="2.0.0" id="dense_rank">

    <title>DENSE_RANK Function</title>

    <conbody>

      <p>
        Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers
        for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values
        for the <q>tied</q> input values, the function continues the sequence with the next higher integer.
        Therefore, the sequence contains duplicates but no gaps when the input contains duplicates. Starts the
        sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>DENSE_RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>

      <p>
        The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
        window clause is not allowed.
      </p>

<!-- Can make the text for ROW_NUMBER, RANK, and DENSE_RANK identical
     so it can be conref'ed in all 3 places. -->

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including
        all the items with the 10 highest values, even if several items tied for 1st place.
      </p>

      <p>
        Similar to <codeph>ROW_NUMBER</codeph> and <codeph>RANK</codeph>. These functions differ in how they treat
        duplicate combinations of values.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example demonstrates how the <codeph>DENSE_RANK()</codeph> function identifies where each
        value <q>places</q> in the result set, producing the same result for duplicate values, but with a strict
        sequence from 1 to the number of groups. For example, when results are ordered by the <codeph>X</codeph>
        column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph> values are tied for
        second; and so on.
      </p>

<codeblock>select x, dense_rank() over(order by x) as rank, property from int_t;
+----+------+----------+
| x  | rank | property |
+----+------+----------+
| 1  | 1    | square   |
| 1  | 1    | odd      |
| 2  | 2    | even     |
| 2  | 2    | prime    |
| 3  | 3    | prime    |
| 3  | 3    | odd      |
| 4  | 4    | even     |
| 4  | 4    | square   |
| 5  | 5    | odd      |
| 5  | 5    | prime    |
| 6  | 6    | even     |
| 6  | 6    | perfect  |
| 7  | 7    | lucky    |
| 7  | 7    | lucky    |
| 7  | 7    | lucky    |
| 7  | 7    | odd      |
| 7  | 7    | prime    |
| 8  | 8    | even     |
| 9  | 9    | square   |
| 9  | 9    | odd      |
| 10 | 10   | round    |
| 10 | 10   | even     |
+----+------+----------+
</codeblock>

      <p>
        The following examples show how the <codeph>DENSE_RANK()</codeph> function is affected by the
        <codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause.
      </p>

      <p>
        Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together,
        and <codeph>DENSE_RANK()</codeph> returns the place of each value within the group, producing several
        ascending sequences.
      </p>

<codeblock>select x, dense_rank() over(partition by property order by x) as rank, property from int_t;
+----+------+----------+
| x  | rank | property |
+----+------+----------+
| 2  | 1    | even     |
| 4  | 2    | even     |
| 6  | 3    | even     |
| 8  | 4    | even     |
| 10 | 5    | even     |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 1  | 1    | odd      |
| 3  | 2    | odd      |
| 5  | 3    | odd      |
| 7  | 4    | odd      |
| 9  | 5    | odd      |
| 6  | 1    | perfect  |
| 2  | 1    | prime    |
| 3  | 2    | prime    |
| 5  | 3    | prime    |
| 7  | 4    | prime    |
| 10 | 1    | round    |
| 1  | 1    | square   |
| 4  | 2    | square   |
| 9  | 3    | square   |
+----+------+----------+
</codeblock>

      <p>
        Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the
        place each each value within the group; because each value occurs only 1 or 2 times,
        <codeph>DENSE_RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its
        group.
      </p>

<codeblock>select x, dense_rank() over(partition by x order by property) as rank, property from int_t;
+----+------+----------+
| x  | rank | property |
+----+------+----------+
| 1  | 1    | odd      |
| 1  | 2    | square   |
| 2  | 1    | even     |
| 2  | 2    | prime    |
| 3  | 1    | odd      |
| 3  | 2    | prime    |
| 4  | 1    | even     |
| 4  | 2    | square   |
| 5  | 1    | odd      |
| 5  | 2    | prime    |
| 6  | 1    | even     |
| 6  | 2    | perfect  |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 7  | 2    | odd      |
| 7  | 3    | prime    |
| 8  | 1    | even     |
| 9  | 1    | odd      |
| 9  | 2    | square   |
| 10 | 1    | even     |
| 10 | 2    | round    |
+----+------+----------+
</codeblock>

      <p>
        The following example shows how <codeph>DENSE_RANK()</codeph> produces a continuous sequence while still
        allowing for ties. In this case, Croesus and Midas both have the second largest fortune, while Crassus has
        the third largest. (In <xref href="impala_analytic_functions.xml#rank"/>, you see a similar query with the
        <codeph>RANK()</codeph> function that shows that while Crassus has the third largest fortune, he is the
        fourth richest person.)
      </p>

<codeblock>select dense_rank() over (order by net_worth desc) as placement, name, net_worth from wealth order by placement, name;
+-----------+---------+---------------+
| placement | name    | net_worth     |
+-----------+---------+---------------+
| 1         | Solomon | 2000000000.00 |
| 2         | Croesus | 1000000000.00 |
| 2         | Midas   | 1000000000.00 |
| 3         | Crassus | 500000000.00  |
| 4         | Scrooge | 80000000.00   |
+-----------+---------+---------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/related_info"/>

      <p>
        <xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#row_number"/>
      </p>

    </conbody>

  </concept>

  <concept rev="2.0.0" id="first_value">

    <title>FIRST_VALUE Function</title>

    <conbody>

      <p>
        Returns the expression value from the first row in the window. The return value is <codeph>NULL</codeph> if
        the input expression is <codeph>NULL</codeph>.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>FIRST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock>

      <p>
        The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
        window clause is optional.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result
        of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to
        ensure consistent ordering.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example shows a table with a wide variety of country-appropriate greetings. For consistency,
        we want to standardize on a single greeting for each country. The <codeph>FIRST_VALUE()</codeph> function
        helps to produce a mail merge report where every person from the same country is addressed with the same
        greeting.
      </p>

<codeblock>select name, country, greeting from mail_merge
+---------+---------+--------------+
| name    | country | greeting     |
+---------+---------+--------------+
| Pete    | USA     | Hello        |
| John    | USA     | Hi           |
| Boris   | Germany | Guten tag    |
| Michael | Germany | Guten morgen |
| Bjorn   | Sweden  | Hej          |
| Mats    | Sweden  | Tja          |
+---------+---------+--------------+

select country, name,
  first_value(greeting)
    over (partition by country order by name, greeting) as greeting
  from mail_merge;
+---------+---------+-----------+
| country | name    | greeting  |
+---------+---------+-----------+
| Germany | Boris   | Guten tag |
| Germany | Michael | Guten tag |
| Sweden  | Bjorn   | Hej       |
| Sweden  | Mats    | Hej       |
| USA     | John    | Hi        |
| USA     | Pete    | Hi        |
+---------+---------+-----------+
</codeblock>

      <p>
        Changing the order in which the names are evaluated changes which greeting is applied to each group.
      </p>

<codeblock>select country, name,
  first_value(greeting)
    over (partition by country order by name desc, greeting) as greeting
  from mail_merge;
+---------+---------+--------------+
| country | name    | greeting     |
+---------+---------+--------------+
| Germany | Michael | Guten morgen |
| Germany | Boris   | Guten morgen |
| Sweden  | Mats    | Tja          |
| Sweden  | Bjorn   | Tja          |
| USA     | Pete    | Hello        |
| USA     | John    | Hello        |
+---------+---------+--------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/related_info"/>

      <p>
        <xref href="impala_analytic_functions.xml#last_value"/>
      </p>

    </conbody>

  </concept>

  <concept rev="2.0.0" id="lag">

    <title>LAG Function</title>

    <conbody>

      <p>
        This function returns the value of an expression using column values from a preceding row. You specify an
        integer offset, which designates a row position some number of rows previous to the current row. Any column
        references in the expression argument refer to column values from that prior row. Typically, the table
        contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>LAG (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>])
  OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>

      <p>
        The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
        window clause is not allowed.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        Sometimes used an an alternative to doing a self-join.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example uses the same stock data created in <xref href="#window_clause"/>. For each day, the
        query prints the closing price alongside the previous day's closing price. The first row for each stock
        symbol has no previous row, so that <codeph>LAG()</codeph> value is <codeph>NULL</codeph>.
      </p>

<codeblock>select stock_symbol, closing_date, closing_price,
    lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing"
  from stock_ticker
    order by closing_date;
+--------------+---------------------+---------------+-------------------+
| stock_symbol | closing_date        | closing_price | yesterday closing |
+--------------+---------------------+---------------+-------------------+
| JDR          | 2014-09-13 00:00:00 | 12.86         | NULL              |
| JDR          | 2014-09-14 00:00:00 | 12.89         | 12.86             |
| JDR          | 2014-09-15 00:00:00 | 12.94         | 12.89             |
| JDR          | 2014-09-16 00:00:00 | 12.55         | 12.94             |
| JDR          | 2014-09-17 00:00:00 | 14.03         | 12.55             |
| JDR          | 2014-09-18 00:00:00 | 14.75         | 14.03             |
| JDR          | 2014-09-19 00:00:00 | 13.98         | 14.75             |
+--------------+---------------------+---------------+-------------------+
</codeblock>

      <p>
        The following example does an arithmetic operation between the current row and a value from the previous
        row, to produce a delta value for each day. This example also demonstrates how <codeph>ORDER BY</codeph>
        works independently in the different parts of the query. The <codeph>ORDER BY closing_date</codeph> in the
        <codeph>OVER</codeph> clause makes the query analyze the rows in chronological order. Then the outer query
        block uses <codeph>ORDER BY closing_date DESC</codeph> to present the results with the most recent date
        first.
      </p>

<codeblock>select stock_symbol, closing_date, closing_price,
    cast(
      closing_price - lag(closing_price,1) over
        (partition by stock_symbol order by closing_date)
      as decimal(8,2)
    )
    as "change from yesterday"
  from stock_ticker
    order by closing_date desc;
+--------------+---------------------+---------------+-----------------------+
| stock_symbol | closing_date        | closing_price | change from yesterday |
+--------------+---------------------+---------------+-----------------------+
| JDR          | 2014-09-19 00:00:00 | 13.98         | -0.76                 |
| JDR          | 2014-09-18 00:00:00 | 14.75         | 0.72                  |
| JDR          | 2014-09-17 00:00:00 | 14.03         | 1.47                  |
| JDR          | 2014-09-16 00:00:00 | 12.55         | -0.38                 |
| JDR          | 2014-09-15 00:00:00 | 12.94         | 0.04                  |
| JDR          | 2014-09-14 00:00:00 | 12.89         | 0.03                  |
| JDR          | 2014-09-13 00:00:00 | 12.86         | NULL                  |
+--------------+---------------------+---------------+-----------------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/related_info"/>

      <p>
        This function is the converse of <xref href="impala_analytic_functions.xml#lead"/>.
      </p>

    </conbody>

  </concept>

  <concept rev="2.0.0" id="last_value">

    <title>LAST_VALUE Function</title>

    <conbody>

      <p>
        Returns the expression value from the last row in the window. This same value is repeated for all result
        rows for the group. The return value is <codeph>NULL</codeph> if the input expression is
        <codeph>NULL</codeph>.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>LAST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock>

      <p>
        The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
        window clause is optional.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result
        of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to
        ensure consistent ordering.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example uses the same <codeph>MAIL_MERGE</codeph> table as in the example for
        <xref href="impala_analytic_functions.xml#first_value"/>. Because the default window when <codeph>ORDER
        BY</codeph> is used is <codeph>BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, the query requires the
        <codeph>UNBOUNDED FOLLOWING</codeph> to look ahead to subsequent rows and find the last value for each
        country.
      </p>

<codeblock>select country, name,
  last_value(greeting) over (
    partition by country order by name, greeting
    rows between unbounded preceding and unbounded following
  ) as greeting
  from mail_merge
+---------+---------+--------------+
| country | name    | greeting     |
+---------+---------+--------------+
| Germany | Boris   | Guten morgen |
| Germany | Michael | Guten morgen |
| Sweden  | Bjorn   | Tja          |
| Sweden  | Mats    | Tja          |
| USA     | John    | Hello        |
| USA     | Pete    | Hello        |
+---------+---------+--------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/related_info"/>

      <p>
        <xref href="impala_analytic_functions.xml#first_value"/>
      </p>

    </conbody>

  </concept>

  <concept rev="2.0.0" id="lead">

    <title>LEAD Function</title>

    <conbody>

      <p>
        This function returns the value of an expression using column values from a following row. You specify an
        integer offset, which designates a row position some number of rows after to the current row. Any column
        references in the expression argument refer to column values from that later row. Typically, the table
        contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>LEAD (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>])
  OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>

      <p>
        The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
        window clause is not allowed.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        Sometimes used an an alternative to doing a self-join.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example uses the same stock data created in <xref href="#window_clause"/>. The query analyzes
        the closing price for a stock symbol, and for each day evaluates if the closing price for the following day
        is higher or lower.
      </p>

<codeblock>select stock_symbol, closing_date, closing_price,
  case
    (lead(closing_price,1)
      over (partition by stock_symbol order by closing_date)
        - closing_price) &gt; 0
    when true then "higher"
    when false then "flat or lower"
  end as "trending"
from stock_ticker
  order by closing_date;
+--------------+---------------------+---------------+---------------+
| stock_symbol | closing_date        | closing_price | trending      |
+--------------+---------------------+---------------+---------------+
| JDR          | 2014-09-13 00:00:00 | 12.86         | higher        |
| JDR          | 2014-09-14 00:00:00 | 12.89         | higher        |
| JDR          | 2014-09-15 00:00:00 | 12.94         | flat or lower |
| JDR          | 2014-09-16 00:00:00 | 12.55         | higher        |
| JDR          | 2014-09-17 00:00:00 | 14.03         | higher        |
| JDR          | 2014-09-18 00:00:00 | 14.75         | flat or lower |
| JDR          | 2014-09-19 00:00:00 | 13.98         | NULL          |
+--------------+---------------------+---------------+---------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/related_info"/>

      <p>
        This function is the converse of <xref href="impala_analytic_functions.xml#lag"/>.
      </p>

    </conbody>

  </concept>

  <concept id="max_analytic">

    <title>MAX Function - Analytic Context</title>

    <conbody>

      <p>
        You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
        function. See <xref href="impala_max.xml#max"/> for details and examples.
      </p>

    </conbody>

  </concept>

  <concept id="min_analytic">

    <title>MIN Function - Analytic Context</title>

    <conbody>

      <p>
        You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
        function. See <xref href="impala_min.xml#min"/> for details and examples.
      </p>

    </conbody>

  </concept>

  <concept audience="hidden" rev="2.x.x" id="nth_value">

    <title>NTH_VALUE Function</title>

    <conbody>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

    </conbody>

  </concept>

  <concept rev="2.3.0" id="ntile">

    <title>NTILE Function (<keyword keyref="impala23"/> or higher only)</title>

    <conbody>

      <p>
        Returns the <q>bucket number</q> associated with each row, between 1 and the value of an expression. For
        example, creating 100 buckets puts the lowest 1% of values in the first bucket, while creating 10 buckets
        puts the lowest 10% of values in the first bucket. Each partition can have a different number of buckets.
<!-- What's the syntax or data distribution that would create a different number of buckets per partition? -->
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>NTILE (<varname>expr</varname> [, <varname>offset</varname> ...]
  OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>

      <p>
        The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
        window clause is not allowed.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        The <q>ntile</q> name is derived from the practice of dividing result sets into fourths (quartile), tenths
        (decile), and so on. The <codeph>NTILE()</codeph> function divides the result set based on an arbitrary
        percentile value.
      </p>

      <p>
        The number of buckets must be a positive integer.
      </p>

      <p>
        The number of items in each bucket is identical or almost so, varying by at most 1. If the number of items
        does not divide evenly between the buckets, the remaining N items are divided evenly among the first N
        buckets.
      </p>

      <p>
        If the number of buckets N is greater than the number of input rows in the partition, then the first N
        buckets each contain one item, and the remaining buckets are empty.
      </p>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example shows divides groups of animals into 4 buckets based on their weight. The
        <codeph>ORDER BY ... DESC</codeph> clause in the <codeph>OVER()</codeph> clause means that the heaviest 25%
        are in the first group, and the lightest 25% are in the fourth group. (The <codeph>ORDER BY</codeph> in the
        outermost part of the query shows how you can order the final result set independently from the order in
        which the rows are evaluated by the <codeph>OVER()</codeph> clause.) Because there are 9 rows in the group,
        divided into 4 buckets, the first bucket receives the extra item.
      </p>

<codeblock>create table animals (name string, kind string, kilos decimal(9,3));

insert into animals values
  ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020),
  ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5),
  ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5);

select name, ntile(4) over (order by kilos desc) as quarter
  from animals
order by quarter desc;
+------------+---------+
| name       | quarter |
+------------+---------+
| Owl        | 4       |
| Mouse      | 4       |
| Condor     | 3       |
| Housecat   | 3       |
| Horse      | 2       |
| Ostrich    | 2       |
| Elephant   | 1       |
| Giraffe    | 1       |
| Polar bear | 1       |
+------------+---------+
</codeblock>

      <p>
        The following examples show how the <codeph>PARTITION</codeph> clause works for the
        <codeph>NTILE()</codeph> function. Here, we divide each kind of animal (mammal or bird) into 2 buckets,
        the heavier half and the lighter half.
      </p>

<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half
  from animals
order by kind;
+------------+--------+------+
| name       | kind   | half |
+------------+--------+------+
| Ostrich    | Bird   | 1    |
| Condor     | Bird   | 1    |
| Owl        | Bird   | 2    |
| Elephant   | Mammal | 1    |
| Giraffe    | Mammal | 1    |
| Polar bear | Mammal | 1    |
| Horse      | Mammal | 2    |
| Housecat   | Mammal | 2    |
| Mouse      | Mammal | 2    |
+------------+--------+------+
</codeblock>

      <p>
        Again, the result set can be ordered independently
        from the analytic evaluation. This next example lists all the animals heaviest to lightest,
        showing that elephant and giraffe are in the <q>top half</q> of mammals by weight, while
        housecat and mouse are in the <q>bottom half</q>.
      </p>

<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half
  from animals
order by kilos desc;
+------------+--------+------+
| name       | kind   | half |
+------------+--------+------+
| Elephant   | Mammal | 1    |
| Giraffe    | Mammal | 1    |
| Polar bear | Mammal | 1    |
| Horse      | Mammal | 2    |
| Ostrich    | Bird   | 1    |
| Condor     | Bird   | 1    |
| Housecat   | Mammal | 2    |
| Owl        | Bird   | 2    |
| Mouse      | Mammal | 2    |
+------------+--------+------+
</codeblock>

    </conbody>

  </concept>

  <concept rev="2.3.0" id="percent_rank">

    <title>PERCENT_RANK Function (<keyword keyref="impala23"/> or higher only)</title>

    <conbody>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>PERCENT_RANK (<varname>expr</varname>)
  OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)
</codeblock>

      <p>
      Calculates the rank, expressed as a percentage, of each row within a group of rows.
      If <codeph>rank</codeph> is the value for that same row from the <codeph>RANK()</codeph> function (from 1 to the total number of rows in the partition group),
      then the <codeph>PERCENT_RANK()</codeph> value is calculated as <codeph>(<varname>rank</varname> - 1) / (<varname>rows_in_group</varname> - 1)</codeph> .
      If there is only a single item in the partition group, its <codeph>PERCENT_RANK()</codeph> value is 0.
      </p>

      <p>
        The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
        window clause is not allowed.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        This function is similar to the <codeph>RANK</codeph> and <codeph>CUME_DIST()</codeph> functions: it returns an ascending sequence representing the position of each
        row within the rows of the same partition group. The actual numeric sequence is calculated differently,
        and the handling of duplicate (tied) values is different.
      </p>

      <p>
        The return values range from 0 to 1 inclusive.
        The first row in each partition group always has the value 0.
        A <codeph>NULL</codeph> value is considered the lowest possible value.
        In the case of duplicate input values, all the corresponding rows in the result set
        have an identical value: the lowest <codeph>PERCENT_RANK()</codeph> value of those
        tied rows. (In contrast to <codeph>CUME_DIST()</codeph>, where all tied rows have
        the highest <codeph>CUME_DIST()</codeph> value.)
      </p>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example uses the same <codeph>ANIMALS</codeph> table as the examples for <codeph>CUME_DIST()</codeph>
        and <codeph>NTILE()</codeph>, with a few additional rows to illustrate the results where some values are
        <codeph>NULL</codeph> or there is only a single row in a partition group.
      </p>

<codeblock>insert into animals values ('Komodo dragon', 'Reptile', 70);
insert into animals values ('Unicorn', 'Mythical', NULL);
insert into animals values ('Fire-breathing dragon', 'Mythical', NULL);
</codeblock>

      <p>
        As with <codeph>CUME_DIST()</codeph>, there is an ascending sequence for each kind of animal.
        For example, the <q>Birds</q> and <q>Mammals</q> rows each have a <codeph>PERCENT_RANK()</codeph> sequence
        that ranges from 0 to 1.
        The <q>Reptile</q> row has a <codeph>PERCENT_RANK()</codeph> of 0 because that partition group contains only a single item.
        Both <q>Mythical</q> animals have a <codeph>PERCENT_RANK()</codeph> of 0 because
        a <codeph>NULL</codeph> is considered the lowest value within its partition group.
      </p>

<codeblock>select name, kind, percent_rank() over (partition by kind order by kilos) from animals;
+-----------------------+----------+--------------------------+
| name                  | kind     | percent_rank() OVER(...) |
+-----------------------+----------+--------------------------+
| Mouse                 | Mammal   | 0                        |
| Housecat              | Mammal   | 0.2                      |
| Horse                 | Mammal   | 0.4                      |
| Polar bear            | Mammal   | 0.6                      |
| Giraffe               | Mammal   | 0.8                      |
| Elephant              | Mammal   | 1                        |
| Komodo dragon         | Reptile  | 0                        |
| Owl                   | Bird     | 0                        |
| California Condor     | Bird     | 0.25                     |
| Andean Condor         | Bird     | 0.25                     |
| Condor                | Bird     | 0.25                     |
| Ostrich               | Bird     | 1                        |
| Fire-breathing dragon | Mythical | 0                        |
| Unicorn               | Mythical | 0                        |
+-----------------------+----------+--------------------------+
</codeblock>
    </conbody>

  </concept>

  <concept rev="2.0.0" id="rank">

    <title>RANK Function</title>

    <conbody>

      <p>
        Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers
        for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values
        for the <q>tied</q> input values, the function increments the sequence by the number of tied values.
        Therefore, the sequence contains both duplicates and gaps when the input contains duplicates. Starts the
        sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>

      <p>
        The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
        window clause is not allowed.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

<!-- Make a little tutorial to show these 3 functions side-by-side and illustrate their difference. -->

      <p>
        Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including
        several items that were tied for 10th place.
      </p>

      <p>
        Similar to <codeph>ROW_NUMBER</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they
        treat duplicate combinations of values.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example demonstrates how the <codeph>RANK()</codeph> function identifies where each value
        <q>places</q> in the result set, producing the same result for duplicate values, and skipping values in the
        sequence to account for the number of duplicates. For example, when results are ordered by the
        <codeph>X</codeph> column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph>
        values are tied for third; and so on.
      </p>

<codeblock>select x, rank() over(order by x) as rank, property from int_t;
+----+------+----------+
| x  | rank | property |
+----+------+----------+
| 1  | 1    | square   |
| 1  | 1    | odd      |
| 2  | 3    | even     |
| 2  | 3    | prime    |
| 3  | 5    | prime    |
| 3  | 5    | odd      |
| 4  | 7    | even     |
| 4  | 7    | square   |
| 5  | 9    | odd      |
| 5  | 9    | prime    |
| 6  | 11   | even     |
| 6  | 11   | perfect  |
| 7  | 13   | lucky    |
| 7  | 13   | lucky    |
| 7  | 13   | lucky    |
| 7  | 13   | odd      |
| 7  | 13   | prime    |
| 8  | 18   | even     |
| 9  | 19   | square   |
| 9  | 19   | odd      |
| 10 | 21   | round    |
| 10 | 21   | even     |
+----+------+----------+
</codeblock>

      <p>
        The following examples show how the <codeph>RANK()</codeph> function is affected by the
        <codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause.
      </p>

      <p>
        Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together,
        and <codeph>RANK()</codeph> returns the place of each value within the group, producing several ascending
        sequences.
      </p>

<codeblock>select x, rank() over(partition by property order by x) as rank, property from int_t;
+----+------+----------+
| x  | rank | property |
+----+------+----------+
| 2  | 1    | even     |
| 4  | 2    | even     |
| 6  | 3    | even     |
| 8  | 4    | even     |
| 10 | 5    | even     |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 1  | 1    | odd      |
| 3  | 2    | odd      |
| 5  | 3    | odd      |
| 7  | 4    | odd      |
| 9  | 5    | odd      |
| 6  | 1    | perfect  |
| 2  | 1    | prime    |
| 3  | 2    | prime    |
| 5  | 3    | prime    |
| 7  | 4    | prime    |
| 10 | 1    | round    |
| 1  | 1    | square   |
| 4  | 2    | square   |
| 9  | 3    | square   |
+----+------+----------+
</codeblock>

      <p>
        Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the
        place each each value within the group; because each value occurs only 1 or 2 times,
        <codeph>RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its
        group.
      </p>

<codeblock>select x, rank() over(partition by x order by property) as rank, property from int_t;
+----+------+----------+
| x  | rank | property |
+----+------+----------+
| 1  | 1    | odd      |
| 1  | 2    | square   |
| 2  | 1    | even     |
| 2  | 2    | prime    |
| 3  | 1    | odd      |
| 3  | 2    | prime    |
| 4  | 1    | even     |
| 4  | 2    | square   |
| 5  | 1    | odd      |
| 5  | 2    | prime    |
| 6  | 1    | even     |
| 6  | 2    | perfect  |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 7  | 1    | lucky    |
| 7  | 4    | odd      |
| 7  | 5    | prime    |
| 8  | 1    | even     |
| 9  | 1    | odd      |
| 9  | 2    | square   |
| 10 | 1    | even     |
| 10 | 2    | round    |
+----+------+----------+
</codeblock>

      <p>
        The following example shows how a magazine might prepare a list of history's wealthiest people. Croesus and
        Midas are tied for second, then Crassus is fourth.
      </p>

<codeblock>select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name;
+------+---------+---------------+
| rank | name    | net_worth     |
+------+---------+---------------+
| 1    | Solomon | 2000000000.00 |
| 2    | Croesus | 1000000000.00 |
| 2    | Midas   | 1000000000.00 |
| 4    | Crassus | 500000000.00  |
| 5    | Scrooge | 80000000.00   |
+------+---------+---------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/related_info"/>

      <p>
        <xref href="impala_analytic_functions.xml#dense_rank"/>,
        <xref href="impala_analytic_functions.xml#row_number"/>
      </p>

    </conbody>

  </concept>

  <concept rev="2.0.0" id="row_number">

    <title>ROW_NUMBER Function</title>

    <conbody>

      <p>
        Returns an ascending sequence of integers, starting with 1. Starts the sequence over for each group
        produced by the <codeph>PARTITIONED BY</codeph> clause. The output sequence includes different values for
        duplicate input values. Therefore, the sequence never contains any duplicates or gaps, regardless of
        duplicate input values.
      </p>

      <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>ROW_NUMBER() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>

      <p>
        The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
        window clause is not allowed.
      </p>

      <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

      <p>
        Often used for top-N and bottom-N queries where the input values are known to be unique, or precisely N
        rows are needed regardless of duplicate values.
      </p>

      <p>
        Because its result value is different for each row in the result set (when used without a <codeph>PARTITION
        BY</codeph> clause), <codeph>ROW_NUMBER()</codeph> can be used to synthesize unique numeric ID values, for
        example for result sets involving unique values or tuples.
      </p>

      <p>
        Similar to <codeph>RANK</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they treat
        duplicate combinations of values.
      </p>

      <p conref="../shared/impala_common.xml#common/added_in_20"/>

      <p conref="../shared/impala_common.xml#common/example_blurb"/>

      <p>
        The following example demonstrates how <codeph>ROW_NUMBER()</codeph> produces a continuous numeric
        sequence, even though some values of <codeph>X</codeph> are repeated.
      </p>

<codeblock>select x, row_number() over(order by x, property) as row_number, property from int_t;
+----+------------+----------+
| x  | row_number | property |
+----+------------+----------+
| 1  | 1          | odd      |
| 1  | 2          | square   |
| 2  | 3          | even     |
| 2  | 4          | prime    |
| 3  | 5          | odd      |
| 3  | 6          | prime    |
| 4  | 7          | even     |
| 4  | 8          | square   |
| 5  | 9          | odd      |
| 5  | 10         | prime    |
| 6  | 11         | even     |
| 6  | 12         | perfect  |
| 7  | 13         | lucky    |
| 7  | 14         | lucky    |
| 7  | 15         | lucky    |
| 7  | 16         | odd      |
| 7  | 17         | prime    |
| 8  | 18         | even     |
| 9  | 19         | odd      |
| 9  | 20         | square   |
| 10 | 21         | even     |
| 10 | 22         | round    |
+----+------------+----------+
</codeblock>

      <p>
        The following example shows how a financial institution might assign customer IDs to some of history's
        wealthiest figures. Although two of the people have identical net worth figures, unique IDs are required
        for this purpose. <codeph>ROW_NUMBER()</codeph> produces a sequence of five different values for the five
        input rows.
      </p>

<codeblock>select row_number() over (order by net_worth desc) as account_id, name, net_worth
  from wealth order by account_id, name;
+------------+---------+---------------+
| account_id | name    | net_worth     |
+------------+---------+---------------+
| 1          | Solomon | 2000000000.00 |
| 2          | Croesus | 1000000000.00 |
| 3          | Midas   | 1000000000.00 |
| 4          | Crassus | 500000000.00  |
| 5          | Scrooge | 80000000.00   |
+------------+---------+---------------+
</codeblock>

      <p conref="../shared/impala_common.xml#common/related_info"/>

      <p>
        <xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#dense_rank"/>
      </p>

    </conbody>

  </concept>

  <concept id="sum_analytic">

    <title>SUM Function - Analytic Context</title>

    <conbody>

      <p>
        You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
        function. See <xref href="impala_sum.xml#sum"/> for details and examples.
      </p>

    </conbody>

  </concept>

</concept>