mirror of
https://github.com/apache/impala.git
synced 2025-12-30 03:01:44 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
1758 lines
69 KiB
XML
1758 lines
69 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept rev="2.0.0" id="analytic_functions">
|
|
|
|
<title>Impala Analytic Functions</title>
|
|
|
|
<titlealts audience="PDF">
|
|
|
|
<navtitle>Analytic Functions</navtitle>
|
|
|
|
</titlealts>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Impala Functions"/>
|
|
<data name="Category" value="Aggregate Functions"/>
|
|
<data name="Category" value="Analytic Functions"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Querying"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p rev="2.0.0">
|
|
<indexterm audience="hidden">analytic functions</indexterm>
|
|
|
|
<indexterm audience="hidden">window functions</indexterm>
|
|
Analytic functions (also known as window functions) are a special category of built-in functions. Like
|
|
aggregate functions, they examine the contents of multiple input rows to compute each output value. However,
|
|
rather than being limited to one result value per <codeph>GROUP BY</codeph> group, they operate on
|
|
<term>windows</term> where the input rows are ordered and grouped using flexible conditions expressed through
|
|
an <codeph>OVER()</codeph> clause.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<!--
|
|
<p>
|
|
Analytic functions produce one output value for each input row, like scalar functions such as
|
|
<codeph>length()</codeph> or
|
|
<codeph>substr()</codeph>.
|
|
</p>
|
|
-->
|
|
|
|
<p>
|
|
Some functions, such as <codeph>LAG()</codeph> and <codeph>RANK()</codeph>, can only be used in this analytic
|
|
context. Some aggregate functions do double duty: when you call the aggregation functions such as
|
|
<codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>AVG()</codeph>, and so on with an
|
|
<codeph>OVER()</codeph> clause, they produce an output value for each row, based on computations across other
|
|
rows in the window.
|
|
</p>
|
|
|
|
<p>
|
|
Although analytic functions often compute the same value you would see from an aggregate function in a
|
|
<codeph>GROUP BY</codeph> query, the analytic functions produce a value for each row in the result set rather
|
|
than a single value for each group. This flexibility lets you include additional columns in the
|
|
<codeph>SELECT</codeph> list, offering more opportunities for organizing and filtering the result set.
|
|
</p>
|
|
|
|
<p>
|
|
Analytic function calls are only allowed in the <codeph>SELECT</codeph> list and in the outermost
|
|
<codeph>ORDER BY</codeph> clause of the query. During query processing, analytic functions are evaluated
|
|
after other query stages such as joins, <codeph>WHERE</codeph>, and <codeph>GROUP BY</codeph>,
|
|
</p>
|
|
|
|
<!-- Oracle doesn't show examples until it gets to the actual functions, so let's follow that lead.
|
|
<p>
|
|
The following example shows a very simple call to <codeph>MAX()</codeph> in
|
|
an analytic context, and a similar query using a <codeph>GROUP BY</codeph> clause.
|
|
</p>
|
|
-->
|
|
|
|
<!--
|
|
This basic query could be represented either as an analytic
|
|
function call or an aggregation function call in a <codeph>GROUP BY</codeph> query.
|
|
For more elaborate kinds of computations, the flexibility of the analytic window
|
|
makes that the preferred option.
|
|
-->
|
|
|
|
<!-- TK: construct sample data and fill in query results. -->
|
|
|
|
<!-- Other DB docs don't necessarily include examples up at this level, only for the individual functions.
|
|
So maybe take these placeholders out entirely.
|
|
|
|
<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year) FROM historical_temps;
|
|
SELECT year, month, max(degrees_c) FROM historical_temps GROUP BY year;
|
|
</codeblock>
|
|
-->
|
|
|
|
<p>
|
|
The rows that are part of each partition are analyzed by computations across an ordered or unordered set of
|
|
rows. For example, <codeph>COUNT()</codeph> and <codeph>SUM()</codeph> might be applied to all the rows in
|
|
the partition, in which case the order of analysis does not matter. The <codeph>ORDER BY</codeph> clause
|
|
might be used inside the <codeph>OVER()</codeph> clause to defines the ordering that applies to functions
|
|
such as <codeph>LAG()</codeph> and <codeph>FIRST_VALUE()</codeph>.
|
|
</p>
|
|
|
|
<!-- TK: output needed here also. -->
|
|
|
|
<!--
|
|
<codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year ORDER BY MONTH DESC) FROM historical_temps;
|
|
</codeblock>
|
|
-->
|
|
|
|
<p>
|
|
Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and
|
|
bucketed analysis for large data sets. You might also see the term <q>window functions</q> in database
|
|
literature, referring to the sequence of rows (the <q>window</q>) that the function call applies to,
|
|
particularly when the <codeph>OVER</codeph> clause includes a <codeph>ROWS</codeph> or <codeph>RANGE</codeph>
|
|
keyword.
|
|
</p>
|
|
|
|
<p>
|
|
The following sections describe the analytic query clauses and the pure analytic functions provided by
|
|
Impala. For usage information about aggregate functions in an analytic context, see
|
|
<xref href="impala_aggregate_functions.xml#aggregate_functions"/>.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
|
|
</conbody>
|
|
|
|
<concept id="over">
|
|
|
|
<title>OVER Clause</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The <codeph>OVER</codeph> clause is required for calls to pure analytic functions such as
|
|
<codeph>LEAD()</codeph>, <codeph>RANK()</codeph>, and <codeph>FIRST_VALUE()</codeph>. When you include an
|
|
<codeph>OVER</codeph> clause with calls to aggregate functions such as <codeph>MAX()</codeph>,
|
|
<codeph>COUNT()</codeph>, or <codeph>SUM()</codeph>, they operate as analytic functions.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>function(<varname>args</varname>) OVER([<varname>partition_by_clause</varname>] [<varname>order_by_clause</varname> [<varname>window_clause</varname>]])
|
|
|
|
partition_by_clause ::= PARTITION BY <varname>expr</varname> [, <varname>expr</varname> ...]
|
|
order_by_clause ::= ORDER BY <varname>expr</varname> [ASC | DESC] [NULLS FIRST | NULLS LAST] [, <varname>expr</varname> [ASC | DESC] [NULLS FIRST | NULLS LAST] ...]
|
|
window_clause: See <xref href="#window_clause">Window Clause</xref>
|
|
</codeblock>
|
|
|
|
<p>
|
|
<b>PARTITION BY clause:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The <codeph>PARTITION BY</codeph> clause acts much like the <codeph>GROUP BY</codeph> clause in the
|
|
outermost block of a query. It divides the rows into groups containing identical values in one or more
|
|
columns. These logical groups are known as <term>partitions</term>. Throughout the discussion of analytic
|
|
functions, <q>partitions</q> refers to the groups produced by the <codeph>PARTITION BY</codeph> clause, not
|
|
to partitioned tables. However, note the following limitation that applies specifically to analytic function
|
|
calls involving partitioned tables.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/>
|
|
|
|
<p>
|
|
The sequence of results from an analytic function <q>resets</q> for each new partition in the result set.
|
|
That is, the set of preceding or following rows considered by the analytic function always come from a
|
|
single partition. Any <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>ROW_NUMBER()</codeph>, and so
|
|
on apply to each partition independently. Omit the <codeph>PARTITION BY</codeph> clause to apply the
|
|
analytic operation to all the rows in the table.
|
|
</p>
|
|
|
|
<p>
|
|
<b>ORDER BY clause:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The <codeph>ORDER BY</codeph> clause works much like the <codeph>ORDER BY</codeph> clause in the outermost
|
|
block of a query. It defines the order in which rows are evaluated for the entire input set, or for each
|
|
group produced by a <codeph>PARTITION BY</codeph> clause. You can order by one or multiple expressions, and
|
|
for each expression optionally choose ascending or descending order and whether nulls come first or last in
|
|
the sort order. Because this <codeph>ORDER BY</codeph> clause only defines the order in which rows are
|
|
evaluated, if you want the results to be output in a specific order, also include an <codeph>ORDER
|
|
BY</codeph> clause in the outer block of the query.
|
|
</p>
|
|
|
|
<p>
|
|
When the <codeph>ORDER BY</codeph> clause is omitted, the analytic function applies to all items in the
|
|
group produced by the <codeph>PARTITION BY</codeph> clause. When the <codeph>ORDER BY</codeph> clause is
|
|
included, the analysis can apply to all or a subset of the items in the group, depending on the optional
|
|
window clause.
|
|
</p>
|
|
|
|
<p>
|
|
The order in which the rows are analyzed is only defined for those columns specified in <codeph>ORDER
|
|
BY</codeph> clauses.
|
|
</p>
|
|
|
|
<p>
|
|
One difference between the analytic and outer uses of the <codeph>ORDER BY</codeph> clause: inside the
|
|
<codeph>OVER</codeph> clause, <codeph>ORDER BY 1</codeph> or other integer value is interpreted as a
|
|
constant sort value (effectively a no-op) rather than referring to column 1.
|
|
</p>
|
|
|
|
<p>
|
|
<b>Window clause:</b>
|
|
</p>
|
|
|
|
<p>
|
|
The window clause is only allowed in combination with an <codeph>ORDER BY</codeph> clause. If the
|
|
<codeph>ORDER BY</codeph> clause is specified but the window clause is not, the default window is
|
|
<codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>. See
|
|
<xref href="impala_analytic_functions.xml#window_clause"/> for full details.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
|
|
|
|
<p>
|
|
Because HBase tables are optimized for single-row lookups rather than full scans, analytic functions using
|
|
the <codeph>OVER()</codeph> clause are not recommended for HBase tables. Although such queries work, their
|
|
performance is lower than on comparable tables using HDFS data files.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/parquet_blurb"/>
|
|
|
|
<p>
|
|
Analytic functions are very efficient for Parquet tables. The data that is examined during evaluation of
|
|
the <codeph>OVER()</codeph> clause comes from a specified set of columns, and the values for each column
|
|
are arranged sequentially within each data file.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/text_blurb"/>
|
|
|
|
<p>
|
|
Analytic functions are convenient to use with text tables for exploratory business intelligence. When the
|
|
volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example shows how to synthesize a numeric sequence corresponding to all the rows in a table.
|
|
The new table has the same columns as the old one, plus an additional column <codeph>ID</codeph> containing
|
|
the integers 1, 2, 3, and so on, corresponding to the order of a <codeph>TIMESTAMP</codeph> column in the
|
|
original table.
|
|
</p>
|
|
|
|
<!-- TK: synthesize some data and fill in output here. -->
|
|
|
|
<codeblock>CREATE TABLE events_with_id AS
|
|
SELECT
|
|
row_number() OVER (ORDER BY date_and_time) AS id,
|
|
c1, c2, c3, c4
|
|
FROM events;
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows how to determine the number of rows containing each value for a column. Unlike
|
|
a corresponding <codeph>GROUP BY</codeph> query, this one can analyze a single column and still return all
|
|
values (not just the distinct ones) from the other columns.
|
|
</p>
|
|
|
|
<!-- TK: verify the 'unbounded' shortcut syntax. -->
|
|
|
|
<codeblock>SELECT x, y, z,
|
|
count() OVER (PARTITION BY x) AS how_many_x
|
|
FROM t1;
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
|
|
|
<p>
|
|
You cannot directly combine the <codeph>DISTINCT</codeph> operator with analytic function calls. You can
|
|
put the analytic function call in a <codeph>WITH</codeph> clause or an inline view, and apply the
|
|
<codeph>DISTINCT</codeph> operator to its result set.
|
|
</p>
|
|
|
|
<codeblock>WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1)
|
|
SELECT DISTINCT x, total FROM t1;
|
|
</codeblock>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="window_clause">
|
|
|
|
<title>Window Clause</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Certain analytic functions accept an optional <term>window clause</term>, which makes the function analyze
|
|
only certain rows <q>around</q> the current row rather than all rows in the partition. For example, you can
|
|
get a moving average by specifying some number of preceding and following rows, or a running count or
|
|
running total by specifying all rows up to the current position. This clause can result in different
|
|
analytic results for rows within the same partition.
|
|
</p>
|
|
|
|
<p>
|
|
The window clause is supported with the <codeph>AVG()</codeph>, <codeph>COUNT()</codeph>,
|
|
<codeph>FIRST_VALUE()</codeph>, <codeph>LAST_VALUE()</codeph>, and <codeph>SUM()</codeph> functions.
|
|
<!-- To do: fill in this factoid under MAX and MIN also. -->
|
|
For <codeph>MAX()</codeph> and <codeph>MIN()</codeph>, the window clause only allowed if the start bound is
|
|
<codeph>UNBOUNDED PRECEDING</codeph>
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>ROWS BETWEEN [ { <varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ]
|
|
RANGE BETWEEN [ {<varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ]</codeblock>
|
|
|
|
<p>
|
|
<codeph>ROWS BETWEEN</codeph> defines the size of the window in terms of the indexes of the rows in the
|
|
result set. The size of the window is predictable based on the clauses the position within the result set.
|
|
</p>
|
|
|
|
<p>
|
|
<codeph>RANGE BETWEEN</codeph> does not currently support numeric arguments to define a variable-size
|
|
sliding window.
|
|
<!--
|
|
Currently, it effectively works the same as the
|
|
equivalent <codeph>ROWS BETWEEN</codeph> clause.
|
|
-->
|
|
</p>
|
|
|
|
<!--
|
|
<p>
|
|
<codeph>RANGE BETWEEN</codeph> defines the size of the window based on arithmetic comparisons
|
|
of the values in the result set.
|
|
The size of the window varies depending on the order and distribution of values.
|
|
</p>
|
|
-->
|
|
|
|
<p>
|
|
Currently, Impala supports only some combinations of arguments to the <codeph>RANGE</codeph> clause:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> (the default when <codeph>ORDER
|
|
BY</codeph> is specified and the window clause is omitted)
|
|
</li>
|
|
|
|
<li>
|
|
<codeph>RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING</codeph>
|
|
</li>
|
|
|
|
<li>
|
|
<codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING</codeph>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
When <codeph>RANGE</codeph> is used, <codeph>CURRENT ROW</codeph> includes not just the current row but all
|
|
rows that are tied with the current row based on the <codeph>ORDER BY</codeph> expressions.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following examples show financial data for a fictional stock symbol <codeph>JDR</codeph>. The closing
|
|
price moves up and down each day.
|
|
</p>
|
|
|
|
<codeblock>create table stock_ticker (stock_symbol string, closing_price decimal(8,2), closing_date timestamp);
|
|
...load some data...
|
|
select * from stock_ticker order by stock_symbol, closing_date
|
|
+--------------+---------------+---------------------+
|
|
| stock_symbol | closing_price | closing_date |
|
|
+--------------+---------------+---------------------+
|
|
| JDR | 12.86 | 2014-10-02 00:00:00 |
|
|
| JDR | 12.89 | 2014-10-03 00:00:00 |
|
|
| JDR | 12.94 | 2014-10-04 00:00:00 |
|
|
| JDR | 12.55 | 2014-10-05 00:00:00 |
|
|
| JDR | 14.03 | 2014-10-06 00:00:00 |
|
|
| JDR | 14.75 | 2014-10-07 00:00:00 |
|
|
| JDR | 13.98 | 2014-10-08 00:00:00 |
|
|
+--------------+---------------+---------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The queries use analytic functions with window clauses to compute moving averages of the closing price. For
|
|
example, <codeph>ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING</codeph> produces an average of the value from a
|
|
3-day span, producing a different value for each row. The first row, which has no preceding row, only gets
|
|
averaged with the row following it. If the table contained more than one stock symbol, the
|
|
<codeph>PARTITION BY</codeph> clause would limit the window for the moving average to only consider the
|
|
prices for a single stock.
|
|
</p>
|
|
|
|
<codeblock>select stock_symbol, closing_date, closing_price,
|
|
avg(closing_price) over (partition by stock_symbol order by closing_date
|
|
rows between 1 preceding and 1 following) as moving_average
|
|
from stock_ticker;
|
|
+--------------+---------------------+---------------+----------------+
|
|
| stock_symbol | closing_date | closing_price | moving_average |
|
|
+--------------+---------------------+---------------+----------------+
|
|
| JDR | 2014-10-02 00:00:00 | 12.86 | 12.87 |
|
|
| JDR | 2014-10-03 00:00:00 | 12.89 | 12.89 |
|
|
| JDR | 2014-10-04 00:00:00 | 12.94 | 12.79 |
|
|
| JDR | 2014-10-05 00:00:00 | 12.55 | 13.17 |
|
|
| JDR | 2014-10-06 00:00:00 | 14.03 | 13.77 |
|
|
| JDR | 2014-10-07 00:00:00 | 14.75 | 14.25 |
|
|
| JDR | 2014-10-08 00:00:00 | 13.98 | 14.36 |
|
|
+--------------+---------------------+---------------+----------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The clause <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> produces a cumulative moving
|
|
average, from the earliest data up to the value for each day.
|
|
</p>
|
|
|
|
<codeblock>select stock_symbol, closing_date, closing_price,
|
|
avg(closing_price) over (partition by stock_symbol order by closing_date
|
|
rows between unbounded preceding and current row) as moving_average
|
|
from stock_ticker;
|
|
+--------------+---------------------+---------------+----------------+
|
|
| stock_symbol | closing_date | closing_price | moving_average |
|
|
+--------------+---------------------+---------------+----------------+
|
|
| JDR | 2014-10-02 00:00:00 | 12.86 | 12.86 |
|
|
| JDR | 2014-10-03 00:00:00 | 12.89 | 12.87 |
|
|
| JDR | 2014-10-04 00:00:00 | 12.94 | 12.89 |
|
|
| JDR | 2014-10-05 00:00:00 | 12.55 | 12.81 |
|
|
| JDR | 2014-10-06 00:00:00 | 14.03 | 13.05 |
|
|
| JDR | 2014-10-07 00:00:00 | 14.75 | 13.33 |
|
|
| JDR | 2014-10-08 00:00:00 | 13.98 | 13.42 |
|
|
+--------------+---------------------+---------------+----------------+
|
|
</codeblock>
|
|
|
|
<!-- Matt suggests not always true depending on data. Hiding until I can try myself.
|
|
<p>
|
|
The clause <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> would produce the same
|
|
output as above. Because <codeph>RANGE</codeph> currently does not support numeric offsets while
|
|
<codeph>ROWS</codeph> does, currently the <codeph>ROWS</codeph> syntax is more flexible.
|
|
</p>
|
|
-->
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="avg_analytic">
|
|
|
|
<title>AVG Function - Analytic Context</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
|
|
function. See <xref href="impala_avg.xml#avg"/> for details and examples.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="count_analytic">
|
|
|
|
<title>COUNT Function - Analytic Context</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
|
|
function. See <xref href="impala_count.xml#count"/> for details and examples.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.3.0" id="cume_dist">
|
|
|
|
<title>CUME_DIST Function (<keyword keyref="impala23"/> or higher only)</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0
|
|
and less than or equal to 1.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>CUME_DIST (<varname>expr</varname>)
|
|
OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)
|
|
</codeblock>
|
|
|
|
<p>
|
|
The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
Within each partition of the result set, the <codeph>CUME_DIST()</codeph> value represents an ascending
|
|
sequence that ends at 1. Each value represents the proportion of rows in the partition whose values are
|
|
less than or equal to the value in the current row.
|
|
</p>
|
|
|
|
<p>
|
|
If the sequence of input values contains ties, the <codeph>CUME_DIST()</codeph> results are identical for the
|
|
tied values.
|
|
</p>
|
|
|
|
<p>
|
|
Impala only supports the <codeph>CUME_DIST()</codeph> function in an analytic context, not as a regular
|
|
aggregate function.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
This example uses a table with 9 rows. The <codeph>CUME_DIST()</codeph>
|
|
function evaluates the entire table because there is no <codeph>PARTITION BY</codeph> clause,
|
|
with the rows ordered by the weight of the animal.
|
|
the sequence of values shows that 1/9 of the values are less than or equal to the lightest
|
|
animal (mouse), 2/9 of the values are less than or equal to the second-lightest animal,
|
|
and so on up to the heaviest animal (elephant), where 9/9 of the rows are less than or
|
|
equal to its weight.
|
|
</p>
|
|
|
|
<codeblock>create table animals (name string, kind string, kilos decimal(9,3));
|
|
insert into animals values
|
|
('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020),
|
|
('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5),
|
|
('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5);
|
|
|
|
select name, cume_dist() over (order by kilos) from animals;
|
|
+------------+-----------------------+
|
|
| name | cume_dist() OVER(...) |
|
|
+------------+-----------------------+
|
|
| Elephant | 1 |
|
|
| Giraffe | 0.8888888888888888 |
|
|
| Polar bear | 0.7777777777777778 |
|
|
| Horse | 0.6666666666666666 |
|
|
| Ostrich | 0.5555555555555556 |
|
|
| Condor | 0.4444444444444444 |
|
|
| Housecat | 0.3333333333333333 |
|
|
| Owl | 0.2222222222222222 |
|
|
| Mouse | 0.1111111111111111 |
|
|
+------------+-----------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Using a <codeph>PARTITION BY</codeph> clause produces a separate sequence for each partition
|
|
group, in this case one for mammals and one for birds. Because there are 3 birds and 6 mammals,
|
|
the sequence illustrates how 1/3 of the <q>Bird</q> rows have a <codeph>kilos</codeph> value that is less than or equal to
|
|
the lightest bird, 1/6 of the <q>Mammal</q> rows have a <codeph>kilos</codeph> value that is less than or equal to
|
|
the lightest mammal, and so on until both the heaviest bird and heaviest mammal have a <codeph>CUME_DIST()</codeph>
|
|
value of 1.
|
|
</p>
|
|
|
|
<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) from animals
|
|
+------------+--------+-----------------------+
|
|
| name | kind | cume_dist() OVER(...) |
|
|
+------------+--------+-----------------------+
|
|
| Ostrich | Bird | 1 |
|
|
| Condor | Bird | 0.6666666666666666 |
|
|
| Owl | Bird | 0.3333333333333333 |
|
|
| Elephant | Mammal | 1 |
|
|
| Giraffe | Mammal | 0.8333333333333334 |
|
|
| Polar bear | Mammal | 0.6666666666666666 |
|
|
| Horse | Mammal | 0.5 |
|
|
| Housecat | Mammal | 0.3333333333333333 |
|
|
| Mouse | Mammal | 0.1666666666666667 |
|
|
+------------+--------+-----------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
We can reverse the ordering within each partition group by using an <codeph>ORDER BY ... DESC</codeph>
|
|
clause within the <codeph>OVER()</codeph> clause. Now the lightest (smallest value of <codeph>kilos</codeph>)
|
|
animal of each kind has a <codeph>CUME_DIST()</codeph> value of 1.
|
|
</p>
|
|
|
|
<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos desc) from animals
|
|
+------------+--------+-----------------------+
|
|
| name | kind | cume_dist() OVER(...) |
|
|
+------------+--------+-----------------------+
|
|
| Owl | Bird | 1 |
|
|
| Condor | Bird | 0.6666666666666666 |
|
|
| Ostrich | Bird | 0.3333333333333333 |
|
|
| Mouse | Mammal | 1 |
|
|
| Housecat | Mammal | 0.8333333333333334 |
|
|
| Horse | Mammal | 0.6666666666666666 |
|
|
| Polar bear | Mammal | 0.5 |
|
|
| Giraffe | Mammal | 0.3333333333333333 |
|
|
| Elephant | Mammal | 0.1666666666666667 |
|
|
+------------+--------+-----------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example manufactures some rows with identical values in the <codeph>kilos</codeph> column,
|
|
to demonstrate how the results look in case of tie values. For simplicity, it only shows the <codeph>CUME_DIST()</codeph>
|
|
sequence for the <q>Bird</q> rows. Now with 3 rows all with a value of 15, all of those rows have the same
|
|
<codeph>CUME_DIST()</codeph> value. 4/5 of the rows have a value for <codeph>kilos</codeph> that is less than or
|
|
equal to 15.
|
|
</p>
|
|
|
|
<codeblock>insert into animals values ('California Condor', 'Bird', 15), ('Andean Condor', 'Bird', 15)
|
|
|
|
select name, kind, cume_dist() over (order by kilos) from animals where kind = 'Bird';
|
|
+-------------------+------+-----------------------+
|
|
| name | kind | cume_dist() OVER(...) |
|
|
+-------------------+------+-----------------------+
|
|
| Ostrich | Bird | 1 |
|
|
| Condor | Bird | 0.8 |
|
|
| California Condor | Bird | 0.8 |
|
|
| Andean Condor | Bird | 0.8 |
|
|
| Owl | Bird | 0.2 |
|
|
+-------------------+------+-----------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows how to use an <codeph>ORDER BY</codeph> clause in the outer block
|
|
to order the result set in case of ties. Here, all the <q>Bird</q> rows are together, then in descending order
|
|
by the result of the <codeph>CUME_DIST()</codeph> function, and all tied <codeph>CUME_DIST()</codeph>
|
|
values are ordered by the animal name.
|
|
</p>
|
|
|
|
<codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) as ordering
|
|
from animals
|
|
where
|
|
kind = 'Bird'
|
|
order by kind, ordering desc, name;
|
|
+-------------------+------+----------+
|
|
| name | kind | ordering |
|
|
+-------------------+------+----------+
|
|
| Ostrich | Bird | 1 |
|
|
| Andean Condor | Bird | 0.8 |
|
|
| California Condor | Bird | 0.8 |
|
|
| Condor | Bird | 0.8 |
|
|
| Owl | Bird | 0.2 |
|
|
+-------------------+------+----------+
|
|
</codeblock>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.0.0" id="dense_rank">
|
|
|
|
<title>DENSE_RANK Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers
|
|
for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values
|
|
for the <q>tied</q> input values, the function continues the sequence with the next higher integer.
|
|
Therefore, the sequence contains duplicates but no gaps when the input contains duplicates. Starts the
|
|
sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>DENSE_RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>
|
|
|
|
<p>
|
|
The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<!-- Can make the text for ROW_NUMBER, RANK, and DENSE_RANK identical
|
|
so it can be conref'ed in all 3 places. -->
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including
|
|
all the items with the 10 highest values, even if several items tied for 1st place.
|
|
</p>
|
|
|
|
<p>
|
|
Similar to <codeph>ROW_NUMBER</codeph> and <codeph>RANK</codeph>. These functions differ in how they treat
|
|
duplicate combinations of values.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example demonstrates how the <codeph>DENSE_RANK()</codeph> function identifies where each
|
|
value <q>places</q> in the result set, producing the same result for duplicate values, but with a strict
|
|
sequence from 1 to the number of groups. For example, when results are ordered by the <codeph>X</codeph>
|
|
column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph> values are tied for
|
|
second; and so on.
|
|
</p>
|
|
|
|
<codeblock>select x, dense_rank() over(order by x) as rank, property from int_t;
|
|
+----+------+----------+
|
|
| x | rank | property |
|
|
+----+------+----------+
|
|
| 1 | 1 | square |
|
|
| 1 | 1 | odd |
|
|
| 2 | 2 | even |
|
|
| 2 | 2 | prime |
|
|
| 3 | 3 | prime |
|
|
| 3 | 3 | odd |
|
|
| 4 | 4 | even |
|
|
| 4 | 4 | square |
|
|
| 5 | 5 | odd |
|
|
| 5 | 5 | prime |
|
|
| 6 | 6 | even |
|
|
| 6 | 6 | perfect |
|
|
| 7 | 7 | lucky |
|
|
| 7 | 7 | lucky |
|
|
| 7 | 7 | lucky |
|
|
| 7 | 7 | odd |
|
|
| 7 | 7 | prime |
|
|
| 8 | 8 | even |
|
|
| 9 | 9 | square |
|
|
| 9 | 9 | odd |
|
|
| 10 | 10 | round |
|
|
| 10 | 10 | even |
|
|
+----+------+----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following examples show how the <codeph>DENSE_RANK()</codeph> function is affected by the
|
|
<codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause.
|
|
</p>
|
|
|
|
<p>
|
|
Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together,
|
|
and <codeph>DENSE_RANK()</codeph> returns the place of each value within the group, producing several
|
|
ascending sequences.
|
|
</p>
|
|
|
|
<codeblock>select x, dense_rank() over(partition by property order by x) as rank, property from int_t;
|
|
+----+------+----------+
|
|
| x | rank | property |
|
|
+----+------+----------+
|
|
| 2 | 1 | even |
|
|
| 4 | 2 | even |
|
|
| 6 | 3 | even |
|
|
| 8 | 4 | even |
|
|
| 10 | 5 | even |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 1 | 1 | odd |
|
|
| 3 | 2 | odd |
|
|
| 5 | 3 | odd |
|
|
| 7 | 4 | odd |
|
|
| 9 | 5 | odd |
|
|
| 6 | 1 | perfect |
|
|
| 2 | 1 | prime |
|
|
| 3 | 2 | prime |
|
|
| 5 | 3 | prime |
|
|
| 7 | 4 | prime |
|
|
| 10 | 1 | round |
|
|
| 1 | 1 | square |
|
|
| 4 | 2 | square |
|
|
| 9 | 3 | square |
|
|
+----+------+----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the
|
|
place each each value within the group; because each value occurs only 1 or 2 times,
|
|
<codeph>DENSE_RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its
|
|
group.
|
|
</p>
|
|
|
|
<codeblock>select x, dense_rank() over(partition by x order by property) as rank, property from int_t;
|
|
+----+------+----------+
|
|
| x | rank | property |
|
|
+----+------+----------+
|
|
| 1 | 1 | odd |
|
|
| 1 | 2 | square |
|
|
| 2 | 1 | even |
|
|
| 2 | 2 | prime |
|
|
| 3 | 1 | odd |
|
|
| 3 | 2 | prime |
|
|
| 4 | 1 | even |
|
|
| 4 | 2 | square |
|
|
| 5 | 1 | odd |
|
|
| 5 | 2 | prime |
|
|
| 6 | 1 | even |
|
|
| 6 | 2 | perfect |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 2 | odd |
|
|
| 7 | 3 | prime |
|
|
| 8 | 1 | even |
|
|
| 9 | 1 | odd |
|
|
| 9 | 2 | square |
|
|
| 10 | 1 | even |
|
|
| 10 | 2 | round |
|
|
+----+------+----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows how <codeph>DENSE_RANK()</codeph> produces a continuous sequence while still
|
|
allowing for ties. In this case, Croesus and Midas both have the second largest fortune, while Crassus has
|
|
the third largest. (In <xref href="impala_analytic_functions.xml#rank"/>, you see a similar query with the
|
|
<codeph>RANK()</codeph> function that shows that while Crassus has the third largest fortune, he is the
|
|
fourth richest person.)
|
|
</p>
|
|
|
|
<codeblock>select dense_rank() over (order by net_worth desc) as placement, name, net_worth from wealth order by placement, name;
|
|
+-----------+---------+---------------+
|
|
| placement | name | net_worth |
|
|
+-----------+---------+---------------+
|
|
| 1 | Solomon | 2000000000.00 |
|
|
| 2 | Croesus | 1000000000.00 |
|
|
| 2 | Midas | 1000000000.00 |
|
|
| 3 | Crassus | 500000000.00 |
|
|
| 4 | Scrooge | 80000000.00 |
|
|
+-----------+---------+---------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
<xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#row_number"/>
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.0.0" id="first_value">
|
|
|
|
<title>FIRST_VALUE Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Returns the expression value from the first row in the window. The return value is <codeph>NULL</codeph> if
|
|
the input expression is <codeph>NULL</codeph>.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>FIRST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock>
|
|
|
|
<p>
|
|
The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
|
|
window clause is optional.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result
|
|
of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to
|
|
ensure consistent ordering.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example shows a table with a wide variety of country-appropriate greetings. For consistency,
|
|
we want to standardize on a single greeting for each country. The <codeph>FIRST_VALUE()</codeph> function
|
|
helps to produce a mail merge report where every person from the same country is addressed with the same
|
|
greeting.
|
|
</p>
|
|
|
|
<codeblock>select name, country, greeting from mail_merge
|
|
+---------+---------+--------------+
|
|
| name | country | greeting |
|
|
+---------+---------+--------------+
|
|
| Pete | USA | Hello |
|
|
| John | USA | Hi |
|
|
| Boris | Germany | Guten tag |
|
|
| Michael | Germany | Guten morgen |
|
|
| Bjorn | Sweden | Hej |
|
|
| Mats | Sweden | Tja |
|
|
+---------+---------+--------------+
|
|
|
|
select country, name,
|
|
first_value(greeting)
|
|
over (partition by country order by name, greeting) as greeting
|
|
from mail_merge;
|
|
+---------+---------+-----------+
|
|
| country | name | greeting |
|
|
+---------+---------+-----------+
|
|
| Germany | Boris | Guten tag |
|
|
| Germany | Michael | Guten tag |
|
|
| Sweden | Bjorn | Hej |
|
|
| Sweden | Mats | Hej |
|
|
| USA | John | Hi |
|
|
| USA | Pete | Hi |
|
|
+---------+---------+-----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Changing the order in which the names are evaluated changes which greeting is applied to each group.
|
|
</p>
|
|
|
|
<codeblock>select country, name,
|
|
first_value(greeting)
|
|
over (partition by country order by name desc, greeting) as greeting
|
|
from mail_merge;
|
|
+---------+---------+--------------+
|
|
| country | name | greeting |
|
|
+---------+---------+--------------+
|
|
| Germany | Michael | Guten morgen |
|
|
| Germany | Boris | Guten morgen |
|
|
| Sweden | Mats | Tja |
|
|
| Sweden | Bjorn | Tja |
|
|
| USA | Pete | Hello |
|
|
| USA | John | Hello |
|
|
+---------+---------+--------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
<xref href="impala_analytic_functions.xml#last_value"/>
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.0.0" id="lag">
|
|
|
|
<title>LAG Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This function returns the value of an expression using column values from a preceding row. You specify an
|
|
integer offset, which designates a row position some number of rows previous to the current row. Any column
|
|
references in the expression argument refer to column values from that prior row. Typically, the table
|
|
contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>LAG (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>])
|
|
OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>
|
|
|
|
<p>
|
|
The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
Sometimes used an an alternative to doing a self-join.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example uses the same stock data created in <xref href="#window_clause"/>. For each day, the
|
|
query prints the closing price alongside the previous day's closing price. The first row for each stock
|
|
symbol has no previous row, so that <codeph>LAG()</codeph> value is <codeph>NULL</codeph>.
|
|
</p>
|
|
|
|
<codeblock>select stock_symbol, closing_date, closing_price,
|
|
lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing"
|
|
from stock_ticker
|
|
order by closing_date;
|
|
+--------------+---------------------+---------------+-------------------+
|
|
| stock_symbol | closing_date | closing_price | yesterday closing |
|
|
+--------------+---------------------+---------------+-------------------+
|
|
| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |
|
|
| JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 |
|
|
| JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 |
|
|
| JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 |
|
|
| JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 |
|
|
| JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 |
|
|
| JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 |
|
|
+--------------+---------------------+---------------+-------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example does an arithmetic operation between the current row and a value from the previous
|
|
row, to produce a delta value for each day. This example also demonstrates how <codeph>ORDER BY</codeph>
|
|
works independently in the different parts of the query. The <codeph>ORDER BY closing_date</codeph> in the
|
|
<codeph>OVER</codeph> clause makes the query analyze the rows in chronological order. Then the outer query
|
|
block uses <codeph>ORDER BY closing_date DESC</codeph> to present the results with the most recent date
|
|
first.
|
|
</p>
|
|
|
|
<codeblock>select stock_symbol, closing_date, closing_price,
|
|
cast(
|
|
closing_price - lag(closing_price,1) over
|
|
(partition by stock_symbol order by closing_date)
|
|
as decimal(8,2)
|
|
)
|
|
as "change from yesterday"
|
|
from stock_ticker
|
|
order by closing_date desc;
|
|
+--------------+---------------------+---------------+-----------------------+
|
|
| stock_symbol | closing_date | closing_price | change from yesterday |
|
|
+--------------+---------------------+---------------+-----------------------+
|
|
| JDR | 2014-09-19 00:00:00 | 13.98 | -0.76 |
|
|
| JDR | 2014-09-18 00:00:00 | 14.75 | 0.72 |
|
|
| JDR | 2014-09-17 00:00:00 | 14.03 | 1.47 |
|
|
| JDR | 2014-09-16 00:00:00 | 12.55 | -0.38 |
|
|
| JDR | 2014-09-15 00:00:00 | 12.94 | 0.04 |
|
|
| JDR | 2014-09-14 00:00:00 | 12.89 | 0.03 |
|
|
| JDR | 2014-09-13 00:00:00 | 12.86 | NULL |
|
|
+--------------+---------------------+---------------+-----------------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
This function is the converse of <xref href="impala_analytic_functions.xml#lead"/>.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.0.0" id="last_value">
|
|
|
|
<title>LAST_VALUE Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Returns the expression value from the last row in the window. This same value is repeated for all result
|
|
rows for the group. The return value is <codeph>NULL</codeph> if the input expression is
|
|
<codeph>NULL</codeph>.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>LAST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock>
|
|
|
|
<p>
|
|
The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
|
|
window clause is optional.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result
|
|
of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to
|
|
ensure consistent ordering.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example uses the same <codeph>MAIL_MERGE</codeph> table as in the example for
|
|
<xref href="impala_analytic_functions.xml#first_value"/>. Because the default window when <codeph>ORDER
|
|
BY</codeph> is used is <codeph>BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, the query requires the
|
|
<codeph>UNBOUNDED FOLLOWING</codeph> to look ahead to subsequent rows and find the last value for each
|
|
country.
|
|
</p>
|
|
|
|
<codeblock>select country, name,
|
|
last_value(greeting) over (
|
|
partition by country order by name, greeting
|
|
rows between unbounded preceding and unbounded following
|
|
) as greeting
|
|
from mail_merge
|
|
+---------+---------+--------------+
|
|
| country | name | greeting |
|
|
+---------+---------+--------------+
|
|
| Germany | Boris | Guten morgen |
|
|
| Germany | Michael | Guten morgen |
|
|
| Sweden | Bjorn | Tja |
|
|
| Sweden | Mats | Tja |
|
|
| USA | John | Hello |
|
|
| USA | Pete | Hello |
|
|
+---------+---------+--------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
<xref href="impala_analytic_functions.xml#first_value"/>
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.0.0" id="lead">
|
|
|
|
<title>LEAD Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This function returns the value of an expression using column values from a following row. You specify an
|
|
integer offset, which designates a row position some number of rows after to the current row. Any column
|
|
references in the expression argument refer to column values from that later row. Typically, the table
|
|
contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>LEAD (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>])
|
|
OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>
|
|
|
|
<p>
|
|
The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
Sometimes used an an alternative to doing a self-join.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example uses the same stock data created in <xref href="#window_clause"/>. The query analyzes
|
|
the closing price for a stock symbol, and for each day evaluates if the closing price for the following day
|
|
is higher or lower.
|
|
</p>
|
|
|
|
<codeblock>select stock_symbol, closing_date, closing_price,
|
|
case
|
|
(lead(closing_price,1)
|
|
over (partition by stock_symbol order by closing_date)
|
|
- closing_price) > 0
|
|
when true then "higher"
|
|
when false then "flat or lower"
|
|
end as "trending"
|
|
from stock_ticker
|
|
order by closing_date;
|
|
+--------------+---------------------+---------------+---------------+
|
|
| stock_symbol | closing_date | closing_price | trending |
|
|
+--------------+---------------------+---------------+---------------+
|
|
| JDR | 2014-09-13 00:00:00 | 12.86 | higher |
|
|
| JDR | 2014-09-14 00:00:00 | 12.89 | higher |
|
|
| JDR | 2014-09-15 00:00:00 | 12.94 | flat or lower |
|
|
| JDR | 2014-09-16 00:00:00 | 12.55 | higher |
|
|
| JDR | 2014-09-17 00:00:00 | 14.03 | higher |
|
|
| JDR | 2014-09-18 00:00:00 | 14.75 | flat or lower |
|
|
| JDR | 2014-09-19 00:00:00 | 13.98 | NULL |
|
|
+--------------+---------------------+---------------+---------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
This function is the converse of <xref href="impala_analytic_functions.xml#lag"/>.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="max_analytic">
|
|
|
|
<title>MAX Function - Analytic Context</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
|
|
function. See <xref href="impala_max.xml#max"/> for details and examples.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="min_analytic">
|
|
|
|
<title>MIN Function - Analytic Context</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
|
|
function. See <xref href="impala_min.xml#min"/> for details and examples.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept audience="hidden" rev="2.x.x" id="nth_value">
|
|
|
|
<title>NTH_VALUE Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.3.0" id="ntile">
|
|
|
|
<title>NTILE Function (<keyword keyref="impala23"/> or higher only)</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Returns the <q>bucket number</q> associated with each row, between 1 and the value of an expression. For
|
|
example, creating 100 buckets puts the lowest 1% of values in the first bucket, while creating 10 buckets
|
|
puts the lowest 10% of values in the first bucket. Each partition can have a different number of buckets.
|
|
<!-- What's the syntax or data distribution that would create a different number of buckets per partition? -->
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>NTILE (<varname>expr</varname> [, <varname>offset</varname> ...]
|
|
OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>
|
|
|
|
<p>
|
|
The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
The <q>ntile</q> name is derived from the practice of dividing result sets into fourths (quartile), tenths
|
|
(decile), and so on. The <codeph>NTILE()</codeph> function divides the result set based on an arbitrary
|
|
percentile value.
|
|
</p>
|
|
|
|
<p>
|
|
The number of buckets must be a positive integer.
|
|
</p>
|
|
|
|
<p>
|
|
The number of items in each bucket is identical or almost so, varying by at most 1. If the number of items
|
|
does not divide evenly between the buckets, the remaining N items are divided evenly among the first N
|
|
buckets.
|
|
</p>
|
|
|
|
<p>
|
|
If the number of buckets N is greater than the number of input rows in the partition, then the first N
|
|
buckets each contain one item, and the remaining buckets are empty.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example shows divides groups of animals into 4 buckets based on their weight. The
|
|
<codeph>ORDER BY ... DESC</codeph> clause in the <codeph>OVER()</codeph> clause means that the heaviest 25%
|
|
are in the first group, and the lightest 25% are in the fourth group. (The <codeph>ORDER BY</codeph> in the
|
|
outermost part of the query shows how you can order the final result set independently from the order in
|
|
which the rows are evaluated by the <codeph>OVER()</codeph> clause.) Because there are 9 rows in the group,
|
|
divided into 4 buckets, the first bucket receives the extra item.
|
|
</p>
|
|
|
|
<codeblock>create table animals (name string, kind string, kilos decimal(9,3));
|
|
|
|
insert into animals values
|
|
('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020),
|
|
('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5),
|
|
('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5);
|
|
|
|
select name, ntile(4) over (order by kilos desc) as quarter
|
|
from animals
|
|
order by quarter desc;
|
|
+------------+---------+
|
|
| name | quarter |
|
|
+------------+---------+
|
|
| Owl | 4 |
|
|
| Mouse | 4 |
|
|
| Condor | 3 |
|
|
| Housecat | 3 |
|
|
| Horse | 2 |
|
|
| Ostrich | 2 |
|
|
| Elephant | 1 |
|
|
| Giraffe | 1 |
|
|
| Polar bear | 1 |
|
|
+------------+---------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following examples show how the <codeph>PARTITION</codeph> clause works for the
|
|
<codeph>NTILE()</codeph> function. Here, we divide each kind of animal (mammal or bird) into 2 buckets,
|
|
the heavier half and the lighter half.
|
|
</p>
|
|
|
|
<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half
|
|
from animals
|
|
order by kind;
|
|
+------------+--------+------+
|
|
| name | kind | half |
|
|
+------------+--------+------+
|
|
| Ostrich | Bird | 1 |
|
|
| Condor | Bird | 1 |
|
|
| Owl | Bird | 2 |
|
|
| Elephant | Mammal | 1 |
|
|
| Giraffe | Mammal | 1 |
|
|
| Polar bear | Mammal | 1 |
|
|
| Horse | Mammal | 2 |
|
|
| Housecat | Mammal | 2 |
|
|
| Mouse | Mammal | 2 |
|
|
+------------+--------+------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Again, the result set can be ordered independently
|
|
from the analytic evaluation. This next example lists all the animals heaviest to lightest,
|
|
showing that elephant and giraffe are in the <q>top half</q> of mammals by weight, while
|
|
housecat and mouse are in the <q>bottom half</q>.
|
|
</p>
|
|
|
|
<codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half
|
|
from animals
|
|
order by kilos desc;
|
|
+------------+--------+------+
|
|
| name | kind | half |
|
|
+------------+--------+------+
|
|
| Elephant | Mammal | 1 |
|
|
| Giraffe | Mammal | 1 |
|
|
| Polar bear | Mammal | 1 |
|
|
| Horse | Mammal | 2 |
|
|
| Ostrich | Bird | 1 |
|
|
| Condor | Bird | 1 |
|
|
| Housecat | Mammal | 2 |
|
|
| Owl | Bird | 2 |
|
|
| Mouse | Mammal | 2 |
|
|
+------------+--------+------+
|
|
</codeblock>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.3.0" id="percent_rank">
|
|
|
|
<title>PERCENT_RANK Function (<keyword keyref="impala23"/> or higher only)</title>
|
|
|
|
<conbody>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>PERCENT_RANK (<varname>expr</varname>)
|
|
OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)
|
|
</codeblock>
|
|
|
|
<p>
|
|
Calculates the rank, expressed as a percentage, of each row within a group of rows.
|
|
If <codeph>rank</codeph> is the value for that same row from the <codeph>RANK()</codeph> function (from 1 to the total number of rows in the partition group),
|
|
then the <codeph>PERCENT_RANK()</codeph> value is calculated as <codeph>(<varname>rank</varname> - 1) / (<varname>rows_in_group</varname> - 1)</codeph> .
|
|
If there is only a single item in the partition group, its <codeph>PERCENT_RANK()</codeph> value is 0.
|
|
</p>
|
|
|
|
<p>
|
|
The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
This function is similar to the <codeph>RANK</codeph> and <codeph>CUME_DIST()</codeph> functions: it returns an ascending sequence representing the position of each
|
|
row within the rows of the same partition group. The actual numeric sequence is calculated differently,
|
|
and the handling of duplicate (tied) values is different.
|
|
</p>
|
|
|
|
<p>
|
|
The return values range from 0 to 1 inclusive.
|
|
The first row in each partition group always has the value 0.
|
|
A <codeph>NULL</codeph> value is considered the lowest possible value.
|
|
In the case of duplicate input values, all the corresponding rows in the result set
|
|
have an identical value: the lowest <codeph>PERCENT_RANK()</codeph> value of those
|
|
tied rows. (In contrast to <codeph>CUME_DIST()</codeph>, where all tied rows have
|
|
the highest <codeph>CUME_DIST()</codeph> value.)
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example uses the same <codeph>ANIMALS</codeph> table as the examples for <codeph>CUME_DIST()</codeph>
|
|
and <codeph>NTILE()</codeph>, with a few additional rows to illustrate the results where some values are
|
|
<codeph>NULL</codeph> or there is only a single row in a partition group.
|
|
</p>
|
|
|
|
<codeblock>insert into animals values ('Komodo dragon', 'Reptile', 70);
|
|
insert into animals values ('Unicorn', 'Mythical', NULL);
|
|
insert into animals values ('Fire-breathing dragon', 'Mythical', NULL);
|
|
</codeblock>
|
|
|
|
<p>
|
|
As with <codeph>CUME_DIST()</codeph>, there is an ascending sequence for each kind of animal.
|
|
For example, the <q>Birds</q> and <q>Mammals</q> rows each have a <codeph>PERCENT_RANK()</codeph> sequence
|
|
that ranges from 0 to 1.
|
|
The <q>Reptile</q> row has a <codeph>PERCENT_RANK()</codeph> of 0 because that partition group contains only a single item.
|
|
Both <q>Mythical</q> animals have a <codeph>PERCENT_RANK()</codeph> of 0 because
|
|
a <codeph>NULL</codeph> is considered the lowest value within its partition group.
|
|
</p>
|
|
|
|
<codeblock>select name, kind, percent_rank() over (partition by kind order by kilos) from animals;
|
|
+-----------------------+----------+--------------------------+
|
|
| name | kind | percent_rank() OVER(...) |
|
|
+-----------------------+----------+--------------------------+
|
|
| Mouse | Mammal | 0 |
|
|
| Housecat | Mammal | 0.2 |
|
|
| Horse | Mammal | 0.4 |
|
|
| Polar bear | Mammal | 0.6 |
|
|
| Giraffe | Mammal | 0.8 |
|
|
| Elephant | Mammal | 1 |
|
|
| Komodo dragon | Reptile | 0 |
|
|
| Owl | Bird | 0 |
|
|
| California Condor | Bird | 0.25 |
|
|
| Andean Condor | Bird | 0.25 |
|
|
| Condor | Bird | 0.25 |
|
|
| Ostrich | Bird | 1 |
|
|
| Fire-breathing dragon | Mythical | 0 |
|
|
| Unicorn | Mythical | 0 |
|
|
+-----------------------+----------+--------------------------+
|
|
</codeblock>
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.0.0" id="rank">
|
|
|
|
<title>RANK Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers
|
|
for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values
|
|
for the <q>tied</q> input values, the function increments the sequence by the number of tied values.
|
|
Therefore, the sequence contains both duplicates and gaps when the input contains duplicates. Starts the
|
|
sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>
|
|
|
|
<p>
|
|
The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<!-- Make a little tutorial to show these 3 functions side-by-side and illustrate their difference. -->
|
|
|
|
<p>
|
|
Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including
|
|
several items that were tied for 10th place.
|
|
</p>
|
|
|
|
<p>
|
|
Similar to <codeph>ROW_NUMBER</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they
|
|
treat duplicate combinations of values.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example demonstrates how the <codeph>RANK()</codeph> function identifies where each value
|
|
<q>places</q> in the result set, producing the same result for duplicate values, and skipping values in the
|
|
sequence to account for the number of duplicates. For example, when results are ordered by the
|
|
<codeph>X</codeph> column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph>
|
|
values are tied for third; and so on.
|
|
</p>
|
|
|
|
<codeblock>select x, rank() over(order by x) as rank, property from int_t;
|
|
+----+------+----------+
|
|
| x | rank | property |
|
|
+----+------+----------+
|
|
| 1 | 1 | square |
|
|
| 1 | 1 | odd |
|
|
| 2 | 3 | even |
|
|
| 2 | 3 | prime |
|
|
| 3 | 5 | prime |
|
|
| 3 | 5 | odd |
|
|
| 4 | 7 | even |
|
|
| 4 | 7 | square |
|
|
| 5 | 9 | odd |
|
|
| 5 | 9 | prime |
|
|
| 6 | 11 | even |
|
|
| 6 | 11 | perfect |
|
|
| 7 | 13 | lucky |
|
|
| 7 | 13 | lucky |
|
|
| 7 | 13 | lucky |
|
|
| 7 | 13 | odd |
|
|
| 7 | 13 | prime |
|
|
| 8 | 18 | even |
|
|
| 9 | 19 | square |
|
|
| 9 | 19 | odd |
|
|
| 10 | 21 | round |
|
|
| 10 | 21 | even |
|
|
+----+------+----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following examples show how the <codeph>RANK()</codeph> function is affected by the
|
|
<codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause.
|
|
</p>
|
|
|
|
<p>
|
|
Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together,
|
|
and <codeph>RANK()</codeph> returns the place of each value within the group, producing several ascending
|
|
sequences.
|
|
</p>
|
|
|
|
<codeblock>select x, rank() over(partition by property order by x) as rank, property from int_t;
|
|
+----+------+----------+
|
|
| x | rank | property |
|
|
+----+------+----------+
|
|
| 2 | 1 | even |
|
|
| 4 | 2 | even |
|
|
| 6 | 3 | even |
|
|
| 8 | 4 | even |
|
|
| 10 | 5 | even |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 1 | 1 | odd |
|
|
| 3 | 2 | odd |
|
|
| 5 | 3 | odd |
|
|
| 7 | 4 | odd |
|
|
| 9 | 5 | odd |
|
|
| 6 | 1 | perfect |
|
|
| 2 | 1 | prime |
|
|
| 3 | 2 | prime |
|
|
| 5 | 3 | prime |
|
|
| 7 | 4 | prime |
|
|
| 10 | 1 | round |
|
|
| 1 | 1 | square |
|
|
| 4 | 2 | square |
|
|
| 9 | 3 | square |
|
|
+----+------+----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the
|
|
place each each value within the group; because each value occurs only 1 or 2 times,
|
|
<codeph>RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its
|
|
group.
|
|
</p>
|
|
|
|
<codeblock>select x, rank() over(partition by x order by property) as rank, property from int_t;
|
|
+----+------+----------+
|
|
| x | rank | property |
|
|
+----+------+----------+
|
|
| 1 | 1 | odd |
|
|
| 1 | 2 | square |
|
|
| 2 | 1 | even |
|
|
| 2 | 2 | prime |
|
|
| 3 | 1 | odd |
|
|
| 3 | 2 | prime |
|
|
| 4 | 1 | even |
|
|
| 4 | 2 | square |
|
|
| 5 | 1 | odd |
|
|
| 5 | 2 | prime |
|
|
| 6 | 1 | even |
|
|
| 6 | 2 | perfect |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 1 | lucky |
|
|
| 7 | 4 | odd |
|
|
| 7 | 5 | prime |
|
|
| 8 | 1 | even |
|
|
| 9 | 1 | odd |
|
|
| 9 | 2 | square |
|
|
| 10 | 1 | even |
|
|
| 10 | 2 | round |
|
|
+----+------+----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows how a magazine might prepare a list of history's wealthiest people. Croesus and
|
|
Midas are tied for second, then Crassus is fourth.
|
|
</p>
|
|
|
|
<codeblock>select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name;
|
|
+------+---------+---------------+
|
|
| rank | name | net_worth |
|
|
+------+---------+---------------+
|
|
| 1 | Solomon | 2000000000.00 |
|
|
| 2 | Croesus | 1000000000.00 |
|
|
| 2 | Midas | 1000000000.00 |
|
|
| 4 | Crassus | 500000000.00 |
|
|
| 5 | Scrooge | 80000000.00 |
|
|
+------+---------+---------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
<xref href="impala_analytic_functions.xml#dense_rank"/>,
|
|
<xref href="impala_analytic_functions.xml#row_number"/>
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept rev="2.0.0" id="row_number">
|
|
|
|
<title>ROW_NUMBER Function</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Returns an ascending sequence of integers, starting with 1. Starts the sequence over for each group
|
|
produced by the <codeph>PARTITIONED BY</codeph> clause. The output sequence includes different values for
|
|
duplicate input values. Therefore, the sequence never contains any duplicates or gaps, regardless of
|
|
duplicate input values.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>ROW_NUMBER() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock>
|
|
|
|
<p>
|
|
The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The
|
|
window clause is not allowed.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
Often used for top-N and bottom-N queries where the input values are known to be unique, or precisely N
|
|
rows are needed regardless of duplicate values.
|
|
</p>
|
|
|
|
<p>
|
|
Because its result value is different for each row in the result set (when used without a <codeph>PARTITION
|
|
BY</codeph> clause), <codeph>ROW_NUMBER()</codeph> can be used to synthesize unique numeric ID values, for
|
|
example for result sets involving unique values or tuples.
|
|
</p>
|
|
|
|
<p>
|
|
Similar to <codeph>RANK</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they treat
|
|
duplicate combinations of values.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_20"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following example demonstrates how <codeph>ROW_NUMBER()</codeph> produces a continuous numeric
|
|
sequence, even though some values of <codeph>X</codeph> are repeated.
|
|
</p>
|
|
|
|
<codeblock>select x, row_number() over(order by x, property) as row_number, property from int_t;
|
|
+----+------------+----------+
|
|
| x | row_number | property |
|
|
+----+------------+----------+
|
|
| 1 | 1 | odd |
|
|
| 1 | 2 | square |
|
|
| 2 | 3 | even |
|
|
| 2 | 4 | prime |
|
|
| 3 | 5 | odd |
|
|
| 3 | 6 | prime |
|
|
| 4 | 7 | even |
|
|
| 4 | 8 | square |
|
|
| 5 | 9 | odd |
|
|
| 5 | 10 | prime |
|
|
| 6 | 11 | even |
|
|
| 6 | 12 | perfect |
|
|
| 7 | 13 | lucky |
|
|
| 7 | 14 | lucky |
|
|
| 7 | 15 | lucky |
|
|
| 7 | 16 | odd |
|
|
| 7 | 17 | prime |
|
|
| 8 | 18 | even |
|
|
| 9 | 19 | odd |
|
|
| 9 | 20 | square |
|
|
| 10 | 21 | even |
|
|
| 10 | 22 | round |
|
|
+----+------------+----------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following example shows how a financial institution might assign customer IDs to some of history's
|
|
wealthiest figures. Although two of the people have identical net worth figures, unique IDs are required
|
|
for this purpose. <codeph>ROW_NUMBER()</codeph> produces a sequence of five different values for the five
|
|
input rows.
|
|
</p>
|
|
|
|
<codeblock>select row_number() over (order by net_worth desc) as account_id, name, net_worth
|
|
from wealth order by account_id, name;
|
|
+------------+---------+---------------+
|
|
| account_id | name | net_worth |
|
|
+------------+---------+---------------+
|
|
| 1 | Solomon | 2000000000.00 |
|
|
| 2 | Croesus | 1000000000.00 |
|
|
| 3 | Midas | 1000000000.00 |
|
|
| 4 | Crassus | 500000000.00 |
|
|
| 5 | Scrooge | 80000000.00 |
|
|
+------------+---------+---------------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
<xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#dense_rank"/>
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
<concept id="sum_analytic">
|
|
|
|
<title>SUM Function - Analytic Context</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic
|
|
function. See <xref href="impala_sum.xml#sum"/> for details and examples.
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|
|
|
|
</concept>
|