Some functions, such as
Although analytic functions often compute the same value you would see from an aggregate function in a
Analytic function calls are only allowed in the
The rows that are part of each partition are analyzed by computations across an ordered or unordered set of
rows. For example,
Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and
bucketed analysis for large data sets. You might also see the term window functions
in database
literature, referring to the sequence of rows (the window
) that the function call applies to,
particularly when the
The following sections describe the analytic query clauses and the pure analytic functions provided by
Impala. For usage information about aggregate functions in an analytic context, see
The
PARTITION BY clause:
The partitions
refers to the groups produced by the
The sequence of results from an analytic function resets
for each new partition in the result set.
That is, the set of preceding or following rows considered by the analytic function always come from a
single partition. Any
ORDER BY clause:
The
When the
The order in which the rows are analyzed is only defined for those columns specified in
One difference between the analytic and outer uses of the
Window clause:
The window clause is only allowed in combination with an
Because HBase tables are optimized for single-row lookups rather than full scans, analytic functions using
the
Analytic functions are very efficient for Parquet tables. The data that is examined during evaluation of
the
Analytic functions are convenient to use with text tables for exploratory business intelligence. When the volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries.
The following example shows how to synthesize a numeric sequence corresponding to all the rows in a table.
The new table has the same columns as the old one, plus an additional column
The following example shows how to determine the number of rows containing each value for a column. Unlike
a corresponding
You cannot directly combine the
Certain analytic functions accept an optional around
the current row rather than all rows in the partition. For example, you can
get a moving average by specifying some number of preceding and following rows, or a running count or
running total by specifying all rows up to the current position. This clause can result in different
analytic results for rows within the same partition.
The window clause is supported with the
Currently, Impala supports only some combinations of arguments to the
When
The following examples show financial data for a fictional stock symbol
The queries use analytic functions with window clauses to compute moving averages of the closing price. For
example,
The clause
You can include an
You can include an
Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 and less than or equal to 1.
The
Within each partition of the result set, the
If the sequence of input values contains ties, the
Impala only supports the
This example uses a table with 9 rows. The
Using a Bird
rows have a Mammal
rows have a
We can reverse the ordering within each partition group by using an
The following example manufactures some rows with identical values in the Bird
rows. Now with 3 rows all with a value of 15, all of those rows have the same
The following example shows how to use an Bird
rows are together, then in descending order
by the result of the
Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers
for duplicate values of the tied
input values, the function continues the sequence with the next higher integer.
Therefore, the sequence contains duplicates but no gaps when the input contains duplicates. Starts the
sequence over for each group produced by the
The
Often used for top-N and bottom-N queries. For example, it could produce a top 10
report including
all the items with the 10 highest values, even if several items tied for 1st place.
Similar to
The following example demonstrates how the places
in the result set, producing the same result for duplicate values, but with a strict
sequence from 1 to the number of groups. For example, when results are ordered by the
The following examples show how the
Partitioning by the
Partitioning by the
The following example shows how
Returns the expression value from the first row in the window. The return value is
The
If any duplicate values occur in the tuples evaluated by the
The following example shows a table with a wide variety of country-appropriate greetings. For consistency,
we want to standardize on a single greeting for each country. The
Changing the order in which the names are evaluated changes which greeting is applied to each group.
This function returns the value of an expression using column values from a preceding row. You specify an integer offset, which designates a row position some number of rows previous to the current row. Any column references in the expression argument refer to column values from that prior row. Typically, the table contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
The
Sometimes used an an alternative to doing a self-join.
The following example uses the same stock data created in
The following example does an arithmetic operation between the current row and a value from the previous
row, to produce a delta value for each day. This example also demonstrates how
This function is the converse of
Returns the expression value from the last row in the window. This same value is repeated for all result
rows for the group. The return value is
The
If any duplicate values occur in the tuples evaluated by the
The following example uses the same
This function returns the value of an expression using column values from a following row. You specify an integer offset, which designates a row position some number of rows after to the current row. Any column references in the expression argument refer to column values from that later row. Typically, the table contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows.
The
Sometimes used an an alternative to doing a self-join.
The following example uses the same stock data created in
This function is the converse of
You can include an
You can include an
Returns the bucket number
associated with each row, between 1 and the value of an expression. For
example, creating 100 buckets puts the lowest 1% of values in the first bucket, while creating 10 buckets
puts the lowest 10% of values in the first bucket. Each partition can have a different number of buckets.
The
The ntile
name is derived from the practice of dividing result sets into fourths (quartile), tenths
(decile), and so on. The
The number of buckets must be a positive integer.
The number of items in each bucket is identical or almost so, varying by at most 1. If the number of items does not divide evenly between the buckets, the remaining N items are divided evenly among the first N buckets.
If the number of buckets N is greater than the number of input rows in the partition, then the first N buckets each contain one item, and the remaining buckets are empty.
The following example shows divides groups of animals into 4 buckets based on their weight. The
The following examples show how the
Again, the result set can be ordered independently
from the analytic evaluation. This next example lists all the animals heaviest to lightest,
showing that elephant and giraffe are in the top half
of mammals by weight, while
housecat and mouse are in the bottom half
.
Calculates the rank, expressed as a percentage, of each row within a group of rows.
If
The
This function is similar to the
The return values range from 0 to 1 inclusive.
The first row in each partition group always has the value 0.
A
The following example uses the same
As with Birds
and Mammals
rows each have a Reptile
row has a Mythical
animals have a
Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers
for duplicate values of the tied
input values, the function increments the sequence by the number of tied values.
Therefore, the sequence contains both duplicates and gaps when the input contains duplicates. Starts the
sequence over for each group produced by the
The
Often used for top-N and bottom-N queries. For example, it could produce a top 10
report including
several items that were tied for 10th place.
Similar to
The following example demonstrates how the places
in the result set, producing the same result for duplicate values, and skipping values in the
sequence to account for the number of duplicates. For example, when results are ordered by the
The following examples show how the
Partitioning by the
Partitioning by the
The following example shows how a magazine might prepare a list of history's wealthiest people. Croesus and Midas are tied for second, then Crassus is fourth.
Returns an ascending sequence of integers, starting with 1. Starts the sequence over for each group
produced by the
The
Often used for top-N and bottom-N queries where the input values are known to be unique, or precisely N rows are needed regardless of duplicate values.
Because its result value is different for each row in the result set (when used without a
Similar to
The following example demonstrates how
The following example shows how a financial institution might assign customer IDs to some of history's
wealthiest figures. Although two of the people have identical net worth figures, unique IDs are required
for this purpose.
You can include an