impala/docs/topics/impala_count.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="count">

  <title>COUNT Function</title>
  <titlealts audience="PDF"><navtitle>COUNT</navtitle></titlealts>
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="SQL"/>
      <data name="Category" value="Impala Functions"/>
      <data name="Category" value="Analytic Functions"/>
      <data name="Category" value="Aggregate Functions"/>
      <data name="Category" value="Querying"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      <indexterm audience="hidden">count() function</indexterm>
      An aggregate function that returns the number of rows, or the number of non-<codeph>NULL</codeph> rows.
    </p>

    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>COUNT([DISTINCT | ALL] <varname>expression</varname>) [OVER (<varname>analytic_clause</varname>)]</codeblock>

    <p>
      Depending on the argument, <codeph>COUNT()</codeph> considers rows that meet certain conditions:
    </p>

    <ul>
      <li>
        The notation <codeph>COUNT(*)</codeph> includes <codeph>NULL</codeph> values in the total.
      </li>

      <li>
        The notation <codeph>COUNT(<varname>column_name</varname>)</codeph> only considers rows where the column
        contains a non-<codeph>NULL</codeph> value.
      </li>

      <li>
        You can also combine <codeph>COUNT</codeph> with the <codeph>DISTINCT</codeph> operator to eliminate
        duplicates before counting, and to count the combinations of values across multiple columns.
      </li>
    </ul>

    <p>
      When the query contains a <codeph>GROUP BY</codeph> clause, returns one value for each combination of
      grouping values.
    </p>

    <p>
      <b>Return type:</b> <codeph>BIGINT</codeph>
    </p>

    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

    <p conref="../shared/impala_common.xml#common/partition_key_optimization"/>

    <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>

    <p conref="../shared/impala_common.xml#common/complex_types_aggregation_explanation"/>

    <p conref="../shared/impala_common.xml#common/complex_types_aggregation_example"/>

    <p conref="../shared/impala_common.xml#common/example_blurb"/>

<codeblock>-- How many rows total are in the table, regardless of NULL values?
select count(*) from t1;
-- How many rows are in the table with non-NULL values for a column?
select count(c1) from t1;
-- Count the rows that meet certain conditions.
-- Again, * includes NULLs, so COUNT(*) might be greater than COUNT(col).
select count(*) from t1 where x &gt; 10;
select count(c1) from t1 where x &gt; 10;
-- Can also be used in combination with DISTINCT and/or GROUP BY.
-- Combine COUNT and DISTINCT to find the number of unique values.
-- Must use column names rather than * with COUNT(DISTINCT ...) syntax.
-- Rows with NULL values are not counted.
select count(distinct c1) from t1;
-- Rows with a NULL value in _either_ column are not counted.
select count(distinct c1, c2) from t1;
-- Return more than one result.
select month, year, count(distinct visitor_id) from web_stats group by month, year;
</codeblock>

    <p rev="2.0.0">
      The following examples show how to use <codeph>COUNT()</codeph> in an analytic context. They use a table
      containing integers from 1 to 10. Notice how the <codeph>COUNT()</codeph> is reported for each input value, as
      opposed to the <codeph>GROUP BY</codeph> clause which condenses the result set.
<codeblock>select x, property, count(x) over (partition by property) as count from int_t where property in ('odd','even');
+----+----------+-------+
| x  | property | count |
+----+----------+-------+
| 2  | even     | 5     |
| 4  | even     | 5     |
| 6  | even     | 5     |
| 8  | even     | 5     |
| 10 | even     | 5     |
| 1  | odd      | 5     |
| 3  | odd      | 5     |
| 5  | odd      | 5     |
| 7  | odd      | 5     |
| 9  | odd      | 5     |
+----+----------+-------+
</codeblock>

Adding an <codeph>ORDER BY</codeph> clause lets you experiment with results that are cumulative or apply to a moving
set of rows (the <q>window</q>). The following examples use <codeph>COUNT()</codeph> in an analytic context
(that is, with an <codeph>OVER()</codeph> clause) to produce a running count of all the even values,
then a running count of all the odd values. The basic <codeph>ORDER BY x</codeph> clause implicitly
activates a window clause of <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
which is effectively the same as <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>,
therefore all of these examples produce the same results:
<codeblock>select x, property,
  count(x) over (partition by property <b>order by x</b>) as 'cumulative count'
  from int_t where property in ('odd','even');
+----+----------+------------------+
| x  | property | cumulative count |
+----+----------+------------------+
| 2  | even     | 1                |
| 4  | even     | 2                |
| 6  | even     | 3                |
| 8  | even     | 4                |
| 10 | even     | 5                |
| 1  | odd      | 1                |
| 3  | odd      | 2                |
| 5  | odd      | 3                |
| 7  | odd      | 4                |
| 9  | odd      | 5                |
+----+----------+------------------+

select x, property,
  count(x) over
  (
    partition by property
    <b>order by x</b>
    <b>range between unbounded preceding and current row</b>
  ) as 'cumulative total'
from int_t where property in ('odd','even');
+----+----------+------------------+
| x  | property | cumulative count |
+----+----------+------------------+
| 2  | even     | 1                |
| 4  | even     | 2                |
| 6  | even     | 3                |
| 8  | even     | 4                |
| 10 | even     | 5                |
| 1  | odd      | 1                |
| 3  | odd      | 2                |
| 5  | odd      | 3                |
| 7  | odd      | 4                |
| 9  | odd      | 5                |
+----+----------+------------------+

select x, property,
  count(x) over
  (
    partition by property
    <b>order by x</b>
    <b>rows between unbounded preceding and current row</b>
  ) as 'cumulative total'
  from int_t where property in ('odd','even');
+----+----------+------------------+
| x  | property | cumulative count |
+----+----------+------------------+
| 2  | even     | 1                |
| 4  | even     | 2                |
| 6  | even     | 3                |
| 8  | even     | 4                |
| 10 | even     | 5                |
| 1  | odd      | 1                |
| 3  | odd      | 2                |
| 5  | odd      | 3                |
| 7  | odd      | 4                |
| 9  | odd      | 5                |
+----+----------+------------------+
</codeblock>

The following examples show how to construct a moving window, with a running count taking into account 1 row before
and 1 row after the current row, within the same partition (all the even values or all the odd values).
Therefore, the count is consistently 3 for rows in the middle of the window, and 2 for
rows near the ends of the window, where there is no preceding or no following row in the partition.
Because of a restriction in the Impala <codeph>RANGE</codeph> syntax, this type of
moving window is possible with the <codeph>ROWS BETWEEN</codeph> clause but not the <codeph>RANGE BETWEEN</codeph>
clause:
<codeblock>select x, property,
  count(x) over
  (
    partition by property
    <b>order by x</b>
    <b>rows between 1 preceding and 1 following</b>
  ) as 'moving total'
  from int_t where property in ('odd','even');
+----+----------+--------------+
| x  | property | moving total |
+----+----------+--------------+
| 2  | even     | 2            |
| 4  | even     | 3            |
| 6  | even     | 3            |
| 8  | even     | 3            |
| 10 | even     | 2            |
| 1  | odd      | 2            |
| 3  | odd      | 3            |
| 5  | odd      | 3            |
| 7  | odd      | 3            |
| 9  | odd      | 2            |
+----+----------+--------------+

-- Doesn't work because of syntax restriction on RANGE clause.
select x, property,
  count(x) over
  (
    partition by property
    <b>order by x</b>
    <b>range between 1 preceding and 1 following</b>
  ) as 'moving total'
from int_t where property in ('odd','even');
ERROR: AnalysisException: RANGE is only supported with both the lower and upper bounds UNBOUNDED or one UNBOUNDED and the other CURRENT ROW.
</codeblock>
    </p>

    <p conref="../shared/impala_common.xml#common/related_info"/>

    <p>
      <xref href="impala_analytic_functions.xml#analytic_functions"/>
    </p>

  </conbody>
</concept>