impala/docs/topics/impala_group_by.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="group_by">

  <title>GROUP BY Clause</title>
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="SQL"/>
      <data name="Category" value="Querying"/>
      <data name="Category" value="Aggregate Functions"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      Specify the <codeph>GROUP BY</codeph> clause in queries that use aggregation functions, such as
      <codeph><xref href="impala_count.xml#count">COUNT()</xref></codeph>,
      <codeph><xref href="impala_sum.xml#sum">SUM()</xref></codeph>,
      <codeph><xref href="impala_avg.xml#avg">AVG()</xref></codeph>,
      <codeph><xref href="impala_min.xml#min">MIN()</xref></codeph>, and
      <codeph><xref href="impala_max.xml#max">MAX()</xref></codeph>. Specify in the
      <codeph><xref href="impala_group_by.xml#group_by">GROUP BY</xref></codeph> clause the names of all the
      columns that do not participate in the aggregation operation.
    </p>

    <!-- Good to show an example of cases where ORDER BY does and doesn't work with complex types. -->
    <p conref="../shared/impala_common.xml#common/complex_types_blurb"/>

    <p rev="2.3.0">
      In <keyword keyref="impala23_full"/> and higher, the complex data types <codeph>STRUCT</codeph>,
      <codeph>ARRAY</codeph>, and <codeph>MAP</codeph> are available. These columns cannot
      be referenced directly in the <codeph>ORDER BY</codeph> clause.
      When you query a complex type column, you use join notation to <q>unpack</q> the elements
      of the complex type, and within the join query you can include an <codeph>ORDER BY</codeph>
      clause to control the order in the result set of the scalar elements from the complex type.
      See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
    </p>

    <p conref="../shared/impala_common.xml#common/zero_length_strings"/>

    <p conref="../shared/impala_common.xml#common/example_blurb"/>

    <p>
      For example, the following query finds the 5 items that sold the highest total quantity (using the
      <codeph>SUM()</codeph> function, and also counts the number of sales transactions for those items (using the
      <codeph>COUNT()</codeph> function). Because the column representing the item IDs is not used in any
      aggregation functions, we specify that column in the <codeph>GROUP BY</codeph> clause.
    </p>

<codeblock>select
  <b>ss_item_sk</b> as Item,
  <b>count</b>(ss_item_sk) as Times_Purchased,
  <b>sum</b>(ss_quantity) as Total_Quantity_Purchased
from store_sales
  <b>group by ss_item_sk</b>
  order by sum(ss_quantity) desc
  limit 5;
+-------+-----------------+--------------------------+
| item  | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 9325  | 372             | 19072                    |
| 4279  | 357             | 18501                    |
| 7507  | 371             | 18475                    |
| 5953  | 369             | 18451                    |
| 16753 | 375             | 18446                    |
+-------+-----------------+--------------------------+</codeblock>

    <p>
      The <codeph>HAVING</codeph> clause lets you filter the results of aggregate functions, because you cannot
      refer to those expressions in the <codeph>WHERE</codeph> clause. For example, to find the 5 lowest-selling
      items that were included in at least 100 sales transactions, we could use this query:
    </p>

<codeblock>select
  <b>ss_item_sk</b> as Item,
  <b>count</b>(ss_item_sk) as Times_Purchased,
  <b>sum</b>(ss_quantity) as Total_Quantity_Purchased
from store_sales
  <b>group by ss_item_sk</b>
  <b>having times_purchased &gt;= 100</b>
  order by sum(ss_quantity)
  limit 5;
+-------+-----------------+--------------------------+
| item  | times_purchased | total_quantity_purchased |
+-------+-----------------+--------------------------+
| 13943 | 105             | 4087                     |
| 2992  | 101             | 4176                     |
| 4773  | 107             | 4204                     |
| 14350 | 103             | 4260                     |
| 11956 | 102             | 4275                     |
+-------+-----------------+--------------------------+</codeblock>

    <p>
      When performing calculations involving scientific or financial data, remember that columns with type
      <codeph>FLOAT</codeph> or <codeph>DOUBLE</codeph> are stored as true floating-point numbers, which cannot
      precisely represent every possible fractional value. Thus, if you include a <codeph>FLOAT</codeph> or
      <codeph>DOUBLE</codeph> column in a <codeph>GROUP BY</codeph> clause, the results might not precisely match
      literal values in your query or from an original Text data file. Use rounding operations, the
      <codeph>BETWEEN</codeph> operator, or another arithmetic technique to match floating-point values that are
      <q>near</q> literal values you expect. For example, this query on the <codeph>ss_wholesale_cost</codeph>
      column returns cost values that are close but not identical to the original figures that were entered as
      decimal fractions.
    </p>

<codeblock>select ss_wholesale_cost, avg(ss_quantity * ss_sales_price) as avg_revenue_per_sale
  from sales
  group by ss_wholesale_cost
  order by avg_revenue_per_sale desc
  limit 5;
+-------------------+----------------------+
| ss_wholesale_cost | avg_revenue_per_sale |
+-------------------+----------------------+
| 96.94000244140625 | 4454.351539300434    |
| 95.93000030517578 | 4423.119941283189    |
| 98.37999725341797 | 4332.516490316291    |
| 97.97000122070312 | 4330.480601655014    |
| 98.52999877929688 | 4291.316953108634    |
+-------------------+----------------------+</codeblock>

    <p>
      Notice how wholesale cost values originally entered as decimal fractions such as <codeph>96.94</codeph> and
      <codeph>98.38</codeph> are slightly larger or smaller in the result set, due to precision limitations in the
      hardware floating-point types. The imprecise representation of <codeph>FLOAT</codeph> and
      <codeph>DOUBLE</codeph> values is why financial data processing systems often store currency using data types
      that are less space-efficient but avoid these types of rounding errors.
    </p>

    <p conref="../shared/impala_common.xml#common/related_info"/>
    <p>
      <xref href="impala_select.xml#select"/>,
      <xref href="impala_aggregate_functions.xml#aggregate_functions"/>
    </p>

  </conbody>
</concept>