IMPALA-7687: [DOCS] Support for multiple DISTINCT in a query

- Removed notes about the single DISTINCT restriction. - Rewrote the description for the APPX_COUNT_DISTINCT query option. Change-Id: I3a6e664b016e9408a3ff809f1811253a91764481 Reviewed-on: http://gerrit.cloudera.org:8080/11823 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
2025-12-19 09:58:28 -05:00 · 2018-10-29 17:33:30 -07:00
parent 85166afa8a
commit dcc4024b1d
6 changed files with 33 additions and 111 deletions
--- a/docs/shared/impala_common.xml
+++ b/docs/shared/impala_common.xml
@@ -2117,33 +2117,6 @@ show functions in _impala_builtins like '*<varname>substring</varname>*';
        <codeph>--insert_inherit_permissions</codeph> startup option for the <cmdname>impalad</cmdname> daemon.
      </p>

-      <note id="multiple_count_distinct">
-        <p>
-          By default, Impala only allows a single <codeph>COUNT(DISTINCT <varname>columns</varname>)</codeph>
-          expression in each query.
-        </p>
-        <p>
-          If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by
-          specifying <codeph>NDV(<varname>column</varname>)</codeph>; a query can contain multiple instances of
-          <codeph>NDV(<varname>column</varname>)</codeph>. To make Impala automatically rewrite
-          <codeph>COUNT(DISTINCT)</codeph> expressions to <codeph>NDV()</codeph>, enable the
-          <codeph>APPX_COUNT_DISTINCT</codeph> query option.
-        </p>
-        <p>
-          To produce the same result as multiple <codeph>COUNT(DISTINCT)</codeph> expressions, you can use the
-          following technique for queries involving a single table:
-        </p>
-<codeblock xml:space="preserve">select v1.c1 result1, v2.c1 result2 from
-  (select count(distinct col1) as c1 from t1) v1
-    cross join
-  (select count(distinct col2) as c1 from t1) v2;
-</codeblock>
-        <p>
-          Because <codeph>CROSS JOIN</codeph> is an expensive operation, prefer to use the <codeph>NDV()</codeph>
-          technique wherever practical.
-        </p>
-      </note>
-
      <p>
        <ph id="union_all_vs_union">Prefer <codeph>UNION ALL</codeph> over <codeph>UNION</codeph> when you know the
        data sets are disjoint or duplicate values are not a problem; <codeph>UNION ALL</codeph> is more efficient
--- a/docs/topics/impala_appx_count_distinct.xml
+++ b/docs/topics/impala_appx_count_distinct.xml
@@ -21,7 +21,13 @@ under the License.
 <concept rev="2.0.0" id="appx_count_distinct">

  <title>APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only)</title>
-  <titlealts audience="PDF"><navtitle>APPX_COUNT_DISTINCT</navtitle></titlealts>
+
+  <titlealts audience="PDF">
+
+    <navtitle>APPX_COUNT_DISTINCT</navtitle>
+
+  </titlealts>
+
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
@@ -35,65 +41,26 @@ under the License.
  <conbody>

    <p rev="2.0.0">
-      <indexterm audience="hidden">APPX_COUNT_DISTINCT query option</indexterm>
-      Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting
-      each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is
-      approximate rather than precise.
+      When the <codeph>APPX_COUNT_DISTINCT</codeph> query option is set to
+      <codeph>TRUE</codeph>, Impala implicitly converts <codeph>COUNT(DISTINCT)</codeph>
+      operations to the <codeph>NDV()</codeph> function calls. The resulting count is
+      approximate rather than precise. Enable the query option when a tolerable amount of error
+      is acceptable in order to obtain faster query results than with a <codeph>COUNT
+      (DISTINCT)</codeph> queries.
    </p>

    <p conref="../shared/impala_common.xml#common/type_boolean"/>

    <p conref="../shared/impala_common.xml#common/default_false_0"/>

-    <p conref="../shared/impala_common.xml#common/example_blurb"/>
-
-    <p>
-      The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction
-      where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single
-      column. By default, you can count the distinct values of one column or another, but not both in a single
-      query:
-    </p>
-
-<codeblock>[localhost:21000] &gt; select count(distinct x) from int_t;
-+-------------------+
-| count(distinct x) |
-+-------------------+
-| 10                |
-+-------------------+
-[localhost:21000] &gt; select count(distinct property) from int_t;
-+--------------------------+
-| count(distinct property) |
-+--------------------------+
-| 7                        |
-+--------------------------+
-[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
-ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
-as count(DISTINCT x); deviating function: count(DISTINCT property)
-</codeblock>
-
-    <p>
-      When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple
-      <codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each
-      <codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead,
-      which provides an approximate result rather than a precise count.
-    </p>
-
-<codeblock>[localhost:21000] &gt; set APPX_COUNT_DISTINCT=true;
-[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
-+-------------------+--------------------------+
-| count(distinct x) | count(distinct property) |
-+-------------------+--------------------------+
-| 10                | 7                        |
-+-------------------+--------------------------+
-</codeblock>
-
    <p conref="../shared/impala_common.xml#common/related_info"/>

    <p>
      <xref href="impala_count.xml#count"/>,
-      <xref href="impala_distinct.xml#distinct"/>,
-      <xref href="impala_ndv.xml#ndv"/>
+      <xref
+        href="impala_distinct.xml#distinct"/>, <xref href="impala_ndv.xml#ndv"/>
    </p>

  </conbody>
+
 </concept>
--- a/docs/topics/impala_count.xml
+++ b/docs/topics/impala_count.xml
@@ -242,8 +242,6 @@ ERROR: AnalysisException: RANGE is only supported with both the lower and upper
 </codeblock>
    </p>

-    <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-
    <p conref="../shared/impala_common.xml#common/related_info"/>

    <p>
--- a/docs/topics/impala_distinct.xml
+++ b/docs/topics/impala_distinct.xml
@@ -21,6 +21,7 @@ under the License.
 <concept id="distinct">

  <title>DISTINCT Operator</title>
+
  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
@@ -35,45 +36,40 @@ under the License.
  <conbody>

    <p>
-      <indexterm audience="hidden">DISTINCT operator</indexterm>
-      The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to
-      remove duplicates:
+      The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the
+      result set to remove duplicates.
    </p>

 <codeblock>-- Returns the unique values from one column.
 -- NULL is included in the set of values if any rows have a NULL in this column.
-select distinct c_birth_country from customer;
+SELECT DISTINCT c_birth_country FROM customer;
+
 -- Returns the unique combinations of values from multiple columns.
-select distinct c_salutation, c_last_name from customer;</codeblock>
+SELECT DISTINCT c_salutation, c_last_name FROM customer;</codeblock>

    <p>
-      You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically
-      <codeph>COUNT()</codeph>, to find how many different values a column contains:
+      You can use <codeph>DISTINCT</codeph> in combination with an aggregation function,
+      typically <codeph>COUNT()</codeph>, to find how many different values a column contains.
    </p>

 <codeblock>-- Counts the unique values from one column.
 -- NULL is not included as a distinct value in the count.
-select count(distinct c_birth_country) from customer;
-- Counts the unique combinations of values from multiple columns.
-select count(distinct c_salutation, c_last_name) from customer;</codeblock>
+SELECT COUNT(DISTINCT c_birth_country) FROM customer;

-    <p>
-      One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one
-      aggregation function in the same query. For example, you could not have a single query with both
-      <codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the
-      <codeph>SELECT</codeph> list.
-    </p>
+-- Counts the unique combinations of values from multiple columns.
+SELECT COUNT(DISTINCT c_salutation, c_last_name) FROM customer;</codeblock>

    <p conref="../shared/impala_common.xml#common/zero_length_strings"/>

-    <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-
    <note>
      <p>
-        In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order,
-        Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER
-        BY</codeph> clause if you need the values in alphabetical or numeric sorted order.
+        In contrast with some database systems that always return <codeph>DISTINCT</codeph>
+        values in sorted order, Impala does not do any ordering of <codeph>DISTINCT</codeph>
+        values. Always include an <codeph>ORDER BY</codeph> clause if you need the values in
+        alphabetical or numeric sorted order.
      </p>
    </note>
+
  </conbody>
+
 </concept>
--- a/docs/topics/impala_langref_unsupported.xml
+++ b/docs/topics/impala_langref_unsupported.xml
@@ -105,12 +105,6 @@ under the License.
          rather than the <codeph>EXPLODE()</codeph> keyword.
          See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
        </li>
-
-        <li>
-          Multiple <codeph>DISTINCT</codeph> clauses per query, although Impala includes some workarounds for this
-          limitation.
-          <note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
-        </li>
      </ul>

      <p>
--- a/docs/topics/impala_select.xml
+++ b/docs/topics/impala_select.xml
@@ -107,12 +107,6 @@ table_reference := { <varname>table_name</varname> | (<varname>subquery</varname
        that are referenced multiple times in the same query.
      </li>

-      <li>
-        By default, one <codeph>DISTINCT</codeph> clause per query. See <xref href="impala_distinct.xml#distinct"/>
-        for details. See <xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for a query option to
-        allow multiple <codeph>COUNT(DISTINCT)</codeph> impressions in the same query.
-      </li>
-
      <li>
        Subqueries in a <codeph>FROM</codeph> clause. In <keyword keyref="impala20_full"/> and higher,
        subqueries can also go in the <codeph>WHERE</codeph> clause, for example with the