IMPALA-7687: [DOCS] Support for multiple DISTINCT in a query

- Removed notes about the single DISTINCT restriction.
- Rewrote the description for the APPX_COUNT_DISTINCT query option.

Change-Id: I3a6e664b016e9408a3ff809f1811253a91764481
Reviewed-on: http://gerrit.cloudera.org:8080/11823
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
This commit is contained in:
Alex Rodoni
2018-10-29 17:33:30 -07:00
parent 85166afa8a
commit dcc4024b1d
6 changed files with 33 additions and 111 deletions

View File

@@ -2117,33 +2117,6 @@ show functions in _impala_builtins like '*<varname>substring</varname>*';
<codeph>--insert_inherit_permissions</codeph> startup option for the <cmdname>impalad</cmdname> daemon.
</p>
<note id="multiple_count_distinct">
<p>
By default, Impala only allows a single <codeph>COUNT(DISTINCT <varname>columns</varname>)</codeph>
expression in each query.
</p>
<p>
If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by
specifying <codeph>NDV(<varname>column</varname>)</codeph>; a query can contain multiple instances of
<codeph>NDV(<varname>column</varname>)</codeph>. To make Impala automatically rewrite
<codeph>COUNT(DISTINCT)</codeph> expressions to <codeph>NDV()</codeph>, enable the
<codeph>APPX_COUNT_DISTINCT</codeph> query option.
</p>
<p>
To produce the same result as multiple <codeph>COUNT(DISTINCT)</codeph> expressions, you can use the
following technique for queries involving a single table:
</p>
<codeblock xml:space="preserve">select v1.c1 result1, v2.c1 result2 from
(select count(distinct col1) as c1 from t1) v1
cross join
(select count(distinct col2) as c1 from t1) v2;
</codeblock>
<p>
Because <codeph>CROSS JOIN</codeph> is an expensive operation, prefer to use the <codeph>NDV()</codeph>
technique wherever practical.
</p>
</note>
<p>
<ph id="union_all_vs_union">Prefer <codeph>UNION ALL</codeph> over <codeph>UNION</codeph> when you know the
data sets are disjoint or duplicate values are not a problem; <codeph>UNION ALL</codeph> is more efficient

View File

@@ -21,7 +21,13 @@ under the License.
<concept rev="2.0.0" id="appx_count_distinct">
<title>APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>APPX_COUNT_DISTINCT</navtitle></titlealts>
<titlealts audience="PDF">
<navtitle>APPX_COUNT_DISTINCT</navtitle>
</titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -35,65 +41,26 @@ under the License.
<conbody>
<p rev="2.0.0">
<indexterm audience="hidden">APPX_COUNT_DISTINCT query option</indexterm>
Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting
each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is
approximate rather than precise.
When the <codeph>APPX_COUNT_DISTINCT</codeph> query option is set to
<codeph>TRUE</codeph>, Impala implicitly converts <codeph>COUNT(DISTINCT)</codeph>
operations to the <codeph>NDV()</codeph> function calls. The resulting count is
approximate rather than precise. Enable the query option when a tolerable amount of error
is acceptable in order to obtain faster query results than with a <codeph>COUNT
(DISTINCT)</codeph> queries.
</p>
<p conref="../shared/impala_common.xml#common/type_boolean"/>
<p conref="../shared/impala_common.xml#common/default_false_0"/>
<p conref="../shared/impala_common.xml#common/example_blurb"/>
<p>
The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction
where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single
column. By default, you can count the distinct values of one column or another, but not both in a single
query:
</p>
<codeblock>[localhost:21000] &gt; select count(distinct x) from int_t;
+-------------------+
| count(distinct x) |
+-------------------+
| 10 |
+-------------------+
[localhost:21000] &gt; select count(distinct property) from int_t;
+--------------------------+
| count(distinct property) |
+--------------------------+
| 7 |
+--------------------------+
[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
as count(DISTINCT x); deviating function: count(DISTINCT property)
</codeblock>
<p>
When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple
<codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each
<codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead,
which provides an approximate result rather than a precise count.
</p>
<codeblock>[localhost:21000] &gt; set APPX_COUNT_DISTINCT=true;
[localhost:21000] &gt; select count(distinct x), count(distinct property) from int_t;
+-------------------+--------------------------+
| count(distinct x) | count(distinct property) |
+-------------------+--------------------------+
| 10 | 7 |
+-------------------+--------------------------+
</codeblock>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_count.xml#count"/>,
<xref href="impala_distinct.xml#distinct"/>,
<xref href="impala_ndv.xml#ndv"/>
<xref
href="impala_distinct.xml#distinct"/>, <xref href="impala_ndv.xml#ndv"/>
</p>
</conbody>
</concept>

View File

@@ -242,8 +242,6 @@ ERROR: AnalysisException: RANGE is only supported with both the lower and upper
</codeblock>
</p>
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
<p conref="../shared/impala_common.xml#common/related_info"/>
<p>

View File

@@ -21,6 +21,7 @@ under the License.
<concept id="distinct">
<title>DISTINCT Operator</title>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
@@ -35,45 +36,40 @@ under the License.
<conbody>
<p>
<indexterm audience="hidden">DISTINCT operator</indexterm>
The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to
remove duplicates:
The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the
result set to remove duplicates.
</p>
<codeblock>-- Returns the unique values from one column.
-- NULL is included in the set of values if any rows have a NULL in this column.
select distinct c_birth_country from customer;
SELECT DISTINCT c_birth_country FROM customer;
-- Returns the unique combinations of values from multiple columns.
select distinct c_salutation, c_last_name from customer;</codeblock>
SELECT DISTINCT c_salutation, c_last_name FROM customer;</codeblock>
<p>
You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically
<codeph>COUNT()</codeph>, to find how many different values a column contains:
You can use <codeph>DISTINCT</codeph> in combination with an aggregation function,
typically <codeph>COUNT()</codeph>, to find how many different values a column contains.
</p>
<codeblock>-- Counts the unique values from one column.
-- NULL is not included as a distinct value in the count.
select count(distinct c_birth_country) from customer;
-- Counts the unique combinations of values from multiple columns.
select count(distinct c_salutation, c_last_name) from customer;</codeblock>
SELECT COUNT(DISTINCT c_birth_country) FROM customer;
<p>
One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one
aggregation function in the same query. For example, you could not have a single query with both
<codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the
<codeph>SELECT</codeph> list.
</p>
-- Counts the unique combinations of values from multiple columns.
SELECT COUNT(DISTINCT c_salutation, c_last_name) FROM customer;</codeblock>
<p conref="../shared/impala_common.xml#common/zero_length_strings"/>
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
<note>
<p>
In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order,
Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER
BY</codeph> clause if you need the values in alphabetical or numeric sorted order.
In contrast with some database systems that always return <codeph>DISTINCT</codeph>
values in sorted order, Impala does not do any ordering of <codeph>DISTINCT</codeph>
values. Always include an <codeph>ORDER BY</codeph> clause if you need the values in
alphabetical or numeric sorted order.
</p>
</note>
</conbody>
</concept>

View File

@@ -105,12 +105,6 @@ under the License.
rather than the <codeph>EXPLODE()</codeph> keyword.
See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
</li>
<li>
Multiple <codeph>DISTINCT</codeph> clauses per query, although Impala includes some workarounds for this
limitation.
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
</li>
</ul>
<p>

View File

@@ -107,12 +107,6 @@ table_reference := { <varname>table_name</varname> | (<varname>subquery</varname
that are referenced multiple times in the same query.
</li>
<li>
By default, one <codeph>DISTINCT</codeph> clause per query. See <xref href="impala_distinct.xml#distinct"/>
for details. See <xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for a query option to
allow multiple <codeph>COUNT(DISTINCT)</codeph> impressions in the same query.
</li>
<li>
Subqueries in a <codeph>FROM</codeph> clause. In <keyword keyref="impala20_full"/> and higher,
subqueries can also go in the <codeph>WHERE</codeph> clause, for example with the