mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
IMPALA-7687: [DOCS] Support for multiple DISTINCT in a query
- Removed notes about the single DISTINCT restriction. - Rewrote the description for the APPX_COUNT_DISTINCT query option. Change-Id: I3a6e664b016e9408a3ff809f1811253a91764481 Reviewed-on: http://gerrit.cloudera.org:8080/11823 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
This commit is contained in:
@@ -2117,33 +2117,6 @@ show functions in _impala_builtins like '*<varname>substring</varname>*';
|
||||
<codeph>--insert_inherit_permissions</codeph> startup option for the <cmdname>impalad</cmdname> daemon.
|
||||
</p>
|
||||
|
||||
<note id="multiple_count_distinct">
|
||||
<p>
|
||||
By default, Impala only allows a single <codeph>COUNT(DISTINCT <varname>columns</varname>)</codeph>
|
||||
expression in each query.
|
||||
</p>
|
||||
<p>
|
||||
If you do not need precise accuracy, you can produce an estimate of the distinct values for a column by
|
||||
specifying <codeph>NDV(<varname>column</varname>)</codeph>; a query can contain multiple instances of
|
||||
<codeph>NDV(<varname>column</varname>)</codeph>. To make Impala automatically rewrite
|
||||
<codeph>COUNT(DISTINCT)</codeph> expressions to <codeph>NDV()</codeph>, enable the
|
||||
<codeph>APPX_COUNT_DISTINCT</codeph> query option.
|
||||
</p>
|
||||
<p>
|
||||
To produce the same result as multiple <codeph>COUNT(DISTINCT)</codeph> expressions, you can use the
|
||||
following technique for queries involving a single table:
|
||||
</p>
|
||||
<codeblock xml:space="preserve">select v1.c1 result1, v2.c1 result2 from
|
||||
(select count(distinct col1) as c1 from t1) v1
|
||||
cross join
|
||||
(select count(distinct col2) as c1 from t1) v2;
|
||||
</codeblock>
|
||||
<p>
|
||||
Because <codeph>CROSS JOIN</codeph> is an expensive operation, prefer to use the <codeph>NDV()</codeph>
|
||||
technique wherever practical.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
<p>
|
||||
<ph id="union_all_vs_union">Prefer <codeph>UNION ALL</codeph> over <codeph>UNION</codeph> when you know the
|
||||
data sets are disjoint or duplicate values are not a problem; <codeph>UNION ALL</codeph> is more efficient
|
||||
|
||||
@@ -21,7 +21,13 @@ under the License.
|
||||
<concept rev="2.0.0" id="appx_count_distinct">
|
||||
|
||||
<title>APPX_COUNT_DISTINCT Query Option (<keyword keyref="impala20"/> or higher only)</title>
|
||||
<titlealts audience="PDF"><navtitle>APPX_COUNT_DISTINCT</navtitle></titlealts>
|
||||
|
||||
<titlealts audience="PDF">
|
||||
|
||||
<navtitle>APPX_COUNT_DISTINCT</navtitle>
|
||||
|
||||
</titlealts>
|
||||
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
@@ -35,65 +41,26 @@ under the License.
|
||||
<conbody>
|
||||
|
||||
<p rev="2.0.0">
|
||||
<indexterm audience="hidden">APPX_COUNT_DISTINCT query option</indexterm>
|
||||
Allows multiple <codeph>COUNT(DISTINCT)</codeph> operations within a single query, by internally rewriting
|
||||
each <codeph>COUNT(DISTINCT)</codeph> to use the <codeph>NDV()</codeph> function. The resulting count is
|
||||
approximate rather than precise.
|
||||
When the <codeph>APPX_COUNT_DISTINCT</codeph> query option is set to
|
||||
<codeph>TRUE</codeph>, Impala implicitly converts <codeph>COUNT(DISTINCT)</codeph>
|
||||
operations to the <codeph>NDV()</codeph> function calls. The resulting count is
|
||||
approximate rather than precise. Enable the query option when a tolerable amount of error
|
||||
is acceptable in order to obtain faster query results than with a <codeph>COUNT
|
||||
(DISTINCT)</codeph> queries.
|
||||
</p>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/type_boolean"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/default_false_0"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
||||
|
||||
<p>
|
||||
The following examples show how the <codeph>APPX_COUNT_DISTINCT</codeph> lets you work around the restriction
|
||||
where a query can only evaluate <codeph>COUNT(DISTINCT <varname>col_name</varname>)</codeph> for a single
|
||||
column. By default, you can count the distinct values of one column or another, but not both in a single
|
||||
query:
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > select count(distinct x) from int_t;
|
||||
+-------------------+
|
||||
| count(distinct x) |
|
||||
+-------------------+
|
||||
| 10 |
|
||||
+-------------------+
|
||||
[localhost:21000] > select count(distinct property) from int_t;
|
||||
+--------------------------+
|
||||
| count(distinct property) |
|
||||
+--------------------------+
|
||||
| 7 |
|
||||
+--------------------------+
|
||||
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
|
||||
ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters
|
||||
as count(DISTINCT x); deviating function: count(DISTINCT property)
|
||||
</codeblock>
|
||||
|
||||
<p>
|
||||
When you enable the <codeph>APPX_COUNT_DISTINCT</codeph> query option, now the query with multiple
|
||||
<codeph>COUNT(DISTINCT)</codeph> works. The reason this behavior requires a query option is that each
|
||||
<codeph>COUNT(DISTINCT)</codeph> is rewritten internally to use the <codeph>NDV()</codeph> function instead,
|
||||
which provides an approximate result rather than a precise count.
|
||||
</p>
|
||||
|
||||
<codeblock>[localhost:21000] > set APPX_COUNT_DISTINCT=true;
|
||||
[localhost:21000] > select count(distinct x), count(distinct property) from int_t;
|
||||
+-------------------+--------------------------+
|
||||
| count(distinct x) | count(distinct property) |
|
||||
+-------------------+--------------------------+
|
||||
| 10 | 7 |
|
||||
+-------------------+--------------------------+
|
||||
</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
<xref href="impala_count.xml#count"/>,
|
||||
<xref href="impala_distinct.xml#distinct"/>,
|
||||
<xref href="impala_ndv.xml#ndv"/>
|
||||
<xref
|
||||
href="impala_distinct.xml#distinct"/>, <xref href="impala_ndv.xml#ndv"/>
|
||||
</p>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
@@ -242,8 +242,6 @@ ERROR: AnalysisException: RANGE is only supported with both the lower and upper
|
||||
</codeblock>
|
||||
</p>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/related_info"/>
|
||||
|
||||
<p>
|
||||
|
||||
@@ -21,6 +21,7 @@ under the License.
|
||||
<concept id="distinct">
|
||||
|
||||
<title>DISTINCT Operator</title>
|
||||
|
||||
<prolog>
|
||||
<metadata>
|
||||
<data name="Category" value="Impala"/>
|
||||
@@ -35,45 +36,40 @@ under the License.
|
||||
<conbody>
|
||||
|
||||
<p>
|
||||
<indexterm audience="hidden">DISTINCT operator</indexterm>
|
||||
The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the result set to
|
||||
remove duplicates:
|
||||
The <codeph>DISTINCT</codeph> operator in a <codeph>SELECT</codeph> statement filters the
|
||||
result set to remove duplicates.
|
||||
</p>
|
||||
|
||||
<codeblock>-- Returns the unique values from one column.
|
||||
-- NULL is included in the set of values if any rows have a NULL in this column.
|
||||
select distinct c_birth_country from customer;
|
||||
SELECT DISTINCT c_birth_country FROM customer;
|
||||
|
||||
-- Returns the unique combinations of values from multiple columns.
|
||||
select distinct c_salutation, c_last_name from customer;</codeblock>
|
||||
SELECT DISTINCT c_salutation, c_last_name FROM customer;</codeblock>
|
||||
|
||||
<p>
|
||||
You can use <codeph>DISTINCT</codeph> in combination with an aggregation function, typically
|
||||
<codeph>COUNT()</codeph>, to find how many different values a column contains:
|
||||
You can use <codeph>DISTINCT</codeph> in combination with an aggregation function,
|
||||
typically <codeph>COUNT()</codeph>, to find how many different values a column contains.
|
||||
</p>
|
||||
|
||||
<codeblock>-- Counts the unique values from one column.
|
||||
-- NULL is not included as a distinct value in the count.
|
||||
select count(distinct c_birth_country) from customer;
|
||||
-- Counts the unique combinations of values from multiple columns.
|
||||
select count(distinct c_salutation, c_last_name) from customer;</codeblock>
|
||||
SELECT COUNT(DISTINCT c_birth_country) FROM customer;
|
||||
|
||||
<p>
|
||||
One construct that Impala SQL does <i>not</i> support is using <codeph>DISTINCT</codeph> in more than one
|
||||
aggregation function in the same query. For example, you could not have a single query with both
|
||||
<codeph>COUNT(DISTINCT c_first_name)</codeph> and <codeph>COUNT(DISTINCT c_last_name)</codeph> in the
|
||||
<codeph>SELECT</codeph> list.
|
||||
</p>
|
||||
-- Counts the unique combinations of values from multiple columns.
|
||||
SELECT COUNT(DISTINCT c_salutation, c_last_name) FROM customer;</codeblock>
|
||||
|
||||
<p conref="../shared/impala_common.xml#common/zero_length_strings"/>
|
||||
|
||||
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
|
||||
|
||||
<note>
|
||||
<p>
|
||||
In contrast with some database systems that always return <codeph>DISTINCT</codeph> values in sorted order,
|
||||
Impala does not do any ordering of <codeph>DISTINCT</codeph> values. Always include an <codeph>ORDER
|
||||
BY</codeph> clause if you need the values in alphabetical or numeric sorted order.
|
||||
In contrast with some database systems that always return <codeph>DISTINCT</codeph>
|
||||
values in sorted order, Impala does not do any ordering of <codeph>DISTINCT</codeph>
|
||||
values. Always include an <codeph>ORDER BY</codeph> clause if you need the values in
|
||||
alphabetical or numeric sorted order.
|
||||
</p>
|
||||
</note>
|
||||
|
||||
</conbody>
|
||||
|
||||
</concept>
|
||||
|
||||
@@ -105,12 +105,6 @@ under the License.
|
||||
rather than the <codeph>EXPLODE()</codeph> keyword.
|
||||
See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Multiple <codeph>DISTINCT</codeph> clauses per query, although Impala includes some workarounds for this
|
||||
limitation.
|
||||
<note conref="../shared/impala_common.xml#common/multiple_count_distinct"/>
|
||||
</li>
|
||||
</ul>
|
||||
|
||||
<p>
|
||||
|
||||
@@ -107,12 +107,6 @@ table_reference := { <varname>table_name</varname> | (<varname>subquery</varname
|
||||
that are referenced multiple times in the same query.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
By default, one <codeph>DISTINCT</codeph> clause per query. See <xref href="impala_distinct.xml#distinct"/>
|
||||
for details. See <xref href="impala_appx_count_distinct.xml#appx_count_distinct"/> for a query option to
|
||||
allow multiple <codeph>COUNT(DISTINCT)</codeph> impressions in the same query.
|
||||
</li>
|
||||
|
||||
<li>
|
||||
Subqueries in a <codeph>FROM</codeph> clause. In <keyword keyref="impala20_full"/> and higher,
|
||||
subqueries can also go in the <codeph>WHERE</codeph> clause, for example with the
|
||||
|
||||
Reference in New Issue
Block a user