mirror of
https://github.com/apache/impala.git
synced 2026-01-07 09:02:19 -05:00
For this change to land in master, the audience="hidden" code review needs to be completed first. Otherwise, the doc build would still work but the audience="hidden" content would be visible rather than hidden as desired. Some work happening in parallel might introduce additional instances of audience="Cloudera". I suggest addressing those in a followup CR so this global change can land quickly. Since the changes apply across so many different files, but are so narrow in scope, I suggest that the way to validate (check that no extraneous changes were introduced accidentally) is to diff just the changed lines: git diff -U0 HEAD^ HEAD In patch set 2, I updated other topics marked audience="Cloudera" by CRs that were pushed in the meantime. Change-Id: Ic93d89da77e1f51bbf548a522d98d0c4e2fb31c8 Reviewed-on: http://gerrit.cloudera.org:8080/5613 Reviewed-by: John Russell <jrussell@cloudera.com> Tested-by: Impala Public Jenkins
553 lines
26 KiB
XML
553 lines
26 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="joins">
|
|
|
|
<title>Joins in Impala SELECT Statements</title>
|
|
<titlealts audience="PDF"><navtitle>Joins</navtitle></titlealts>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Querying"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
<indexterm audience="hidden">joins</indexterm>
|
|
A join query is a <codeph>SELECT</codeph> statement that combines data from two or more tables,
|
|
and returns a result set containing items from some or all of those tables. It is a way to
|
|
cross-reference and correlate related data that is organized into multiple tables, typically
|
|
using identifiers that are repeated in each of the joined tables.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<p conref="../shared/impala_common.xml#common/join_types"/>
|
|
|
|
<codeblock>SELECT <varname>select_list</varname> FROM
|
|
<varname>table_or_subquery1</varname> [INNER] JOIN <varname>table_or_subquery2</varname> |
|
|
<varname>table_or_subquery1</varname> {LEFT [OUTER] | RIGHT [OUTER] | FULL [OUTER]} JOIN <varname>table_or_subquery2</varname> |
|
|
<varname>table_or_subquery1</varname> {LEFT | RIGHT} SEMI JOIN <varname>table_or_subquery2</varname> |
|
|
<ph rev="2.0.0"><varname>table_or_subquery1</varname> {LEFT | RIGHT} ANTI JOIN <varname>table_or_subquery2</varname> |</ph>
|
|
[ ON <varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...] |
|
|
USING (<varname>col1</varname> [, <varname>col2</varname> ...]) ]
|
|
[<varname>other_join_clause</varname> ...]
|
|
[ WHERE <varname>where_clauses</varname> ]
|
|
|
|
SELECT <varname>select_list</varname> FROM
|
|
<varname>table_or_subquery1</varname>, <varname>table_or_subquery2</varname> [, <varname>table_or_subquery3</varname> ...]
|
|
[<varname>other_join_clause</varname> ...]
|
|
WHERE
|
|
<varname>col1</varname> = <varname>col2</varname> [AND <varname>col3</varname> = <varname>col4</varname> ...]
|
|
|
|
SELECT <varname>select_list</varname> FROM
|
|
<varname>table_or_subquery1</varname> CROSS JOIN <varname>table_or_subquery2</varname>
|
|
[<varname>other_join_clause</varname> ...]
|
|
[ WHERE <varname>where_clauses</varname> ]</codeblock>
|
|
|
|
<p>
|
|
<b>SQL-92 and SQL-89 Joins:</b>
|
|
</p>
|
|
|
|
<p>
|
|
Queries with the explicit <codeph>JOIN</codeph> keywords are known as SQL-92 style joins, referring to the
|
|
level of the SQL standard where they were introduced. The corresponding <codeph>ON</codeph> or
|
|
<codeph>USING</codeph> clauses clearly show which columns are used as the join keys in each case:
|
|
</p>
|
|
|
|
<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b>
|
|
<b>ON t1.id = t2.id and t1.type_flag = t2.type_flag</b>
|
|
WHERE t1.c1 > 100;
|
|
|
|
SELECT t1.c1, t2.c2 FROM <b>t1 JOIN t2</b>
|
|
<b>USING (id, type_flag)</b>
|
|
WHERE t1.c1 > 100;</codeblock>
|
|
|
|
<p>
|
|
The <codeph>ON</codeph> clause is a general way to compare columns across the two tables, even if the column
|
|
names are different. The <codeph>USING</codeph> clause is a shorthand notation for specifying the join
|
|
columns, when the column names are the same in both tables. You can code equivalent <codeph>WHERE</codeph>
|
|
clauses that compare the columns, instead of <codeph>ON</codeph> or <codeph>USING</codeph> clauses, but that
|
|
practice is not recommended because mixing the join comparisons with other filtering clauses is typically
|
|
less readable and harder to maintain.
|
|
</p>
|
|
|
|
<p>
|
|
Queries with a comma-separated list of tables and subqueries are known as SQL-89 style joins. In these
|
|
queries, the equality comparisons between columns of the joined tables go in the <codeph>WHERE</codeph>
|
|
clause alongside other kinds of comparisons. This syntax is easy to learn, but it is also easy to
|
|
accidentally remove a <codeph>WHERE</codeph> clause needed for the join to work correctly.
|
|
</p>
|
|
|
|
<codeblock>SELECT t1.c1, t2.c2 FROM <b>t1, t2</b>
|
|
WHERE
|
|
<b>t1.id = t2.id AND t1.type_flag = t2.type_flag</b>
|
|
AND t1.c1 > 100;</codeblock>
|
|
|
|
<p>
|
|
<b>Self-joins:</b>
|
|
</p>
|
|
|
|
<p>
|
|
Impala can do self-joins, for example to join on two different columns in the same table to represent
|
|
parent-child relationships or other tree-structured data. There is no explicit syntax for this; just use the
|
|
same table name for both the left-hand and right-hand table, and assign different table aliases to use when
|
|
referring to the fully qualified column names:
|
|
</p>
|
|
|
|
<codeblock>-- Combine fields from both parent and child rows.
|
|
SELECT lhs.id, rhs.parent, lhs.c1, rhs.c2 FROM tree_data lhs, tree_data rhs WHERE lhs.id = rhs.parent;</codeblock>
|
|
|
|
<p>
|
|
<b>Cartesian joins:</b>
|
|
</p>
|
|
|
|
<p>
|
|
To avoid producing huge result sets by mistake, Impala does not allow Cartesian joins of the form:
|
|
<codeblock>SELECT ... FROM t1 JOIN t2;
|
|
SELECT ... FROM t1, t2;</codeblock>
|
|
If you intend to join the tables based on common values, add <codeph>ON</codeph> or <codeph>WHERE</codeph>
|
|
clauses to compare columns across the tables. If you truly intend to do a Cartesian join, use the
|
|
<codeph>CROSS JOIN</codeph> keyword as the join operator. The <codeph>CROSS JOIN</codeph> form does not use
|
|
any <codeph>ON</codeph> clause, because it produces a result set with all combinations of rows from the
|
|
left-hand and right-hand tables. The result set can still be filtered by subsequent <codeph>WHERE</codeph>
|
|
clauses. For example:
|
|
</p>
|
|
|
|
<codeblock>SELECT ... FROM t1 CROSS JOIN t2;
|
|
SELECT ... FROM t1 CROSS JOIN t2 WHERE <varname>tests_on_non_join_columns</varname>;</codeblock>
|
|
|
|
<p>
|
|
<b>Inner and outer joins:</b>
|
|
</p>
|
|
|
|
<p>
|
|
An inner join is the most common and familiar type: rows in the result set contain the requested columns from
|
|
the appropriate tables, for all combinations of rows where the join columns of the tables have identical
|
|
values. If a column with the same name occurs in both tables, use a fully qualified name or a column alias to
|
|
refer to the column in the select list or other clauses. Impala performs inner joins by default for both
|
|
SQL-89 and SQL-92 join syntax:
|
|
</p>
|
|
|
|
<codeblock>-- The following 3 forms are all equivalent.
|
|
SELECT t1.id, c1, c2 FROM t1, t2 WHERE t1.id = t2.id;
|
|
SELECT t1.id, c1, c2 FROM t1 JOIN t2 ON t1.id = t2.id;
|
|
SELECT t1.id, c1, c2 FROM t1 INNER JOIN t2 ON t1.id = t2.id;</codeblock>
|
|
|
|
<p>
|
|
An outer join retrieves all rows from the left-hand table, or the right-hand table, or both; wherever there
|
|
is no matching data in the table on the other side of the join, the corresponding columns in the result set
|
|
are set to <codeph>NULL</codeph>. To perform an outer join, include the <codeph>OUTER</codeph> keyword in the
|
|
join operator, along with either <codeph>LEFT</codeph>, <codeph>RIGHT</codeph>, or <codeph>FULL</codeph>:
|
|
</p>
|
|
|
|
<codeblock>SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;
|
|
SELECT * FROM t1 RIGHT OUTER JOIN t2 ON t1.id = t2.id;
|
|
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.id = t2.id;</codeblock>
|
|
|
|
<p>
|
|
For outer joins, Impala requires SQL-92 syntax; that is, the <codeph>JOIN</codeph> keyword instead of
|
|
comma-separated table names. Impala does not support vendor extensions such as <codeph>(+)</codeph> or
|
|
<codeph>*=</codeph> notation for doing outer joins with SQL-89 query syntax.
|
|
</p>
|
|
|
|
<p>
|
|
<b>Equijoins and Non-Equijoins:</b>
|
|
</p>
|
|
|
|
<p>
|
|
By default, Impala requires an equality comparison between the left-hand and right-hand tables, either
|
|
through <codeph>ON</codeph>, <codeph>USING</codeph>, or <codeph>WHERE</codeph> clauses. These types of
|
|
queries are classified broadly as equijoins. Inner, outer, full, and semi joins can all be equijoins based on
|
|
the presence of equality tests between columns in the left-hand and right-hand tables.
|
|
</p>
|
|
|
|
<p>
|
|
In Impala 1.2.2 and higher, non-equijoin queries are also possible, with comparisons such as
|
|
<codeph>!=</codeph> or <codeph><</codeph> between the join columns. These kinds of queries require care to
|
|
avoid producing huge result sets that could exceed resource limits. Once you have planned a non-equijoin
|
|
query that produces a result set of acceptable size, you can code the query using the <codeph>CROSS
|
|
JOIN</codeph> operator, and add the extra comparisons in the <codeph>WHERE</codeph> clause:
|
|
</p>
|
|
|
|
<codeblock>SELECT * FROM t1 CROSS JOIN t2 WHERE t1.total > t2.maximum_price;</codeblock>
|
|
|
|
<p rev="2.3.0">
|
|
In <keyword keyref="impala23_full"/> and higher, additional non-equijoin queries are possible due to the addition
|
|
of nested loop joins. These queries typically involve <codeph>SEMI JOIN</codeph>,
|
|
<codeph>ANTI JOIN</codeph>, or <codeph>FULL OUTER JOIN</codeph> clauses.
|
|
Impala sometimes also uses nested loop joins internally when evaluating <codeph>OUTER JOIN</codeph>
|
|
queries involving complex type columns.
|
|
Query phases involving nested loop joins do not use the spill-to-disk mechanism if they
|
|
exceed the memory limit. Impala decides internally when to use each join mechanism; you cannot
|
|
specify any query hint to choose between the nested loop join or the original hash join algorithm.
|
|
</p>
|
|
|
|
<codeblock rev="2.3.0">SELECT * FROM t1 LEFT OUTER JOIN t2 ON t1.int_col < t2.int_col;</codeblock>
|
|
|
|
<p>
|
|
<b>Semi-joins:</b>
|
|
</p>
|
|
|
|
<p>
|
|
Semi-joins are a relatively rarely used variation. With the left semi-join, only data from the left-hand
|
|
table is returned, for rows where there is matching data in the right-hand table, based on comparisons
|
|
between join columns in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses. Only one instance of each row
|
|
from the left-hand table is returned, regardless of how many matching rows exist in the right-hand table.
|
|
<ph rev="2.0.0">A right semi-join (available in Impala 2.0 and higher) reverses the comparison and returns
|
|
data from the right-hand table.</ph>
|
|
</p>
|
|
|
|
<codeblock>SELECT t1.c1, t1.c2, t1.c2 FROM t1 LEFT SEMI JOIN t2 ON t1.id = t2.id;</codeblock>
|
|
|
|
<p>
|
|
<b>Natural joins (not supported):</b>
|
|
</p>
|
|
|
|
<p>
|
|
Impala does not support the <codeph>NATURAL JOIN</codeph> operator, again to avoid inconsistent or huge
|
|
result sets. Natural joins do away with the <codeph>ON</codeph> and <codeph>USING</codeph> clauses, and
|
|
instead automatically join on all columns with the same names in the left-hand and right-hand tables. This
|
|
kind of query is not recommended for rapidly evolving data structures such as are typically used in Hadoop.
|
|
Thus, Impala does not support the <codeph>NATURAL JOIN</codeph> syntax, which can produce different query
|
|
results as columns are added to or removed from tables.
|
|
</p>
|
|
|
|
<p>
|
|
If you do have any queries that use <codeph>NATURAL JOIN</codeph>, make sure to rewrite them with explicit
|
|
<codeph>USING</codeph> clauses, because Impala could interpret the <codeph>NATURAL</codeph> keyword as a
|
|
table alias:
|
|
</p>
|
|
|
|
<codeblock>-- 'NATURAL' is interpreted as an alias for 't1' and Impala attempts an inner join,
|
|
-- resulting in an error because inner joins require explicit comparisons between columns.
|
|
SELECT t1.c1, t2.c2 FROM t1 NATURAL JOIN t2;
|
|
ERROR: NotImplementedException: Join with 't2' requires at least one conjunctive equality predicate.
|
|
To perform a Cartesian product between two tables, use a CROSS JOIN.
|
|
|
|
-- If you expect the tables to have identically named columns with matching values,
|
|
-- list the corresponding column names in a USING clause.
|
|
SELECT t1.c1, t2.c2 FROM t1 JOIN t2 USING (id, type_flag, name, address);</codeblock>
|
|
|
|
<p rev="2.0.0">
|
|
<b>Anti-joins (<keyword keyref="impala20_full"/> and higher only):</b>
|
|
</p>
|
|
|
|
<p rev="2.0.0">
|
|
Impala supports the <codeph>LEFT ANTI JOIN</codeph> and <codeph>RIGHT ANTI JOIN</codeph> clauses in
|
|
<keyword keyref="impala20"/> and higher. The <codeph>LEFT</codeph> or <codeph>RIGHT</codeph>
|
|
keyword is required for this kind of join. For <codeph>LEFT ANTI JOIN</codeph>, this clause returns those
|
|
values from the left-hand table that have no matching value in the right-hand table. <codeph>RIGHT ANTI
|
|
JOIN</codeph> reverses the comparison and returns values from the right-hand table. You can express this
|
|
negative relationship either through the <codeph>ANTI JOIN</codeph> clause or through a <codeph>NOT
|
|
EXISTS</codeph> operator with a subquery.
|
|
</p>
|
|
|
|
<!-- Restriction lifted in Impala 2.0.
|
|
<p>
|
|
Impala does not support <codeph>WHERE</codeph> clauses
|
|
such as <codeph>IN (<varname>subquery</varname>)</codeph>,
|
|
<codeph>NOT IN (<varname>subquery</varname>)</codeph>,
|
|
<codeph>EXISTS (<varname>subquery</varname>)</codeph>,
|
|
and <codeph>NOT EXISTS (<varname>subquery</varname>)</codeph>.
|
|
Therefore from a practical standpoint, you cannot
|
|
express an anti-join condition, where values from one table
|
|
are returned only if no matching values are present in another table.
|
|
</p>
|
|
-->
|
|
|
|
<p conref="../shared/impala_common.xml#common/complex_types_blurb"/>
|
|
|
|
<!-- To do: reuse some complex types examples with joins here or under Examples farther down. -->
|
|
|
|
<p rev="2.3.0">
|
|
When referring to a column with a complex type (<codeph>STRUCT</codeph>, <codeph>ARRAY</codeph>, or <codeph>MAP</codeph>)
|
|
in a query, you use join notation to <q>unpack</q> the scalar fields of the struct, the elements of the array, or
|
|
the key-value pairs of the map. (The join notation is not required for aggregation operations, such as
|
|
<codeph>COUNT()</codeph> or <codeph>SUM()</codeph> for array elements.) Because Impala recognizes which complex type elements are associated with which row
|
|
of the result set, you use the same syntax as for a cross or cartesian join, without an explicit join condition.
|
|
See <xref href="impala_complex_types.xml#complex_types"/> for details about Impala support for complex types.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
You typically use join queries in situations like these:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
When related data arrives from different sources, with each data set physically residing in a separate
|
|
table. For example, you might have address data from business records that you cross-check against phone
|
|
listings or census data.
|
|
<note>
|
|
Impala can join tables of different file formats, including Impala-managed tables and HBase tables. For
|
|
example, you might keep small dimension tables in HBase, for convenience of single-row lookups and
|
|
updates, and for the larger fact tables use Parquet or other binary file format optimized for scan
|
|
operations. Then, you can issue a join query to cross-reference the fact tables with the dimension
|
|
tables.
|
|
</note>
|
|
</li>
|
|
|
|
<li>
|
|
When data is normalized, a technique for reducing data duplication by dividing it across multiple tables.
|
|
This kind of organization is often found in data that comes from traditional relational database systems.
|
|
For example, instead of repeating some long string such as a customer name in multiple tables, each table
|
|
might contain a numeric customer ID. Queries that need to display the customer name could <q>join</q> the
|
|
table that specifies which customer ID corresponds to which name.
|
|
</li>
|
|
|
|
<li>
|
|
When certain columns are rarely needed for queries, so they are moved into separate tables to reduce
|
|
overhead for common queries. For example, a <codeph>biography</codeph> field might be rarely needed in
|
|
queries on employee data. Putting that field in a separate table reduces the amount of I/O for common
|
|
queries on employee addresses or phone numbers. Queries that do need the <codeph>biography</codeph> column
|
|
can retrieve it by performing a join with that separate table.
|
|
</li>
|
|
|
|
<li>
|
|
In <keyword keyref="impala23_full"/> or higher, when referring to complex type columns in queries.
|
|
See <xref href="impala_complex_types.xml#complex_types"/> for details.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
When comparing columns with the same names in <codeph>ON</codeph> or <codeph>WHERE</codeph> clauses, use the
|
|
fully qualified names such as <codeph><varname>db_name</varname>.<varname>table_name</varname></codeph>, or
|
|
assign table aliases, column aliases, or both to make the code more compact and understandable:
|
|
</p>
|
|
|
|
<codeblock>select t1.c1 as first_id, t2.c2 as second_id from
|
|
t1 join t2 on first_id = second_id;
|
|
|
|
select fact.custno, dimension.custno from
|
|
customer_data as fact join customer_address as dimension
|
|
using (custno)</codeblock>
|
|
|
|
<note>
|
|
<p>
|
|
Performance for join queries is a crucial aspect for Impala, because complex join queries are
|
|
resource-intensive operations. An efficient join query produces much less network traffic and CPU overhead
|
|
than an inefficient one. For best results:
|
|
</p>
|
|
<ul>
|
|
<li rev="1.2">
|
|
Make sure that both <xref href="impala_perf_stats.xml#perf_stats">table and column statistics</xref> are
|
|
available for all the tables involved in a join query, and especially for the columns referenced in any
|
|
join conditions. Impala uses the statistics to automatically deduce an efficient join order.
|
|
Use <xref href="impala_show.xml#show"><codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and
|
|
<codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph></xref> to check if statistics are
|
|
already present. Issue the <codeph>COMPUTE STATS <varname>table_name</varname></codeph> for a nonpartitioned table,
|
|
or (in Impala 2.1.0 and higher) <codeph>COMPUTE INCREMENTAL STATS <varname>table_name</varname></codeph>
|
|
for a partitioned table, to collect the initial statistics at both the table and column levels, and to keep the
|
|
statistics up to date after any substantial <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> operations.
|
|
</li>
|
|
|
|
<li rev="1.2">
|
|
If table or column statistics are not available, join the largest table first. You can check the
|
|
existence of statistics with the <codeph>SHOW TABLE STATS <varname>table_name</varname></codeph> and
|
|
<codeph>SHOW COLUMN STATS <varname>table_name</varname></codeph> statements.
|
|
</li>
|
|
|
|
<li rev="1.2.2">
|
|
If table or column statistics are not available, join subsequent tables according to which table has the
|
|
most selective filter, based on overall size and <codeph>WHERE</codeph> clauses. Joining the table with
|
|
the most selective filter results in the fewest number of rows being returned.
|
|
</li>
|
|
</ul>
|
|
<p>
|
|
For more information and examples of performance for join queries, see
|
|
<xref href="impala_perf_joins.xml#perf_joins"/>.
|
|
</p>
|
|
</note>
|
|
|
|
<p>
|
|
To control the result set from a join query, include the names of corresponding column names in both tables
|
|
in an <codeph>ON</codeph> or <codeph>USING</codeph> clause, or by coding equality comparisons for those
|
|
columns in the <codeph>WHERE</codeph> clause.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select c_last_name, ca_city from customer join customer_address where c_customer_sk = ca_address_sk;
|
|
+-------------+-----------------+
|
|
| c_last_name | ca_city |
|
|
+-------------+-----------------+
|
|
| Lewis | Fairfield |
|
|
| Moses | Fairview |
|
|
| Hamilton | Pleasant Valley |
|
|
| White | Oak Ridge |
|
|
| Moran | Glendale |
|
|
...
|
|
| Richards | Lakewood |
|
|
| Day | Lebanon |
|
|
| Painter | Oak Hill |
|
|
| Bentley | Greenfield |
|
|
| Jones | Stringtown |
|
|
+-------------+------------------+
|
|
Returned 50000 row(s) in 9.82s</codeblock>
|
|
|
|
<p>
|
|
One potential downside of joins is the possibility of excess resource usage in poorly constructed queries.
|
|
Impala imposes restrictions on join queries to guard against such issues. To minimize the chance of runaway
|
|
queries on large data sets, Impala requires every join query to contain at least one equality predicate
|
|
between the columns of the various tables. For example, if <codeph>T1</codeph> contains 1000 rows and
|
|
<codeph>T2</codeph> contains 1,000,000 rows, a query <codeph>SELECT <varname>columns</varname> FROM t1 JOIN
|
|
t2</codeph> could return up to 1 billion rows (1000 * 1,000,000); Impala requires that the query include a
|
|
clause such as <codeph>ON t1.c1 = t2.c2</codeph> or <codeph>WHERE t1.c1 = t2.c2</codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
Because even with equality clauses, the result set can still be large, as we saw in the previous example, you
|
|
might use a <codeph>LIMIT</codeph> clause to return a subset of the results:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select c_last_name, ca_city from customer, customer_address where c_customer_sk = ca_address_sk limit 10;
|
|
+-------------+-----------------+
|
|
| c_last_name | ca_city |
|
|
+-------------+-----------------+
|
|
| Lewis | Fairfield |
|
|
| Moses | Fairview |
|
|
| Hamilton | Pleasant Valley |
|
|
| White | Oak Ridge |
|
|
| Moran | Glendale |
|
|
| Sharp | Lakeview |
|
|
| Wiles | Farmington |
|
|
| Shipman | Union |
|
|
| Gilbert | New Hope |
|
|
| Brunson | Martinsville |
|
|
+-------------+-----------------+
|
|
Returned 10 row(s) in 0.63s</codeblock>
|
|
|
|
<p>
|
|
Or you might use additional comparison operators or aggregation functions to condense a large result set into
|
|
a smaller set of values:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > -- Find the names of customers who live in one particular town.
|
|
[localhost:21000] > select distinct c_last_name from customer, customer_address where
|
|
c_customer_sk = ca_address_sk
|
|
and ca_city = "Green Acres";
|
|
+---------------+
|
|
| c_last_name |
|
|
+---------------+
|
|
| Hensley |
|
|
| Pearson |
|
|
| Mayer |
|
|
| Montgomery |
|
|
| Ricks |
|
|
...
|
|
| Barrett |
|
|
| Price |
|
|
| Hill |
|
|
| Hansen |
|
|
| Meeks |
|
|
+---------------+
|
|
Returned 332 row(s) in 0.97s
|
|
|
|
[localhost:21000] > -- See how many different customers in this town have names starting with "A".
|
|
[localhost:21000] > select count(distinct c_last_name) from customer, customer_address where
|
|
c_customer_sk = ca_address_sk
|
|
and ca_city = "Green Acres"
|
|
and substr(c_last_name,1,1) = "A";
|
|
+-----------------------------+
|
|
| count(distinct c_last_name) |
|
|
+-----------------------------+
|
|
| 12 |
|
|
+-----------------------------+
|
|
Returned 1 row(s) in 1.00s</codeblock>
|
|
|
|
<p>
|
|
Because a join query can involve reading large amounts of data from disk, sending large amounts of data
|
|
across the network, and loading large amounts of data into memory to do the comparisons and filtering, you
|
|
might do benchmarking, performance analysis, and query tuning to find the most efficient join queries for
|
|
your data set, hardware capacity, network configuration, and cluster workload.
|
|
</p>
|
|
|
|
<p>
|
|
The two categories of joins in Impala are known as <b>partitioned joins</b> and <b>broadcast joins</b>. If
|
|
inaccurate table or column statistics, or some quirk of the data distribution, causes Impala to choose the
|
|
wrong mechanism for a particular join, consider using query hints as a temporary workaround. For details, see
|
|
<xref href="impala_hints.xml#hints"/>.
|
|
</p>
|
|
|
|
<p rev="2.5.0">
|
|
<b>Handling NULLs in Join Columns:</b>
|
|
</p>
|
|
|
|
<p rev="2.5.0">
|
|
By default, join key columns do not match if either one contains a <codeph>NULL</codeph> value.
|
|
To treat such columns as equal if both contain <codeph>NULL</codeph>, you can use an expression
|
|
such as <codeph>A = B OR (A IS NULL AND B IS NULL)</codeph>.
|
|
In <keyword keyref="impala25_full"/> and higher, the <codeph><=></codeph> operator (shorthand for
|
|
<codeph>IS NOT DISTINCT FROM</codeph>) performs the same comparison in a concise and efficient form.
|
|
The <codeph><=></codeph> operator is more efficient in for comparing join keys in a <codeph>NULL</codeph>-safe
|
|
manner, because the operator can use a hash join while the <codeph>OR</codeph> expression cannot.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following examples refer to these simple tables containing small sets of integers:
|
|
<codeblock>[localhost:21000] > create table t1 (x int);
|
|
[localhost:21000] > insert into t1 values (1), (2), (3), (4), (5), (6);
|
|
|
|
[localhost:21000] > create table t2 (y int);
|
|
[localhost:21000] > insert into t2 values (2), (4), (6);
|
|
|
|
[localhost:21000] > create table t3 (z int);
|
|
[localhost:21000] > insert into t3 values (1), (3), (5);
|
|
</codeblock>
|
|
</p>
|
|
|
|
<!-- To do: fill in examples for other join types. -->
|
|
|
|
<p>
|
|
The following example demonstrates an anti-join, returning the values from <codeph>T1</codeph> that do not
|
|
exist in <codeph>T2</codeph> (in this case, the odd numbers 1, 3, and 5):
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select x from t1 left anti join t2 on (t1.x = t2.y);
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 1 |
|
|
| 3 |
|
|
| 5 |
|
|
+---+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
See these tutorials for examples of different kinds of joins:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<xref href="impala_tutorial.xml#tut_cross_join"/>
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|