mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
Change-Id: I2304e422636d03d7d1109d3f0a551d6f04ce1a63 Reviewed-on: http://gerrit.cloudera.org:8080/12392 Reviewed-by: Alex Rodoni <arodoni@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
531 lines
20 KiB
XML
531 lines
20 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept id="tablesample" rev="IMPALA-5309">
|
|
|
|
<title>TABLESAMPLE Clause</title>
|
|
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="SQL"/>
|
|
<data name="Category" value="Querying"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need to explore the
|
|
data distribution within the table, the table is very large, and it is impractical or
|
|
unnecessary to process all the data from the table or selected partitions.
|
|
</p>
|
|
|
|
<p>
|
|
The clause makes the query process a randomized set of data files from the table, so that
|
|
the total volume of data is greater than or equal to the specified percentage of data
|
|
bytes within that table. (Or the data bytes within the set of partitions that remain after
|
|
partition pruning is performed.)
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/syntax_blurb"/>
|
|
|
|
<codeblock>
|
|
<ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
|
|
</codeblock>
|
|
|
|
<p>
|
|
The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table
|
|
alias.
|
|
</p>
|
|
|
|
<p>
|
|
The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently, Impala only
|
|
supports a single sampling method named <codeph>SYSTEM</codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
The <varname>percentage</varname> argument is an integer literal from 0 to 100. A
|
|
percentage of 0 produces an empty result set for a particular table reference, while a
|
|
percentage of 100 uses the entire contents. Because the sampling works by selecting a
|
|
random set of data files, the proportion of sampled data from the table may be greater
|
|
than the specified percentage, based on the number and sizes of the underlying data files.
|
|
See the usage notes for details.
|
|
</p>
|
|
|
|
<p>
|
|
The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary positive
|
|
integer seed value that ensures that when the query is run again, the sampling selects the
|
|
same set of data files each time. <codeph>REPEATABLE</codeph> does not have a default
|
|
value. If you omit the <codeph>REPEATABLE</codeph> keyword, the random seed is derived
|
|
from the current time.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/added_in_290"/>
|
|
|
|
<p rev="2.12.0 IMPALA-5310">
|
|
See <keyword keyref="compute_stats"/> for the <codeph>TABLESAMPLE</codeph> clause used in
|
|
the <codeph>COMPUTE STATS</codeph> statement.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>
|
|
|
|
<p>
|
|
You might use this clause with aggregation queries, such as finding the approximate
|
|
average, minimum, or maximum where exact precision is not required. You can use these
|
|
findings to plan the most effective strategy for constructing queries against the full
|
|
table or designing a partitioning strategy for the data.
|
|
</p>
|
|
|
|
<p>
|
|
Some other database systems have a <codeph>TABLESAMPLE</codeph> clause. The Impala syntax
|
|
for this clause is modeled on the syntax for popular relational databases, not the Hive
|
|
<codeph>TABLESAMPLE</codeph> clause. For example, there is no <codeph>BUCKETS</codeph>
|
|
keyword as in HiveQL.
|
|
</p>
|
|
|
|
<p>
|
|
The precision of the <varname>percentage</varname> threshold depends on the number and
|
|
sizes of the underlying data files. Impala brings in additional data files, one at a time,
|
|
until the number of bytes exceeds the specified percentage based on the total number of
|
|
bytes for the entire set of table data. The precision of the percentage threshold is
|
|
higher when the table contains many data files with consistent sizes. See the code
|
|
listings later in this section for examples.
|
|
</p>
|
|
|
|
<p>
|
|
When you estimate characteristics of the data distribution based on sampling a percentage
|
|
of the table data, be aware that the data might be unevenly distributed between different
|
|
files. Do not assume that the percentage figure reflects the percentage of rows in the
|
|
table. For example, one file might contain all blank values for a <codeph>STRING</codeph>
|
|
column, while another file contains long strings in that column; therefore, one file could
|
|
contain many more rows than another. Likewise, a table created with the <codeph>SORT
|
|
BY</codeph> clause might contain narrow ranges of values for the sort columns, making it
|
|
impractical to extrapolate the number of distinct values for those columns based on
|
|
sampling only some of the data files.
|
|
</p>
|
|
|
|
<p>
|
|
Because a sample of the table data might not contain all values for a particular column,
|
|
if the <codeph>TABLESAMPLE</codeph> is used in a join query, the key relationships between
|
|
the tables might produce incomplete result sets compared to joins using all the table
|
|
data. For example, if you join 50% of table A with 50% of table B, some values in the join
|
|
columns might not match between the two tables, even though overall there is a 1:1
|
|
relationship between the tables.
|
|
</p>
|
|
|
|
<p>
|
|
The <codeph>REPEATABLE</codeph> keyword makes identical queries use a consistent set of
|
|
data files when the query is repeated. You specify an arbitrary integer key that acts as a
|
|
seed value when Impala randomly selects the set of data files to use in the query. This
|
|
technique lets you verify correctness, examine performance, and so on for queries using
|
|
the <codeph>TABLESAMPLE</codeph> clause without the sampled data being different each
|
|
time. The repeatable aspect is reset (that is, the set of selected data files may change)
|
|
any time the contents of the table change. The statements or operations that can make
|
|
sampling results non-repeatable are:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<codeph>INSERT</codeph>.
|
|
</li>
|
|
|
|
<li>
|
|
<codeph>TRUNCATE TABLE</codeph>.
|
|
</li>
|
|
|
|
<li>
|
|
<codeph>LOAD DATA</codeph>.
|
|
</li>
|
|
|
|
<li>
|
|
<codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> after files are added
|
|
or removed by a non-Impala mechanism.
|
|
</li>
|
|
|
|
<li></li>
|
|
</ul>
|
|
|
|
<p>
|
|
This clause is similar in some ways to the <codeph>LIMIT</codeph> clause, because both
|
|
serve to limit the size of the intermediate data and final result set. <codeph>LIMIT
|
|
0</codeph> is more efficient than <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying
|
|
that a query can execute without producing any results. <codeph>TABLESAMPLE
|
|
SYSTEM(<varname>n</varname>)</codeph> often makes query processing more efficient than
|
|
using a <codeph>LIMIT</codeph> clause by itself, because all phases of query execution use
|
|
less data overall. If the intent is to retrieve some representative values from the table
|
|
in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>, <codeph>ORDER
|
|
BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/partitioning_blurb"/>
|
|
|
|
<p>
|
|
When you query a partitioned table, any partition pruning happens before Impala selects
|
|
the data files to sample. For example, in a table partitioned by year, a query with
|
|
<codeph>WHERE year = 2017</codeph> and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause
|
|
would sample data files representing at least 10% of the bytes present in the 2017
|
|
partition.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/s3_blurb"/>
|
|
|
|
<p>
|
|
This clause applies to S3 tables the same way as tables with data files stored on HDFS.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/adls_blurb"/>
|
|
|
|
<p>
|
|
This clause applies to ADLS tables the same way as tables with data files stored on HDFS.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/kudu_blurb"/>
|
|
|
|
<p>
|
|
This clause does not apply to Kudu tables.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/hbase_blurb"/>
|
|
|
|
<p>
|
|
This clause does not apply to HBase tables.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/performance_blurb"/>
|
|
|
|
<p>
|
|
From a performance perspective, the <codeph>TABLESAMPLE</codeph> clause is especially
|
|
valuable for exploratory queries on text, Avro, or other file formats other than Parquet.
|
|
Text-based or row-oriented file formats must process substantial amounts of redundant data
|
|
for queries that derive aggregate results such as <codeph>MAX()</codeph>,
|
|
<codeph>MIN()</codeph>, or <codeph>AVG()</codeph> for a single column. Therefore, you
|
|
might use <codeph>TABLESAMPLE</codeph> early in the ETL pipeline, when data is still in
|
|
raw text format and has not been converted to Parquet or moved into a partitioned table.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/restrictions_blurb"/>
|
|
|
|
<p>
|
|
This clause applies only to tables that use a storage layer with underlying raw data
|
|
files, such as HDFS, Amazon S3, or Microsoft ADLS.
|
|
</p>
|
|
|
|
<p>
|
|
This clause does not apply to table references that represent views. A query that applies
|
|
the <codeph>TABLESAMPLE</codeph> clause to a view or a subquery fails with a semantic
|
|
error.
|
|
</p>
|
|
|
|
<p>
|
|
Because the sampling works at the level of entire data files, it is by nature
|
|
coarse-grained. It is possible to specify a small sample percentage but still process a
|
|
substantial portion of the table data if the table contains relatively few data files, if
|
|
each data file is very large, or if the data files vary substantially in size. Be sure
|
|
that you understand the data distribution and physical file layout so that you can verify
|
|
if the results are suitable for extrapolation. For example, if the table contains only a
|
|
single data file, the <q>sample</q> will consist of all the table data regardless of the
|
|
percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when
|
|
you specify a sampling percentage of 50 you would either process slightly more than 50% of
|
|
the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which
|
|
data files were selected for sampling.
|
|
</p>
|
|
|
|
<p>
|
|
If data files are added by a non-Impala mechanism, and the table metadata is not updated
|
|
by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the
|
|
<codeph>TABLESAMPLE</codeph> clause does not consider those new files when computing the
|
|
number of bytes in the table or selecting which files to sample.
|
|
</p>
|
|
|
|
<p>
|
|
If data files are removed by a non-Impala mechanism, and the table metadata is not updated
|
|
by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the query
|
|
fails if the <codeph>TABLESAMPLE</codeph> clause attempts to reference any of the missing
|
|
files.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/example_blurb"/>
|
|
|
|
<p>
|
|
The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause. These examples
|
|
intentionally use very small data sets to illustrate how the number of files, size of each
|
|
file, and overall size of data in the table interact with the percentage specified in the
|
|
clause.
|
|
</p>
|
|
|
|
<p>
|
|
These examples use an unpartitioned table, containing several files of roughly the same
|
|
size:
|
|
</p>
|
|
|
|
<codeblock>create table sample_demo (x int, s string);
|
|
|
|
insert into sample_demo values (1, 'one');
|
|
insert into sample_demo values (2, 'two');
|
|
insert into sample_demo values (3, 'three');
|
|
insert into sample_demo values (4, 'four');
|
|
insert into sample_demo values (5, 'five');
|
|
|
|
show files in sample_demo;
|
|
+---------------------+------+-----------+
|
|
| Path | Size | Partition |
|
|
+---------------------+------+-----------+
|
|
| 991213608_data.0. | 7B | |
|
|
| 982196806_data.0. | 6B | |
|
|
| _2122096884_data.0. | 8B | |
|
|
| _586325431_data.0. | 6B | |
|
|
| 1894746258_data.0. | 7B | |
|
|
+---------------------+------+-----------+
|
|
|
|
show table stats sample_demo;
|
|
+-------+--------+------+--------+-------------------------+
|
|
| #Rows | #Files | Size | Format | Location |
|
|
+-------+--------+------+--------+-------------------------+
|
|
| -1 | 5 | 34B | TEXT | /tsample.db/sample_demo |
|
|
+-------+--------+------+--------+-------------------------+</codeblock>
|
|
|
|
<p>
|
|
A query that samples 50% of the table must process at least 17 bytes of data. Based on the
|
|
sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1
|
|
or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the
|
|
query adds more files until it passes the 50% threshold:
|
|
</p>
|
|
|
|
<codeblock>select distinct x from sample_demo tablesample system(50);
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 4 |
|
|
| 1 |
|
|
| 5 |
|
|
+---+
|
|
|
|
select distinct x from sample_demo tablesample system(50);
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 5 |
|
|
| 4 |
|
|
| 2 |
|
|
+---+
|
|
|
|
select distinct x from sample_demo tablesample system(50);
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 5 |
|
|
| 3 |
|
|
| 2 |
|
|
+---+</codeblock>
|
|
|
|
<p>
|
|
To help run reproducible experiments, the <codeph>REPEATABLE</codeph> clause causes Impala
|
|
to choose the same set of files for each query. Although the data set being considered is
|
|
deterministic, the order of results varies (in the absence of an <codeph>ORDER BY</codeph>
|
|
clause) because of the way distributed queries are processed:
|
|
</p>
|
|
|
|
<codeblock>select distinct x from sample_demo
|
|
tablesample system(50) repeatable (12345);
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 3 |
|
|
| 2 |
|
|
| 1 |
|
|
+---+
|
|
|
|
select distinct x from sample_demo
|
|
tablesample system(50) repeatable (12345);
|
|
+---+
|
|
| x |
|
|
+---+
|
|
| 2 |
|
|
| 1 |
|
|
| 3 |
|
|
+---+
|
|
</codeblock>
|
|
|
|
<p>
|
|
The following examples show how uneven data distribution affects which data is sampled.
|
|
Adding another data file containing a long string value changes the threshold for 50% of
|
|
the total data in the table:
|
|
</p>
|
|
|
|
<codeblock>insert into sample_demo values
|
|
(1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington');
|
|
|
|
show files in sample_demo;
|
|
+---------------------+------+-----------+
|
|
| Path | Size | Partition |
|
|
+---------------------+------+-----------+
|
|
| 991213608_data.0. | 7B | |
|
|
| 982196806_data.0. | 6B | |
|
|
| _253317650_data.0. | 196B | |
|
|
| _2122096884_data.0. | 8B | |
|
|
| _586325431_data.0. | 6B | |
|
|
| 1894746258_data.0. | 7B | |
|
|
+---------------------+------+-----------+
|
|
|
|
show table stats sample_demo;
|
|
+-------+--------+------+--------+-------------------------+
|
|
| #Rows | #Files | Size | Format | Location |
|
|
+-------+--------+------+--------+-------------------------+
|
|
| -1 | 6 | 230B | TEXT | /tsample.db/sample_demo |
|
|
+-------+--------+------+--------+-------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
Even though the queries do not refer to the <codeph>S</codeph> column containing the long
|
|
value, all the sampling queries include the data file containing the column value
|
|
<codeph>X=1000</codeph>, because the query cannot reach the 50% threshold (115 bytes)
|
|
without including that file. The large file might be considered first, in which case it is
|
|
the only file processed by the query. Or an arbitrary set of other files might be
|
|
considered first.
|
|
</p>
|
|
|
|
<codeblock>select distinct x from sample_demo tablesample system(50);
|
|
+------+
|
|
| x |
|
|
+------+
|
|
| 1000 |
|
|
| 3 |
|
|
| 1 |
|
|
+------+
|
|
|
|
select distinct x from sample_demo tablesample system(50);
|
|
+------+
|
|
| x |
|
|
+------+
|
|
| 1000 |
|
|
+------+
|
|
|
|
select distinct x from sample_demo tablesample system(50);
|
|
+------+
|
|
| x |
|
|
+------+
|
|
| 1000 |
|
|
| 4 |
|
|
| 2 |
|
|
| 1 |
|
|
+------+</codeblock>
|
|
|
|
<p>
|
|
The following examples demonstrate how the <codeph>TABLESAMPLE</codeph> clause interacts
|
|
with other table aspects, such as partitioning and file format:
|
|
</p>
|
|
|
|
<codeblock>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;
|
|
|
|
insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
|
|
insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
|
|
insert into sample_demo_partitions partition (n = 3) select * from sample_demo;
|
|
|
|
show files in sample_demo_partitions;
|
|
+--------------------------------+--------+-----------+
|
|
| Path | Size | Partition |
|
|
+--------------------------------+--------+-----------+
|
|
| 000000_364262785_data.0.parq | 1.24KB | n=1 |
|
|
| 000001_973526736_data.0.parq | 566B | n=1 |
|
|
| 0000000_1300598134_data.0.parq | 1.24KB | n=2 |
|
|
| 0000001_689099063_data.0.parq | 568B | n=2 |
|
|
| 0000000_1861371709_data.0.parq | 1.24KB | n=3 |
|
|
| 0000001_1065507912_data.0.parq | 566B | n=3 |
|
|
+--------------------------------+--------+-----------+
|
|
|
|
show table stats tablesample_demo_partitioned;
|
|
+-------+-------+--------+--------+---------+----------------------------------------------+
|
|
| n | #Rows | #Files | Size | Format | Location |
|
|
+-------+-------+--------+--------+---------+----------------------------------------------+
|
|
| 1 | -1 | 2 | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=1 |
|
|
| 2 | -1 | 2 | 1.80KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=2 |
|
|
| 3 | -1 | 2 | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=3 |
|
|
| Total | -1 | 6 | 5.39KB | | |
|
|
+-------+-------+--------+--------+---------+----------------------------------------------+
|
|
</codeblock>
|
|
|
|
<p>
|
|
If the query does not involve any partition pruning, the sampling applies to the data
|
|
volume of the entire table:
|
|
</p>
|
|
|
|
<codeblock>-- 18 rows total.
|
|
select count(*) from sample_demo_partitions;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 18 |
|
|
+----------+
|
|
|
|
-- The number of rows per data file is not
|
|
-- perfectly balanced, therefore the count
|
|
-- is different depending on which set of files
|
|
-- is considered.
|
|
select count(*) from sample_demo_partitions
|
|
tablesample system(75);
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 14 |
|
|
+----------+
|
|
|
|
select count(*) from sample_demo_partitions
|
|
tablesample system(75);
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 16 |
|
|
+----------+</codeblock>
|
|
|
|
<p>
|
|
If the query only processes certain partitions, the query computes the sampling threshold
|
|
based on the data size and set of files only from the relevant partitions:
|
|
</p>
|
|
|
|
<codeblock>select count(*) from sample_demo_partitions
|
|
tablesample system(50) where n = 1;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 6 |
|
|
+----------+
|
|
|
|
select count(*) from sample_demo_partitions
|
|
tablesample system(50) where n = 1;
|
|
+----------+
|
|
| count(*) |
|
|
+----------+
|
|
| 2 |
|
|
+----------+
|
|
</codeblock>
|
|
|
|
<p conref="../shared/impala_common.xml#common/related_info"/>
|
|
|
|
<p>
|
|
<xref href="impala_select.xml#select"/>
|
|
</p>
|
|
|
|
</conbody>
|
|
|
|
</concept>
|