impala/docs/topics/impala_tablesample.xml

<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="tablesample" rev="IMPALA-5309">

  <title>TABLESAMPLE Clause</title>

  <prolog>
    <metadata>
      <data name="Category" value="Impala"/>
      <data name="Category" value="SQL"/>
      <data name="Category" value="Querying"/>
      <data name="Category" value="Developers"/>
      <data name="Category" value="Data Analysts"/>
    </metadata>
  </prolog>

  <conbody>

    <p>
      Specify the <codeph>TABLESAMPLE</codeph> clause in cases where you need to explore the
      data distribution within the table, the table is very large, and it is impractical or
      unnecessary to process all the data from the table or selected partitions.
    </p>

    <p>
      The clause makes the query process a randomized set of data files from the table, so that
      the total volume of data is greater than or equal to the specified percentage of data
      bytes within that table. (Or the data bytes within the set of partitions that remain after
      partition pruning is performed.)
    </p>

    <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

<codeblock>
<ph rev="IMPALA-5309">TABLESAMPLE SYSTEM(<varname>percentage</varname>) [REPEATABLE(<varname>seed</varname>)]</ph>
</codeblock>

    <p>
      The <codeph>TABLESAMPLE</codeph> clause comes immediately after a table name or table
      alias.
    </p>

    <p>
      The <codeph>SYSTEM</codeph> keyword represents the sampling method. Currently, Impala only
      supports a single sampling method named <codeph>SYSTEM</codeph>.
    </p>

    <p>
      The <varname>percentage</varname> argument is an integer literal from 0 to 100. A
      percentage of 0 produces an empty result set for a particular table reference, while a
      percentage of 100 uses the entire contents. Because the sampling works by selecting a
      random set of data files, the proportion of sampled data from the table may be greater
      than the specified percentage, based on the number and sizes of the underlying data files.
      See the usage notes for details.
    </p>

    <p>
      The optional <codeph>REPEATABLE</codeph> keyword lets you specify an arbitrary positive
      integer seed value that ensures that when the query is run again, the sampling selects the
      same set of data files each time. <codeph>REPEATABLE</codeph> does not have a default
      value. If you omit the <codeph>REPEATABLE</codeph> keyword, the random seed is derived
      from the current time.
    </p>

    <p conref="../shared/impala_common.xml#common/added_in_290"/>

    <p rev="2.12.0 IMPALA-5310">
      See <keyword keyref="compute_stats"/> for the <codeph>TABLESAMPLE</codeph> clause used in
      the <codeph>COMPUTE STATS</codeph> statement.
    </p>

    <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

    <p>
      You might use this clause with aggregation queries, such as finding the approximate
      average, minimum, or maximum where exact precision is not required. You can use these
      findings to plan the most effective strategy for constructing queries against the full
      table or designing a partitioning strategy for the data.
    </p>

    <p>
      Some other database systems have a <codeph>TABLESAMPLE</codeph> clause. The Impala syntax
      for this clause is modeled on the syntax for popular relational databases, not the Hive
      <codeph>TABLESAMPLE</codeph> clause. For example, there is no <codeph>BUCKETS</codeph>
      keyword as in HiveQL.
    </p>

    <p>
      The precision of the <varname>percentage</varname> threshold depends on the number and
      sizes of the underlying data files. Impala brings in additional data files, one at a time,
      until the number of bytes exceeds the specified percentage based on the total number of
      bytes for the entire set of table data. The precision of the percentage threshold is
      higher when the table contains many data files with consistent sizes. See the code
      listings later in this section for examples.
    </p>

    <p>
      When you estimate characteristics of the data distribution based on sampling a percentage
      of the table data, be aware that the data might be unevenly distributed between different
      files. Do not assume that the percentage figure reflects the percentage of rows in the
      table. For example, one file might contain all blank values for a <codeph>STRING</codeph>
      column, while another file contains long strings in that column; therefore, one file could
      contain many more rows than another. Likewise, a table created with the <codeph>SORT
      BY</codeph> clause might contain narrow ranges of values for the sort columns, making it
      impractical to extrapolate the number of distinct values for those columns based on
      sampling only some of the data files.
    </p>

    <p>
      Because a sample of the table data might not contain all values for a particular column,
      if the <codeph>TABLESAMPLE</codeph> is used in a join query, the key relationships between
      the tables might produce incomplete result sets compared to joins using all the table
      data. For example, if you join 50% of table A with 50% of table B, some values in the join
      columns might not match between the two tables, even though overall there is a 1:1
      relationship between the tables.
    </p>

    <p>
      The <codeph>REPEATABLE</codeph> keyword makes identical queries use a consistent set of
      data files when the query is repeated. You specify an arbitrary integer key that acts as a
      seed value when Impala randomly selects the set of data files to use in the query. This
      technique lets you verify correctness, examine performance, and so on for queries using
      the <codeph>TABLESAMPLE</codeph> clause without the sampled data being different each
      time. The repeatable aspect is reset (that is, the set of selected data files may change)
      any time the contents of the table change. The statements or operations that can make
      sampling results non-repeatable are:
    </p>

    <ul>
      <li>
        <codeph>INSERT</codeph>.
      </li>

      <li>
        <codeph>TRUNCATE TABLE</codeph>.
      </li>

      <li>
        <codeph>LOAD DATA</codeph>.
      </li>

      <li>
        <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> after files are added
        or removed by a non-Impala mechanism.
      </li>

      <li></li>
    </ul>

    <p>
      This clause is similar in some ways to the <codeph>LIMIT</codeph> clause, because both
      serve to limit the size of the intermediate data and final result set. <codeph>LIMIT
      0</codeph> is more efficient than <codeph>TABLESAMPLE SYSTEM(0)</codeph> for verifying
      that a query can execute without producing any results. <codeph>TABLESAMPLE
      SYSTEM(<varname>n</varname>)</codeph> often makes query processing more efficient than
      using a <codeph>LIMIT</codeph> clause by itself, because all phases of query execution use
      less data overall. If the intent is to retrieve some representative values from the table
      in an efficient way, you might combine <codeph>TABLESAMPLE</codeph>, <codeph>ORDER
      BY</codeph>, and <codeph>LIMIT</codeph> clauses within a single query.
    </p>

    <p conref="../shared/impala_common.xml#common/partitioning_blurb"/>

    <p>
      When you query a partitioned table, any partition pruning happens before Impala selects
      the data files to sample. For example, in a table partitioned by year, a query with
      <codeph>WHERE year = 2017</codeph> and a <codeph>TABLESAMPLE SYSTEM(10)</codeph> clause
      would sample data files representing at least 10% of the bytes present in the 2017
      partition.
    </p>

    <p conref="../shared/impala_common.xml#common/s3_blurb"/>

    <p>
      This clause applies to S3 tables the same way as tables with data files stored on HDFS.
    </p>

    <p conref="../shared/impala_common.xml#common/adls_blurb"/>

    <p>
      This clause applies to ADLS tables the same way as tables with data files stored on HDFS.
    </p>

    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>

    <p>
      This clause does not apply to Kudu tables.
    </p>

    <p conref="../shared/impala_common.xml#common/hbase_blurb"/>

    <p>
      This clause does not apply to HBase tables.
    </p>

    <p conref="../shared/impala_common.xml#common/performance_blurb"/>

    <p>
      From a performance perspective, the <codeph>TABLESAMPLE</codeph> clause is especially
      valuable for exploratory queries on text, Avro, or other file formats other than Parquet.
      Text-based or row-oriented file formats must process substantial amounts of redundant data
      for queries that derive aggregate results such as <codeph>MAX()</codeph>,
      <codeph>MIN()</codeph>, or <codeph>AVG()</codeph> for a single column. Therefore, you
      might use <codeph>TABLESAMPLE</codeph> early in the ETL pipeline, when data is still in
      raw text format and has not been converted to Parquet or moved into a partitioned table.
    </p>

    <p conref="../shared/impala_common.xml#common/restrictions_blurb"/>

    <p>
      This clause applies only to tables that use a storage layer with underlying raw data
      files, such as HDFS, Amazon S3, or Microsoft ADLS.
    </p>

    <p>
      This clause does not apply to table references that represent views. A query that applies
      the <codeph>TABLESAMPLE</codeph> clause to a view or a subquery fails with a semantic
      error.
    </p>

    <p>
      Because the sampling works at the level of entire data files, it is by nature
      coarse-grained. It is possible to specify a small sample percentage but still process a
      substantial portion of the table data if the table contains relatively few data files, if
      each data file is very large, or if the data files vary substantially in size. Be sure
      that you understand the data distribution and physical file layout so that you can verify
      if the results are suitable for extrapolation. For example, if the table contains only a
      single data file, the <q>sample</q> will consist of all the table data regardless of the
      percentage you specify. If the table contains data files of 1 GiB, 1 GiB, and 1 KiB, when
      you specify a sampling percentage of 50 you would either process slightly more than 50% of
      the table (1 GiB + 1 KiB) or almost the entire table (1 GiB + 1 GiB), depending on which
      data files were selected for sampling.
    </p>

    <p>
      If data files are added by a non-Impala mechanism, and the table metadata is not updated
      by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the
      <codeph>TABLESAMPLE</codeph> clause does not consider those new files when computing the
      number of bytes in the table or selecting which files to sample.
    </p>

    <p>
      If data files are removed by a non-Impala mechanism, and the table metadata is not updated
      by a <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> statement, the query
      fails if the <codeph>TABLESAMPLE</codeph> clause attempts to reference any of the missing
      files.
    </p>

    <p conref="../shared/impala_common.xml#common/example_blurb"/>

    <p>
      The following examples demonstrate the <codeph>TABLESAMPLE</codeph> clause. These examples
      intentionally use very small data sets to illustrate how the number of files, size of each
      file, and overall size of data in the table interact with the percentage specified in the
      clause.
    </p>

    <p>
      These examples use an unpartitioned table, containing several files of roughly the same
      size:
    </p>

<codeblock>create table sample_demo (x int, s string);

insert into sample_demo values (1, 'one');
insert into sample_demo values (2, 'two');
insert into sample_demo values (3, 'three');
insert into sample_demo values (4, 'four');
insert into sample_demo values (5, 'five');

show files in sample_demo;
+---------------------+------+-----------+
| Path                | Size | Partition |
+---------------------+------+-----------+
| 991213608_data.0.   | 7B   |           |
| 982196806_data.0.   | 6B   |           |
| _2122096884_data.0. | 8B   |           |
| _586325431_data.0.  | 6B   |           |
| 1894746258_data.0.  | 7B   |           |
+---------------------+------+-----------+

show table stats sample_demo;
+-------+--------+------+--------+-------------------------+
| #Rows | #Files | Size | Format | Location                |
+-------+--------+------+--------+-------------------------+
| -1    | 5      | 34B  | TEXT   | /tsample.db/sample_demo |
+-------+--------+------+--------+-------------------------+</codeblock>

    <p>
      A query that samples 50% of the table must process at least 17 bytes of data. Based on the
      sizes of the data files, we can predict that each such query uses 3 arbitrary files. Any 1
      or 2 files are not enough to reach 50% of the total data in the table (34 bytes), so the
      query adds more files until it passes the 50% threshold:
    </p>

<codeblock>select distinct x from sample_demo tablesample system(50);
+---+
| x |
+---+
| 4 |
| 1 |
| 5 |
+---+

select distinct x from sample_demo tablesample system(50);
+---+
| x |
+---+
| 5 |
| 4 |
| 2 |
+---+

select distinct x from sample_demo tablesample system(50);
+---+
| x |
+---+
| 5 |
| 3 |
| 2 |
+---+</codeblock>

    <p>
      To help run reproducible experiments, the <codeph>REPEATABLE</codeph> clause causes Impala
      to choose the same set of files for each query. Although the data set being considered is
      deterministic, the order of results varies (in the absence of an <codeph>ORDER BY</codeph>
      clause) because of the way distributed queries are processed:
    </p>

<codeblock>select distinct x from sample_demo
  tablesample system(50) repeatable (12345);
+---+
| x |
+---+
| 3 |
| 2 |
| 1 |
+---+

select distinct x from sample_demo
  tablesample system(50) repeatable (12345);
+---+
| x |
+---+
| 2 |
| 1 |
| 3 |
+---+
</codeblock>

    <p>
      The following examples show how uneven data distribution affects which data is sampled.
      Adding another data file containing a long string value changes the threshold for 50% of
      the total data in the table:
    </p>

<codeblock>insert into sample_demo values
(1000, 'Boyhood is the longest time in life for a boy. The last term of the school-year is made of decades, not of weeks, and living through them is like waiting for the millennium. Booth Tarkington');

show files in sample_demo;
+---------------------+------+-----------+
| Path                | Size | Partition |
+---------------------+------+-----------+
| 991213608_data.0.   | 7B   |           |
| 982196806_data.0.   | 6B   |           |
| _253317650_data.0.  | 196B |           |
| _2122096884_data.0. | 8B   |           |
| _586325431_data.0.  | 6B   |           |
| 1894746258_data.0.  | 7B   |           |
+---------------------+------+-----------+

show table stats sample_demo;
+-------+--------+------+--------+-------------------------+
| #Rows | #Files | Size | Format | Location                |
+-------+--------+------+--------+-------------------------+
| -1    | 6      | 230B | TEXT   | /tsample.db/sample_demo |
+-------+--------+------+--------+-------------------------+
</codeblock>

    <p>
      Even though the queries do not refer to the <codeph>S</codeph> column containing the long
      value, all the sampling queries include the data file containing the column value
      <codeph>X=1000</codeph>, because the query cannot reach the 50% threshold (115 bytes)
      without including that file. The large file might be considered first, in which case it is
      the only file processed by the query. Or an arbitrary set of other files might be
      considered first.
    </p>

<codeblock>select distinct x from sample_demo tablesample system(50);
+------+
| x    |
+------+
| 1000 |
| 3    |
| 1    |
+------+

select distinct x from sample_demo tablesample system(50);
+------+
| x    |
+------+
| 1000 |
+------+

select distinct x from sample_demo tablesample system(50);
+------+
| x    |
+------+
| 1000 |
| 4    |
| 2    |
| 1    |
+------+</codeblock>

    <p>
      The following examples demonstrate how the <codeph>TABLESAMPLE</codeph> clause interacts
      with other table aspects, such as partitioning and file format:
    </p>

<codeblock>create table sample_demo_partitions (x int, s string) partitioned by (n int) stored as parquet;

insert into sample_demo_partitions partition (n = 1) select * from sample_demo;
insert into sample_demo_partitions partition (n = 2) select * from sample_demo;
insert into sample_demo_partitions partition (n = 3) select * from sample_demo;

show files in sample_demo_partitions;
+--------------------------------+--------+-----------+
| Path                           | Size   | Partition |
+--------------------------------+--------+-----------+
| 000000_364262785_data.0.parq   | 1.24KB | n=1       |
| 000001_973526736_data.0.parq   | 566B   | n=1       |
| 0000000_1300598134_data.0.parq | 1.24KB | n=2       |
| 0000001_689099063_data.0.parq  | 568B   | n=2       |
| 0000000_1861371709_data.0.parq | 1.24KB | n=3       |
| 0000001_1065507912_data.0.parq | 566B   | n=3       |
+--------------------------------+--------+-----------+

show table stats tablesample_demo_partitioned;
+-------+-------+--------+--------+---------+----------------------------------------------+
| n     | #Rows | #Files | Size   | Format  | Location                                     |
+-------+-------+--------+--------+---------+----------------------------------------------+
| 1     | -1    | 2      | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=1 |
| 2     | -1    | 2      | 1.80KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=2 |
| 3     | -1    | 2      | 1.79KB | PARQUET | /tsample.db/tablesample_demo_partitioned/n=3 |
| Total | -1    | 6      | 5.39KB |         |                                              |
+-------+-------+--------+--------+---------+----------------------------------------------+
</codeblock>

    <p>
      If the query does not involve any partition pruning, the sampling applies to the data
      volume of the entire table:
    </p>

<codeblock>-- 18 rows total.
select count(*) from sample_demo_partitions;
+----------+
| count(*) |
+----------+
| 18       |
+----------+

-- The number of rows per data file is not
-- perfectly balanced, therefore the count
-- is different depending on which set of files
-- is considered.
select count(*) from sample_demo_partitions
  tablesample system(75);
+----------+
| count(*) |
+----------+
| 14       |
+----------+

select count(*) from sample_demo_partitions
  tablesample system(75);
+----------+
| count(*) |
+----------+
| 16       |
+----------+</codeblock>

    <p>
      If the query only processes certain partitions, the query computes the sampling threshold
      based on the data size and set of files only from the relevant partitions:
    </p>

<codeblock>select count(*) from sample_demo_partitions
  tablesample system(50) where n = 1;
+----------+
| count(*) |
+----------+
| 6        |
+----------+

select count(*) from sample_demo_partitions
  tablesample system(50) where n = 1;
+----------+
| count(*) |
+----------+
| 2        |
+----------+
</codeblock>

    <p conref="../shared/impala_common.xml#common/related_info"/>

    <p>
      <xref href="impala_select.xml#select"/>
    </p>

  </conbody>

</concept>