Files
impala/testdata/workloads/functional-query/queries/QueryTest/compute-stats-column-minmax.test
Qifan Chen 1231208da7 IMPALA-10494: Making use of the min/max column stats to improve min/max filters
This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files;
 4. IMPALA-10617: Compute min/max column stats beyond parquet tables.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Reviewed-on: http://gerrit.cloudera.org:8080/17075
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-02 21:50:17 +00:00

96 lines
3.2 KiB
Plaintext

====
---- QUERY
##################################
# Create a new alltypestiny table.
##################################
drop table if exists alltypestiny;
CREATE TABLE alltypestiny
STORED AS PARQUET
as select * from functional_parquet.alltypestiny;
====
---- QUERY
# Compute stats including the min/max for integers and floats.
set compute_column_minmax_stats = true;
compute stats alltypestiny;
====
---- QUERY
# show column stats including the min/max.
set show_column_minmax_stats = true;
show column stats alltypestiny;
---- LABELS
COLUMN, TYPE, #DISTINCT VALUES, #NULLS, MAX SIZE, AVG SIZE, #TRUES, #FALSES, MIN, MAX
---- RESULTS
'id','INT',8,0,4,4.0,-1,-1,'0','7'
'bool_col','BOOLEAN',2,0,1,1.0,4,4,'-1','-1'
'tinyint_col','TINYINT',2,0,1,1.0,-1,-1,'0','1'
'smallint_col','SMALLINT',2,0,2,2.0,-1,-1,'0','1'
'int_col','INT',2,0,4,4.0,-1,-1,'0','1'
'bigint_col','BIGINT',2,0,8,8.0,-1,-1,'0','10'
'float_col','FLOAT',2,0,4,4.0,-1,-1,'0.0','1.100000023841858'
'double_col','DOUBLE',2,0,8,8.0,-1,-1,'0.0','10.1'
'date_string_col','STRING',4,0,8,8.0,-1,-1,'-1','-1'
'string_col','STRING',2,0,1,1.0,-1,-1,'-1','-1'
'timestamp_col','TIMESTAMP',8,0,16,16.0,-1,-1,'-1','-1'
'year','INT',1,0,4,4.0,-1,-1,'2009','2009'
'month','INT',4,0,4,4.0,-1,-1,'1','4'
---- TYPES
STRING, STRING, BIGINT, BIGINT, BIGINT, DOUBLE, BIGINT, BIGINT, STRING, STRING
====
---- QUERY
##############################
# Create a new date_tbl table.
##############################
drop table if exists date_tbl;
CREATE TABLE date_tbl
STORED AS PARQUET
as select * from functional_parquet.date_tbl;
====
---- QUERY
# Compute stats including the min/max for date types.
set compute_column_minmax_stats = true;
compute stats date_tbl;
====
---- QUERY
# show column stats including the min/max.
set show_column_minmax_stats = true;
show column stats date_tbl;
---- LABELS
COLUMN, TYPE, #DISTINCT VALUES, #NULLS, MAX SIZE, AVG SIZE, #TRUES, #FALSES, MIN, MAX
---- RESULTS
'id_col','INT',22,0,4,4,-1,-1,'0','31'
'date_col','DATE',16,2,4,4,-1,-1,'0001-01-01','9999-12-31'
'date_part','DATE',4,0,4,4,-1,-1,'0001-01-01','9999-12-31'
---- TYPES
STRING, STRING, BIGINT, BIGINT, BIGINT, DOUBLE, BIGINT, BIGINT, STRING, STRING
====
---- QUERY
#################################
# Create a new decimal_tbl table.
#################################
drop table if exists decimal_tbl;
CREATE TABLE decimal_tbl
STORED AS PARQUET
as select * from functional_parquet.decimal_tbl;
====
---- QUERY
# Compute stats including the min/max for decimal types.
set compute_column_minmax_stats = true;
compute stats decimal_tbl;
====
---- QUERY
# show column stats including the min/max.
set show_column_minmax_stats = true;
show column stats decimal_tbl;
---- LABELS
COLUMN, TYPE, #DISTINCT VALUES, #NULLS, MAX SIZE, AVG SIZE, #TRUES, #FALSES, MIN, MAX
---- RESULTS
'd1','DECIMAL(9,0)',4,0,4,4,-1,-1,'1234','132842'
'd2','DECIMAL(10,0)',3,0,8,8,-1,-1,'111','2222'
'd3','DECIMAL(20,10)',5,0,16,16,-1,-1,'1.2345678900','12345.6789000000'
'd4','DECIMAL(38,38)',1,0,16,16,-1,-1,'0.12345678900000000000000000000000000000','0.12345678900000000000000000000000000000'
'd5','DECIMAL(10,5)',5,0,8,8,-1,-1,'0.10000','12345.78900'
'd6','DECIMAL(9,0)',1,0,4,4,-1,-1,'1','1'
---- TYPES
STRING, STRING, BIGINT, BIGINT, BIGINT, DOUBLE, BIGINT, BIGINT, STRING, STRING
====