impala

jprdonnelly/impala

Fork 0

mirror of https://github.com/apache/impala.git synced 2026-02-02 06:00:36 -05:00

Commit Graph

Author	SHA1	Message	Date
stiga-huang	331ff4647d	IMPALA-11137: Enable proleptic Gregorian Calendar for Hive Since HIVE-22589, Hive still uses Julian Calendar for writing dates before 1582-10-15, whereas Impala uses proleptic Gregorian Calendar. This affects the results Impala gets when querying tables written by Hive. Currently, the Avro and ORC formats of date_tbl are suffering this issue. This patch enables proleptic Gregorian Calendar for Hive by default. It also reverts the two commits of IMPALA-9555 which modifies the tests to satisfy the inconsistent results. Tests: - Ran CORE tests Change-Id: I6be9c9720dd352d6821cdaa6c64d35ba20473bc0 Reviewed-on: http://gerrit.cloudera.org:8080/18262 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-23 20:09:10 +00:00
stiga-huang	c127b6b1a7	IMPALA-10873: Push down EQUALS, IS NULL and IN-list predicate to ORC reader This patch pushs down more kinds of predicates into the ORC reader, including EQUALS, IN-list, and IS-NULL predicates to have more improvements: - EQUALS and IN-list predicates can be evaluated inside the ORC reader with bloom filters in the ORC files. - Comparing to scanning parquet that converting an IN-list predicate into two binary predicates (i.e. LE and GE), the ORC reader can leverage IN-list predicates to skip ORC RowGroups. E.g. a RowGroup with int column 'x' in range [1, 100] will be skipped if we push down predicate "x in (0, 101)". - IS-NULL predicates (including IS-NOT-NULL) can also be used in the ORC reader to skip RowGroups. Implementation: FE will collect these kinds of predicates into 'min_max_conjuncts' of THdfsScanNode. To better reflect the meaning, 'min_max_conjuncts' is renamed to 'stats_conjuncts'. Same for other related variable names. Parquet scanner will only pick binary min-max conjuncts (i.e. LT, GT, LE, and GE) to keep the existing behavior. ORC scanner will build SearchArgument based on all these conjuncts. Tests * Add a new test table 'alltypessmall_bool_sorted' which has files contiaining sorted bool values. * Add test in orc-stats.test Change-Id: Iaa89f080fe2e87d94fc8ea7f1be83e087fa34225 Reviewed-on: http://gerrit.cloudera.org:8080/17815 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Qifan Chen <qchen@cloudera.com>	2021-10-21 15:45:39 +00:00
norbert.luksa	35b21083b1	IMPALA-6505: Min-Max predicate push down in ORC scanner In planning phase, the planner collects and generates min-max predicates that can be evaluated on parquet file statistics. We can easily extend this on ORC tables. This commit implements min/max predicate pushdown for the ORC scanner leveraging on the external ORC library's search arguments. We build the search arguments when we open the scanner as we need not to modify them later. Also added a new query option orc_read_statistics, similar to parquet_read_statistics. If the option is set to true (it is by default) predicate pushdown will take effect, otherwise it will be skipped. The predicates will be evaluated at ORC row group level, i.e. by default for every 10,000 rows. Limitations: - Min-max predicates on CHAR/VARCHAR types are not pushed down due to inconsistent behaviors on padding/truncating between Hive and Impala. (IMPALA-10882) - Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915). - Min-max predicates having different arg types are not pushed down (IMPALA-10916). - Min-max predicates with non-literal const exprs are not pushed down since SearchArgument interfaces only accept literals. This only happens when expr rewrites are disabled thus constant folding is disabled. Tests: - Add e2e tests similar to test_parquet_stats to verify that predicates are pushed down. - Run CORE tests - Run TPCH benchmark, there is no improvement, nor regression. On the other hand, certain selective queries gained significant speed-up, e.g. select count(*) from lineitem where l_orderkey = 1. Change-Id: I136622413db21e0941d238ab6aeea901a6464845 Reviewed-on: http://gerrit.cloudera.org:8080/15403 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-17 00:44:15 +00:00

Author

SHA1

Message

Date

stiga-huang

331ff4647d

IMPALA-11137: Enable proleptic Gregorian Calendar for Hive

Since HIVE-22589, Hive still uses Julian Calendar for writing dates
before 1582-10-15, whereas Impala uses proleptic Gregorian Calendar.
This affects the results Impala gets when querying tables written by
Hive. Currently, the Avro and ORC formats of date_tbl are suffering this
issue.

This patch enables proleptic Gregorian Calendar for Hive by default.
It also reverts the two commits of IMPALA-9555 which modifies the tests
to satisfy the inconsistent results.

Tests:
 - Ran CORE tests

Change-Id: I6be9c9720dd352d6821cdaa6c64d35ba20473bc0
Reviewed-on: http://gerrit.cloudera.org:8080/18262
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2022-02-23 20:09:10 +00:00

stiga-huang

c127b6b1a7

IMPALA-10873: Push down EQUALS, IS NULL and IN-list predicate to ORC reader

This patch pushs down more kinds of predicates into the ORC reader,
including EQUALS, IN-list, and IS-NULL predicates to have more
improvements:
 - EQUALS and IN-list predicates can be evaluated inside the ORC reader
   with bloom filters in the ORC files.
 - Comparing to scanning parquet that converting an IN-list predicate
   into two binary predicates (i.e. LE and GE), the ORC reader can
   leverage IN-list predicates to skip ORC RowGroups. E.g. a RowGroup
   with int column 'x' in range [1, 100] will be skipped if we push down
   predicate "x in (0, 101)".
 - IS-NULL predicates (including IS-NOT-NULL) can also be used in the
   ORC reader to skip RowGroups.

Implementation:
FE will collect these kinds of predicates into 'min_max_conjuncts' of
THdfsScanNode. To better reflect the meaning, 'min_max_conjuncts' is
renamed to 'stats_conjuncts'. Same for other related variable names.

Parquet scanner will only pick binary min-max conjuncts (i.e. LT, GT,
LE, and GE) to keep the existing behavior. ORC scanner will build
SearchArgument based on all these conjuncts.

Tests
 * Add a new test table 'alltypessmall_bool_sorted' which has files
   contiaining sorted bool values.
 * Add test in orc-stats.test

Change-Id: Iaa89f080fe2e87d94fc8ea7f1be83e087fa34225
Reviewed-on: http://gerrit.cloudera.org:8080/17815
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>

2021-10-21 15:45:39 +00:00

norbert.luksa

35b21083b1

IMPALA-6505: Min-Max predicate push down in ORC scanner

In planning phase, the planner collects and generates min-max predicates
that can be evaluated on parquet file statistics. We can easily extend
this on ORC tables.

This commit implements min/max predicate pushdown for the ORC scanner
leveraging on the external ORC library's search arguments. We build
the search arguments when we open the scanner as we need not to
modify them later.

Also added a new query option orc_read_statistics, similar to
parquet_read_statistics. If the option is set to true (it is by default)
predicate pushdown will take effect, otherwise it will be skipped. The
predicates will be evaluated at ORC row group level, i.e. by default for
every 10,000 rows.

Limitations:
 - Min-max predicates on CHAR/VARCHAR types are not pushed down due to
   inconsistent behaviors on padding/truncating between Hive and Impala.
   (IMPALA-10882)
 - Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915).
 - Min-max predicates having different arg types are not pushed down
   (IMPALA-10916).
 - Min-max predicates with non-literal const exprs are not pushed down
   since SearchArgument interfaces only accept literals. This only
   happens when expr rewrites are disabled thus constant folding is
   disabled.

Tests:
 - Add e2e tests similar to test_parquet_stats to verify that
   predicates are pushed down.
 - Run CORE tests
 - Run TPCH benchmark, there is no improvement, nor regression.
   On the other hand, certain selective queries gained significant
   speed-up, e.g. select count(*) from lineitem where l_orderkey = 1.

Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Reviewed-on: http://gerrit.cloudera.org:8080/15403
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2021-09-17 00:44:15 +00:00

3 Commits