Commit Graph

3 Commits

Author SHA1 Message Date
stiga-huang
331ff4647d IMPALA-11137: Enable proleptic Gregorian Calendar for Hive
Since HIVE-22589, Hive still uses Julian Calendar for writing dates
before 1582-10-15, whereas Impala uses proleptic Gregorian Calendar.
This affects the results Impala gets when querying tables written by
Hive. Currently, the Avro and ORC formats of date_tbl are suffering this
issue.

This patch enables proleptic Gregorian Calendar for Hive by default.
It also reverts the two commits of IMPALA-9555 which modifies the tests
to satisfy the inconsistent results.

Tests:
 - Ran CORE tests

Change-Id: I6be9c9720dd352d6821cdaa6c64d35ba20473bc0
Reviewed-on: http://gerrit.cloudera.org:8080/18262
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-23 20:09:10 +00:00
stiga-huang
c127b6b1a7 IMPALA-10873: Push down EQUALS, IS NULL and IN-list predicate to ORC reader
This patch pushs down more kinds of predicates into the ORC reader,
including EQUALS, IN-list, and IS-NULL predicates to have more
improvements:
 - EQUALS and IN-list predicates can be evaluated inside the ORC reader
   with bloom filters in the ORC files.
 - Comparing to scanning parquet that converting an IN-list predicate
   into two binary predicates (i.e. LE and GE), the ORC reader can
   leverage IN-list predicates to skip ORC RowGroups. E.g. a RowGroup
   with int column 'x' in range [1, 100] will be skipped if we push down
   predicate "x in (0, 101)".
 - IS-NULL predicates (including IS-NOT-NULL) can also be used in the
   ORC reader to skip RowGroups.

Implementation:
FE will collect these kinds of predicates into 'min_max_conjuncts' of
THdfsScanNode. To better reflect the meaning, 'min_max_conjuncts' is
renamed to 'stats_conjuncts'. Same for other related variable names.

Parquet scanner will only pick binary min-max conjuncts (i.e. LT, GT,
LE, and GE) to keep the existing behavior. ORC scanner will build
SearchArgument based on all these conjuncts.

Tests
 * Add a new test table 'alltypessmall_bool_sorted' which has files
   contiaining sorted bool values.
 * Add test in orc-stats.test

Change-Id: Iaa89f080fe2e87d94fc8ea7f1be83e087fa34225
Reviewed-on: http://gerrit.cloudera.org:8080/17815
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
2021-10-21 15:45:39 +00:00
norbert.luksa
35b21083b1 IMPALA-6505: Min-Max predicate push down in ORC scanner
In planning phase, the planner collects and generates min-max predicates
that can be evaluated on parquet file statistics. We can easily extend
this on ORC tables.

This commit implements min/max predicate pushdown for the ORC scanner
leveraging on the external ORC library's search arguments. We build
the search arguments when we open the scanner as we need not to
modify them later.

Also added a new query option orc_read_statistics, similar to
parquet_read_statistics. If the option is set to true (it is by default)
predicate pushdown will take effect, otherwise it will be skipped. The
predicates will be evaluated at ORC row group level, i.e. by default for
every 10,000 rows.

Limitations:
 - Min-max predicates on CHAR/VARCHAR types are not pushed down due to
   inconsistent behaviors on padding/truncating between Hive and Impala.
   (IMPALA-10882)
 - Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915).
 - Min-max predicates having different arg types are not pushed down
   (IMPALA-10916).
 - Min-max predicates with non-literal const exprs are not pushed down
   since SearchArgument interfaces only accept literals. This only
   happens when expr rewrites are disabled thus constant folding is
   disabled.

Tests:
 - Add e2e tests similar to test_parquet_stats to verify that
   predicates are pushed down.
 - Run CORE tests
 - Run TPCH benchmark, there is no improvement, nor regression.
   On the other hand, certain selective queries gained significant
   speed-up, e.g. select count(*) from lineitem where l_orderkey = 1.

Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Reviewed-on: http://gerrit.cloudera.org:8080/15403
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-17 00:44:15 +00:00