Since HIVE-22589, Hive still uses Julian Calendar for writing dates
before 1582-10-15, whereas Impala uses proleptic Gregorian Calendar.
This affects the results Impala gets when querying tables written by
Hive. Currently, the Avro and ORC formats of date_tbl are suffering this
issue.
This patch enables proleptic Gregorian Calendar for Hive by default.
It also reverts the two commits of IMPALA-9555 which modifies the tests
to satisfy the inconsistent results.
Tests:
- Ran CORE tests
Change-Id: I6be9c9720dd352d6821cdaa6c64d35ba20473bc0
Reviewed-on: http://gerrit.cloudera.org:8080/18262
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch pushs down more kinds of predicates into the ORC reader,
including EQUALS, IN-list, and IS-NULL predicates to have more
improvements:
- EQUALS and IN-list predicates can be evaluated inside the ORC reader
with bloom filters in the ORC files.
- Comparing to scanning parquet that converting an IN-list predicate
into two binary predicates (i.e. LE and GE), the ORC reader can
leverage IN-list predicates to skip ORC RowGroups. E.g. a RowGroup
with int column 'x' in range [1, 100] will be skipped if we push down
predicate "x in (0, 101)".
- IS-NULL predicates (including IS-NOT-NULL) can also be used in the
ORC reader to skip RowGroups.
Implementation:
FE will collect these kinds of predicates into 'min_max_conjuncts' of
THdfsScanNode. To better reflect the meaning, 'min_max_conjuncts' is
renamed to 'stats_conjuncts'. Same for other related variable names.
Parquet scanner will only pick binary min-max conjuncts (i.e. LT, GT,
LE, and GE) to keep the existing behavior. ORC scanner will build
SearchArgument based on all these conjuncts.
Tests
* Add a new test table 'alltypessmall_bool_sorted' which has files
contiaining sorted bool values.
* Add test in orc-stats.test
Change-Id: Iaa89f080fe2e87d94fc8ea7f1be83e087fa34225
Reviewed-on: http://gerrit.cloudera.org:8080/17815
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
In planning phase, the planner collects and generates min-max predicates
that can be evaluated on parquet file statistics. We can easily extend
this on ORC tables.
This commit implements min/max predicate pushdown for the ORC scanner
leveraging on the external ORC library's search arguments. We build
the search arguments when we open the scanner as we need not to
modify them later.
Also added a new query option orc_read_statistics, similar to
parquet_read_statistics. If the option is set to true (it is by default)
predicate pushdown will take effect, otherwise it will be skipped. The
predicates will be evaluated at ORC row group level, i.e. by default for
every 10,000 rows.
Limitations:
- Min-max predicates on CHAR/VARCHAR types are not pushed down due to
inconsistent behaviors on padding/truncating between Hive and Impala.
(IMPALA-10882)
- Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915).
- Min-max predicates having different arg types are not pushed down
(IMPALA-10916).
- Min-max predicates with non-literal const exprs are not pushed down
since SearchArgument interfaces only accept literals. This only
happens when expr rewrites are disabled thus constant folding is
disabled.
Tests:
- Add e2e tests similar to test_parquet_stats to verify that
predicates are pushed down.
- Run CORE tests
- Run TPCH benchmark, there is no improvement, nor regression.
On the other hand, certain selective queries gained significant
speed-up, e.g. select count(*) from lineitem where l_orderkey = 1.
Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Reviewed-on: http://gerrit.cloudera.org:8080/15403
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>