mirror of
https://github.com/apache/impala.git
synced 2025-12-25 02:03:09 -05:00
This patch enables late materialization for collections to avoid the cost of materializing collections that will never be accessed by the query. For a collection column, late materialization takes effect only when the collection column is not used in any predicate, including the `!empty()` predicate added by the planner. Otherwise we need to read every row to evaluate the predicate and cannot skip any. Therefore, this patch skips registering the `!empty()` predicates if the query contains zipping unnests. This can affect performance if the table contains many empty collections, but should be noticeable only in very extreme cases. The late materialization threshold is set to 1 in HdfsParquetScanner when there is any collection that can be skipped. This patch also adds the detail of `HdfsScanner::parse_status_` to the error message returned by the HdfsParquetScanner to help figure out the root cause. Performance: - Tests with the queries involving collection columns in table `tpch_nested_parquet.customer` show that when the selectivity is low, the single-threaded (1 impalad and MT_DOP=1) scanning time can be reduced by about 50%, while when the selectivity is high, the scanning time almost does not change. - For queries not involving collections, performance A/B testing shows no regression on TPC-H. Testing: - Added a runtime profile counter NumTopLevelValuesSkipped to record the total number of top-level values skipped for all columns. The counter only counts the values that are not skipped as a page. - Added e2e test cases in test_parquet_late_materialization.py to ensure that late materialization works using the new counter. Change-Id: Ia21bdfa6811408d66d74367e0a9520e20951105f Reviewed-on: http://gerrit.cloudera.org:8080/22662 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
42 lines
1.8 KiB
Python
42 lines
1.8 KiB
Python
# Licensed to the Apache Software Foundation (ASF) under one
|
|
# or more contributor license agreements. See the NOTICE file
|
|
# distributed with this work for additional information
|
|
# regarding copyright ownership. The ASF licenses this file
|
|
# to you under the Apache License, Version 2.0 (the
|
|
# "License"); you may not use this file except in compliance
|
|
# with the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing,
|
|
# software distributed under the License is distributed on an
|
|
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
# KIND, either express or implied. See the License for the
|
|
# specific language governing permissions and limitations
|
|
# under the License.
|
|
|
|
from __future__ import absolute_import, division, print_function
|
|
from tests.common.impala_test_suite import ImpalaTestSuite
|
|
from tests.common.file_utils import create_table_from_parquet
|
|
|
|
|
|
class TestParquetLateMaterialization(ImpalaTestSuite):
|
|
"""
|
|
This suite tests late materialization optimization for parquet.
|
|
"""
|
|
|
|
@classmethod
|
|
def add_test_dimensions(cls):
|
|
super(TestParquetLateMaterialization, cls).add_test_dimensions()
|
|
cls.ImpalaTestMatrix.add_constraint(
|
|
lambda v: v.get_value('table_format').file_format == 'parquet')
|
|
|
|
def test_parquet_late_materialization(self, vector):
|
|
self.run_test_case('QueryTest/parquet-late-materialization', vector)
|
|
|
|
def test_parquet_late_materialization_unique_db(self, vector, unique_database):
|
|
create_table_from_parquet(self.client, unique_database, 'decimals_1_10')
|
|
create_table_from_parquet(self.client, unique_database, 'nested_decimals')
|
|
self.run_test_case('QueryTest/parquet-late-materialization-unique-db', vector,
|
|
unique_database)
|