mirror of
https://github.com/apache/impala.git
synced 2026-01-07 09:02:19 -05:00
This is similar to the single-node execution optimisation, but applies to slightly larger queries that should run in a distributed manner but won't benefit from codegen. This adds a new query option disable_codegen_rows_threshold that defaults to 50,000. If fewer than this number of rows are processed by a plan node per impalad, the cost of codegen almost certainly outweighs the benefit. Using rows processed as a threshold is justified by a simple model that assumes the cost of codegen and execution per row for the same operation are proportional. E.g. if x is the complexity of the operation, n is the number of rows processed, C is a constant factor giving the cost of codegen and Ec/Ei are constant factor giving the cost of codegen'd and interpreted execution and d, then the cost of the codegen'd operator is C * x + Ec * x * n and the cost of the interpreted operator is Ei * x * n. Rearranging means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that (at least with the simplified model) it makes sense to choose interpretation or codegen based on a constant threshold. The model also implies that it is somewhat safer to choose codegen because the additional cost of codegen is O(1) but the additional cost of interpretation is O(n). I ran some experiments with TPC-H Q1, varying the input table size, to determine what the cut-over point where codegen was beneficial was. The cutover was around 150k rows per node for both text and parquet. At 50k rows per node disabling codegen was very beneficial - around 0.12s versus 0.24s. To be somewhat conservative I set the default threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the cutover tends to be higher because there are plan nodes that process many fewer than the max rows. Fix a couple of minor issues in the frontend - the numNodes_ calculation could return 0 for Kudu, and the single node optimization didn't handle the case where for a scan node with conjuncts, a limit and missing stats correctly (it considered the estimate still valid.) Testing: Updated e2e tests that set disable_codegen to set disable_codegen_rows_threshold to 0, so that those tests run both with and without codegen still. Added an e2e test to make sure that the optimisation is applied in the backend. Added planner tests for various cases where codegen should and shouldn't be disabled. Perf: Added a targeted perf test for a join+agg over a small input, which benefits from this change. Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e Reviewed-on: http://gerrit.cloudera.org:8080/7153 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins
84 lines
3.2 KiB
Python
84 lines
3.2 KiB
Python
# Licensed to the Apache Software Foundation (ASF) under one
|
|
# or more contributor license agreements. See the NOTICE file
|
|
# distributed with this work for additional information
|
|
# regarding copyright ownership. The ASF licenses this file
|
|
# to you under the Apache License, Version 2.0 (the
|
|
# "License"); you may not use this file except in compliance
|
|
# with the License. You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing,
|
|
# software distributed under the License is distributed on an
|
|
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
# KIND, either express or implied. See the License for the
|
|
# specific language governing permissions and limitations
|
|
# under the License.
|
|
|
|
# Targeted tests for decimal type.
|
|
|
|
from copy import copy
|
|
|
|
from tests.common.impala_test_suite import ImpalaTestSuite
|
|
from tests.common.test_dimensions import create_exec_option_dimension_from_dict
|
|
from tests.common.test_vector import ImpalaTestDimension
|
|
|
|
class TestDecimalQueries(ImpalaTestSuite):
|
|
@classmethod
|
|
def get_workload(cls):
|
|
return 'functional-query'
|
|
|
|
@classmethod
|
|
def add_test_dimensions(cls):
|
|
super(TestDecimalQueries, cls).add_test_dimensions()
|
|
cls.ImpalaTestMatrix.add_dimension(
|
|
create_exec_option_dimension_from_dict({
|
|
'decimal_v2' : ['false', 'true'],
|
|
'batch_size' : [0, 1],
|
|
'disable_codegen' : ['false', 'true'],
|
|
'disable_codegen_rows_threshold' : [0]}))
|
|
# Hive < 0.11 does not support decimal so we can't run these tests against the other
|
|
# file formats.
|
|
# TODO: Enable them on Hive >= 0.11.
|
|
cls.ImpalaTestMatrix.add_constraint(lambda v:\
|
|
(v.get_value('table_format').file_format == 'text' and
|
|
v.get_value('table_format').compression_codec == 'none') or
|
|
v.get_value('table_format').file_format == 'parquet')
|
|
|
|
def test_queries(self, vector):
|
|
self.run_test_case('QueryTest/decimal', vector)
|
|
|
|
# Tests involving DECIMAL typed expressions. The results depend on whether DECIMAL
|
|
# version 1 or version 2 are enabled, so the .test file itself toggles the DECIMAL_V2
|
|
# query option.
|
|
class TestDecimalExprs(ImpalaTestSuite):
|
|
@classmethod
|
|
def get_workload(cls):
|
|
return 'functional-query'
|
|
|
|
@classmethod
|
|
def add_test_dimensions(cls):
|
|
super(TestDecimalExprs, cls).add_test_dimensions()
|
|
cls.ImpalaTestMatrix.add_constraint(lambda v:
|
|
(v.get_value('table_format').file_format == 'parquet'))
|
|
|
|
def test_exprs(self, vector):
|
|
self.run_test_case('QueryTest/decimal-exprs', vector)
|
|
|
|
# TODO: when we have a good way to produce Avro decimal data (e.g. upgrade Hive), we can
|
|
# run Avro through the same tests as above instead of using avro_decimal_tbl.
|
|
class TestAvroDecimalQueries(ImpalaTestSuite):
|
|
@classmethod
|
|
def get_workload(cls):
|
|
return 'functional-query'
|
|
|
|
@classmethod
|
|
def add_test_dimensions(cls):
|
|
super(TestAvroDecimalQueries, cls).add_test_dimensions()
|
|
cls.ImpalaTestMatrix.add_constraint(lambda v:
|
|
(v.get_value('table_format').file_format == 'avro' and
|
|
v.get_value('table_format').compression_codec == 'snap'))
|
|
|
|
def test_avro_queries(self, vector):
|
|
self.run_test_case('QueryTest/decimal_avro', vector)
|