Files
impala/tests/failure/test_failpoints.py
Tim Armstrong 08acccf9eb IMPALA-9156: share broadcast join builds
The scheduler will only create one join build finstance per
backend in cases where this is supported.

The builder is aware of the number of finstances executing the
probe and hands off the build data structures to the builders.

Nested loop join requires minimal modifications because the
build data structures are read-only after initial construction.
The only significant change is that memory can't be transferred
to the multiple consumers, so MarkNeedsDeepCopy() needs to be
used instead.

Hash join requires additional synchronisation because the
spilling algorithm mutates build-side data structures. This
patch adds synchronisation so that rebuilding spilled
partitions is done in a thread-safe manner, using a single
thread. This uses the CyclicBarrier added in an earlier patch.

Threads blocked on CyclicBarrier need to be cancellable,
which is handled by cancelling the barrier when cancelling
fragments on the backend.

BufferPool now correctly handles multiple threads calling
CleanPages() concurrently, which makes other methods thread-safe.

Update planner to cost broadcast join and estimate memory
consumption based on a single instance per node.

Planner estimates of number of instances are improved. Instead of
assuming mt_dop instances per node, use the total number of input
splits (also called scan ranges in places) as an upper bound on
the number of instances generated by scans. These instance
estimates from the scan nodes are then propagated up the
plan tree in the same way as the numNodes estimates. The instance
estimate for the join build fragment is fixed to be based on
the destination fragment.

The profile now correctly accounts for time waiting for the
builder, counting it in inactive time and showing it in the
node timeline. Additional improvements/cleanup to the time
accounting are deferring until IMPALA-9422.

Testing:
* Updated planner tests
* Ran a single node stress test with TPC-H and TPC-DS
* Add a targeted test for spilling broadcast joins, both repartitioning
  and not repartitioning.
* Add a targeted test for a spilling broadcast join with empty probe
* Add a targeted test for spilling broadcast join with empty build
  partitions.
* Add a broadcast join to test_cancellation and test_failpoints.

Perf:

I did a single node run on my desktop:
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 6.26    | -15.70%    | 4.63       | -16.16%        |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval    |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+
| TPCH(30) | TPCH-Q21 | parquet / none / none | 24.97  | 23.25       | R +7.38%   |   0.51%   |   0.22%        | 5     | R +6.95%       | 2.31    | 27.93   |
| TPCH(30) | TPCH-Q4  | parquet / none / none | 2.83   | 2.79        |   +1.31%   |   1.86%   |   0.36%        | 5     |   +1.88%       | 1.15    | 1.53    |
| TPCH(30) | TPCH-Q6  | parquet / none / none | 1.28   | 1.28        |   -0.01%   |   1.64%   |   1.63%        | 5     |   -0.11%       | -0.58   | -0.01   |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 2.65   | 2.68        |   -0.94%   |   0.84%   |   1.46%        | 5     |   -0.21%       | -0.87   | -1.25   |
| TPCH(30) | TPCH-Q1  | parquet / none / none | 4.69   | 4.72        |   -0.56%   |   1.29%   |   0.52%        | 5     |   -1.04%       | -1.15   | -0.89   |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 10.64  | 10.80       |   -1.48%   |   0.61%   |   0.60%        | 5     |   -1.39%       | -1.73   | -3.91   |
| TPCH(30) | TPCH-Q15 | parquet / none / none | 4.11   | 4.32        |   -4.92%   |   0.05%   |   0.40%        | 5     |   -4.93%       | -2.31   | -27.46  |
| TPCH(30) | TPCH-Q20 | parquet / none / none | 3.47   | 3.67        | I -5.41%   |   0.81%   |   0.03%        | 5     | I -5.70%       | -2.31   | -15.75  |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 7.58   | 8.14        | I -6.93%   |   3.13%   |   2.62%        | 5     | I -9.31%       | -2.02   | -3.96   |
| TPCH(30) | TPCH-Q9  | parquet / none / none | 15.59  | 17.02       | I -8.38%   |   0.95%   |   0.43%        | 5     | I -8.92%       | -2.31   | -19.37  |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 2.90   | 3.25        | I -10.93%  |   1.42%   |   4.41%        | 5     | I -10.28%      | -2.31   | -5.33   |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.69   | 3.13        | I -14.31%  |   4.50%   |   1.40%        | 5     | I -17.79%      | -2.31   | -7.80   |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.50   | 3.03        | I -17.54%  |   0.10%   |   0.79%        | 5     | I -20.55%      | -2.31   | -49.31  |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 4.76   | 5.92        | I -19.52%  |   0.78%   |   0.33%        | 5     | I -24.31%      | -2.31   | -61.63  |
| TPCH(30) | TPCH-Q2  | parquet / none / none | 2.56   | 3.33        | I -23.18%  |   2.13%   |   0.85%        | 5     | I -30.39%      | -2.31   | -28.14  |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 12.59  | 16.41       | I -23.26%  |   1.73%   |   0.90%        | 5     | I -30.43%      | -2.31   | -32.36  |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.83   | 2.41        | I -24.04%  |   1.83%   |   2.22%        | 5     | I -30.48%      | -2.31   | -20.54  |
| TPCH(30) | TPCH-Q8  | parquet / none / none | 4.43   | 5.94        | I -25.33%  |   0.96%   |   0.54%        | 5     | I -34.54%      | -2.31   | -63.01  |
| TPCH(30) | TPCH-Q5  | parquet / none / none | 3.81   | 5.37        | I -29.08%  |   1.43%   |   0.69%        | 5     | I -41.47%      | -2.31   | -53.11  |
| TPCH(30) | TPCH-Q7  | parquet / none / none | 13.34  | 21.49       | I -37.92%  |   0.46%   |   0.30%        | 5     | I -60.69%      | -2.31   | -203.08 |
| TPCH(30) | TPCH-Q3  | parquet / none / none | 4.73   | 7.73        | I -38.81%  |   4.90%   |   1.35%        | 5     | I -61.68%      | -2.31   | -26.40  |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 3.71   | 6.61        | I -43.83%  |   1.63%   |   0.09%        | 5     | I -77.12%      | -2.31   | -106.61 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+---------+

Change-Id: I4c67e4b2c87ed0fba648f1e1710addb885d66dc7
Reviewed-on: http://gerrit.cloudera.org:8080/15096
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-17 23:29:45 +00:00

208 lines
9.1 KiB
Python

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# Injects failures at specific locations in each of the plan nodes. Currently supports
# two types of failures - cancellation of the query and a failure test hook.
#
import pytest
import os
import re
from collections import defaultdict
from time import sleep
from tests.beeswax.impala_beeswax import ImpalaBeeswaxException
from tests.common.impala_test_suite import ImpalaTestSuite, LOG
from tests.common.skip import SkipIf, SkipIfS3, SkipIfABFS, SkipIfADLS, SkipIfIsilon, \
SkipIfLocal
from tests.common.test_dimensions import create_exec_option_dimension
from tests.common.test_vector import ImpalaTestDimension
FAILPOINT_ACTIONS = ['FAIL', 'CANCEL', 'MEM_LIMIT_EXCEEDED']
# Not included:
# - SCANNER_ERROR, because it only fires if the query already hit an error.
FAILPOINT_LOCATIONS = ['PREPARE', 'PREPARE_SCANNER', 'OPEN', 'GETNEXT', 'GETNEXT_SCANNER',
'CLOSE']
# Map debug actions to their corresponding query options' values.
FAILPOINT_ACTION_MAP = {'FAIL': 'FAIL', 'CANCEL': 'WAIT',
'MEM_LIMIT_EXCEEDED': 'MEM_LIMIT_EXCEEDED'}
MT_DOP_VALUES = [0, 4]
# Queries should cover all exec nodes.
QUERIES = [
"select * from alltypes",
"select count(*) from alltypessmall",
"select count(int_col) from alltypessmall group by id",
"select 1 from alltypessmall a join alltypessmall b on a.id = b.id",
"select 1 from alltypessmall a join alltypessmall b on a.id != b.id",
"select 1 from alltypessmall order by id",
"select 1 from alltypessmall order by id limit 100",
"select * from alltypessmall union all select * from alltypessmall",
"select row_number() over (partition by int_col order by id) from alltypessmall",
"select c from (select id c from alltypessmall order by id limit 10) v where c = 1",
"""SELECT STRAIGHT_JOIN *
FROM alltypes t1
JOIN /*+broadcast*/ alltypesagg t2 ON t1.id = t2.id
WHERE t2.int_col < 1000"""
]
@SkipIf.skip_hbase # -skip_hbase argument specified
@SkipIfS3.hbase # S3: missing coverage: failures
@SkipIfABFS.hbase
@SkipIfADLS.hbase
@SkipIfIsilon.hbase # ISILON: missing coverage: failures.
@SkipIfLocal.hbase
class TestFailpoints(ImpalaTestSuite):
@classmethod
def get_workload(cls):
return 'functional-query'
@classmethod
def add_test_dimensions(cls):
super(TestFailpoints, cls).add_test_dimensions()
cls.ImpalaTestMatrix.add_dimension(
ImpalaTestDimension('query', *QUERIES))
cls.ImpalaTestMatrix.add_dimension(
ImpalaTestDimension('action', *FAILPOINT_ACTIONS))
cls.ImpalaTestMatrix.add_dimension(
ImpalaTestDimension('location', *FAILPOINT_LOCATIONS))
cls.ImpalaTestMatrix.add_dimension(
ImpalaTestDimension('mt_dop', *MT_DOP_VALUES))
cls.ImpalaTestMatrix.add_dimension(
create_exec_option_dimension([0], [False], [0]))
# Don't create PREPARE:WAIT debug actions because cancellation is not checked until
# after the prepare phase once execution is started.
cls.ImpalaTestMatrix.add_constraint(
lambda v: not (v.get_value('action') == 'CANCEL'
and v.get_value('location') == 'PREPARE'))
# Don't create CLOSE:WAIT debug actions to avoid leaking plan fragments (there's no
# way to cancel a plan fragment once Close() has been called)
cls.ImpalaTestMatrix.add_constraint(
lambda v: not (v.get_value('action') == 'CANCEL'
and v.get_value('location') == 'CLOSE'))
# Run serially because this can create enough memory pressure to invoke the Linux OOM
# killer on machines with 30GB RAM. This makes the test run in 4 minutes instead of 1-2.
@pytest.mark.execute_serially
def test_failpoints(self, vector):
query = vector.get_value('query')
action = vector.get_value('action')
location = vector.get_value('location')
vector.get_value('exec_option')['mt_dop'] = vector.get_value('mt_dop')
plan_node_ids = self.__parse_plan_nodes_from_explain(query, vector)
for node_id in plan_node_ids:
debug_action = '%d:%s:%s' % (node_id, location, FAILPOINT_ACTION_MAP[action])
# IMPALA-7046: add jitter to backend startup to exercise various failure paths.
debug_action += '|COORD_BEFORE_EXEC_RPC:JITTER@100@0.3'
LOG.info('Current debug action: SET DEBUG_ACTION=%s' % debug_action)
vector.get_value('exec_option')['debug_action'] = debug_action
if action == 'CANCEL':
self.__execute_cancel_action(query, vector)
elif action == 'FAIL' or action == 'MEM_LIMIT_EXCEEDED':
self.__execute_fail_action(query, vector)
else:
assert 0, 'Unknown action: %s' % action
# We should be able to execute the same query successfully when no failures are
# injected.
del vector.get_value('exec_option')['debug_action']
self.execute_query(query, vector.get_value('exec_option'))
def __parse_plan_nodes_from_explain(self, query, vector):
"""Parses the EXPLAIN <query> output and returns a list of node ids.
Expects format of <ID>:<NAME>"""
explain_result =\
self.execute_query("explain " + query, vector.get_value('exec_option'),
table_format=vector.get_value('table_format'))
node_ids = []
for row in explain_result.data:
match = re.search(r'\s*(?P<node_id>\d+)\:(?P<node_type>\S+\s*\S+)', row)
if match is not None:
node_ids.append(int(match.group('node_id')))
return node_ids
def test_lifecycle_failures(self):
"""Test that targeted failure injections in the query lifecycle do not cause crashes
or hangs"""
query = "select * from tpch.lineitem limit 10000"
# Test that the admission controller handles scheduler errors correctly.
debug_action = "SCHEDULER_SCHEDULE:FAIL"
result = self.execute_query_expect_failure(self.client, query,
query_options={'debug_action': debug_action})
assert "Error during scheduling" in str(result)
# Fail Exec() in the coordinator.
debug_action = 'EXEC_SERIALIZE_FRAGMENT_INFO:FAIL@1.0'
self.execute_query_expect_failure(self.client, query,
query_options={'debug_action': debug_action})
# Fail the Prepare() phase of all fragment instances.
debug_action = 'FIS_IN_PREPARE:FAIL@1.0'
self.execute_query_expect_failure(self.client, query,
query_options={'debug_action': debug_action})
# Fail the Open() phase of all fragment instances.
debug_action = 'FIS_IN_OPEN:FAIL@1.0'
self.execute_query_expect_failure(self.client, query,
query_options={'debug_action': debug_action})
# Fail the ExecInternal() phase of all fragment instances.
debug_action = 'FIS_IN_EXEC_INTERNAL:FAIL@1.0'
self.execute_query_expect_failure(self.client, query,
query_options={'debug_action': debug_action})
# Fail the fragment instance thread creation with a 0.5 probability.
debug_action = 'FIS_FAIL_THREAD_CREATION:FAIL@0.5'
# We want to test the behavior when only some fragment instance threads fail to be
# created, so we set the probability of fragment instance thread creation failure to
# 0.5. Since there's only a 50% chance of fragment instance thread creation failure,
# we attempt to catch a query failure up to a very conservative maximum of 50 tries.
for i in range(50):
try:
self.execute_query(query,
query_options={'debug_action': debug_action})
except ImpalaBeeswaxException as e:
assert 'Debug Action: FIS_FAIL_THREAD_CREATION:FAIL@0.5' \
in str(e), str(e)
break
def __execute_fail_action(self, query, vector):
try:
self.execute_query(query, vector.get_value('exec_option'),
table_format=vector.get_value('table_format'))
assert 'Expected Failure'
except ImpalaBeeswaxException as e:
LOG.debug(e)
# IMPALA-5197: None of the debug actions should trigger corrupted file message
assert 'Corrupt Parquet file' not in str(e)
def __execute_cancel_action(self, query, vector):
LOG.info('Starting async query execution')
handle = self.execute_query_async(query, vector.get_value('exec_option'),
table_format=vector.get_value('table_format'))
LOG.info('Sleeping')
sleep(3)
cancel_result = self.client.cancel(handle)
self.client.close_query(handle)
assert cancel_result.status_code == 0,\
'Unexpected status code from cancel request: %s' % cancel_result