impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 15:00:38 -05:00

Author	SHA1	Message	Date
stiga-huang	e0a6e942b2	IMPALA-9955,IMPALA-9957: Fix not enough reservation for large pages in GroupingAggregator The minimum requirement for a spillable operator is ((min_buffers -2) * default_buffer_size) + 2 * max_row_size. In the min reservation, we only reserve space for two large pages, one for reading, the other for writing. However, to make the non-streaming GroupingAggregator work correctly, we have to manage these extra reservations carefully. So it won't run out of the min reservation when it actually needs to spill a large page, or when it actually needs to read a large page. To be specific, for how to manage the large write page reservation, depending on whether needs_serialize is true or false: - If the aggregator needs to serialize the intermediate results when spilling a partition, we have to save a large page worth of reservation for the serialize stream, in case it needs to write large rows. This space can be restored when all the partitions are spilled so the serialize stream is not needed until we build/repartition a spilled partition and thus have pinned partitions again. If the large write page reservation is used, we save it back whenever possible after we spill or close a partition. - If the aggregator doesn't need the serialize stream at all, we can restore the large write page reservation whenever we fail to add a large row, before spilling any partitions. Reclaim it whenever possible after we spill or close a partition. A special case is when we are processing a large row and it's the last row in building/repartitioning a spilled partition, the large write page reservation can be restored for it no matter whether we need the serialize stream. Because partitions will be read out after this so no needs for spilling. For the large read page reservation, it's transferred to the spilled BufferedTupleStream that we are reading in building/repartitioning a spilled partition. The stream will restore some of it when reading a large page, and reclaim it when the output row batch is reset. Note that the stream is read in attach_on_read mode, the large page will be attached to the row batch's buffers and only get freed when the row batch is reset. Tests: - Add tests in test_spilling_large_rows (test_spilling.py) with different row sizes to reproduce the issue. - One test in test_spilling_no_debug_action becomes flaky after this patch. Revise the query to make the udf allocate larger strings so it can consistently pass. - Run CORE tests. Change-Id: I3d9c3a2e7f0da60071b920dec979729e86459775 Reviewed-on: http://gerrit.cloudera.org:8080/16240 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-25 16:11:20 +00:00
Tim Armstrong	b2d9901fb8	IMPALA-9176: shared null-aware anti-join build This switches null-aware anti-join (NAAJ) to use shared join builds with mt_dop > 0. To support this, we make all access to the join build data structures from the probe read-only. NAAJ requires iterating over rows from build partitions at various steps in the algorithm and before this patch this was not thread-safe. We avoided that problem by having a separate builder for each join node and duplicating the data. The main challenge was iteration over null_aware_partition()->build_rows() from the probe side, because it uses an embedded iterator in the stream so was not thread-safe (since each thread would be trying to use the same iterator). The solution is to extend BufferedTupleStream to allow multiple read iterators into a pinned, read-only, stream. Each probe thread can then iterate over the stream independently with no thread safety issues. With BufferedTupleStream changes, I partially abstracted ReadIterator more from the rest of BufferedTupleStream, but decided not to completely refactor so that this patchset didn't cause excessive churn. I.e. much BufferedTupleStream code still accesses internal fields of ReadIterator. Fix a pre-existing bug in grouping-aggregator where Spill() hit a DCHECK because the hash table was destroyed unnecessarily when it hit an OOM. This was flushed out by the parameter change in test_spilling. Testing: Add test to buffered-tuple-stream-test for multiple readers to BTS. Tweaked test_spilling_naaj_no_deny_reservation to have a smaller minimum reservation, required to keep the test passing with the new, lower, memory requirement. Updated a TPC-H planner test where resource requirements slightly decreased for the NAAJ. Ran the naaj tests in test_spilling.py with TSAN enabled, confirmed no data races. Ran exhaustive tests, which passed after fixing IMPALA-9611. Ran core tests with ASAN. Ran backend tests with TSAN. Perf: I ran this query that exercises EvaluateNullProbe() heavily. select l_orderkey, l_partkey, l_suppkey, l_linenumber from tpch30_parquet.lineitem where l_suppkey = 4162 and l_shipmode = 'AIR' and l_returnflag = 'A' and l_shipdate > '1993-01-01' and if(l_orderkey > 5500000, NULL, l_orderkey) not in ( select if(o_orderkey % 2 = 0, NULL, o_orderkey + 1) from orders where l_orderkey = o_orderkey) order by 1,2,3,4; It went from ~13s to ~11s running on a single impalad with this change, because of the inlining of CreateOutputRow() and EvalConjuncts(). I also ran TPC-H SF 30 on Parquet with mt_dop=4, and there was no change in performance. Change-Id: I95ead761430b0aa59a4fb2e7848e47d1bf73c1c9 Reviewed-on: http://gerrit.cloudera.org:8080/15612 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-24 20:56:58 +00:00
Tim Armstrong	7b280e5841	IMPALA-9349: free output_unmatched_batch_ buffers promptly in PHJ This fixes a subtle memory managment issue where freeing of a buffer is delayed longer than it should be. This means that the full buffer pool reservation is not available for repartitioning, which can lead to crashes or hang for very specific queries. The fix is to transfer resources from output_unmatched_batch_ as soon as the last row from the batch is appended to the output batch. This bug would only be triggered by join modes that output unmatched rows from the right side (RIGHT OUTER JOIN, FULL OUTER JOIN, RIGHT ANTI JOIN) and have an empty probe side (otherwise unmatched rows are output by iterating over the hash table). Testing: Added DCHECKs to check that all resources are available before repartitioning. Added a regression test that triggered the bug. Change-Id: Ie13b51d4d909afb0fe2e7b7dc00b085c51058fed Reviewed-on: http://gerrit.cloudera.org:8080/15142 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-01 02:25:13 +00:00
poojanilangekar	2a4835cfba	IMPALA-7367: Pack StringValue and CollectionValue slots This change packs StringValue and CollectionValue slots to ensure they now occupy 12 bytes instead of 16 bytes. This reduces the memory requirements and improves the performance. Since Kudu tuples are populated using a memcopy, 4 bytes of padding was added to StringSlots in Kudu tables. Testing: Ran core tests. Added static asserts to ensure the value sizes are as expected. Performance tests on TPCH-40 produced 3.96% improvement. Change-Id: I32f3b06622c087e4aa288e8db1bf4581b10d386a Reviewed-on: http://gerrit.cloudera.org:8080/11599 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-11-19 17:27:13 +00:00
stakiar	98d923243f	IMPALA-5004: Switch to sorting node for large TopN queries Adds a new query option 'topn_bytes_limit' that places a limit on the number of estimated bytes that a TopN operator can process. If the Impala planner estimates that a TopN operator will process more bytes than this limit, it will replace the TopN operator with a sort operator. Since the TopN operator cannot spill to disk, it has to buffer everything in memory. This can cause frequent OOM issues when running with a large limit + offset. Switching to a sort operator allows Impala to spill to disk. We prefer to use the TopN operator when possible as it has better performance than the sort operator for 'order by limit [offset]' queries. The default limit is set to 512MB and is based on micro-benchmarking the topn vs. sort operator for various limits (see the JIRA for full details). The default is set to an intentionally high value in order to avoid performance regressions. Testing: * Added a new planner test to fuctional-planner/ to validate that 'topn_bytes_limit' properly switches between topn and sort operators. Change-Id: I34c9db33c9302b55e9978f53f9c7061f2806c8a9 Reviewed-on: http://gerrit.cloudera.org:8080/11698 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-11-07 01:17:34 +00:00
Bikramjeet Vig	77c56a805a	IMPALA-7699: Fix spilling test run with hdfs erasure coding turned on A spilling test when run on test build with hdfs erasure coding turned on hits an out of memory error on the hdfs scan node. This happens because the test is tuned for a regular 3 node minicluster without hdfs erasure coding. Fix is to simply increase the memory limit on the test to accommodate this difference yet keep it small enough to achieve desired spilling on the hash join node. Testing: Ran it on an EC enabled minicluster to make sure it works Change-Id: I207569822ba7388e78936d25e2311fa09c7a1b9a Reviewed-on: http://gerrit.cloudera.org:8080/11740 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 03:32:07 +00:00
Tim Armstrong	c80c62f22d	IMPALA-7648 part 1: add expected out-of-memory tests This adds test coverage for some cases where queries are currently expected to fail with out-of-memory: * memory limit exceeded in exchange node * aggregation with large var-len intermediate values * top N with large limit * hash join with many duplicates on right side * analytic with a large window that needs to be buffered in-memory Note that it's not always totally deterministic where the query hits 'memory limit exceeded' so we don't include the node ID or name in the expected error message. Testing: * ran exhaustive tests * looped modified tests locally overnight Change-Id: Icd1a7eb97837b742a967c260cafb5a7f4f45412e Reviewed-on: http://gerrit.cloudera.org:8080/11564 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-05 01:36:17 +00:00
Tim Armstrong	fb5dc9eb48	IMPALA-4835: switch I/O buffers to buffer pool This is the following squashed patches that were reverted. I will fix the known issues with some follow-on patches. ====================================================================== IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation In preparation for switching the I/O mgr to the buffer pool, this removes and cleans up a lot of code so that the switchover patch starts from a cleaner slate. * Remove the free buffer cache (which will be replaced by buffer pool's own caching). * Make memory limit exceeded error checking synchronous (in anticipation of having to propagate buffer pool errors synchronously). * Simplify error propagation - remove the (ineffectual) code that enqueued BufferDescriptors containing error statuses. * Document locking scheme better in a few places, make it part of the function signature when it seemed reasonable. * Move ReturnBuffer() to ScanRange, because it is intrinsically connected with the lifecycle of a scan range. * Separate external ReturnBuffer() and internal CleanUpBuffer() interfaces - previously callers of ReturnBuffer() were fudging the num_buffers_in_reader accounting to make the external interface work. * Eliminate redundant state in ScanRange: 'eosr_returned_' and 'is_cancelled_'. * Clarify the logic around calling Close() for the last BufferDescriptor. -> There appeared to be an implicit assumption that buffers would be freed in the order they were returned from the scan range, so that the "eos" buffer was returned last. Instead just count the number of outstanding buffers to detect the last one. -> Touching the is_cancelled_ field without holding a lock was hard to reason about - violated locking rules and it was unclear that it was race-free. * Remove DiskIoMgr::Read() to simplify the interface. It is trivial to inline at the callsites. This will probably regress performance somewhat because of the cache removal, so my plan is to merge it around the same time as switching the I/O mgr to allocate from the buffer pool. I'm keeping the patches separate to make reviewing easier. Testing: * Ran exhaustive tests * Ran the disk-io-mgr-stress-test overnight ====================================================================== IMPALA-4835: Part 2: Allocate scan range buffers upfront This change is a step towards reserving memory for buffers from the buffer pool and constraining per-scanner memory requirements. This change restructures the DiskIoMgr code so that each ScanRange operates with a fixed set of buffers that are allocated upfront and recycled as the I/O mgr works through the ScanRange. One major change is that ScanRanges get blocked when a buffer is not available and get unblocked when a client returns a buffer via ReturnBuffer(). I was able to remove the logic to maintain the blocked_ranges_ list by instead adding a separate set with all ranges that are active. There is also some miscellaneous cleanup included - e.g. reducing the amount of code devoted to maintaining counters and metrics. One tricky part of the existing code was the it called IssueInitialRanges() with empty lists of files and depended on DiskIoMgr::AddScanRanges() to not check for cancellation in that case. See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue ranges for empty lists of files. I plan to merge this along with the actual buffer pool switch, but separated it out to allow review of the DiskIoMgr changes separate from other aspects of the buffer pool switchover. Testing: * Ran core and exhaustive tests. ====================================================================== IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00 Reviewed-on: http://gerrit.cloudera.org:8080/9679 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	161cbe30ff	Revert IMPALA-4835 and dependent changes Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit" This reverts commit `25bcb258df`. Revert "IMPALA-6588: don't add empty list of ranges in text scan" This reverts commit `d57fbec6f6`. Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool" This reverts commit `24b4ed0b29`. Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront" This reverts commit `5699b59d0c`. Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation" This reverts commit `65680dc421`. Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb Reviewed-on: http://gerrit.cloudera.org:8080/9480 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-03 04:22:12 +00:00
Tim Armstrong	24b4ed0b29	IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786 Reviewed-on: http://gerrit.cloudera.org:8080/8966 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 04:17:41 +00:00

10 Commits