mirror of
https://github.com/apache/impala.git
synced 2026-02-02 06:00:36 -05:00
13bbff4e4e5fc5d459cc6f7a5512f84ceba897cd
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e0a6e942b2 |
IMPALA-9955,IMPALA-9957: Fix not enough reservation for large pages in GroupingAggregator
The minimum requirement for a spillable operator is ((min_buffers -2) * default_buffer_size) + 2 * max_row_size. In the min reservation, we only reserve space for two large pages, one for reading, the other for writing. However, to make the non-streaming GroupingAggregator work correctly, we have to manage these extra reservations carefully. So it won't run out of the min reservation when it actually needs to spill a large page, or when it actually needs to read a large page. To be specific, for how to manage the large write page reservation, depending on whether needs_serialize is true or false: - If the aggregator needs to serialize the intermediate results when spilling a partition, we have to save a large page worth of reservation for the serialize stream, in case it needs to write large rows. This space can be restored when all the partitions are spilled so the serialize stream is not needed until we build/repartition a spilled partition and thus have pinned partitions again. If the large write page reservation is used, we save it back whenever possible after we spill or close a partition. - If the aggregator doesn't need the serialize stream at all, we can restore the large write page reservation whenever we fail to add a large row, before spilling any partitions. Reclaim it whenever possible after we spill or close a partition. A special case is when we are processing a large row and it's the last row in building/repartitioning a spilled partition, the large write page reservation can be restored for it no matter whether we need the serialize stream. Because partitions will be read out after this so no needs for spilling. For the large read page reservation, it's transferred to the spilled BufferedTupleStream that we are reading in building/repartitioning a spilled partition. The stream will restore some of it when reading a large page, and reclaim it when the output row batch is reset. Note that the stream is read in attach_on_read mode, the large page will be attached to the row batch's buffers and only get freed when the row batch is reset. Tests: - Add tests in test_spilling_large_rows (test_spilling.py) with different row sizes to reproduce the issue. - One test in test_spilling_no_debug_action becomes flaky after this patch. Revise the query to make the udf allocate larger strings so it can consistently pass. - Run CORE tests. Change-Id: I3d9c3a2e7f0da60071b920dec979729e86459775 Reviewed-on: http://gerrit.cloudera.org:8080/16240 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> |
||
|
|
0bb056e525 |
IMPALA-4224: execute separate join builds fragments
This enables parallel plans with the join build in a
separate fragment and fixes all of the ensuing fallout.
After this change, mt_dop plans with joins have separate
build fragments. There is still a 1:1 relationship between
join nodes and builders, so the builders are only accessed
by the join node's thread after it is handed off. This lets
us defer the work required to make PhjBuilder and NljBuilder
safe to be shared between nodes.
Planner changes:
* Combined the parallel and distributed planning code paths.
* Misc fixes to generate reasonable thrift structures in the
query exec requests, i.e. containing the right nodes.
* Fixes to resource calculations for the separate build plans.
** Calculate separate join/build resource consumption.
** Simplified the resource estimation by calculating resource
consumption for each fragment separately, and assuming that
all fragments hit their peak resource consumption at the
same time. IMPALA-9255 is the follow-on to make the resource
estimation more accurate.
Scheduler changes:
* Various fixes to handle multiple TPlanExecInfos correctly,
which are generated by the planner for the different cohorts.
* Add logic to colocate build fragments with parent fragments.
Runtime filter changes:
* Build sinks now produce runtime filters, which required
planner and coordinator fixes to handle.
DataSink changes:
* Close the input plan tree before calling FlushFinal() to release
resources. This depends on Send() not holding onto references
to input batches, which was true except for NljBuilder. This
invariant is documented.
Join builder changes:
* Add a common base class for PhjBuilder and NljBuilder with
functions to handle synchronisation with the join node.
* Close plan tree earlier in FragmentInstanceState::Exec()
so that peak resource requirements are lower.
* The NLJ always copies input batches, so that it can close
its input tree.
JoinNode changes:
* Join node blocks waiting for build-side to be ready,
then eventually signals that it's done, allowing the builder
to be cleaned up.
* NLJ and PHJ nodes handle both the integrated builder and
the external builder. There is a 1:1 relationship between
the node and the builder, so we don't deal with thread safety
yet.
* Buffer reservations are transferred between the builder and join
node when running with the separate builder. This is not really
necessary right now, since it is all single-threaded, but will
be important for the shared broadcast.
- The builder transfers memory for probe buffers to the join node
at the end of each build phase.
- At end of each probe phase, reservation needs to be handed back
to builder (or released).
ExecSummary changes:
* The summary logic was modified to handle connecting fragments
via join builds. The logic is an extension of what was used
for exchanges.
Testing:
* Enable --unlock_mt_dop for end-to-end tests
* Migrate some tests to run as part of end-to-end tests instead of
custom cluster.
* Add mt_dop dimension to various end-to-end tests to provide
coverage of join queries, spill-to-disk and cancellation.
* Ran a single node TPC-H and TPC-DS stress test with mt_dop=0
and mt_dop=4.
Perf:
* Ran TPC-H scale factor 30 locally with mt_dop=0. No significant
change.
Change-Id: I4403c8e62d9c13854e7830602ee613f8efc80c58
Reviewed-on: http://gerrit.cloudera.org:8080/14859
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
548106f5e1 |
IMPALA-8451,IMPALA-8905: enable admission control for dockerised tests
This gives us some additional coverage for using admission control in a simple but realistic configuration. What are the implications of this change for test stability and flakiness? On one hand were are adding some more unpredictability to tests, because they may be queued for an arbitrary amount of time. On the other, we can prevent queries from contending over memory. Currently we rely on luck to prevent concurrent queries from forcing each other out-of-memory. I think the unpredictability from the queueing is preferable, because we can generally work around these by fixing tests that are sensitive to being queued, whereas contention over memory requires us to use crude workarounds like forcing tests to execute serially. Added observability for the configured queue wait time for each pool. I noticed that I did not have a direct way to observe the effective value when I set configs. This is IMPALA-8905. I had to tweak tests in a few ways: * Tests with large strings needed higher memory limits. * Hardcoded instances of default-pool had to handle root.default as well. * test_query_mem_limit needed to run without a mem_limit. I created a special pool root.no-limits with no memory limits to allow that. Testing: Ran the dockerised build 5-6 times to flush out flaky tests. Change-Id: I7517673f9e348780fcf7cd6ce1f12c9c5a55373a Reviewed-on: http://gerrit.cloudera.org:8080/13942 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
d037ac8304 |
IMPALA-8818: Replace deque with spillable queue in BufferedPRS
Replaces DequeRowBatchQueue with SpillableRowBatchQueue in BufferedPlanRootSink. A few changes to BufferedPlanRootSink were necessary for it to work with the spillable queue, however, all the synchronization logic is the same. SpillableRowBatchQueue is a wrapper around a BufferedTupleStream and a ReservationManager. It takes in a TBackendResourceProfile that specifies the max / min memory reservation the BufferedTupleStream can use to buffer rows. The 'max_unpinned_bytes' parameter limits the max number of bytes that can be unpinned in the BufferedTupleStream. The limit is a 'soft' limit because calls to AddBatch may push the amount of unpinned memory over the limit. The queue is non-blocking and not thread safe. It provides AddBatch and GetBatch methods. Calls to AddBatch spill if the BufferedTupleStream does not have enough reservation to fit the entire RowBatch. Adds two new query options: 'MAX_PINNED_RESULT_SPOOLING_MEMORY' and 'MAX_UNPINNED_RESULT_SPOOLING_MEMORY', which bound the amount of pinned and unpinned memory that a query can use for spooling, respectively. MAX_PINNED_RESULT_SPOOLING_MEMORY must be <= MAX_UNPINNED_RESULT_SPOOLING_MEMORY in order to allow all the pinned data in the BufferedTupleStream to be unpinned. This is enforced in a new method in QueryOptions called 'ValidateQueryOptions'. Planner Changes: PlanRootSink.java now computes a full ResourceProfile if result spooling is enabled. The min mem reservation is bounded by the size of the read and write pages used by the BufferedTupleStream. The max mem reservation is bounded by 'MAX_PINNED_RESULT_SPOOLING_MEMORY'. The mem estimate is computed by estimating the size of the result set using stats. BufferedTupleStream Re-Factoring: For the most part, using a BufferedTupleStream outside an ExecNode works properly. However, some changes were necessary: * The message for the MAX_ROW_SIZE error is ExecNode specific. In order to fix this, this patch introduces the concept of an ExecNode 'label' which is a more generic version of an ExecNode 'id'. * The definition of TBackendResourceProfile lived in PlanNodes.thrift, it was moved to its own file so it can be used by DataSinks.thrift. * Modified BufferedTupleStream so it internally tracks how many bytes are unpinned (necessary for 'MAX_UNPINNED_RESULT_SPOOLING_MEMORY'). Metrics: * Added a few of the metrics mentioned in IMPALA-8825 to BufferedPlanRootSink. Specifically, added timers to track how much time is spent waiting in the BufferedPlanRootSink 'Send' and 'GetNext' methods. * The BufferedTupleStream in the SpillableRowBatchQueue exposes several BufferPool metrics such as number of reserved and unpinned bytes. Bug Fixes: * Fixed a bug in BufferedPlanRootSink where the MemPool used by the expression evaluators was not being cleared incrementally. * Fixed a bug where the inactive timer was not being properly updated in BufferedPlanRootSink. * Fixed a bug where RowBatch memory was not freed if BufferedPlanRootSink::GetNext terminated early because it could not handle requests where num_results < BATCH_SIZE. Testing: * Added new tests to test_result_spooling.py. * Updated errors thrown in spilling-large-rows.test. * Ran exhaustive tests. Change-Id: I10f9e72374cdf9501c0e5e2c5b39c13688ae65a9 Reviewed-on: http://gerrit.cloudera.org:8080/14039 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
f0b3d9d122 |
IMPALA-3916: Reserve SQL:2016 reserved words
This patch reserves SQL:2016 reserved words, excluding: 1. Impala builtin function names. 2. Time unit words(year, month, etc.). 3. An exception list based on a discussion. Some test cases are modified to avoid these words. A impalad and catalogd startup option reserved_words_version is added. The words are reserved if the option is set to "3.0.0". Change-Id: If1b295e6a77e840cf1b794c2eb73e1b9d2b8ddd6 Reviewed-on: http://gerrit.cloudera.org:8080/9096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins |
||
|
|
ed87c40600 |
IMPALA-3208: max_row_size option
Adds support for a "max_row_size" query option that instructs Impala to reserve enough memory to process rows of the specified size. For spilling operators, the planner reserves enough memory to process rows of this size. The advantage of this compared to simply specifying larger values for min_spillable_buffer_size and default_spillable_buffer_size is that operators may be able to handler larger rows without increasing the size of all their buffers. The default value is 512KB. I picked that number because it doesn't increase minimum reservations *too* much even with smaller buffers like 64kb but should be large enough for almost all reasonable workloads. This is implemented in the aggs and joins using the variable page size support added to BufferedTupleStream in an earlier commit. The synopsis is that each stream requires reservation for one default-sized page per read and write iterator, and temporarily requires reservation for a max-sized page when reading or writing larger pages. The max-sized write reservation is released immediately after the row is appended and the max-size read reservation is released after advancing to the next row. The sorter and analytic simply use max-sized buffers for all pages in the stream. Testing: Updated existing planner tests to reflect default max_row_size. Added new planner tests to test the effect of the query option. Added "set" test to check validation of query option. Added end-to-end tests exercising spilling operators with large rows with and without spilling induced by SET_DENY_RESERVATION_PROBABILITY. Change-Id: Ic70f6dddbcef124bb4b329ffa2e42a74a1826570 Reviewed-on: http://gerrit.cloudera.org:8080/7629 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins |