The Sorter's memory management logic failed to correctly manage buffers
when spilling. It would try to make use of all buffers in the system,
neglecting to account for other operators' buffer usage.
This patch adjusts the logic so that it handles contention for buffers
so long as it can get enough buffers to make progress. Instead of
precalculating the number of buffers it thinks it should be able to
pin, it just makes a best-effort attempt to pin the initial buffers
as many runs as possible, up to a limit. As long as it can pin three
runs, it can make progress.
Testing:
Added an additional test that failed before the patch without OOM.
An analytic function test that was meant to fail also started succeeding
so I had to adjust the limit there too.
Change-Id: Idfe55cc13c7f2b54cba1d05ade44cbcf6bb573c0
Reviewed-on: http://gerrit.cloudera.org:8080/2908
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Clarify relationships between classes, clean up the previous mess
where every class was friends with the other so there's an actual
distinction between public and private members. TupleIterator
is now no longer tied to TupleSorter, just Run.
Document and enforce invariants in many cases.
Factor out some functions from large functions.
Simplify and document iterator logic.
Make management of buffers when iterating over output stream more
explicitly correct: either use MarkNeedToReturn() or attach block
to the batch as appropriate. The SortedRunMerger didn't handle
resource transfer correctly, except if all the memory came from
the batch's MemPool. This patch fixes the cases when resources
are attached to the batches, but not the 'need_to_return' case.
Document that SortedRunMerger requires 'deep_copy_input' to be true
if batches can have the 'need_to_return' flag set.
Also use the atomic block exchange operation when moving between
blocks in unpinned runs to prevent pin failures at that point.
I explicitly have avoided changing the hairy block management logic
when allocating buffers for merging, that will need addressing in
a follow-up patch.
Add a SpilledRuns counter so that it's more explicit that spilling
occurred.
Testing:
Added some tests for corner cases with empty and NULL strings.
Fixed a test that previously failed with OOM but now succeeds.
Performance:
Benchmarking against old code initial revealed some regressions from
changes in inlining. Force inlining the TupleComparator::operator() and
iterator Next()/Prev() functions helped and performance seems similar or
slightly better on the targeted orderby benchmarks.
Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a
Reviewed-on: http://gerrit.cloudera.org:8080/2826
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Sorter can have runaway memory consumption as it never frees
local allocations made in comparator_.Less(). In addition, it
doesn't check for errors generated during expression evaluation
so it may keep sorting even after failures have occurred.
This change fixes the problem by freeing local allocations for
every n invocations of comparator_.Less() where n is the row
batch size specified in the query options. Various error checks
are also added to return early if any error is encountered.
Change-Id: I941729b4836e5dbb827d4313a0b45bc5df2fa8e1
Reviewed-on: http://gerrit.cloudera.org:8080/3116
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
Add missing coverage for sorting by CHAR and VARCHAR.
Add more coverage for spilling sorts.
Fix spilling tests: ensure that they actually reliably spill (many of
them had memory limits high enough that they could run entirely in
memory).
I ran this in a loop for a while to flush out flaky tests. The tests
should be fairly predictable given that they're not run concurrently
with other tests and we allocate enough block manager memory so that
each operator can obtain its reservation.
Change-Id: Ia2d2627a2c327dcdf269ea3216385b1af9dfa305
Reviewed-on: http://gerrit.cloudera.org:8080/2877
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
The test was recently reenabled in commit
71a0a7d998702781ae44270f8c742b10c34c0efc.
Continue running the test but loosen the memory limit and don't check
the runtime profile. The memory limits for this set of tests needs
revisiting in any case.
Change-Id: I195e8ad3b67c8ff85d5d15c2646a13f5feb57553
Reviewed-on: http://gerrit.cloudera.org:8080/2183
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 51632f39a45ba9deac9b86bbdb14ff10cbee35ac)
A couple of tests were disabled because of IMPALA-1305. Now that the fix
is in, those tests can be reenabled. I ran them in a loop to make sure
that they weren't flaky.
Also fix the spelling mistake in the file name.
Change-Id: I1bfcc619911a92d93b871be3a14852aa11f78da9
Reviewed-on: http://gerrit.cloudera.org:8080/2150
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
A failed test case inside a test file will leave the rest of
the test cases in the file unexecuted. Some test cases may
modify some query options such as memory limit and then
restore them in the subsequent test cases in the same file.
The failure of those test cases will leave the query options
modified, causing cascading failures to other test cases
which aren't expected to be run with the modified query
options (e.g. lowered memory limit). This problem may lead
to broken builds which are recorded in IMPALA-2724 and
IMPALA-2824.
This change fixes the problem above by checking if a test
case modifies any query option and if so, restore those
modified query options to their default values. This change
makes the assumption that a test should not modify an option
specified in its test vector so it's safe to restore the
modified query options to their default values.
Change-Id: Ib88d1dcb6a65183e1afc8eef0c764179a9f6a8ce
Reviewed-on: http://gerrit.cloudera.org:8080/1774
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
When building hash tables for the build side in partitioned
hash join or aggreagtion, we will evaluate the build or probe
side expressions to compute the hash values for each TupleRow.
Evaluation of certain expressions (e.g. CastToChar) requires
"local" memory allocation. "Local" memory allocation is supposed
to be freed after processing each row batch.
However, the calls to free local allocations are missing in
PartitionedHashJoinNode::BuildHashTableInternal() and
PartitionedAggregationNode::ProcessStream(). This causes all
"local" memory allocation to accumulate potentially for the
entire duration of the query or until GetNext() is called.
This may lead to unnecessary memory allocation failure as
memory limit is exceeded.
This patch calls ExecNode::FreeLocalAllocations() at least once
per row-batch when building hash tables. It also adds the missing
checks for the query status in the loop building hash tables.
Please note that QueryMaintenance() isn't called due to its
overhead in memory limit checks.
Change-Id: Idbeab043a45b0aaf6b6a8c560882bd1474a1216d
Reviewed-on: http://gerrit.cloudera.org:8080/1448
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
This fixes a regression introduced with:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce
min mem needed for PAGG/PHJ
Prior to that change, as soon as any partition's stream overflowed
its small buffers, all partitions' streams would be switched
immediately to IO-buffers, which would be satisfied by the initial
buffer "reservation".
After that change, individual streams are switched to IO-buffers on
demand as they overflow their small buffers. However, that change
also made it so that Partition::Spill() would eagerly switch that
partition's streams to IO-buffers, and fail the query if the buffer
is not available. The buffer may not be available because the
reserved buffers may be in use by other partition's streams.
We don't need to fail the query if the switch to IO-buffers in
Partition::Spill() fails. Instead, we should just let the streams
switch on demand as they fill up the small buffers. When that
happens, if the IO buffer is not available, then we already have a
mechanism to pick partitions to spill until we can get the IO-buffer
(in the worst case it means working our way back down to the initial
reservation). See AppendRowStreamFull() and BuildHashTables().
The symptom of this regression was that some queries would fail at a
lower memory limit than before.
Also revert the max_block_mgr_memory values back to their originals.
Additional testing: loop custom_cluster/spilling.py. We should also
remeasure minimum memory required by queries after this change.
Change-Id: I11add15540606d42cd64f2af99f4e96140ae8bb5
Reviewed-on: http://gerrit.cloudera.org:8080/1228
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
The sorter was dropping on the floor the returned Status of the
PrepareRead() calls. PrepareRead() tries to Pin() blocks. In some
queries with large sorts, those Pin() calls could fail with OOM,
but because the sorter was ignoring the returned Status it would
happily put the unpinned block in the vector of blocks and eventually
seg fault, because the buffer_desc_ of that block was NULL.
This patch fixes this problem and adds a test that eventually we may
want to move to the exhaustive build because it takes quite some time.
It also changes the comments of the sorter class to the doxygen style.
Change-Id: Icad48bcfbb97a68f2d51b015a37a7345ebb5e479
Reviewed-on: http://gerrit.cloudera.org:8080/1156
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
The commit:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ
recently lowered a couple of limits from 100m to 40m, which appears to
be too aggressive based on occasionol gvm failures. Let's bump
it back up.
Change-Id: I2c3cc24841cf3a305785890329d77e4e9e74f6e5
Reviewed-on: http://gerrit.cloudera.org:8080/1125
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
PAGG and PHJ were using an all-or-nothing approach wrt spilling. In
particular, they were trying to switch to IO-sized buffers for both
streams (aggregated and unaggregated in PAGG; build and probe in PHJ)
of every partition (currently 16 partitions for a total of 32
streams), even if some of the streams had very few rows, they were
empty or simply they would not spill so there was no need to allocate
IO-buffers for them. That was increasing the min mem needed by those
operators in many queries.
This patch decouples the decision to switch to IO-buffers for each
stream of each partition. Streams will switch to IO-sized buffers
whenever the rows they contain do not fit in the first two small
buffers (64KB and 512KB respectively). When we decide to spill a
partition, we switch to IO buffers both streams.
With these change many streams of PAGG and PHJ nodes do not need to
use IO-sized buffers, reducing the min mem requirement. For example,
below is the min mem needed (in MBs) for some of the TPC-H queries.
Some need half or less mem from the mem they needed before:
TPC-H Q3: 645 -> 240
TPC-H Q5: 375 -> 245
TPC-H Q7: 685 -> 265
TPC-H Q8: 740 -> 250
TPC-H Q9: 650 -> 400
TPC-H Q18: 1100 -> 425
TPC-H Q20: 420 -> 250
TPC-H Q21: 975 -> 620
To make this small buffer optimization to work, we had to fix
IMPALA-2352. That is, the AllocateRow() call of
PAGG::ConstructIntermediateTuple() could return unsuccessfully just
because the small buffers of the stream were exhausted. In that case,
previously we would treat it as an indication that there is no memory
left, start spilling a partition and switching all stream to
IO-buffes. Now we make a best effort, trying to first
SwitchToIoffers() and if that is successful, we re-attempt the
AllocateRow() call. See IMPALA-2352 for more details.
Another change is that now SwitchToIoBuffers() will reset the flag
using_small_buffers_ back to false, in case we are in a very low
memory situation and it fails to get a buffer. That allows us to
retry calling SwitchToIoBuffers() once we free up some space. See
IMPALA-2330 for more details.
With the above fixes we should also have fixed IMPALA-2241 and
IMPALA-2271 that are essentially stream::using_small_buffers_-related
DCHECKs.
This patch adds all 22 TPC-H queries in test_mem_usage_scaling test
and updates the per-query min mem limits in it. Additionally, it adds
a new aggregation test that uses the TPC-H dataset for larger
aggregations (TestTPCHAggregationQueries). It also removes some
dead test code.
Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38
Reviewed-on: http://gerrit.cloudera.org:8080/818
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
Conflicts:
tests/query_test/test_aggregation.py
The change of the PARTITION_FANOUT from 32 to 16 exposed a pathological case due to
the lack of coordination across concurrently executing spilling nodes of the same query.
In particular, when we repartition a partition we try to initialize hash tables for the
new partitions. But each hash table needs a block (for the nodes). In case there were not
any IO-sized blocks available, because they had been consumed by other nodes, we would get
into a loop trying to repartition those smaller partitions that couldn't initialize their
hash table. Additional repartitions that, among others, would need additional blocks for
the new streams. These partitions would end up being very small, still we would fail the
query when we were reaching the MAX_PARTITION_DEPTH limit, which was fixed to 4.
This patch fixes the problem by initializing the hash tables during repartitions with
small pages. That is, the hash tables always first use a 64KB and a 512KB block for their
nodes before switching to IO-sized blocks. This helps the partitioning algorithm to
finish when we end up with partitions that can fit in those small pages. The performance
may not be optimal, still the memory consumption is lower and the algorithm finishes. For
example, without this patch and with PARTITION_FANOUT == 16 in order to run TPC-H Q18 and
Q20 we needed 3.4GB and 3.1GB respectively. With this patch TPC-H Q18 needs ~1GB and Q20
975MB.
This patch also removes the restriction of stopping repartitioning when we are reaching
4 levels of repartitioning. Instead, whenever we repartition we compare the size of
the input partition to the size of the largest new partition. If there is no reduction
on the size we stop the algorithm. Otherwise, we keep on repartitioning. That should
help in cases of skew (e.g. due to bad hashing). There is a new MAX_PARTITION_DEPTH limit
of 16. It is very unlikely we will ever hit this limit.
Change-Id: Ib33fece10585448bc2d07bb39d0535d78b168ccc
Reviewed-on: http://gerrit.cloudera.org:8080/119
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
Our .test file parser used to not abort tests when there
is a malformed test/section. This patch changes that behavior
to report an error and treat the test as failed.
Quite a few tests were not well-formed, and were not executed
as a result. This patch fixes those tests.
Arguably, the test file parser should be more flexible in which places
to accept comments, but this patch does not address that problem.
Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
In cases where we had to spill the probe side of PHJs, we were not only appending
the probe row to the tuple stream to be spilled, but we were also getting into the
regular processing loop with the iterator set to End(). In the case of left anti
and left outer joins, the result was to incorrectly output this row, since it did not
have a match.
This bug had a small perf impact for all spilling joins because we were doing an
unnecessary loop for each probe row we had to spill.
This patch solves the problem by immediately going to the next probe row if the
current row is spilled. Additionally, it fixes a bug in the block mgr where there
was a code path we were not counting correctly the number of pinned buffers.
It also adds tpch-q21 in the set of queries to run in the spilling test.
Change-Id: I762f5c41fe468e4485a4b31dabe2e53f6b49ae24
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5313
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5334
Small buffers introduced an issue that is exacerbated by the large fanout. A stream can
only be appended to forever once it has grabbed the initial io sized buffer. With small
buffers, we don't grab that at the beginning anymore and, before this patch, it is
grabbed when the stream first needs it. This means when one stream needs it, another
stream could have already grabbed it (meaning this stream is pinned with multiple
buffers).
This patch has all the streams grab an IO buffer as soon as the first stream needs an
io buffer. This guarantees that all streams get 1 before any get 2.
Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d
Update PA/PHJ to use small (< io sized buffers) initially. Without this we would
not be able to run at the QPS that we need just due to the buffering requirements
of these operators.
Change-Id: Ic8a777d147893567c9590fbab17f561eadb6ee19
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4623
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
This was always a TODO. We want memory to come from the block mgr and trigger spilling.
Change-Id: I07f1f79fbbb33068fb2df64510a80a9b008ef73d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4466
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
The previous code did not handle well the case where the spilling happens when
building the hash table (i.e. partitioning the build rows fit). This caused the
probe partition to be starved causing queries that should be able to run to fail
with a not enough buffers error.
Change-Id: I3a9a84e8800a72ed3ce6f5ab7ff03bc2d6eb7ad8
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4403
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
This patch fixes two issues:
- Add API to buffered block mgr to allow an atomic Unpin and GetNewBlock. This has
the semantics of unpinning a block and giving the buffer to the new block. This
is necessary for the tuple stream to make sure another thread does not grab the
unpinned block in between.
- Buffer management reading an unpinned stream. Before moving onto a new block (and
unpinning the current), we need to make sure all the tuples returned from the
current block are returned up the operator tree.
Change-Id: I95ee58d1019dd971f6a7dc19ecafdfa54cdbf942
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4333
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>