Commit Graph

28 Commits

Author SHA1 Message Date
Tim Armstrong
ee53ddb389 IMPALA-1346/1590/2344: fix sorter buffer mgmt when spilling
The Sorter's memory management logic failed to correctly manage buffers
when spilling. It would try to make use of all buffers in the system,
neglecting to account for other operators' buffer usage.

This patch adjusts the logic so that it handles contention for buffers
so long as it can get enough buffers to make progress. Instead of
precalculating the number of buffers it thinks it should be able to
pin, it just makes a best-effort attempt to pin the initial buffers
as many runs as possible, up to a limit. As long as it can pin three
runs, it can make progress.

Testing:
Added an additional test that failed before the patch without OOM.
An analytic function test that was meant to fail also started succeeding
so I had to adjust the limit there too.

Change-Id: Idfe55cc13c7f2b54cba1d05ade44cbcf6bb573c0
Reviewed-on: http://gerrit.cloudera.org:8080/2908
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-06-06 17:34:07 -07:00
Tim Armstrong
37ec25396f IMPALA-3344: Simplify sorter and document/enforce invariants.
Clarify relationships between classes, clean up the previous mess
where every class was friends with the other so there's an actual
distinction between public and private members. TupleIterator
is now no longer tied to TupleSorter, just Run.

Document and enforce invariants in many cases.

Factor out some functions from large functions.

Simplify and document iterator logic.

Make management of buffers when iterating over output stream more
explicitly correct: either use MarkNeedToReturn() or attach block
to the batch as appropriate. The SortedRunMerger didn't handle
resource transfer correctly, except if all the memory came from
the batch's MemPool. This patch fixes the cases when resources
are attached to the batches, but not the 'need_to_return' case.
Document that SortedRunMerger requires 'deep_copy_input' to be true
if batches can have the 'need_to_return' flag set.

Also use the atomic block exchange operation when moving between
blocks in unpinned runs to prevent pin failures at that point.
I explicitly have avoided changing the hairy block management logic
when allocating buffers for merging, that will need addressing in
a follow-up patch.

Add a SpilledRuns counter so that it's more explicit that spilling
occurred.

Testing:
Added some tests for corner cases with empty and NULL strings.
Fixed a test that previously failed with OOM but now succeeds.

Performance:
Benchmarking against old code initial revealed some regressions from
changes in inlining. Force inlining the TupleComparator::operator() and
iterator Next()/Prev() functions helped and performance seems similar or
slightly better on the targeted orderby benchmarks.

Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a
Reviewed-on: http://gerrit.cloudera.org:8080/2826
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-06-02 21:33:08 -07:00
Michael Ho
f7501d2ec1 IMPALA-3332: Free local allocations in sorter.
Sorter can have runaway memory consumption as it never frees
local allocations made in comparator_.Less(). In addition, it
doesn't check for errors generated during expression evaluation
so it may keep sorting even after failures have occurred.

This change fixes the problem by freeing local allocations for
every n invocations of comparator_.Less() where n is the row
batch size specified in the query options. Various error checks
are also added to return early if any error is encountered.

Change-Id: I941729b4836e5dbb827d4313a0b45bc5df2fa8e1
Reviewed-on: http://gerrit.cloudera.org:8080/3116
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-05-23 08:40:18 -07:00
Tim Armstrong
34c95c9590 IMPALA-2345,2991: test coverage for spilling and sorts
Add missing coverage for sorting by CHAR and VARCHAR.

Add more coverage for spilling sorts.

Fix spilling tests: ensure that they actually reliably spill (many of
them had memory limits high enough that they could run entirely in
memory).

I ran this in a loop for a while to flush out flaky tests. The tests
should be fairly predictable given that they're not run concurrently
with other tests and we allocate enough block manager memory so that
each operator can obtain its reservation.

Change-Id: Ia2d2627a2c327dcdf269ea3216385b1af9dfa305
Reviewed-on: http://gerrit.cloudera.org:8080/2877
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:55 -07:00
Tim Armstrong
212bea529f IMPALA-2994: Temporary workaround for flaky spilling test
The test was recently reenabled in commit
71a0a7d998702781ae44270f8c742b10c34c0efc.

Continue running the test but loosen the memory limit and don't check
the runtime profile. The memory limits for this set of tests needs
revisiting in any case.

Change-Id: I195e8ad3b67c8ff85d5d15c2646a13f5feb57553
Reviewed-on: http://gerrit.cloudera.org:8080/2183
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 51632f39a45ba9deac9b86bbdb14ff10cbee35ac)
2016-02-17 20:21:57 -08:00
Tim Armstrong
1c102d9d8e Reenable tests that were disabled for IMPALA-1305
A couple of tests were disabled because of IMPALA-1305. Now that the fix
is in, those tests can be reenabled. I ran them in a loop to make sure
that they weren't flaky.

Also fix the spelling mistake in the file name.

Change-Id: I1bfcc619911a92d93b871be3a14852aa11f78da9
Reviewed-on: http://gerrit.cloudera.org:8080/2150
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-02-13 10:08:13 +00:00
Michael Ho
968c61c940 IMPALA-2824: Restore query options after each test.
A failed test case inside a test file will leave the rest of
the test cases in the file unexecuted. Some test cases may
modify some query options such as memory limit and then
restore them in the subsequent test cases in the same file.
The failure of those test cases will leave the query options
modified, causing cascading failures to other test cases
which aren't expected to be run with the modified query
options (e.g. lowered memory limit). This problem may lead
to broken builds which are recorded in IMPALA-2724 and
IMPALA-2824.

This change fixes the problem above by checking if a test
case modifies any query option and if so, restore those
modified query options to their default values. This change
makes the assumption that a test should not modify an option
specified in its test vector so it's safe to restore the
modified query options to their default values.

Change-Id: Ib88d1dcb6a65183e1afc8eef0c764179a9f6a8ce
Reviewed-on: http://gerrit.cloudera.org:8080/1774
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 03:13:05 +00:00
Michael Ho
ba0bd1d0da IMPALA-2612: Free local allocations once for every row batch when building hash tables.
When building hash tables for the build side in partitioned
hash join or aggreagtion, we will evaluate the build or probe
side expressions to compute the hash values for each TupleRow.
Evaluation of certain expressions (e.g. CastToChar) requires
"local" memory allocation. "Local" memory allocation is supposed
to be freed after processing each row batch.

However, the calls to free local allocations are missing in
PartitionedHashJoinNode::BuildHashTableInternal() and
PartitionedAggregationNode::ProcessStream(). This causes all
"local" memory allocation to accumulate potentially for the
entire duration of the query or until GetNext() is called.
This may lead to unnecessary memory allocation failure as
memory limit is exceeded.

This patch calls ExecNode::FreeLocalAllocations() at least once
per row-batch when building hash tables. It also adds the missing
checks for the query status in the loop building hash tables.
Please note that QueryMaintenance() isn't called due to its
overhead in memory limit checks.

Change-Id: Idbeab043a45b0aaf6b6a8c560882bd1474a1216d
Reviewed-on: http://gerrit.cloudera.org:8080/1448
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2015-11-26 03:21:46 +00:00
Dan Hecht
84c4c2ce86 IMPALA-2480, IMPALA-2519: Don't force IO-buffer on probe side when spilling PHJ
This fixes a regression introduced with:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce
min mem needed for PAGG/PHJ

Prior to that change, as soon as any partition's stream overflowed
its small buffers, all partitions' streams would be switched
immediately to IO-buffers, which would be satisfied by the initial
buffer "reservation".

After that change, individual streams are switched to IO-buffers on
demand as they overflow their small buffers.  However, that change
also made it so that Partition::Spill() would eagerly switch that
partition's streams to IO-buffers, and fail the query if the buffer
is not available.  The buffer may not be available because the
reserved buffers may be in use by other partition's streams.

We don't need to fail the query if the switch to IO-buffers in
Partition::Spill() fails.  Instead, we should just let the streams
switch on demand as they fill up the small buffers.  When that
happens, if the IO buffer is not available, then we already have a
mechanism to pick partitions to spill until we can get the IO-buffer
(in the worst case it means working our way back down to the initial
reservation).  See AppendRowStreamFull() and BuildHashTables().

The symptom of this regression was that some queries would fail at a
lower memory limit than before.

Also revert the max_block_mgr_memory values back to their originals.

Additional testing: loop custom_cluster/spilling.py.  We should also
remeasure minimum memory required by queries after this change.

Change-Id: I11add15540606d42cd64f2af99f4e96140ae8bb5
Reviewed-on: http://gerrit.cloudera.org:8080/1228
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-10-12 14:41:08 -07:00
Ippokratis Pandis
49b588a714 IMPALA-2265: Sorter was not checking the returned Status of PrepareRead
The sorter was dropping on the floor the returned Status of the
PrepareRead() calls. PrepareRead() tries to Pin() blocks. In some
queries with large sorts, those Pin() calls could fail with OOM,
but because the sorter was ignoring the returned Status it would
happily put the unpinned block in the vector of blocks and eventually
seg fault, because the buffer_desc_ of that block was NULL.

This patch fixes this problem and adds a test that eventually we may
want to move to the exhaustive build because it takes quite some time.
It also changes the comments of the sorter class to the doxygen style.

Change-Id: Icad48bcfbb97a68f2d51b015a37a7345ebb5e479
Reviewed-on: http://gerrit.cloudera.org:8080/1156
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-09 16:42:03 -07:00
Dan Hecht
df5271e7c1 Fix flaky test_spilling.py test case
The commit:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ

recently lowered a couple of limits from 100m to 40m, which appears to
be too aggressive based on occasionol gvm failures.  Let's bump
it back up.

Change-Id: I2c3cc24841cf3a305785890329d77e4e9e74f6e5
Reviewed-on: http://gerrit.cloudera.org:8080/1125
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:34 -07:00
Ippokratis Pandis
48699de6e3 IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ
PAGG and PHJ were using an all-or-nothing approach wrt spilling. In
particular, they were trying to switch to IO-sized buffers for both
streams (aggregated and unaggregated in PAGG; build and probe in PHJ)
of every partition (currently 16 partitions for a total of 32
streams), even if some of the streams had very few rows, they were
empty or simply they would not spill so there was no need to allocate
IO-buffers for them. That was increasing the min mem needed by those
operators in many queries.

This patch decouples the decision to switch to IO-buffers for each
stream of each partition. Streams will switch to IO-sized buffers
whenever the rows they contain do not fit in the first two small
buffers (64KB and 512KB respectively). When we decide to spill a
partition, we switch to IO buffers both streams.

With these change many streams of PAGG and PHJ nodes do not need to
use IO-sized buffers, reducing the min mem requirement. For example,
below is the min mem needed (in MBs) for some of the TPC-H queries.
Some need half or less mem from the mem they needed before:

  TPC-H Q3: 645 -> 240
  TPC-H Q5: 375 -> 245
  TPC-H Q7: 685 -> 265
  TPC-H Q8: 740 -> 250
  TPC-H Q9: 650 -> 400
  TPC-H Q18: 1100 -> 425
  TPC-H Q20: 420 -> 250
  TPC-H Q21: 975 -> 620

To make this small buffer optimization to work, we had to fix
IMPALA-2352. That is, the AllocateRow() call of
PAGG::ConstructIntermediateTuple() could return unsuccessfully just
because the small buffers of the stream were exhausted. In that case,
previously we would treat it as an indication that there is no memory
left, start spilling a partition and switching all stream to
IO-buffes. Now we make a best effort, trying to first
SwitchToIoffers() and if that is successful, we re-attempt the
AllocateRow() call. See IMPALA-2352 for more details.

Another change is that now SwitchToIoBuffers() will reset the flag
using_small_buffers_ back to false, in case we are in a very low
memory situation and it fails to get a buffer. That allows us to
retry calling SwitchToIoBuffers() once we free up some space. See
IMPALA-2330 for more details.

With the above fixes we should also have fixed IMPALA-2241 and
IMPALA-2271 that are essentially stream::using_small_buffers_-related
DCHECKs.

This patch adds all 22 TPC-H queries in test_mem_usage_scaling test
and updates the per-query min mem limits in it. Additionally, it adds
a new aggregation test that uses the TPC-H dataset for larger
aggregations (TestTPCHAggregationQueries). It also removes some
dead test code.

Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38
Reviewed-on: http://gerrit.cloudera.org:8080/818
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins

Conflicts:

	tests/query_test/test_aggregation.py
2015-09-23 11:07:42 -07:00
aacalfa
57dd4d1502 IMPALA-1309: Add support for distinct in group_concat function.
Change-Id: I2790f1d2a7bfd0ecc7ef66cc5d91dafe3414e111
Reviewed-on: http://gerrit.cloudera.org:8080/892
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-09-23 09:42:17 +00:00
Ippokratis Pandis
d58aedff42 IMPALA-1820: Start with small pages for hash tables during repartitioning
The change of the PARTITION_FANOUT from 32 to 16 exposed a pathological case due to
the lack of coordination across concurrently executing spilling nodes of the same query.
In particular, when we repartition a partition we try to initialize hash tables for the
new partitions. But each hash table needs a block (for the nodes). In case there were not
any IO-sized blocks available, because they had been consumed by other nodes, we would get
into a loop trying to repartition those smaller partitions that couldn't initialize their
hash table. Additional repartitions that, among others, would need additional blocks for
the new streams. These partitions would end up being very small, still we would fail the
query when we were reaching the MAX_PARTITION_DEPTH limit, which was fixed to 4.

This patch fixes the problem by initializing the hash tables during repartitions with
small pages. That is, the hash tables always first use a 64KB and a 512KB block for their
nodes before switching to IO-sized blocks. This helps the partitioning algorithm to
finish when we end up with partitions that can fit in those small pages. The performance
may not be optimal, still the memory consumption is lower and the algorithm finishes. For
example, without this patch and with PARTITION_FANOUT == 16 in order to run TPC-H Q18 and
Q20 we needed 3.4GB and 3.1GB respectively. With this patch TPC-H Q18 needs ~1GB and Q20
975MB.

This patch also removes the restriction of stopping repartitioning when we are reaching
4 levels of repartitioning. Instead, whenever we repartition we compare the size of
the input partition to the size of the largest new partition. If there is no reduction
on the size we stop the algorithm. Otherwise, we keep on repartitioning. That should
help in cases of skew (e.g. due to bad hashing). There is a new MAX_PARTITION_DEPTH limit
of 16. It is very unlikely we will ever hit this limit.

Change-Id: Ib33fece10585448bc2d07bb39d0535d78b168ccc
Reviewed-on: http://gerrit.cloudera.org:8080/119
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-02-28 00:42:04 +00:00
Alex Behm
f696861c5c Throw error on unrecognized test sections.
Our .test file parser used to not abort tests when there
is a malformed test/section. This patch changes that behavior
to report an error and treat the test as failed.

Quite a few tests were not well-formed, and were not executed
as a result. This patch fixes those tests.

Arguably, the test file parser should be more flexible in which places
to accept comments, but this patch does not address that problem.

Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-12-02 18:08:09 -08:00
Ippokratis Pandis
87502f829c IMPALA-1471: Bug in spilling of PHJ that was affecting left anti and outer joins.
In cases where we had to spill the probe side of PHJs, we were not only appending
the probe row to the tuple stream to be spilled, but we were also getting into the
regular processing loop with the iterator set to End(). In the case of left anti
and left outer joins, the result was to incorrectly output this row, since it did not
have a match.

This bug had a small perf impact for all spilling joins because we were doing an
unnecessary loop for each probe row we had to spill.

This patch solves the problem by immediately going to the next probe row if the
current row is spilled. Additionally, it fixes a bug in the block mgr where there
was a code path we were not counting correctly the number of pinned buffers.
It also adds tpch-q21 in the set of queries to run in the spilling test.

Change-Id: I762f5c41fe468e4485a4b31dabe2e53f6b49ae24
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5313
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5334
2014-11-20 02:21:14 -08:00
ishaan
23964c19af [CDH5] Fix bad merge in in spilling.test
Change-Id: Ia6e30cf5916c737088d8cb969e0167b9d69a599e
2014-10-08 23:19:02 -07:00
Nong Li
de31fa8e21 Disable spilling tests that are too flaky.
Change-Id: I4ac877c3fa8297d873c67f219bb0c75f0001562d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4731
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-10-06 15:18:56 -07:00
Nong Li
e08ffde009 PA/PHJ: Increase fanout to 32 and fix interaction with small buffers.
Small buffers introduced an issue that is exacerbated by the large fanout. A stream can
only be appended to forever once it has grabbed the initial io sized buffer. With small
buffers, we don't grab that at the beginning anymore and, before this patch, it is
grabbed when the stream first needs it. This means when one stream needs it, another
stream could have already grabbed it (meaning this stream is pinned with multiple
buffers).

This patch has all the streams grab an IO buffer as soon as the first stream needs an
io buffer. This guarantees that all streams get 1 before any get 2.

Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d
2014-10-06 15:16:16 -07:00
Nong Li
3e632ef6ad Reduce min PA/PHJ mem requirement.
Update PA/PHJ to use small (< io sized buffers) initially. Without this we would
not be able to run at the QPS that we need just due to the buffering requirements
of these operators.

Change-Id: Ic8a777d147893567c9590fbab17f561eadb6ee19
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4623
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-10-06 15:14:10 -07:00
ishaan
010cc22a2f [CDH5] Fix test spilling.
tpch in cdh5 does not have double columns. Also, remove round calls to test that we get
consistent results.

Change-Id: Ia45ef08644ed78b05a08c47422733ab38a26b508
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4595
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-09-26 22:57:02 -07:00
Nong Li
d5c948c351 Increase the mem limit for one of the spilling queries.
Change-Id: I9b52582b2ded82821ecc446762f07d7702dedabf
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4555
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-09-26 12:27:29 -07:00
Nong Li
f03b05ed50 Fix hash table buckets to allocate memory from the BlockMgr.
This was always a TODO. We want memory to come from the block mgr and trigger spilling.

Change-Id: I07f1f79fbbb33068fb2df64510a80a9b008ef73d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4466
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-09-26 12:26:09 -07:00
Matthew Jacobs
da5198e615 Add spilling test for an analytic fn
Change-Id: Ia93c71c9c2a01f7f04a81593d51f5ca565286b7d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4447
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-09-23 07:26:09 -07:00
Nong Li
8a661d0787 [CDH5] cherry pick conflicts.
Change-Id: Ic11237b7ead4a810b523d6b6095781efbc5bb66b
2014-09-20 19:41:42 -07:00
Nong Li
6b73eec02d PHJ: Fix block management when spilling.
The previous code did not handle well the case where the spilling happens when
building the hash table (i.e. partitioning the build rows fit). This caused the
probe partition to be starved causing queries that should be able to run to fail
with a not enough buffers error.

Change-Id: I3a9a84e8800a72ed3ce6f5ab7ff03bc2d6eb7ad8
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4403
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-09-20 16:12:21 -07:00
Skye Wanderman-Milne
2a449651da Use CRC hash for 0th partition level.
Change-Id: Ie845e0edb684f13421eea41327b1571b368db21a
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4370
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-09-20 16:11:40 -07:00
ishaan
c4b4e010ff Buffered Tuple Stream fixes.
This patch fixes two issues:
  - Add API to buffered block mgr to allow an atomic Unpin and GetNewBlock. This has
    the semantics of unpinning a block and giving the buffer to the new block. This
    is necessary for the tuple stream to make sure another thread does not grab the
    unpinned block in between.
  - Buffer management reading an unpinned stream. Before moving onto a new block (and
    unpinning the current), we need to make sure all the tuples returned from the
    current block are returned up the operator tree.

Change-Id: I95ee58d1019dd971f6a7dc19ecafdfa54cdbf942
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4333
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-09-20 16:05:11 -07:00