When an in-memory blocking aggregation or join is in the GetNext()
phase where it is outputting accumulated rows then we expect
memory consumption to monotonically decrease because no more
rows will be accumulated in memory.
This change adds support to release unused reservation and makes
use of it for in-memory aggregations and sorts.
We don't release memory for operators with spilled data, since they
may need the reservation to bring it back into memory. We also
don't release memory in subplans, since it will probably be used
in a later iteration of the subplan.
Testing:
Updated spilling test that now requires less memory.
Ran stress test binary search on tpch_parquet. No changes, except
Q18 now requires 325MB instead of 450MB to execute without spilling.
Ran query with two sorts in the same pipeline and watched /memz to
confirm that the first node in the pipeline was incrementally releasing
memory. Added a regression test based on this experiment.
Added a backend test to directly test reservation decreasing.
Change-Id: I6f4d0ad127d5fcd14b9821a7c127eec11d98692f
Reviewed-on: http://gerrit.cloudera.org:8080/7619
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Always create global BufferPool at startup using 80% of memory and
limit reservations to 80% of query memory (same as BufferedBlockMgr).
The query's initial reservation is computed in the planner, claimed
centrally (managed by the InitialReservations class) and distributed
to query operators from there.
min_spillable_buffer_size and default_spillable_buffer_size query
options control the buffer size that the planner selects for
spilling operators.
Port ExecNodes to use BufferPool:
* Each ExecNode has to claim its reservation during Open()
* Port Sorter to use BufferPool.
* Switch from BufferedTupleStream to BufferedTupleStreamV2
* Port HashTable to use BufferPool via a Suballocator.
This also makes PAGG memory consumption more efficient (avoid wasting buffers)
and improve the spilling algorithm:
* Allow preaggs to execute with 0 reservation - if streams and hash tables
cannot be allocated, it will pass through rows.
* Halve the buffer requirement for spilling aggs - avoid allocating
buffers for aggregated and unaggregated streams simultaneously.
* Rebuild spilled partitions instead of repartitioning (IMPALA-2708)
TODO in follow-up patches:
* Rename BufferedTupleStreamV2 to BufferedTupleStream
* Implement max_row_size query option.
Testing:
* Updated tests to reflect new memory requirements
Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e
Reviewed-on: http://gerrit.cloudera.org:8080/5801
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This was a poorly written test that relies on assumptions about
the behavior of 'rand' and the order that rows get processed in
a table that Impala doesn't actually guarantee.
The new version is still sensitive to the precise behavior of
'rand()', but shouldn't be flaky unless that behavior is changed.
Change-Id: If1ba8154c2b6a8d508916d85391b95885ef915a9
Reviewed-on: http://gerrit.cloudera.org:8080/6775
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Previously, exprs used in sorts were evaluated lazily. This can
potentially be bad for performance if the exprs are expensive to
evaluate, and it can lead to crashes if the exprs are
non-deterministic, as this violates assumptions of our sorting
algorithm.
This patch addresses these issues by materializing ordering exprs.
It does so when the expr is non-deterministic (including when it
contains a UDF, which we cannot currently know if they are
non-deterministic), or when its cost exceeds a threshold (or the
cost is unknown).
Testing:
- Added e2e tests in test_sort.py.
- Updated planner tests.
Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e
Reviewed-on: http://gerrit.cloudera.org:8080/6322
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
This patch addresses warning messages from pytest re: the imported
TestMatrix, TestVector, and TestDimension classes, which were being
collected as potential test classes. The fix was to simply prepend
the class names with Impala-
git grep -l 'TestDimension' | xargs \
sed -i 's/TestDimension/ImpalaTestDimension/g'
git grep -l 'TestMatrix' | xargs \
sed -i 's/TestMatrix/ImpalaTestMatrix/g'
git grep -l 'TestVector' | xargs \
sed -i 's/TestVector/ImpalaTestVector/g'
The tests all passed in an exhaustive run on the upstream jenkins
server:
http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/
Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1
Reviewed-on: http://gerrit.cloudera.org:8080/5794
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Many of our test scripts have import statements that look like
"from xxx import *". It is a good practice to explicitly name what
needs to be imported. This commit implements this practice. Also,
unused import statements are removed.
Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8
Reviewed-on: http://gerrit.cloudera.org:8080/3444
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
Also make test_scratch_disk.py more deterministic, by using
max_block_mgr_memory, which doesn't include scanner memory.
The fixed test_scratch_disk.py exercises the other sorter bugs
that occurs when scratch cannot be written.
Testing:
Added a test that does a sort with various memory limits and consumes
the whole output of the sorter (we have many tests of sorts with limits
but limited coverage of sorts without limits). Ran an exhaustive test
run before posting for review.
This added test reproduced one of the sorter bugs, where var-len blocks
were not always attached to the output batch. The other test was
reproduced by the test change in IMPALA-3669: test_scratch_disk fix.
Change-Id: Ia1a0ddffa0a5b157ab86a376b7b7360a923698d6
Reviewed-on: http://gerrit.cloudera.org:8080/3315
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Clarify relationships between classes, clean up the previous mess
where every class was friends with the other so there's an actual
distinction between public and private members. TupleIterator
is now no longer tied to TupleSorter, just Run.
Document and enforce invariants in many cases.
Factor out some functions from large functions.
Simplify and document iterator logic.
Make management of buffers when iterating over output stream more
explicitly correct: either use MarkNeedToReturn() or attach block
to the batch as appropriate. The SortedRunMerger didn't handle
resource transfer correctly, except if all the memory came from
the batch's MemPool. This patch fixes the cases when resources
are attached to the batches, but not the 'need_to_return' case.
Document that SortedRunMerger requires 'deep_copy_input' to be true
if batches can have the 'need_to_return' flag set.
Also use the atomic block exchange operation when moving between
blocks in unpinned runs to prevent pin failures at that point.
I explicitly have avoided changing the hairy block management logic
when allocating buffers for merging, that will need addressing in
a follow-up patch.
Add a SpilledRuns counter so that it's more explicit that spilling
occurred.
Testing:
Added some tests for corner cases with empty and NULL strings.
Fixed a test that previously failed with OOM but now succeeds.
Performance:
Benchmarking against old code initial revealed some regressions from
changes in inlining. Force inlining the TupleComparator::operator() and
iterator Next()/Prev() functions helped and performance seems similar or
slightly better on the targeted orderby benchmarks.
Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a
Reviewed-on: http://gerrit.cloudera.org:8080/2826
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Switch to a median of three random tuples that should be very robust to
a range of inputs. It may be slightly worse than the existing pivot
selection on some inputs where the original algorithm is close to
optimal (e.g. already sorted inputs), but should be typically
better overall.
Always recurse on the smaller partition: this prevent the stack
overflow even with bad pivot selection.
The overhead is minimal - in profiles for small sorts I'm seeing pivot
selection take at most 0.5% of CPU time.
The improved pivot selections gives modest improvements of 2-5% on the
targeted perf order by benchmarks on a single node run with TPC-H
scale factor 20.
Change-Id: Iae50112b6deca3d6268e18b6f4daae1af279b452
Reviewed-on: http://gerrit.cloudera.org:8080/2824
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.
Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Small buffers introduced an issue that is exacerbated by the large fanout. A stream can
only be appended to forever once it has grabbed the initial io sized buffer. With small
buffers, we don't grab that at the beginning anymore and, before this patch, it is
grabbed when the stream first needs it. This means when one stream needs it, another
stream could have already grabbed it (meaning this stream is pinned with multiple
buffers).
This patch has all the streams grab an IO buffer as soon as the first stream needs an
io buffer. This guarantees that all streams get 1 before any get 2.
Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d
Similar to some of our other resource management objects, the buffered block mgr
will be shared by all fragments within a query.
The memory given to the block mgr is based on the query limit (e.g. 80% of query limit).
We can't have each fragment having a block mgr that uses 80% of the query limit and
we probably don't want to impose per fragment limits.
Change-Id: Idcd89f302534b37ed236cdd42784ae8d717ec29e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3965
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4179
This patch does a few things:
1. Moves the buffer block mgr from the sorter to the runtime state. This is now
one that is shared across the query fragment. The partitioned hash join and agg
will use this as well.
2. Adds a Client interface to the block mgr. Each exec node is a different client
and can reserve a minimum number of buffers. This avoid starvation.
3. Updated the BufferedBlockMgr interface's for getting pinned blocks to collapse
two existing APIs.
Change-Id: Ibb31fbe480f3726048457f26e24a9e33f7201d86
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3504
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3574
This patch does two things in preparation for external joins. The
hash table used to contain a directory structure (buckets and nodes)
both of which were contiguous. The nodes contained the tuple ptrs
within it.
This patch changes it so the nodes are not stored contiguously but
allocated in pages. (this structure is dense and does not require
random lookups by index). The bucket structure is still contiguous
since we rely on the doubling property and random lookup by index.
The second change is that the node's no longer store the tuple ptrs
within them. This makes it easier to build the hash table ontop of
existing data.
Here's a quick benchmark doing a self join on tpch lineitem. Both
build and probe times decreased a bit.
Before:
HASH_JOIN_NODE (id=2):(Total: 1s139ms, non-child: 985.939ms, % non-child: 86.50%)
- BuildBuckets: 2.10M (2097152)
- BuildRows: 6.00M (6001215)
- BuildTime: 527.991ms
- LeftChildRows: 6.00M (6001215)
- LeftChildTime: 451.964ms
- LoadFactor: 0.50
- RowsReturned: 30.01M (30012985)
- RowsReturnedRate: 26.33 M/sec
After:
HASH_JOIN_NODE (id=2):(Total: 1s019ms, non-child: 835.350ms, % non-child: 81.97%)
- BuildBuckets: 2.10M (2097152)
- BuildRows: 6.00M (6001215)
- BuildTime: 423.175ms
- LeftChildRows: 6.00M (6001215)
- LeftChildTime: 406.67ms
- LoadFactor: 0.50
- RowsReturned: 30.01M (30012985)
- RowsReturnedRate: 29.45 M/sec
Change-Id: I79e209a24c24fb4f2f99574bcf187746fddadc06
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3245
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
Re-order union operands descending by their estimated per-host memory,
s.t. parent nodes can gauge the peak memory consumption of a MergeNode after
opening it during execution (a MergeNode opens its first operand in Open()).
Scan nodes are always ordered last because they can dynamically scale down their
memory usage, whereas many other nodes cannot (e.g., joins, aggregations).
One goal is to decrease the likelihood of a SortNode parent claiming too much
memory in its Open(), possibly causing the mem limit to be hit when subsequent
union operands are executed.
Change-Id: Ia51caaffd55305ea3dbd2146cd55acc7da67f382
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3146
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3213
Tested-by: jenkins
- Added static order by tests to test_queries.py and QueryTest/sort.test
- test_order_by.py also contains tests with static queries that are run with
multiple memory limits.
- Added stress, scratch disk and failpoints tests
- Incorporated Srinath's change that copied all order by with limit tests into
the top-n.test file
Extra time required:
Serial:
scratch disk: 42 seconds
test queries sort : 77 seconds
test sort: 56 seconds
sort stress: 142 seconds
TOTAL: 5 min 17 seconds
Parallel(8 threads):
scratch disk: 40 seconds
test queries sort: 42 seconds
test sort: 49 seconds
sort stress: 93 seconds
TOTAL: 3 min 44 sec
Change-Id: Ic5716bcfabb5bb3053c6b9cebc9bfbbb9dc64a7c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2820
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3205