Switch to a median of three random tuples that should be very robust to
a range of inputs. It may be slightly worse than the existing pivot
selection on some inputs where the original algorithm is close to
optimal (e.g. already sorted inputs), but should be typically
better overall.
Always recurse on the smaller partition: this prevent the stack
overflow even with bad pivot selection.
The overhead is minimal - in profiles for small sorts I'm seeing pivot
selection take at most 0.5% of CPU time.
The improved pivot selections gives modest improvements of 2-5% on the
targeted perf order by benchmarks on a single node run with TPC-H
scale factor 20.
Change-Id: Iae50112b6deca3d6268e18b6f4daae1af279b452
Reviewed-on: http://gerrit.cloudera.org:8080/2824
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.
Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Small buffers introduced an issue that is exacerbated by the large fanout. A stream can
only be appended to forever once it has grabbed the initial io sized buffer. With small
buffers, we don't grab that at the beginning anymore and, before this patch, it is
grabbed when the stream first needs it. This means when one stream needs it, another
stream could have already grabbed it (meaning this stream is pinned with multiple
buffers).
This patch has all the streams grab an IO buffer as soon as the first stream needs an
io buffer. This guarantees that all streams get 1 before any get 2.
Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d
Similar to some of our other resource management objects, the buffered block mgr
will be shared by all fragments within a query.
The memory given to the block mgr is based on the query limit (e.g. 80% of query limit).
We can't have each fragment having a block mgr that uses 80% of the query limit and
we probably don't want to impose per fragment limits.
Change-Id: Idcd89f302534b37ed236cdd42784ae8d717ec29e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3965
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4179
This patch does a few things:
1. Moves the buffer block mgr from the sorter to the runtime state. This is now
one that is shared across the query fragment. The partitioned hash join and agg
will use this as well.
2. Adds a Client interface to the block mgr. Each exec node is a different client
and can reserve a minimum number of buffers. This avoid starvation.
3. Updated the BufferedBlockMgr interface's for getting pinned blocks to collapse
two existing APIs.
Change-Id: Ibb31fbe480f3726048457f26e24a9e33f7201d86
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3504
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3574
This patch does two things in preparation for external joins. The
hash table used to contain a directory structure (buckets and nodes)
both of which were contiguous. The nodes contained the tuple ptrs
within it.
This patch changes it so the nodes are not stored contiguously but
allocated in pages. (this structure is dense and does not require
random lookups by index). The bucket structure is still contiguous
since we rely on the doubling property and random lookup by index.
The second change is that the node's no longer store the tuple ptrs
within them. This makes it easier to build the hash table ontop of
existing data.
Here's a quick benchmark doing a self join on tpch lineitem. Both
build and probe times decreased a bit.
Before:
HASH_JOIN_NODE (id=2):(Total: 1s139ms, non-child: 985.939ms, % non-child: 86.50%)
- BuildBuckets: 2.10M (2097152)
- BuildRows: 6.00M (6001215)
- BuildTime: 527.991ms
- LeftChildRows: 6.00M (6001215)
- LeftChildTime: 451.964ms
- LoadFactor: 0.50
- RowsReturned: 30.01M (30012985)
- RowsReturnedRate: 26.33 M/sec
After:
HASH_JOIN_NODE (id=2):(Total: 1s019ms, non-child: 835.350ms, % non-child: 81.97%)
- BuildBuckets: 2.10M (2097152)
- BuildRows: 6.00M (6001215)
- BuildTime: 423.175ms
- LeftChildRows: 6.00M (6001215)
- LeftChildTime: 406.67ms
- LoadFactor: 0.50
- RowsReturned: 30.01M (30012985)
- RowsReturnedRate: 29.45 M/sec
Change-Id: I79e209a24c24fb4f2f99574bcf187746fddadc06
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3245
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
Re-order union operands descending by their estimated per-host memory,
s.t. parent nodes can gauge the peak memory consumption of a MergeNode after
opening it during execution (a MergeNode opens its first operand in Open()).
Scan nodes are always ordered last because they can dynamically scale down their
memory usage, whereas many other nodes cannot (e.g., joins, aggregations).
One goal is to decrease the likelihood of a SortNode parent claiming too much
memory in its Open(), possibly causing the mem limit to be hit when subsequent
union operands are executed.
Change-Id: Ia51caaffd55305ea3dbd2146cd55acc7da67f382
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3146
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3213
Tested-by: jenkins
- Added static order by tests to test_queries.py and QueryTest/sort.test
- test_order_by.py also contains tests with static queries that are run with
multiple memory limits.
- Added stress, scratch disk and failpoints tests
- Incorporated Srinath's change that copied all order by with limit tests into
the top-n.test file
Extra time required:
Serial:
scratch disk: 42 seconds
test queries sort : 77 seconds
test sort: 56 seconds
sort stress: 142 seconds
TOTAL: 5 min 17 seconds
Parallel(8 threads):
scratch disk: 40 seconds
test queries sort: 42 seconds
test sort: 49 seconds
sort stress: 93 seconds
TOTAL: 3 min 44 sec
Change-Id: Ic5716bcfabb5bb3053c6b9cebc9bfbbb9dc64a7c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2820
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3205