PlanNode does not consider some factors when estimating memory,
this will cause a large error rate
AggregationNode
1.MemoryEstimate = Ndv * (AvgRowSize + SizeOfBucket)
2.When estimating the Ndv of merge aggregation, Ndv should be
divided only once.
3.If there is no grouping exprs, MemoryEstimate =
MIN_PLAIN_AGG_MEM
SortNode
1.MemoryEstimate = Cardinality * AvgRowSize. Memory used when
there is enough memory
HashJoinNode
1.MemoryEstimate= DataRows + Buckets + DuplicateNodes,
DataRows = RightTableCardinality * AvgRowSize,
Buckets= roundUpToPowerOf2(RightTableCardinality) *
SizeOfBucket,
DuplicateNodes = (RightTableCardinality - RightNdv) *
SizeOfDuplicateNode
KuduScanNode
1.MemoryEstimate = Columns * BytesPerColumn * MaxScannerThreads,
Columns are scanned in query, not all the columns of the table
UnitTest
1.CardinalityTest adds test cases to test memory estimation.
Modify existing test cases related to memory estimation
Change-Id: Ic01db168ff2c6d6de33ee553a8175599f035d7a1
Reviewed-on: http://gerrit.cloudera.org:8080/16842
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This adds a new version of the pre-existing partition
key scan optimization that always returns correct
results, even when files have zero rows. This new
version is always enabled by default. The old
existing optimization, which does a metadata-only
query, is still enabled behind the
OPTIMIZE_PARTITION_KEY_SCANS query option.
The new version of the optimization must scan the files
to see if they are non-empty. Instead of using metadata
only, the planner instructs the backend to short-circuit HDFS
scans after a single row has been returned from each
file. This gives results equivalent to returning all
the rows from each file, because all rows in the file
belong to the same partition and therefore have identical
values for any columns that are partition key values.
Planner cardinality estimates are adjusted accordingly
to enable potentially better plans and other optimisations
like disabling codegen.
We make some effort to avoid generated extra scan ranges
for remote scans by only generating one range per remote
file.
The backend optimisation is implemented by constructing a
row batch with capacity for a single row only and then
terminating each scan range once a single row has been
produced. Both Parquet and ORC have optimized code paths
for zero slot table scans that mean this will only result
in a footer read. (Other file formats still need to read
some portion of the file, but can terminate early once
one row has been produced.)
This should be quite efficient in practice with file handle
caching and data caching enabled, because it then only
requires reading the footer from the cache for each file.
The partition key scan optimization is also slightly
generalised to apply to scans of unpartitioned tables
where no slots are materialized.
A limitation of the optimization where it did not apply
to multiple grouping classes was also fixed.
Limitations:
* This still scans every file in the partition. I.e. there is
no short-circuiting if a row has already been found in the
partition by the current scan node.
* Resource reservations and estimates for the scan node do
not all take into account this optimisation, so are
conservative - they assume the whole file is scanned.
Testing:
* Added end-to-end tests that execute the query on all
HDFS file formats and verify that the correct number of rows
flow through the plan.
* Added planner test based on the existing test partition key
scan test.
* Added test to make sure single node optimisation kicks in
when expected.
* Add test for cardinality estimates with and without stats
* Added test for unpartitioned tables.
* Added planner test that checks that optimisation is enabled
for multiple aggregation classes.
* Added a targeted perf test.
Change-Id: I26c87525a4f75ffeb654267b89948653b2e1ff8c
Reviewed-on: http://gerrit.cloudera.org:8080/13993
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If all primary key columns of the Kudu table are in equivalence
predicates pushed down to Kudu, Kudu will return at most one row.
In this case, we can adjust the cardinality estimation to speed
up point lookup.
This patch sets the input and output cardinality as 1 if the
number of primary key columns in equivalence predicates pushed
down to Kudu equals the total number of primary key columns of
the Kudu table, hence enable small query optimization.
Testing:
- Added test cases in following PlannerTest: small-query-opt.test,
disable-codegen.test and kudu.test.
- Passed all FE tests, including new test cases.
Change-Id: I4631cd4d1a528a1152b5cdcb268426f2ba1a0c08
Reviewed-on: http://gerrit.cloudera.org:8080/15250
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-5036 added an optimisation for count(star) in Parquet scans
that avoids materialising dummy rows. This change provides similar
optimization for Kudu tables.
Instead of materializing empty rows when computing count star, we use
the NumRows field from the Kudu API. The Kudu scanner tuple is
modified to have one slot into which we will write the
num rows statistic. The aggregate function is changed from count to a
special sum function that gets initialized to 0.
Tests:
* Added end-to-end tests
̣* Added planner tests
* Run performance tests on tpch.lineitem Kudu table with 25 set as
scaling factor, on 1 node, with mt_dop set to 1, just to measure
the speedup gained when scanning. Counting the rows before the
optimization took around 400ms, and around 170ms after.
Change-Id: Ic99e0f954d0ca65779bd531ca79ace1fcb066fb9
Reviewed-on: http://gerrit.cloudera.org:8080/14347
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Cardinality is vital to understanding why a plan has the form it does,
yet the planner normally emits cardinality information only for the
detailed levels. Unfortunately, most query profiles we see are at the
standard level without this information (except in the summary table),
making it hard to understand what happened.
This patch adds cardinality to the standard EXPLAIN output. It also
changes the displayed cardinality value to be in abbreviated "metric"
form: 1.23K instead of 1234, etc.
Changing the DESCRIBE output has a huge impact on PlannerTest: all the
"golden" test files must change. To avoid doing this twice, this patch
also includes:
IMPALA-7919: Add predicates line in plan output for partition key
predicates
This is also the time to also include:
IMPALA-8022: Add cardinality checks to PlannerTest
The comparison code was changed to allow a set of validators, one of
which compares cardinality to ensure it is within 5% of the expected
value. This should ensure we don't change estimates unintentionally.
While many planner tests are concerned with cardinality, many others are
not. Testing showed that the cardinality is actually unstable within
tests. For such tests, added filters to ignore cardinality. The filter
is enabled by default (for backward compatibility) but disabled (to
allow cardinality verification) for the critical tests.
Rebasing the tests was complicated by a bug in the error-matching code,
so this patch also fixes:
IMPALA-8023: Fix PlannerTest to handle error lines consistently
Now, the error output written to the output "save results" file matches
that expected in the "golden" file -- no more handling these specially.
Testing:
* Added cardinality verification.
* Reran all FE tests.
* Rebased all PlannerTest .test files.
* Adjusted the metadata/test_explain.py test to handle the changed
EXPLAIN output.
Change-Id: Ie9aa2d715b04cbb279aaffec8c5692686562d986
Reviewed-on: http://gerrit.cloudera.org:8080/12136
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds the resource estimates for key benchmark workloads:
TPC-H, TPC-DS, TPC-H Nested and TPC-H Kudu to the planner
test so that we can track changes in resource requirements
and estimates for these queries.
Also don't show decimal places for MB and KB estimates. The
estimates are not accurate to that level and displaying
extra precision has some disadvantages:
* It communicates to readers that the estimates have a high level of
precision.
* It increases the odds of small variations in file sizes, etc
causing test failures.
Also fixed a regex in the stress test that didn't escape the decimal
point correctly.
Testing:
Ran core tests.
Change-Id: I6a9f836699200ea87fb03bf36abad0e23949ac26
Reviewed-on: http://gerrit.cloudera.org:8080/11087
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This only factors in fragment execution threads. E.g. this does *not*
try to account for the number of threads on the old Thrift RPC
code path if that is enabled.
This is loosely related to the old VCores estimate, but is different in
that it:
* Directly ties into the notion of required threads in
ThreadResourceMgr.
* Is a strict upper bound on the number of such threads, rather than
an estimate.
Does not include "optional" threads. ThreadResourceMgr in the backend
bounds the number of "optional" threads per impalad, so the number of
execution threads on a backend is limited by
sum(required threads per query) +
CpuInfo::num_cores() * FLAGS_num_threads_per_core
DCHECKS in the backend enforce that the calculation is correct. They
were actually hit in KuduScanNode because of some races in thread
management leading to multiple "required" threads running. Now the
first thread in the multithreaded scans never exits, which means
that it's always safe for any of the other threads to exit early,
which simplifies the logic a lot.
Testing:
Updated planner tests.
Ran core tests.
Change-Id: I982837ef883457fa4d2adc3bdbdc727353469140
Reviewed-on: http://gerrit.cloudera.org:8080/10256
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This has two related changes.
IMPALA-6679: defer scanner reservation increases
------------------------------------------------
When starting each scan range, check to see how big the initial scan
range is (the full thing for row-based formats, the footer for
Parquet) and determine whether more reservation would be useful.
For Parquet, base the ideal reservation on the actual column layout
of each file. This avoids reserving memory that we won't use for
the actual files that we're scanning. This also avoid the need to
estimate ideal reservation in the planner.
We also release scanner thread reservations above the minimum as
soon as threads complete, so that resources can be released slightly
earlier.
IMPALA-6678: estimate Parquet column size for reservation
---------------------------------------------------------
This change also reduces reservation computed by the planner in certain
cases by estimating the on-disk size of column data based on stats. It
also reduces the default per-column reservation to 4MB since it appears
that < 8MB columns are generally common in practice and the method for
estimating column size is biased towards over-estimating. There are two
main cases to consider for the performance implications:
* Memory is available to improve query perf - if we underestimate, we
can increase the reservation so we can do "efficient" 8MB I/Os for
large columns.
* The ideal reservation is not available - query performance is affected
because we can't overlap I/O and compute as much and may do smaller
(probably 4MB I/Os). However, we should avoid pathological behaviour
like tiny I/Os.
When stats are not available, we just default to reserving 4MB per
column, which typically is more memory than required. When stats are
available, the memory required can be reduced below when some heuristic
tell us with high confidence that the column data for most or all files
is smaller than 4MB.
The stats-based heuristic could reduce scan performance if both the
conservative heuristics significantly underestimate the column size
and memory is constrained such that we can't increase the scan
reservation at runtime (in which case the memory might be used by
a different operator or scanner thread).
Observability:
Added counters to track when threads were not spawned due to reservation
and to track when reservation increases are requested and denied. These
allow determining if performance may have been affected by memory
availability.
Testing:
Updated test_mem_usage_scaling.py memory requirements and added steps
to regenerate the requirements. Loops test for a while to flush out
flakiness.
Added targeted planner and query tests for reservation calculations and
increases.
Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842
Reviewed-on: http://gerrit.cloudera.org:8080/9757
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the following squashed patches that were reverted.
I will fix the known issues with some follow-on patches.
======================================================================
IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation
In preparation for switching the I/O mgr to the buffer pool, this
removes and cleans up a lot of code so that the switchover patch starts
from a cleaner slate.
* Remove the free buffer cache (which will be replaced by buffer pool's
own caching).
* Make memory limit exceeded error checking synchronous (in anticipation
of having to propagate buffer pool errors synchronously).
* Simplify error propagation - remove the (ineffectual) code that
enqueued BufferDescriptors containing error statuses.
* Document locking scheme better in a few places, make it part of the
function signature when it seemed reasonable.
* Move ReturnBuffer() to ScanRange, because it is intrinsically
connected with the lifecycle of a scan range.
* Separate external ReturnBuffer() and internal CleanUpBuffer()
interfaces - previously callers of ReturnBuffer() were fudging
the num_buffers_in_reader accounting to make the external interface work.
* Eliminate redundant state in ScanRange: 'eosr_returned_' and
'is_cancelled_'.
* Clarify the logic around calling Close() for the last
BufferDescriptor.
-> There appeared to be an implicit assumption that buffers would be
freed in the order they were returned from the scan range, so that
the "eos" buffer was returned last. Instead just count the number
of outstanding buffers to detect the last one.
-> Touching the is_cancelled_ field without holding a lock was hard to
reason about - violated locking rules and it was unclear that it
was race-free.
* Remove DiskIoMgr::Read() to simplify the interface. It is trivial to
inline at the callsites.
This will probably regress performance somewhat because of the cache
removal, so my plan is to merge it around the same time as switching
the I/O mgr to allocate from the buffer pool. I'm keeping the patches
separate to make reviewing easier.
Testing:
* Ran exhaustive tests
* Ran the disk-io-mgr-stress-test overnight
======================================================================
IMPALA-4835: Part 2: Allocate scan range buffers upfront
This change is a step towards reserving memory for buffers from the
buffer pool and constraining per-scanner memory requirements. This
change restructures the DiskIoMgr code so that each ScanRange operates
with a fixed set of buffers that are allocated upfront and recycled as
the I/O mgr works through the ScanRange.
One major change is that ScanRanges get blocked when a buffer is not
available and get unblocked when a client returns a buffer via
ReturnBuffer(). I was able to remove the logic to maintain the
blocked_ranges_ list by instead adding a separate set with all ranges
that are active.
There is also some miscellaneous cleanup included - e.g. reducing the
amount of code devoted to maintaining counters and metrics.
One tricky part of the existing code was the it called
IssueInitialRanges() with empty lists of files and depended on
DiskIoMgr::AddScanRanges() to not check for cancellation in that case.
See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue
ranges for empty lists of files.
I plan to merge this along with the actual buffer pool switch, but
separated it out to allow review of the DiskIoMgr changes separate from
other aspects of the buffer pool switchover.
Testing:
* Ran core and exhaustive tests.
======================================================================
IMPALA-4835: Part 3: switch I/O buffers to buffer pool
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00
Reviewed-on: http://gerrit.cloudera.org:8080/9679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit"
This reverts commit 25bcb258df.
Revert "IMPALA-6588: don't add empty list of ranges in text scan"
This reverts commit d57fbec6f6.
Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool"
This reverts commit 24b4ed0b29.
Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront"
This reverts commit 5699b59d0c.
Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation"
This reverts commit 65680dc421.
Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb
Reviewed-on: http://gerrit.cloudera.org:8080/9480
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786
Reviewed-on: http://gerrit.cloudera.org:8080/8966
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This patch adds changes to the planner to account for memory used by
bloom filters at the fragment instance level. Also adds changes to
allocate memory for those bloom filters from the buffer pool.
Testing:
- Modified Planner Tests and end to end tests to account for memory
reservation for the runtime filters.
- Modified backend tests and benchmarks to use the bufferpool for
bloom filter allocation.
- Add an end to end test.
- Ran rest of the core tests.
Change-Id: Iea2759665fb2e8bef9433014a8d42a7ebf99ce1f
Reviewed-on: http://gerrit.cloudera.org:8080/8971
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
The metadata-only scan doesn't allocate I/O buffers, contrary to
an assumption of the memory estimation code in the planner.
This fix also sets a floor on the memory estimate, to avoid
estimating 0 bytes. 1MB seems like a reasonable approximation:
I ran metadata-only scans on a few different data sizes and
saw numbers from 128kb to 1mb.
The estimate is now much closer to actual consumption
(it was 80MB before):
[localhost:21000] > select count(*) from tpch_parquet.lineitem; summary;
Query: select count(*) from tpch_parquet.lineitem
Query submitted at: 2017-08-23 11:58:29 (Coordinator: http://tarmstrong-box:25000)
Query progress can be monitored at: http://tarmstrong-box:25000/query_plan?query_id=cb4b8d41fc838c9a:c5496ff300000000
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
Fetched 1 row(s) in 0.13s
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
| 03:AGGREGATE | 1 | 168.49us | 168.49us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 30.11ms | 30.11ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 2.05us | 6.14us | 3 | 1 | 20.00 KB | 10.00 MB | |
| 00:SCAN HDFS | 3 | 4.58ms | 4.72ms | 3 | 6.00M | 128.00 KB | 1.00 MB | tpch_parquet.lineitem |
+--------------+--------+----------+----------+-------+------------+-----------+---------------+-----------------------+
Testing:
Updated affected planner tests.
Change-Id: Iaf5c2316bef2afae54a94245c715534ed294f286
Reviewed-on: http://gerrit.cloudera.org:8080/7783
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Adds support for a "max_row_size" query option that instructs Impala
to reserve enough memory to process rows of the specified size. For
spilling operators, the planner reserves enough memory to process
rows of this size. The advantage of this compared to simply
specifying larger values for min_spillable_buffer_size and
default_spillable_buffer_size is that operators may be able to
handler larger rows without increasing the size of all their
buffers.
The default value is 512KB. I picked that number because it doesn't
increase minimum reservations *too* much even with smaller buffers
like 64kb but should be large enough for almost all reasonable
workloads.
This is implemented in the aggs and joins using the variable page size
support added to BufferedTupleStream in an earlier commit. The synopsis
is that each stream requires reservation for one default-sized page
per read and write iterator, and temporarily requires reservation
for a max-sized page when reading or writing larger pages. The
max-sized write reservation is released immediately after the row
is appended and the max-size read reservation is released after
advancing to the next row.
The sorter and analytic simply use max-sized buffers for all pages
in the stream.
Testing:
Updated existing planner tests to reflect default max_row_size. Added
new planner tests to test the effect of the query option.
Added "set" test to check validation of query option.
Added end-to-end tests exercising spilling operators with large rows
with and without spilling induced by SET_DENY_RESERVATION_PROBABILITY.
Change-Id: Ic70f6dddbcef124bb4b329ffa2e42a74a1826570
Reviewed-on: http://gerrit.cloudera.org:8080/7629
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This is implemented in the ResourceProfileBuilder to avoid duplicating
the login in every plan node.
Testing:
Updated planner tests.
Change-Id: I1e2853300371e31b13d81a763dbafb21709b16c4
Reviewed-on: http://gerrit.cloudera.org:8080/7703
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Before this change, the per-host reservation size was computed
by the Planner. However, scheduling happens after planning,
so the Planner must assume that all fragments run on all
hosts, and the reservation size is likely much larger than
it needs to be.
This moves the computation of the per-host reservation size
to the BE where it can be computed more precisely. This also
includes a number of plan/profile changes.
Change-Id: Idbcd1e9b1be14edc4017b4907e83f9d56059fbac
Reviewed-on: http://gerrit.cloudera.org:8080/7630
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
Always create global BufferPool at startup using 80% of memory and
limit reservations to 80% of query memory (same as BufferedBlockMgr).
The query's initial reservation is computed in the planner, claimed
centrally (managed by the InitialReservations class) and distributed
to query operators from there.
min_spillable_buffer_size and default_spillable_buffer_size query
options control the buffer size that the planner selects for
spilling operators.
Port ExecNodes to use BufferPool:
* Each ExecNode has to claim its reservation during Open()
* Port Sorter to use BufferPool.
* Switch from BufferedTupleStream to BufferedTupleStreamV2
* Port HashTable to use BufferPool via a Suballocator.
This also makes PAGG memory consumption more efficient (avoid wasting buffers)
and improve the spilling algorithm:
* Allow preaggs to execute with 0 reservation - if streams and hash tables
cannot be allocated, it will pass through rows.
* Halve the buffer requirement for spilling aggs - avoid allocating
buffers for aggregated and unaggregated streams simultaneously.
* Rebuild spilled partitions instead of repartitioning (IMPALA-2708)
TODO in follow-up patches:
* Rename BufferedTupleStreamV2 to BufferedTupleStream
* Implement max_row_size query option.
Testing:
* Updated tests to reflect new memory requirements
Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e
Reviewed-on: http://gerrit.cloudera.org:8080/5801
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This moves away from the PipelinedPlanNodeSet approach of enumerating
sets of concurrently-executing nodes because unions would force
creating many overlapping sets of nodes. The new approach computes
the peak resources during Open() and the peak resources between Open()
and Close() (i.e. while calling GetNext()) bottom-up for each plan node
in a fragment. The fragment resources are then combined to produce the
query resources.
The basic assumptions for the new resource estimates are:
* resources are acquired during or after the first call to Open()
and released in Close().
* Blocking nodes call Open() on their child before acquiring
their own resources (this required some backend changes).
* Blocking nodes call Close() on their children before returning
from Open().
* The peak resource consumption of the query is the sum of the
independent fragments (except for the parallel join build plans
where we can assume there will be synchronisation). This is
conservative but we don't synchronise fragment Open() and Close()
across exchanges so can't make stronger assumptions in general.
Also compute the sum of minimum reservations. This will be useful
in the backend to determine exactly when all of the initial
reservations have been claimed from a shared pool of initial reservations.
Testing:
* Updated planner tests to reflect behavioural changes.
* Added extra resource requirement planner tests for unions, subplans,
pipelines of blocking operators, and bushy join plans.
* Added single-node plans to resource-requirements tests. These have
more complex plan trees inside a single fragment, which is useful
for testing the peak resource requirement logic.
Change-Id: I492cf5052bb27e4e335395e2a8f8a3b07248ec9d
Reviewed-on: http://gerrit.cloudera.org:8080/7223
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Instead of materializing empty rows when computing count star, we use
the data stored in the Parquet RowGroup.num_rows field. The Parquet
scanner tuple is modified to have one slot into which we will write the
num rows statistic. The aggregate function is changed from count to a
special sum function that gets initialized to 0. We also add a rewrite
rule so that count(<literal>) is rewritten to count(*) in order to make
sure that this optimization is applied in all cases.
Testing:
- Added functional and planner tests
Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Reviewed-on: http://gerrit.cloudera.org:8080/6812
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
This is similar to the single-node execution optimisation, but applies
to slightly larger queries that should run in a distributed manner but
won't benefit from codegen.
This adds a new query option disable_codegen_rows_threshold that
defaults to 50,000. If fewer than this number of rows are processed
by a plan node per impalad, the cost of codegen almost certainly
outweighs the benefit.
Using rows processed as a threshold is justified by a simple
model that assumes the cost of codegen and execution per row for
the same operation are proportional. E.g. if x is the complexity
of the operation, n is the number of rows processed, C is a
constant factor giving the cost of codegen and Ec/Ei are constant
factor giving the cost of codegen'd and interpreted execution and
d, then the cost of the codegen'd operator is C * x + Ec * x * n
and the cost of the interpreted operator is Ei * x * n. Rearranging
means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that
(at least with the simplified model) it makes sense to choose
interpretation or codegen based on a constant threshold. The
model also implies that it is somewhat safer to choose codegen
because the additional cost of codegen is O(1) but the additional
cost of interpretation is O(n).
I ran some experiments with TPC-H Q1, varying the input table size, to
determine what the cut-over point where codegen was beneficial was.
The cutover was around 150k rows per node for both text and parquet.
At 50k rows per node disabling codegen was very beneficial - around
0.12s versus 0.24s. To be somewhat conservative I set the default
threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the
cutover tends to be higher because there are plan nodes that process
many fewer than the max rows.
Fix a couple of minor issues in the frontend - the numNodes_
calculation could return 0 for Kudu, and the single node optimization
didn't handle the case where for a scan node with conjuncts, a limit
and missing stats correctly (it considered the estimate still valid.)
Testing:
Updated e2e tests that set disable_codegen to set
disable_codegen_rows_threshold to 0, so that those tests run both
with and without codegen still.
Added an e2e test to make sure that the optimisation is applied in
the backend.
Added planner tests for various cases where codegen should and shouldn't
be disabled.
Perf:
Added a targeted perf test for a join+agg over a small input, which
benefits from this change.
Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e
Reviewed-on: http://gerrit.cloudera.org:8080/7153
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins