This only factors in fragment execution threads. E.g. this does *not*
try to account for the number of threads on the old Thrift RPC
code path if that is enabled.
This is loosely related to the old VCores estimate, but is different in
that it:
* Directly ties into the notion of required threads in
ThreadResourceMgr.
* Is a strict upper bound on the number of such threads, rather than
an estimate.
Does not include "optional" threads. ThreadResourceMgr in the backend
bounds the number of "optional" threads per impalad, so the number of
execution threads on a backend is limited by
sum(required threads per query) +
CpuInfo::num_cores() * FLAGS_num_threads_per_core
DCHECKS in the backend enforce that the calculation is correct. They
were actually hit in KuduScanNode because of some races in thread
management leading to multiple "required" threads running. Now the
first thread in the multithreaded scans never exits, which means
that it's always safe for any of the other threads to exit early,
which simplifies the logic a lot.
Testing:
Updated planner tests.
Ran core tests.
Change-Id: I982837ef883457fa4d2adc3bdbdc727353469140
Reviewed-on: http://gerrit.cloudera.org:8080/10256
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the following squashed patches that were reverted.
I will fix the known issues with some follow-on patches.
======================================================================
IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation
In preparation for switching the I/O mgr to the buffer pool, this
removes and cleans up a lot of code so that the switchover patch starts
from a cleaner slate.
* Remove the free buffer cache (which will be replaced by buffer pool's
own caching).
* Make memory limit exceeded error checking synchronous (in anticipation
of having to propagate buffer pool errors synchronously).
* Simplify error propagation - remove the (ineffectual) code that
enqueued BufferDescriptors containing error statuses.
* Document locking scheme better in a few places, make it part of the
function signature when it seemed reasonable.
* Move ReturnBuffer() to ScanRange, because it is intrinsically
connected with the lifecycle of a scan range.
* Separate external ReturnBuffer() and internal CleanUpBuffer()
interfaces - previously callers of ReturnBuffer() were fudging
the num_buffers_in_reader accounting to make the external interface work.
* Eliminate redundant state in ScanRange: 'eosr_returned_' and
'is_cancelled_'.
* Clarify the logic around calling Close() for the last
BufferDescriptor.
-> There appeared to be an implicit assumption that buffers would be
freed in the order they were returned from the scan range, so that
the "eos" buffer was returned last. Instead just count the number
of outstanding buffers to detect the last one.
-> Touching the is_cancelled_ field without holding a lock was hard to
reason about - violated locking rules and it was unclear that it
was race-free.
* Remove DiskIoMgr::Read() to simplify the interface. It is trivial to
inline at the callsites.
This will probably regress performance somewhat because of the cache
removal, so my plan is to merge it around the same time as switching
the I/O mgr to allocate from the buffer pool. I'm keeping the patches
separate to make reviewing easier.
Testing:
* Ran exhaustive tests
* Ran the disk-io-mgr-stress-test overnight
======================================================================
IMPALA-4835: Part 2: Allocate scan range buffers upfront
This change is a step towards reserving memory for buffers from the
buffer pool and constraining per-scanner memory requirements. This
change restructures the DiskIoMgr code so that each ScanRange operates
with a fixed set of buffers that are allocated upfront and recycled as
the I/O mgr works through the ScanRange.
One major change is that ScanRanges get blocked when a buffer is not
available and get unblocked when a client returns a buffer via
ReturnBuffer(). I was able to remove the logic to maintain the
blocked_ranges_ list by instead adding a separate set with all ranges
that are active.
There is also some miscellaneous cleanup included - e.g. reducing the
amount of code devoted to maintaining counters and metrics.
One tricky part of the existing code was the it called
IssueInitialRanges() with empty lists of files and depended on
DiskIoMgr::AddScanRanges() to not check for cancellation in that case.
See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue
ranges for empty lists of files.
I plan to merge this along with the actual buffer pool switch, but
separated it out to allow review of the DiskIoMgr changes separate from
other aspects of the buffer pool switchover.
Testing:
* Ran core and exhaustive tests.
======================================================================
IMPALA-4835: Part 3: switch I/O buffers to buffer pool
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00
Reviewed-on: http://gerrit.cloudera.org:8080/9679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit"
This reverts commit 25bcb258df.
Revert "IMPALA-6588: don't add empty list of ranges in text scan"
This reverts commit d57fbec6f6.
Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool"
This reverts commit 24b4ed0b29.
Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront"
This reverts commit 5699b59d0c.
Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation"
This reverts commit 65680dc421.
Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb
Reviewed-on: http://gerrit.cloudera.org:8080/9480
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786
Reviewed-on: http://gerrit.cloudera.org:8080/8966
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This patch adds changes to the planner to account for memory used by
bloom filters at the fragment instance level. Also adds changes to
allocate memory for those bloom filters from the buffer pool.
Testing:
- Modified Planner Tests and end to end tests to account for memory
reservation for the runtime filters.
- Modified backend tests and benchmarks to use the bufferpool for
bloom filter allocation.
- Add an end to end test.
- Ran rest of the core tests.
Change-Id: Iea2759665fb2e8bef9433014a8d42a7ebf99ce1f
Reviewed-on: http://gerrit.cloudera.org:8080/8971
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
Before this change, the per-host reservation size was computed
by the Planner. However, scheduling happens after planning,
so the Planner must assume that all fragments run on all
hosts, and the reservation size is likely much larger than
it needs to be.
This moves the computation of the per-host reservation size
to the BE where it can be computed more precisely. This also
includes a number of plan/profile changes.
Change-Id: Idbcd1e9b1be14edc4017b4907e83f9d56059fbac
Reviewed-on: http://gerrit.cloudera.org:8080/7630
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
Always create global BufferPool at startup using 80% of memory and
limit reservations to 80% of query memory (same as BufferedBlockMgr).
The query's initial reservation is computed in the planner, claimed
centrally (managed by the InitialReservations class) and distributed
to query operators from there.
min_spillable_buffer_size and default_spillable_buffer_size query
options control the buffer size that the planner selects for
spilling operators.
Port ExecNodes to use BufferPool:
* Each ExecNode has to claim its reservation during Open()
* Port Sorter to use BufferPool.
* Switch from BufferedTupleStream to BufferedTupleStreamV2
* Port HashTable to use BufferPool via a Suballocator.
This also makes PAGG memory consumption more efficient (avoid wasting buffers)
and improve the spilling algorithm:
* Allow preaggs to execute with 0 reservation - if streams and hash tables
cannot be allocated, it will pass through rows.
* Halve the buffer requirement for spilling aggs - avoid allocating
buffers for aggregated and unaggregated streams simultaneously.
* Rebuild spilled partitions instead of repartitioning (IMPALA-2708)
TODO in follow-up patches:
* Rename BufferedTupleStreamV2 to BufferedTupleStream
* Implement max_row_size query option.
Testing:
* Updated tests to reflect new memory requirements
Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e
Reviewed-on: http://gerrit.cloudera.org:8080/5801
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This moves away from the PipelinedPlanNodeSet approach of enumerating
sets of concurrently-executing nodes because unions would force
creating many overlapping sets of nodes. The new approach computes
the peak resources during Open() and the peak resources between Open()
and Close() (i.e. while calling GetNext()) bottom-up for each plan node
in a fragment. The fragment resources are then combined to produce the
query resources.
The basic assumptions for the new resource estimates are:
* resources are acquired during or after the first call to Open()
and released in Close().
* Blocking nodes call Open() on their child before acquiring
their own resources (this required some backend changes).
* Blocking nodes call Close() on their children before returning
from Open().
* The peak resource consumption of the query is the sum of the
independent fragments (except for the parallel join build plans
where we can assume there will be synchronisation). This is
conservative but we don't synchronise fragment Open() and Close()
across exchanges so can't make stronger assumptions in general.
Also compute the sum of minimum reservations. This will be useful
in the backend to determine exactly when all of the initial
reservations have been claimed from a shared pool of initial reservations.
Testing:
* Updated planner tests to reflect behavioural changes.
* Added extra resource requirement planner tests for unions, subplans,
pipelines of blocking operators, and bushy join plans.
* Added single-node plans to resource-requirements tests. These have
more complex plan trees inside a single fragment, which is useful
for testing the peak resource requirement logic.
Change-Id: I492cf5052bb27e4e335395e2a8f8a3b07248ec9d
Reviewed-on: http://gerrit.cloudera.org:8080/7223
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Compute the minimum buffer requirement for spilling nodes and
per-host estimates for the entire plan tree.
This builds on top of the existing resource estimation code, which
computes the sets of plan nodes that can execute concurrently. This is
cleaned up so that the process of producing resource requirements is
clearer. It also removes the unused VCore estimates.
Fixes various bugs and other issues:
* computeCosts() was not called for unpartitioned fragments, so
the per-operator memory estimate was not visible.
* Nested loop join was not treated as a blocking join.
* The TODO comment about union was misleading
* Fix the computation for mt_dop > 1 by distinguishing per-instance and
per-host estimates.
* Always generate an estimate instead of unpredictably returning
-1/"unavailable" in many circumstances - there was little rhyme or
reason to when this happened.
* Remove the special "trivial plan" estimates. With the rest of the
cleanup we generate estimates <= 10MB for those trivial plans through
the normal code path.
I left one bug (IMPALA-4862) unfixed because it is subtle, will affect
estimates for many plans and will be easier to review once we have the
test infra in place.
Testing:
Added basic planner tests for resource requirements in both the MT and
non-MT cases.
Re-enabled the explain_level tests, which appears to be the only
coverage for many of these estimates. Removed the complex and
brittle test cases and replaced with a couple of much simpler
end-to-end tests.
Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
Reviewed-on: http://gerrit.cloudera.org:8080/5847
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The runtime profile as we present it is not very useful and I think the structure of
it makes it hard to consume. This patch adds a new client facing schemed set of
counters that are collected from the runtime profiles. For example, with this structure
it would be easy to have the shell get the stats of a running query and print a useful
progress report or to check the most relevant metrics for diagnosing issues.
Here's an example of the output for one of the tpch queries:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
------------------------------------------------------------------------------------------------------------------------
09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED
05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B
04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE
08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE
07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority)
03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB
02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST
|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST
| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o
00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l
Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Our new build machines (e.g., beefy) have more cores than our other machines,
so scan nodes may have a different memory estimate causing the explain tests
to fail. This patch fixes the num_scanner_threads to 1 for explain tests
to ensure consisteny estimates.
Change-Id: Ie6194f3c3b17d04aa141d04fcddb7ac948e92fcf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1735
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1753
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.
The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).
This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.
This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.
Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
The overall goal of this change allow for table metadata to be loaded in the background
but also to allow prioritization of loading on an as-needed basis. As part of analysis,
any tables that are not loaded are tracked and if analysis fails the Impalad will make
an RPC to the CatalogServer to requiest the metadata loading of these tables be
prioritized and analysis will be restarted.
To support this, the CatalogServer now has a deque of the tables to load. For
background loading, tables to load are added to the tail of the deque. However, a new
CatalogServer RPC was added that can prioritize the loading of one or more tables in
which case they will get added to the head of the deque. The next table to load is
always taken from the head. This helps prioritize loading but is admittedly not the most
fair approach.
The support the prioritized loading, some changes had to made on the Impalad side during
analysis:
- During analysis, any tables that are missing metadata are tracked.
- Analysis now runs in a loop. If it fails due to an AnalysisException AND at least 1
table/view was missing metadata, these tables missing metadata are requested to be
loaded by calling the CatalogServer.
- The impalad will wait until the required tables are received (by getting notified each
time there is a call to updateCatalog()), and waiting to run analysis until all tables
are available. Once the tables are available, analysis will restart.
This change also introduces two new flags:
--load_catalog_in_background (bool). When this is true (the default) the catalog server
will run a period background thread to queue all unloaded tables for loading. This is
generally the desired behavior, but there may be some cases (very large metastores) where
this may need to be disabled.
--num_metadata_loading_threads (int32). The number of threads to use when loading catalog
metadata (degree of parallelism). The default is 16, but it can be increased to improve
performance at the cost of stressing the Hive metastore/HDFS.
Change-Id: Ib94dbbf66ffcffea8c490f50f5c04d19fb2078ad
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1476
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1538
There are now 4 explain levels summarized as follows:
- Level 0: MINIMAL
Non-fragmented parallel plan only showing plan nodes with minimal attributes
- Level 1: STANDARD
Non-fragmented parallel plan with some details in plan nodes
- Level 2: EXTENDED
Non-fragmented parallel plan with full details in plan nodes including
the table/column stats, row size, #hosts, cardinality,
and estimated per-host memory requirement
- Level 3: VERBOSE
Fragmented parallel plan with full details (like level 2)
This patch also includes several bugfixes related to plan costing and/or
testing of explain plans.
Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins