impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 21:00:54 -05:00

Author	SHA1	Message	Date
Tim Armstrong	418c705787	IMPALA-6679,IMPALA-6678: reduce scan reservation This has two related changes. IMPALA-6679: defer scanner reservation increases ------------------------------------------------ When starting each scan range, check to see how big the initial scan range is (the full thing for row-based formats, the footer for Parquet) and determine whether more reservation would be useful. For Parquet, base the ideal reservation on the actual column layout of each file. This avoids reserving memory that we won't use for the actual files that we're scanning. This also avoid the need to estimate ideal reservation in the planner. We also release scanner thread reservations above the minimum as soon as threads complete, so that resources can be released slightly earlier. IMPALA-6678: estimate Parquet column size for reservation --------------------------------------------------------- This change also reduces reservation computed by the planner in certain cases by estimating the on-disk size of column data based on stats. It also reduces the default per-column reservation to 4MB since it appears that < 8MB columns are generally common in practice and the method for estimating column size is biased towards over-estimating. There are two main cases to consider for the performance implications: * Memory is available to improve query perf - if we underestimate, we can increase the reservation so we can do "efficient" 8MB I/Os for large columns. * The ideal reservation is not available - query performance is affected because we can't overlap I/O and compute as much and may do smaller (probably 4MB I/Os). However, we should avoid pathological behaviour like tiny I/Os. When stats are not available, we just default to reserving 4MB per column, which typically is more memory than required. When stats are available, the memory required can be reduced below when some heuristic tell us with high confidence that the column data for most or all files is smaller than 4MB. The stats-based heuristic could reduce scan performance if both the conservative heuristics significantly underestimate the column size and memory is constrained such that we can't increase the scan reservation at runtime (in which case the memory might be used by a different operator or scanner thread). Observability: Added counters to track when threads were not spawned due to reservation and to track when reservation increases are requested and denied. These allow determining if performance may have been affected by memory availability. Testing: Updated test_mem_usage_scaling.py memory requirements and added steps to regenerate the requirements. Loops test for a while to flush out flakiness. Added targeted planner and query tests for reservation calculations and increases. Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842 Reviewed-on: http://gerrit.cloudera.org:8080/9757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	fb5dc9eb48	IMPALA-4835: switch I/O buffers to buffer pool This is the following squashed patches that were reverted. I will fix the known issues with some follow-on patches. ====================================================================== IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation In preparation for switching the I/O mgr to the buffer pool, this removes and cleans up a lot of code so that the switchover patch starts from a cleaner slate. * Remove the free buffer cache (which will be replaced by buffer pool's own caching). * Make memory limit exceeded error checking synchronous (in anticipation of having to propagate buffer pool errors synchronously). * Simplify error propagation - remove the (ineffectual) code that enqueued BufferDescriptors containing error statuses. * Document locking scheme better in a few places, make it part of the function signature when it seemed reasonable. * Move ReturnBuffer() to ScanRange, because it is intrinsically connected with the lifecycle of a scan range. * Separate external ReturnBuffer() and internal CleanUpBuffer() interfaces - previously callers of ReturnBuffer() were fudging the num_buffers_in_reader accounting to make the external interface work. * Eliminate redundant state in ScanRange: 'eosr_returned_' and 'is_cancelled_'. * Clarify the logic around calling Close() for the last BufferDescriptor. -> There appeared to be an implicit assumption that buffers would be freed in the order they were returned from the scan range, so that the "eos" buffer was returned last. Instead just count the number of outstanding buffers to detect the last one. -> Touching the is_cancelled_ field without holding a lock was hard to reason about - violated locking rules and it was unclear that it was race-free. * Remove DiskIoMgr::Read() to simplify the interface. It is trivial to inline at the callsites. This will probably regress performance somewhat because of the cache removal, so my plan is to merge it around the same time as switching the I/O mgr to allocate from the buffer pool. I'm keeping the patches separate to make reviewing easier. Testing: * Ran exhaustive tests * Ran the disk-io-mgr-stress-test overnight ====================================================================== IMPALA-4835: Part 2: Allocate scan range buffers upfront This change is a step towards reserving memory for buffers from the buffer pool and constraining per-scanner memory requirements. This change restructures the DiskIoMgr code so that each ScanRange operates with a fixed set of buffers that are allocated upfront and recycled as the I/O mgr works through the ScanRange. One major change is that ScanRanges get blocked when a buffer is not available and get unblocked when a client returns a buffer via ReturnBuffer(). I was able to remove the logic to maintain the blocked_ranges_ list by instead adding a separate set with all ranges that are active. There is also some miscellaneous cleanup included - e.g. reducing the amount of code devoted to maintaining counters and metrics. One tricky part of the existing code was the it called IssueInitialRanges() with empty lists of files and depended on DiskIoMgr::AddScanRanges() to not check for cancellation in that case. See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue ranges for empty lists of files. I plan to merge this along with the actual buffer pool switch, but separated it out to allow review of the DiskIoMgr changes separate from other aspects of the buffer pool switchover. Testing: * Ran core and exhaustive tests. ====================================================================== IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00 Reviewed-on: http://gerrit.cloudera.org:8080/9679 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	161cbe30ff	Revert IMPALA-4835 and dependent changes Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit" This reverts commit `25bcb258df`. Revert "IMPALA-6588: don't add empty list of ranges in text scan" This reverts commit `d57fbec6f6`. Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool" This reverts commit `24b4ed0b29`. Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront" This reverts commit `5699b59d0c`. Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation" This reverts commit `65680dc421`. Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb Reviewed-on: http://gerrit.cloudera.org:8080/9480 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-03 04:22:12 +00:00
Tim Armstrong	24b4ed0b29	IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786 Reviewed-on: http://gerrit.cloudera.org:8080/8966 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 04:17:41 +00:00
Bikramjeet Vig	61d941fad3	IMPALA-6526: Fix spilling test for running on local FS One of the spilling test was failing because its minimum bufferpool mem requirement was more when ran on local FS as compared to when it is run on HDFS. The fix is to increase the bufferpool limit to a value just above the min limit so that it still forces spill to disk on both filesystems. Testing: Ran core tests with local FS as target file system. Made sure the failing test passed. Change-Id: I50648d7936007a26891cf64d6343c47d9d646596 Reviewed-on: http://gerrit.cloudera.org:8080/9354 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-17 01:40:08 +00:00
Bikramjeet Vig	8fc1eccce4	IMPALA-5519: Allocate fragment's runtime filter memory from Buffer pool This patch adds changes to the planner to account for memory used by bloom filters at the fragment instance level. Also adds changes to allocate memory for those bloom filters from the buffer pool. Testing: - Modified Planner Tests and end to end tests to account for memory reservation for the runtime filters. - Modified backend tests and benchmarks to use the bufferpool for bloom filter allocation. - Add an end to end test. - Ran rest of the core tests. Change-Id: Iea2759665fb2e8bef9433014a8d42a7ebf99ce1f Reviewed-on: http://gerrit.cloudera.org:8080/8971 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-13 08:29:03 +00:00
Tim Armstrong	51c7fcd5dc	IMPALA-5827: add test for failure to repartition in hash join Testing: Ran the test locally. Change-Id: If6d601f9d15bed4667b50576f07f6216c34ed9c4 Reviewed-on: http://gerrit.cloudera.org:8080/7811 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-25 01:17:46 +00:00
Bikramjeet Vig	b6c02972d6	IMPALA-5788: Fix agg node crash when grouping by nondeterministic exprs Fixed a bug where impala crashes during execution of an aggregation query using nondeterministic grouping expressions. This happens when it tries to rebuild a spilled partition that can fit in memory and rows get re-hashed to a partition other than the spilled one due to the use of nondeterministic expressions. Testing: Added a query test to verify successful execution. Change-Id: Ibdb09239577b3f0a19d710b0d148e882b0b73e23 Reviewed-on: http://gerrit.cloudera.org:8080/7714 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-23 03:59:02 +00:00
Tim Armstrong	852e1bb728	IMPALA-3931: arbitrary fixed-size uda intermediate types Make many builtin aggregate functions use fixed-length intermediate types: * avg() * ndv() * stddev(), variance(), etc * distinctpc(), distinctpcsa() sample(), appx_median(), histogram() and group_concat() actually allocate var-len data so aren't changed. This has some major benefits: * Spill-to-disk works properly with these aggregations. * Aggregations are more efficient because there is one less pointer indirection. * Aggregations use less memory, because we don't need an extra 12-byte StringValue for the indirection. Adds a special-purpose internal type FIXED_UDA_INTERMEDIATE. The type is represented in the same way as CHAR - a fixed-size array of bytes, stored inline in tuples. However, it is not user-visible and does not support CHAR semantics, i.e. users can't declare tables, functions, etc with the type. The pointer and length is passed into aggregate functions wrapped in a StringVal. Updates some internal codegen functions to work better with the new type. E.g. store values directly into the result tuple instead of via an intermediate stack allocation. Testing: This change only affects builtin aggregate functions, for which we have test coverage already. If we were to allow wider use of this type, it would need further testing. Added an analyzer test to ensure we can't use the type for UDAs. Added a regression test for spilling avg(). Added a regression test for UDA with CHAR intermediate hitting DCHECK. Perf: Ran TPC-H locally. TPC-H Q17, which has a high-cardinality AVG(), improved dramatically. +----------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +----------+-----------------------+---------+------------+------------+----------------+ \| TPCH(60) \| parquet / none / none \| 18.44 \| -17.54% \| 11.92 \| -5.34% \| +----------+-----------------------+---------+------------+------------+----------------+ +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Num Clients \| Iters \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ \| TPCH(60) \| TPCH-Q12 \| parquet / none / none \| 18.40 \| 17.64 \| +4.32% \| 0.77% \| 1.09% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q22 \| parquet / none / none \| 7.07 \| 6.90 \| +2.36% \| 0.28% \| 0.30% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q3 \| parquet / none / none \| 12.37 \| 12.11 \| +2.10% \| 0.18% \| 0.15% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q7 \| parquet / none / none \| 42.48 \| 42.09 \| +0.93% \| 2.45% \| 0.80% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q6 \| parquet / none / none \| 3.18 \| 3.15 \| +0.89% \| 0.67% \| 0.76% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q19 \| parquet / none / none \| 7.24 \| 7.20 \| +0.50% \| 0.95% \| 0.67% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q10 \| parquet / none / none \| 13.37 \| 13.30 \| +0.50% \| 0.48% \| 1.39% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q5 \| parquet / none / none \| 7.47 \| 7.44 \| +0.36% \| 0.58% \| 0.54% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q11 \| parquet / none / none \| 2.03 \| 2.02 \| +0.06% \| 0.26% \| 1.95% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q4 \| parquet / none / none \| 5.48 \| 5.50 \| -0.27% \| 0.62% \| 1.12% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q13 \| parquet / none / none \| 22.11 \| 22.18 \| -0.31% \| 0.18% \| 0.55% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q15 \| parquet / none / none \| 8.45 \| 8.48 \| -0.32% \| 0.40% \| 0.47% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q9 \| parquet / none / none \| 33.39 \| 33.66 \| -0.81% \| 0.75% \| 0.59% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q21 \| parquet / none / none \| 71.34 \| 72.07 \| -1.01% \| 1.84% \| 1.79% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q14 \| parquet / none / none \| 5.93 \| 6.00 \| -1.07% \| 0.15% \| 0.69% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q20 \| parquet / none / none \| 5.72 \| 5.79 \| -1.09% \| 0.59% \| 0.51% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q18 \| parquet / none / none \| 45.42 \| 45.93 \| -1.10% \| 1.42% \| 0.50% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q2 \| parquet / none / none \| 4.81 \| 4.89 \| -1.52% \| 1.68% \| 1.01% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q16 \| parquet / none / none \| 5.41 \| 5.52 \| -1.98% \| 0.66% \| 0.73% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q1 \| parquet / none / none \| 27.58 \| 29.13 \| -5.34% \| 0.24% \| 1.51% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q8 \| parquet / none / none \| 12.61 \| 14.30 \| -11.78% \| 6.20% \| * 15.28% * \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q17 \| parquet / none / none \| 43.74 \| 126.58 \| I -65.44% \| 1.34% \| 9.60% \| 1 \| 5 \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+ Change-Id: Ife90cf27989f98ffb5ef5c39f1e09ce92e8cb87c Reviewed-on: http://gerrit.cloudera.org:8080/7526 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2017-08-17 03:12:48 +00:00
Tim Armstrong	c4f903033c	IMPALA-3200: more buffer pool end-to-end tests This adds most of the end-to-end tests described in the test plan. See http://goo.gl/v3Strz. * End-to-end test for disk spill encryption. * Admission control test for the case when acquiring initial reservation fails. * Initial reservation acquire failure test * scratch_limit tests for Join, Agg, Sort, Analytic * Memory usage scaling tests for Join, Agg, Sort, Analytic Also splits out the slow sort queries in test_spilling and moves them to exhaustive so the individual tests run faster and have better parallelism. Testing: Ran all the core tests. Will do a full exhaustive run before committing. Change-Id: I554aa5ddfef4f8e75295596e720a14eee1afa17f Reviewed-on: http://gerrit.cloudera.org:8080/7552 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-07 00:57:46 +00:00
Tim Armstrong	a98b90bd38	IMPALA-4674: Part 2: port backend exec to BufferPool Always create global BufferPool at startup using 80% of memory and limit reservations to 80% of query memory (same as BufferedBlockMgr). The query's initial reservation is computed in the planner, claimed centrally (managed by the InitialReservations class) and distributed to query operators from there. min_spillable_buffer_size and default_spillable_buffer_size query options control the buffer size that the planner selects for spilling operators. Port ExecNodes to use BufferPool: * Each ExecNode has to claim its reservation during Open() * Port Sorter to use BufferPool. * Switch from BufferedTupleStream to BufferedTupleStreamV2 * Port HashTable to use BufferPool via a Suballocator. This also makes PAGG memory consumption more efficient (avoid wasting buffers) and improve the spilling algorithm: * Allow preaggs to execute with 0 reservation - if streams and hash tables cannot be allocated, it will pass through rows. * Halve the buffer requirement for spilling aggs - avoid allocating buffers for aggregated and unaggregated streams simultaneously. * Rebuild spilled partitions instead of repartitioning (IMPALA-2708) TODO in follow-up patches: * Rename BufferedTupleStreamV2 to BufferedTupleStream * Implement max_row_size query option. Testing: * Updated tests to reflect new memory requirements Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e Reviewed-on: http://gerrit.cloudera.org:8080/5801 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-05 01:03:02 +00:00
Tim Armstrong	7843b472f2	IMPALA-5560: always store CHAR(N) inline in tuple This is done to simplify the CHAR(N) logic. I believe this is overall an improvement - any benefits of the out-of-line storage that motivated this optimisation originally were outweighed by the added complexity. This also avoids IMPALA-5559 (fe/be have different notions of var-len), which will unblock IMPALA-3200. Pros: * Reduce the number of code paths and improve test coverage. (e.g. avoids IMPALA-5559: fe/be have different notions of var-len) * Reduced memory to store non-NULL data (saves 12-byte StringValue) * Fewer branches in code -> save CPU cycles. * If CHAR(N) performance is important, reduced complexity makes it easier to implement codegen. Cons: * Requires N bytes to store a NULL value. * May hurt cache locality (although this is speculative in my mind). The change is mostly mechanical - I removed MAX_CHAR_INLINE_LENGTH and then removed branches that depended on that. Testing: Ran exhaustive build. Change-Id: I9c0b823ccff6b0c37f5267c548d096c29b8caac3 Reviewed-on: http://gerrit.cloudera.org:8080/7303 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-30 22:49:40 +00:00
Tim Armstrong	fae36fc77d	IMPALA-5497: spilling hash joins that output build rows hit OOM The bug is that the join tried to bring the next spilled partition into memory while still holding onto memory from the current partition. The fix is to return earlier if the output batch is at capacity so that resources are flushed. Also reduce some of the redundancy in the loop that drives the spilling logic and catch some dropped statuses.. Testing: The failure was originally reproduced by my IMPALA-4703 patch. I was able to cause a query failure with the current code by reducing the memory limit for an existing query. Before it failed with up to 12MB of memory. Now it succeeds with 8MB or less. Ran exhaustive build. Change-Id: I075388d348499c5692d044ac1bc38dd8dd0b10c7 Reviewed-on: http://gerrit.cloudera.org:8080/7180 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-21 20:56:00 +00:00
Tim Armstrong	96316e3b34	IMPALA-5173: crash with hash join feeding directly into nlj The background for this bug is that we can't transfer ownership of BufferdBlockMgr::Blocks that are attached to RowBatches. The NestedLoopJoinNode accumulates row batches on its right side and tries to take ownership of the memory, which doesn't work as expected in this case. The fix is to copy the data when we encounter one of these (likely very rare) cases. Testing: Added a regression test that produces a crash before the fix and succeeds after the fix. Change-Id: I0c04952e591d17e5ff7e994884be4c4c899ae192 Reviewed-on: http://gerrit.cloudera.org:8080/6568 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-18 20:11:23 +00:00
Thomas Tauber-Marshall	6a9df54096	IMPALA-3524: Don't process spilled partitions with 0 probe rows In the partitioned hash join node, if a spilled partition has no probe rows, building the hash table is unnecessary. For some build types (right outer, right anti, and full outer), we still need to process the build side to output unmatched rows (in this case, all rows since there were no probe rows to match). Testing: Added some cases to spilling.test. Manually tested these cases for performance, and they all show around a 6% improvement. Change-Id: I175b32dd9031e51218b38c37693ac3e31dfab47b Reviewed-on: http://gerrit.cloudera.org:8080/5389 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-02-06 20:22:33 +00:00
Thomas Tauber-Marshall	d72353d0c9	IMPALA-2932: Extend DistributedPlanner to account for hash table build cost When deciding between a broadcast or repartition join, Impala calculates the cost of each join as the total amount of data that is sent over the network. This ignores some relevant costs, and can lead to bad plans. One such relevant cost is the work to create the hash table used in the join. This patch accounts for this by adding the amount of data inserted into the hash table (the size of the right side of the join) to the previous cost. This generally increases the estimated cost of broadcast joins relative to repartitioning joins, as the broadcast join must build the hash table on each node the data was broadcast to, so its effect will be to make repartitioning joins more likely to be chosen, especially in large clusters. This patch has not yet been performance tested. Change-Id: I03a0f56f69c8deae68d48dfdb9dc95b71aec11f1 Reviewed-on: http://gerrit.cloudera.org:8080/4098 Tested-by: Internal Jenkins Reviewed-by: Matthew Jacobs <mj@cloudera.com>	2016-08-29 16:44:22 +00:00
Tim Armstrong	ee53ddb389	IMPALA-1346/1590/2344: fix sorter buffer mgmt when spilling The Sorter's memory management logic failed to correctly manage buffers when spilling. It would try to make use of all buffers in the system, neglecting to account for other operators' buffer usage. This patch adjusts the logic so that it handles contention for buffers so long as it can get enough buffers to make progress. Instead of precalculating the number of buffers it thinks it should be able to pin, it just makes a best-effort attempt to pin the initial buffers as many runs as possible, up to a limit. As long as it can pin three runs, it can make progress. Testing: Added an additional test that failed before the patch without OOM. An analytic function test that was meant to fail also started succeeding so I had to adjust the limit there too. Change-Id: Idfe55cc13c7f2b54cba1d05ade44cbcf6bb573c0 Reviewed-on: http://gerrit.cloudera.org:8080/2908 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-06-06 17:34:07 -07:00
Tim Armstrong	37ec25396f	IMPALA-3344: Simplify sorter and document/enforce invariants. Clarify relationships between classes, clean up the previous mess where every class was friends with the other so there's an actual distinction between public and private members. TupleIterator is now no longer tied to TupleSorter, just Run. Document and enforce invariants in many cases. Factor out some functions from large functions. Simplify and document iterator logic. Make management of buffers when iterating over output stream more explicitly correct: either use MarkNeedToReturn() or attach block to the batch as appropriate. The SortedRunMerger didn't handle resource transfer correctly, except if all the memory came from the batch's MemPool. This patch fixes the cases when resources are attached to the batches, but not the 'need_to_return' case. Document that SortedRunMerger requires 'deep_copy_input' to be true if batches can have the 'need_to_return' flag set. Also use the atomic block exchange operation when moving between blocks in unpinned runs to prevent pin failures at that point. I explicitly have avoided changing the hairy block management logic when allocating buffers for merging, that will need addressing in a follow-up patch. Add a SpilledRuns counter so that it's more explicit that spilling occurred. Testing: Added some tests for corner cases with empty and NULL strings. Fixed a test that previously failed with OOM but now succeeds. Performance: Benchmarking against old code initial revealed some regressions from changes in inlining. Force inlining the TupleComparator::operator() and iterator Next()/Prev() functions helped and performance seems similar or slightly better on the targeted orderby benchmarks. Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a Reviewed-on: http://gerrit.cloudera.org:8080/2826 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-06-02 21:33:08 -07:00
Michael Ho	f7501d2ec1	IMPALA-3332: Free local allocations in sorter. Sorter can have runaway memory consumption as it never frees local allocations made in comparator_.Less(). In addition, it doesn't check for errors generated during expression evaluation so it may keep sorting even after failures have occurred. This change fixes the problem by freeing local allocations for every n invocations of comparator_.Less() where n is the row batch size specified in the query options. Various error checks are also added to return early if any error is encountered. Change-Id: I941729b4836e5dbb827d4313a0b45bc5df2fa8e1 Reviewed-on: http://gerrit.cloudera.org:8080/3116 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:18 -07:00
Tim Armstrong	34c95c9590	IMPALA-2345,2991: test coverage for spilling and sorts Add missing coverage for sorting by CHAR and VARCHAR. Add more coverage for spilling sorts. Fix spilling tests: ensure that they actually reliably spill (many of them had memory limits high enough that they could run entirely in memory). I ran this in a loop for a while to flush out flaky tests. The tests should be fairly predictable given that they're not run concurrently with other tests and we allocate enough block manager memory so that each operator can obtain its reservation. Change-Id: Ia2d2627a2c327dcdf269ea3216385b1af9dfa305 Reviewed-on: http://gerrit.cloudera.org:8080/2877 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:55 -07:00
Tim Armstrong	212bea529f	IMPALA-2994: Temporary workaround for flaky spilling test The test was recently reenabled in commit 71a0a7d998702781ae44270f8c742b10c34c0efc. Continue running the test but loosen the memory limit and don't check the runtime profile. The memory limits for this set of tests needs revisiting in any case. Change-Id: I195e8ad3b67c8ff85d5d15c2646a13f5feb57553 Reviewed-on: http://gerrit.cloudera.org:8080/2183 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins (cherry picked from commit 51632f39a45ba9deac9b86bbdb14ff10cbee35ac)	2016-02-17 20:21:57 -08:00
Tim Armstrong	1c102d9d8e	Reenable tests that were disabled for IMPALA-1305 A couple of tests were disabled because of IMPALA-1305. Now that the fix is in, those tests can be reenabled. I ran them in a loop to make sure that they weren't flaky. Also fix the spelling mistake in the file name. Change-Id: I1bfcc619911a92d93b871be3a14852aa11f78da9 Reviewed-on: http://gerrit.cloudera.org:8080/2150 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-02-13 10:08:13 +00:00
Michael Ho	968c61c940	IMPALA-2824: Restore query options after each test. A failed test case inside a test file will leave the rest of the test cases in the file unexecuted. Some test cases may modify some query options such as memory limit and then restore them in the subsequent test cases in the same file. The failure of those test cases will leave the query options modified, causing cascading failures to other test cases which aren't expected to be run with the modified query options (e.g. lowered memory limit). This problem may lead to broken builds which are recorded in IMPALA-2724 and IMPALA-2824. This change fixes the problem above by checking if a test case modifies any query option and if so, restore those modified query options to their default values. This change makes the assumption that a test should not modify an option specified in its test vector so it's safe to restore the modified query options to their default values. Change-Id: Ib88d1dcb6a65183e1afc8eef0c764179a9f6a8ce Reviewed-on: http://gerrit.cloudera.org:8080/1774 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-01-26 03:13:05 +00:00
Michael Ho	ba0bd1d0da	IMPALA-2612: Free local allocations once for every row batch when building hash tables. When building hash tables for the build side in partitioned hash join or aggreagtion, we will evaluate the build or probe side expressions to compute the hash values for each TupleRow. Evaluation of certain expressions (e.g. CastToChar) requires "local" memory allocation. "Local" memory allocation is supposed to be freed after processing each row batch. However, the calls to free local allocations are missing in PartitionedHashJoinNode::BuildHashTableInternal() and PartitionedAggregationNode::ProcessStream(). This causes all "local" memory allocation to accumulate potentially for the entire duration of the query or until GetNext() is called. This may lead to unnecessary memory allocation failure as memory limit is exceeded. This patch calls ExecNode::FreeLocalAllocations() at least once per row-batch when building hash tables. It also adds the missing checks for the query status in the loop building hash tables. Please note that QueryMaintenance() isn't called due to its overhead in memory limit checks. Change-Id: Idbeab043a45b0aaf6b6a8c560882bd1474a1216d Reviewed-on: http://gerrit.cloudera.org:8080/1448 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2015-11-26 03:21:46 +00:00
Dan Hecht	84c4c2ce86	IMPALA-2480, IMPALA-2519: Don't force IO-buffer on probe side when spilling PHJ This fixes a regression introduced with: IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ Prior to that change, as soon as any partition's stream overflowed its small buffers, all partitions' streams would be switched immediately to IO-buffers, which would be satisfied by the initial buffer "reservation". After that change, individual streams are switched to IO-buffers on demand as they overflow their small buffers. However, that change also made it so that Partition::Spill() would eagerly switch that partition's streams to IO-buffers, and fail the query if the buffer is not available. The buffer may not be available because the reserved buffers may be in use by other partition's streams. We don't need to fail the query if the switch to IO-buffers in Partition::Spill() fails. Instead, we should just let the streams switch on demand as they fill up the small buffers. When that happens, if the IO buffer is not available, then we already have a mechanism to pick partitions to spill until we can get the IO-buffer (in the worst case it means working our way back down to the initial reservation). See AppendRowStreamFull() and BuildHashTables(). The symptom of this regression was that some queries would fail at a lower memory limit than before. Also revert the max_block_mgr_memory values back to their originals. Additional testing: loop custom_cluster/spilling.py. We should also remeasure minimum memory required by queries after this change. Change-Id: I11add15540606d42cd64f2af99f4e96140ae8bb5 Reviewed-on: http://gerrit.cloudera.org:8080/1228 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-10-12 14:41:08 -07:00
Ippokratis Pandis	49b588a714	IMPALA-2265: Sorter was not checking the returned Status of PrepareRead The sorter was dropping on the floor the returned Status of the PrepareRead() calls. PrepareRead() tries to Pin() blocks. In some queries with large sorts, those Pin() calls could fail with OOM, but because the sorter was ignoring the returned Status it would happily put the unpinned block in the vector of blocks and eventually seg fault, because the buffer_desc_ of that block was NULL. This patch fixes this problem and adds a test that eventually we may want to move to the exhaustive build because it takes quite some time. It also changes the comments of the sorter class to the doxygen style. Change-Id: Icad48bcfbb97a68f2d51b015a37a7345ebb5e479 Reviewed-on: http://gerrit.cloudera.org:8080/1156 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-09 16:42:03 -07:00
Dan Hecht	df5271e7c1	Fix flaky test_spilling.py test case The commit: IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ recently lowered a couple of limits from 100m to 40m, which appears to be too aggressive based on occasionol gvm failures. Let's bump it back up. Change-Id: I2c3cc24841cf3a305785890329d77e4e9e74f6e5 Reviewed-on: http://gerrit.cloudera.org:8080/1125 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-05 11:30:34 -07:00
Ippokratis Pandis	48699de6e3	IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ PAGG and PHJ were using an all-or-nothing approach wrt spilling. In particular, they were trying to switch to IO-sized buffers for both streams (aggregated and unaggregated in PAGG; build and probe in PHJ) of every partition (currently 16 partitions for a total of 32 streams), even if some of the streams had very few rows, they were empty or simply they would not spill so there was no need to allocate IO-buffers for them. That was increasing the min mem needed by those operators in many queries. This patch decouples the decision to switch to IO-buffers for each stream of each partition. Streams will switch to IO-sized buffers whenever the rows they contain do not fit in the first two small buffers (64KB and 512KB respectively). When we decide to spill a partition, we switch to IO buffers both streams. With these change many streams of PAGG and PHJ nodes do not need to use IO-sized buffers, reducing the min mem requirement. For example, below is the min mem needed (in MBs) for some of the TPC-H queries. Some need half or less mem from the mem they needed before: TPC-H Q3: 645 -> 240 TPC-H Q5: 375 -> 245 TPC-H Q7: 685 -> 265 TPC-H Q8: 740 -> 250 TPC-H Q9: 650 -> 400 TPC-H Q18: 1100 -> 425 TPC-H Q20: 420 -> 250 TPC-H Q21: 975 -> 620 To make this small buffer optimization to work, we had to fix IMPALA-2352. That is, the AllocateRow() call of PAGG::ConstructIntermediateTuple() could return unsuccessfully just because the small buffers of the stream were exhausted. In that case, previously we would treat it as an indication that there is no memory left, start spilling a partition and switching all stream to IO-buffes. Now we make a best effort, trying to first SwitchToIoffers() and if that is successful, we re-attempt the AllocateRow() call. See IMPALA-2352 for more details. Another change is that now SwitchToIoBuffers() will reset the flag using_small_buffers_ back to false, in case we are in a very low memory situation and it fails to get a buffer. That allows us to retry calling SwitchToIoBuffers() once we free up some space. See IMPALA-2330 for more details. With the above fixes we should also have fixed IMPALA-2241 and IMPALA-2271 that are essentially stream::using_small_buffers_-related DCHECKs. This patch adds all 22 TPC-H queries in test_mem_usage_scaling test and updates the per-query min mem limits in it. Additionally, it adds a new aggregation test that uses the TPC-H dataset for larger aggregations (TestTPCHAggregationQueries). It also removes some dead test code. Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38 Reviewed-on: http://gerrit.cloudera.org:8080/818 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins Conflicts: tests/query_test/test_aggregation.py	2015-09-23 11:07:42 -07:00
aacalfa	57dd4d1502	IMPALA-1309: Add support for distinct in group_concat function. Change-Id: I2790f1d2a7bfd0ecc7ef66cc5d91dafe3414e111 Reviewed-on: http://gerrit.cloudera.org:8080/892 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-09-23 09:42:17 +00:00
Ippokratis Pandis	d58aedff42	IMPALA-1820: Start with small pages for hash tables during repartitioning The change of the PARTITION_FANOUT from 32 to 16 exposed a pathological case due to the lack of coordination across concurrently executing spilling nodes of the same query. In particular, when we repartition a partition we try to initialize hash tables for the new partitions. But each hash table needs a block (for the nodes). In case there were not any IO-sized blocks available, because they had been consumed by other nodes, we would get into a loop trying to repartition those smaller partitions that couldn't initialize their hash table. Additional repartitions that, among others, would need additional blocks for the new streams. These partitions would end up being very small, still we would fail the query when we were reaching the MAX_PARTITION_DEPTH limit, which was fixed to 4. This patch fixes the problem by initializing the hash tables during repartitions with small pages. That is, the hash tables always first use a 64KB and a 512KB block for their nodes before switching to IO-sized blocks. This helps the partitioning algorithm to finish when we end up with partitions that can fit in those small pages. The performance may not be optimal, still the memory consumption is lower and the algorithm finishes. For example, without this patch and with PARTITION_FANOUT == 16 in order to run TPC-H Q18 and Q20 we needed 3.4GB and 3.1GB respectively. With this patch TPC-H Q18 needs ~1GB and Q20 975MB. This patch also removes the restriction of stopping repartitioning when we are reaching 4 levels of repartitioning. Instead, whenever we repartition we compare the size of the input partition to the size of the largest new partition. If there is no reduction on the size we stop the algorithm. Otherwise, we keep on repartitioning. That should help in cases of skew (e.g. due to bad hashing). There is a new MAX_PARTITION_DEPTH limit of 16. It is very unlikely we will ever hit this limit. Change-Id: Ib33fece10585448bc2d07bb39d0535d78b168ccc Reviewed-on: http://gerrit.cloudera.org:8080/119 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-02-28 00:42:04 +00:00
Alex Behm	f696861c5c	Throw error on unrecognized test sections. Our .test file parser used to not abort tests when there is a malformed test/section. This patch changes that behavior to report an error and treat the test as failed. Quite a few tests were not well-formed, and were not executed as a result. This patch fixes those tests. Arguably, the test file parser should be more flexible in which places to accept comments, but this patch does not address that problem. Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-12-02 18:08:09 -08:00
Ippokratis Pandis	87502f829c	IMPALA-1471: Bug in spilling of PHJ that was affecting left anti and outer joins. In cases where we had to spill the probe side of PHJs, we were not only appending the probe row to the tuple stream to be spilled, but we were also getting into the regular processing loop with the iterator set to End(). In the case of left anti and left outer joins, the result was to incorrectly output this row, since it did not have a match. This bug had a small perf impact for all spilling joins because we were doing an unnecessary loop for each probe row we had to spill. This patch solves the problem by immediately going to the next probe row if the current row is spilled. Additionally, it fixes a bug in the block mgr where there was a code path we were not counting correctly the number of pinned buffers. It also adds tpch-q21 in the set of queries to run in the spilling test. Change-Id: I762f5c41fe468e4485a4b31dabe2e53f6b49ae24 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5313 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5334	2014-11-20 02:21:14 -08:00
ishaan	23964c19af	[CDH5] Fix bad merge in in spilling.test Change-Id: Ia6e30cf5916c737088d8cb969e0167b9d69a599e	2014-10-08 23:19:02 -07:00
Nong Li	de31fa8e21	Disable spilling tests that are too flaky. Change-Id: I4ac877c3fa8297d873c67f219bb0c75f0001562d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4731 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-10-06 15:18:56 -07:00
Nong Li	e08ffde009	PA/PHJ: Increase fanout to 32 and fix interaction with small buffers. Small buffers introduced an issue that is exacerbated by the large fanout. A stream can only be appended to forever once it has grabbed the initial io sized buffer. With small buffers, we don't grab that at the beginning anymore and, before this patch, it is grabbed when the stream first needs it. This means when one stream needs it, another stream could have already grabbed it (meaning this stream is pinned with multiple buffers). This patch has all the streams grab an IO buffer as soon as the first stream needs an io buffer. This guarantees that all streams get 1 before any get 2. Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d	2014-10-06 15:16:16 -07:00
Nong Li	3e632ef6ad	Reduce min PA/PHJ mem requirement. Update PA/PHJ to use small (< io sized buffers) initially. Without this we would not be able to run at the QPS that we need just due to the buffering requirements of these operators. Change-Id: Ic8a777d147893567c9590fbab17f561eadb6ee19 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4623 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-10-06 15:14:10 -07:00
ishaan	010cc22a2f	[CDH5] Fix test spilling. tpch in cdh5 does not have double columns. Also, remove round calls to test that we get consistent results. Change-Id: Ia45ef08644ed78b05a08c47422733ab38a26b508 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4595 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-09-26 22:57:02 -07:00
Nong Li	d5c948c351	Increase the mem limit for one of the spilling queries. Change-Id: I9b52582b2ded82821ecc446762f07d7702dedabf Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4555 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-09-26 12:27:29 -07:00
Nong Li	f03b05ed50	Fix hash table buckets to allocate memory from the BlockMgr. This was always a TODO. We want memory to come from the block mgr and trigger spilling. Change-Id: I07f1f79fbbb33068fb2df64510a80a9b008ef73d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4466 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-26 12:26:09 -07:00
Matthew Jacobs	da5198e615	Add spilling test for an analytic fn Change-Id: Ia93c71c9c2a01f7f04a81593d51f5ca565286b7d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4447 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-09-23 07:26:09 -07:00
Nong Li	8a661d0787	[CDH5] cherry pick conflicts. Change-Id: Ic11237b7ead4a810b523d6b6095781efbc5bb66b	2014-09-20 19:41:42 -07:00
Nong Li	6b73eec02d	PHJ: Fix block management when spilling. The previous code did not handle well the case where the spilling happens when building the hash table (i.e. partitioning the build rows fit). This caused the probe partition to be starved causing queries that should be able to run to fail with a not enough buffers error. Change-Id: I3a9a84e8800a72ed3ce6f5ab7ff03bc2d6eb7ad8 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4403 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-09-20 16:12:21 -07:00
Skye Wanderman-Milne	2a449651da	Use CRC hash for 0th partition level. Change-Id: Ie845e0edb684f13421eea41327b1571b368db21a Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4370 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-09-20 16:11:40 -07:00
ishaan	c4b4e010ff	Buffered Tuple Stream fixes. This patch fixes two issues: - Add API to buffered block mgr to allow an atomic Unpin and GetNewBlock. This has the semantics of unpinning a block and giving the buffer to the new block. This is necessary for the tuple stream to make sure another thread does not grab the unpinned block in between. - Buffer management reading an unpinned stream. Before moving onto a new block (and unpinning the current), we need to make sure all the tuples returned from the current block are returned up the operator tree. Change-Id: I95ee58d1019dd971f6a7dc19ecafdfa54cdbf942 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4333 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-20 16:05:11 -07:00

44 Commits