Commit Graph

44 Commits

Author SHA1 Message Date
Tim Armstrong
418c705787 IMPALA-6679,IMPALA-6678: reduce scan reservation
This has two related changes.

IMPALA-6679: defer scanner reservation increases
------------------------------------------------
When starting each scan range, check to see how big the initial scan
range is (the full thing for row-based formats, the footer for
Parquet) and determine whether more reservation would be useful.

For Parquet, base the ideal reservation on the actual column layout
of each file. This avoids reserving memory that we won't use for
the actual files that we're scanning. This also avoid the need to
estimate ideal reservation in the planner.

We also release scanner thread reservations above the minimum as
soon as threads complete, so that resources can be released slightly
earlier.

IMPALA-6678: estimate Parquet column size for reservation
---------------------------------------------------------
This change also reduces reservation computed by the planner in certain
cases by estimating the on-disk size of column data based on stats. It
also reduces the default per-column reservation to 4MB since it appears
that < 8MB columns are generally common in practice and the method for
estimating column size is biased towards over-estimating. There are two
main cases to consider for the performance implications:
* Memory is available to improve query perf - if we underestimate, we
  can increase the reservation so we can do "efficient" 8MB I/Os for
  large columns.
* The ideal reservation is not available - query performance is affected
  because we can't overlap I/O and compute as much and may do smaller
  (probably 4MB I/Os). However, we should avoid pathological behaviour
  like tiny I/Os.

When stats are not available, we just default to reserving 4MB per
column, which typically is more memory than required. When stats are
available, the memory required can be reduced below when some heuristic
tell us with high confidence that the column data for most or all files
is smaller than 4MB.

The stats-based heuristic could reduce scan performance if both the
conservative heuristics significantly underestimate the column size
and memory is constrained such that we can't increase the scan
reservation at runtime (in which case the memory might be used by
a different operator or scanner thread).

Observability:
Added counters to track when threads were not spawned due to reservation
and to track when reservation increases are requested and denied. These
allow determining if performance may have been affected by memory
availability.

Testing:
Updated test_mem_usage_scaling.py memory requirements and added steps
to regenerate the requirements. Loops test for a while to flush out
flakiness.

Added targeted planner and query tests for reservation calculations and
increases.

Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842
Reviewed-on: http://gerrit.cloudera.org:8080/9757
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-28 23:41:39 +00:00
Tim Armstrong
fb5dc9eb48 IMPALA-4835: switch I/O buffers to buffer pool
This is the following squashed patches that were reverted.

I will fix the known issues with some follow-on patches.

======================================================================
IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation

In preparation for switching the I/O mgr to the buffer pool, this
removes and cleans up a lot of code so that the switchover patch starts
from a cleaner slate.

* Remove the free buffer cache (which will be replaced by buffer pool's
  own caching).
* Make memory limit exceeded error checking synchronous (in anticipation
  of having to propagate buffer pool errors synchronously).
* Simplify error propagation - remove the (ineffectual) code that
  enqueued BufferDescriptors containing error statuses.
* Document locking scheme better in a few places, make it part of the
  function signature when it seemed reasonable.
* Move ReturnBuffer() to ScanRange, because it is intrinsically
  connected with the lifecycle of a scan range.
* Separate external ReturnBuffer() and internal CleanUpBuffer()
  interfaces - previously callers of ReturnBuffer() were fudging
  the num_buffers_in_reader accounting to make the external interface work.
* Eliminate redundant state in ScanRange: 'eosr_returned_' and
  'is_cancelled_'.
* Clarify the logic around calling Close() for the last
  BufferDescriptor.
  -> There appeared to be an implicit assumption that buffers would be
     freed in the order they were returned from the scan range, so that
     the "eos" buffer was returned last. Instead just count the number
     of outstanding buffers to detect the last one.
  -> Touching the is_cancelled_ field without holding a lock was hard to
     reason about - violated locking rules and it was unclear that it
     was race-free.
* Remove DiskIoMgr::Read() to simplify the interface. It is trivial to
  inline at the callsites.

This will probably regress performance somewhat because of the cache
removal, so my plan is to merge it around the same time as switching
the I/O mgr to allocate from the buffer pool. I'm keeping the patches
separate to make reviewing easier.

Testing:
* Ran exhaustive tests
* Ran the disk-io-mgr-stress-test overnight

======================================================================
IMPALA-4835: Part 2: Allocate scan range buffers upfront

This change is a step towards reserving memory for buffers from the
buffer pool and constraining per-scanner memory requirements. This
change restructures the DiskIoMgr code so that each ScanRange operates
with a fixed set of buffers that are allocated upfront and recycled as
the I/O mgr works through the ScanRange.

One major change is that ScanRanges get blocked when a buffer is not
available and get unblocked when a client returns a buffer via
ReturnBuffer(). I was able to remove the logic to maintain the
blocked_ranges_ list by instead adding a separate set with all ranges
that are active.

There is also some miscellaneous cleanup included - e.g. reducing the
amount of code devoted to maintaining counters and metrics.

One tricky part of the existing code was the it called
IssueInitialRanges() with empty lists of files and depended on
DiskIoMgr::AddScanRanges() to not check for cancellation in that case.
See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue
ranges for empty lists of files.

I plan to merge this along with the actual buffer pool switch, but
separated it out to allow review of the DiskIoMgr changes separate from
other aspects of the buffer pool switchover.

Testing:
* Ran core and exhaustive tests.

======================================================================
IMPALA-4835: Part 3: switch I/O buffers to buffer pool

This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.

* The planner reserves enough memory to run a single scanner per
  scan node.
* The multi-threaded scan node must increase reservation before
  spinning up more threads.
* The scanner implementations must be careful to stay within their
  assigned reservation.

The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.

Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.

I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.

Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
  action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
  since the algorithm is non-trivial.

Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.

Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00
Reviewed-on: http://gerrit.cloudera.org:8080/9679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-04-28 23:41:39 +00:00
Tim Armstrong
161cbe30ff Revert IMPALA-4835 and dependent changes
Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit"

This reverts commit 25bcb258df.

Revert "IMPALA-6588: don't add empty list of ranges in text scan"

This reverts commit d57fbec6f6.

Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool"

This reverts commit 24b4ed0b29.

Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront"

This reverts commit 5699b59d0c.

Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation"

This reverts commit 65680dc421.

Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb
Reviewed-on: http://gerrit.cloudera.org:8080/9480
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-03 04:22:12 +00:00
Tim Armstrong
24b4ed0b29 IMPALA-4835: Part 3: switch I/O buffers to buffer pool
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.

* The planner reserves enough memory to run a single scanner per
  scan node.
* The multi-threaded scan node must increase reservation before
  spinning up more threads.
* The scanner implementations must be careful to stay within their
  assigned reservation.

The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.

Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.

I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.

Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
  action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
  since the algorithm is non-trivial.

Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.

Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786
Reviewed-on: http://gerrit.cloudera.org:8080/8966
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-23 04:17:41 +00:00
Bikramjeet Vig
61d941fad3 IMPALA-6526: Fix spilling test for running on local FS
One of the spilling test was failing because its minimum bufferpool
mem requirement was more when ran on local FS as compared to when
it is run on HDFS.
The fix is to increase the bufferpool limit to a value just above
the min limit so that it still forces spill to disk on both filesystems.

Testing:
Ran core tests with local FS as target file system. Made sure the
failing test passed.

Change-Id: I50648d7936007a26891cf64d6343c47d9d646596
Reviewed-on: http://gerrit.cloudera.org:8080/9354
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-17 01:40:08 +00:00
Bikramjeet Vig
8fc1eccce4 IMPALA-5519: Allocate fragment's runtime filter memory from Buffer pool
This patch adds changes to the planner to account for memory used by
bloom filters at the fragment instance level. Also adds changes to
allocate memory for those bloom filters from the buffer pool.

Testing:
- Modified Planner Tests and end to end tests to account for memory
  reservation for the runtime filters.
- Modified backend tests and benchmarks to use the bufferpool for
  bloom filter allocation.
- Add an end to end test.
- Ran rest of the core tests.

Change-Id: Iea2759665fb2e8bef9433014a8d42a7ebf99ce1f
Reviewed-on: http://gerrit.cloudera.org:8080/8971
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-13 08:29:03 +00:00
Tim Armstrong
51c7fcd5dc IMPALA-5827: add test for failure to repartition in hash join
Testing:
Ran the test locally.

Change-Id: If6d601f9d15bed4667b50576f07f6216c34ed9c4
Reviewed-on: http://gerrit.cloudera.org:8080/7811
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-25 01:17:46 +00:00
Bikramjeet Vig
b6c02972d6 IMPALA-5788: Fix agg node crash when grouping by nondeterministic exprs
Fixed a bug where impala crashes during execution of an aggregation
query using nondeterministic grouping expressions. This happens when
it tries to rebuild a spilled partition that can fit in memory and rows
get re-hashed to a partition other than the spilled one due to the use
of nondeterministic expressions.

Testing:
Added a query test to verify successful execution.

Change-Id: Ibdb09239577b3f0a19d710b0d148e882b0b73e23
Reviewed-on: http://gerrit.cloudera.org:8080/7714
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-23 03:59:02 +00:00
Tim Armstrong
852e1bb728 IMPALA-3931: arbitrary fixed-size uda intermediate types
Make many builtin aggregate functions use fixed-length intermediate
types:
* avg()
* ndv()
* stddev(), variance(), etc
* distinctpc(), distinctpcsa()

sample(), appx_median(), histogram() and group_concat() actually
allocate var-len data so aren't changed.

This has some major benefits:
* Spill-to-disk works properly with these aggregations.
* Aggregations are more efficient because there is one less pointer
  indirection.
* Aggregations use less memory, because we don't need an extra 12-byte
  StringValue for the indirection.

Adds a special-purpose internal type FIXED_UDA_INTERMEDIATE. The type
is represented in the same way as CHAR - a fixed-size array of bytes,
stored inline in tuples. However, it is not user-visible and does
not support CHAR semantics, i.e. users can't declare tables, functions,
etc with the type. The pointer and length is passed into aggregate functions
wrapped in a StringVal.

Updates some internal codegen functions to work better with the new
type. E.g. store values directly into the result tuple instead of
via an intermediate stack allocation.

Testing:
This change only affects builtin aggregate functions, for which we
have test coverage already. If we were to allow wider use of this type,
it would need further testing.

Added an analyzer test to ensure we can't use the type for UDAs.

Added a regression test for spilling avg().

Added a regression test for UDA with CHAR intermediate hitting DCHECK.

Perf:
Ran TPC-H locally. TPC-H Q17, which has a high-cardinality AVG(),
improved dramatically.

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(60) | parquet / none / none | 18.44   | -17.54%    | 11.92      | -5.34%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| TPCH(60) | TPCH-Q12 | parquet / none / none | 18.40  | 17.64       | +4.32%     |   0.77%   |   1.09%        | 1           | 5     |
| TPCH(60) | TPCH-Q22 | parquet / none / none | 7.07   | 6.90        | +2.36%     |   0.28%   |   0.30%        | 1           | 5     |
| TPCH(60) | TPCH-Q3  | parquet / none / none | 12.37  | 12.11       | +2.10%     |   0.18%   |   0.15%        | 1           | 5     |
| TPCH(60) | TPCH-Q7  | parquet / none / none | 42.48  | 42.09       | +0.93%     |   2.45%   |   0.80%        | 1           | 5     |
| TPCH(60) | TPCH-Q6  | parquet / none / none | 3.18   | 3.15        | +0.89%     |   0.67%   |   0.76%        | 1           | 5     |
| TPCH(60) | TPCH-Q19 | parquet / none / none | 7.24   | 7.20        | +0.50%     |   0.95%   |   0.67%        | 1           | 5     |
| TPCH(60) | TPCH-Q10 | parquet / none / none | 13.37  | 13.30       | +0.50%     |   0.48%   |   1.39%        | 1           | 5     |
| TPCH(60) | TPCH-Q5  | parquet / none / none | 7.47   | 7.44        | +0.36%     |   0.58%   |   0.54%        | 1           | 5     |
| TPCH(60) | TPCH-Q11 | parquet / none / none | 2.03   | 2.02        | +0.06%     |   0.26%   |   1.95%        | 1           | 5     |
| TPCH(60) | TPCH-Q4  | parquet / none / none | 5.48   | 5.50        | -0.27%     |   0.62%   |   1.12%        | 1           | 5     |
| TPCH(60) | TPCH-Q13 | parquet / none / none | 22.11  | 22.18       | -0.31%     |   0.18%   |   0.55%        | 1           | 5     |
| TPCH(60) | TPCH-Q15 | parquet / none / none | 8.45   | 8.48        | -0.32%     |   0.40%   |   0.47%        | 1           | 5     |
| TPCH(60) | TPCH-Q9  | parquet / none / none | 33.39  | 33.66       | -0.81%     |   0.75%   |   0.59%        | 1           | 5     |
| TPCH(60) | TPCH-Q21 | parquet / none / none | 71.34  | 72.07       | -1.01%     |   1.84%   |   1.79%        | 1           | 5     |
| TPCH(60) | TPCH-Q14 | parquet / none / none | 5.93   | 6.00        | -1.07%     |   0.15%   |   0.69%        | 1           | 5     |
| TPCH(60) | TPCH-Q20 | parquet / none / none | 5.72   | 5.79        | -1.09%     |   0.59%   |   0.51%        | 1           | 5     |
| TPCH(60) | TPCH-Q18 | parquet / none / none | 45.42  | 45.93       | -1.10%     |   1.42%   |   0.50%        | 1           | 5     |
| TPCH(60) | TPCH-Q2  | parquet / none / none | 4.81   | 4.89        | -1.52%     |   1.68%   |   1.01%        | 1           | 5     |
| TPCH(60) | TPCH-Q16 | parquet / none / none | 5.41   | 5.52        | -1.98%     |   0.66%   |   0.73%        | 1           | 5     |
| TPCH(60) | TPCH-Q1  | parquet / none / none | 27.58  | 29.13       | -5.34%     |   0.24%   |   1.51%        | 1           | 5     |
| TPCH(60) | TPCH-Q8  | parquet / none / none | 12.61  | 14.30       | -11.78%    |   6.20%   | * 15.28% *     | 1           | 5     |
| TPCH(60) | TPCH-Q17 | parquet / none / none | 43.74  | 126.58      | I -65.44%  |   1.34%   |   9.60%        | 1           | 5     |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+

Change-Id: Ife90cf27989f98ffb5ef5c39f1e09ce92e8cb87c
Reviewed-on: http://gerrit.cloudera.org:8080/7526
Tested-by: Impala Public Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2017-08-17 03:12:48 +00:00
Tim Armstrong
c4f903033c IMPALA-3200: more buffer pool end-to-end tests
This adds most of the end-to-end tests described in the test plan.
See http://goo.gl/v3Strz.

* End-to-end test for disk spill encryption.
* Admission control test for the case when acquiring initial
  reservation fails.
* Initial reservation acquire failure test
* scratch_limit tests for Join, Agg, Sort, Analytic
* Memory usage scaling tests for Join, Agg, Sort, Analytic

Also splits out the slow sort queries in test_spilling and moves them
to exhaustive so the individual tests run faster and have better
parallelism.

Testing:
Ran all the core tests. Will do a full exhaustive run before
committing.

Change-Id: I554aa5ddfef4f8e75295596e720a14eee1afa17f
Reviewed-on: http://gerrit.cloudera.org:8080/7552
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-07 00:57:46 +00:00
Tim Armstrong
a98b90bd38 IMPALA-4674: Part 2: port backend exec to BufferPool
Always create global BufferPool at startup using 80% of memory and
limit reservations to 80% of query memory (same as BufferedBlockMgr).
The query's initial reservation is computed in the planner, claimed
centrally (managed by the InitialReservations class) and distributed
to query operators from there.

min_spillable_buffer_size and default_spillable_buffer_size query
options control the buffer size that the planner selects for
spilling operators.

Port ExecNodes to use BufferPool:
  * Each ExecNode has to claim its reservation during Open()
  * Port Sorter to use BufferPool.
  * Switch from BufferedTupleStream to BufferedTupleStreamV2
  * Port HashTable to use BufferPool via a Suballocator.

This also makes PAGG memory consumption more efficient (avoid wasting buffers)
and improve the spilling algorithm:
* Allow preaggs to execute with 0 reservation - if streams and hash tables
  cannot be allocated, it will pass through rows.
* Halve the buffer requirement for spilling aggs - avoid allocating
  buffers for aggregated and unaggregated streams simultaneously.
* Rebuild spilled partitions instead of repartitioning (IMPALA-2708)

TODO in follow-up patches:
* Rename BufferedTupleStreamV2 to BufferedTupleStream
* Implement max_row_size query option.

Testing:
* Updated tests to reflect new memory requirements

Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e
Reviewed-on: http://gerrit.cloudera.org:8080/5801
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-05 01:03:02 +00:00
Tim Armstrong
7843b472f2 IMPALA-5560: always store CHAR(N) inline in tuple
This is done to simplify the CHAR(N) logic. I believe this is overall an
improvement - any benefits of the out-of-line storage that motivated
this optimisation originally were outweighed by the added complexity.

This also avoids IMPALA-5559 (fe/be have different notions of var-len),
which will unblock IMPALA-3200.

Pros:
* Reduce the number of code paths and improve test coverage.
  (e.g. avoids IMPALA-5559: fe/be have different notions of var-len)
* Reduced memory to store non-NULL data (saves 12-byte StringValue)
* Fewer branches in code -> save CPU cycles.
* If CHAR(N) performance is important, reduced complexity makes it
  easier to implement codegen.

Cons:
* Requires N bytes to store a NULL value.
* May hurt cache locality (although this is speculative in my mind).

The change is mostly mechanical - I removed MAX_CHAR_INLINE_LENGTH
and then removed branches that depended on that.

Testing:
Ran exhaustive build.

Change-Id: I9c0b823ccff6b0c37f5267c548d096c29b8caac3
Reviewed-on: http://gerrit.cloudera.org:8080/7303
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-30 22:49:40 +00:00
Tim Armstrong
fae36fc77d IMPALA-5497: spilling hash joins that output build rows hit OOM
The bug is that the join tried to bring the next spilled partition into
memory while still holding onto memory from the current partition.
The fix is to return earlier if the output batch is at capacity so
that resources are flushed.

Also reduce some of the redundancy in the loop that drives the spilling
logic and catch some dropped statuses..

Testing:
The failure was originally reproduced by my IMPALA-4703 patch. I was
able to cause a query failure with the current code by reducing the
memory limit for an existing query. Before it failed with up to 12MB of
memory. Now it succeeds with 8MB or less.

Ran exhaustive build.

Change-Id: I075388d348499c5692d044ac1bc38dd8dd0b10c7
Reviewed-on: http://gerrit.cloudera.org:8080/7180
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-21 20:56:00 +00:00
Tim Armstrong
96316e3b34 IMPALA-5173: crash with hash join feeding directly into nlj
The background for this bug is that we can't transfer ownership
of BufferdBlockMgr::Blocks that are attached to RowBatches.

The NestedLoopJoinNode accumulates row batches on its right
side and tries to take ownership of the memory, which doesn't
work as expected in this case.

The fix is to copy the data when we encounter one of these
(likely very rare) cases.

Testing:
Added a regression test that produces a crash before the fix and
succeeds after the fix.

Change-Id: I0c04952e591d17e5ff7e994884be4c4c899ae192
Reviewed-on: http://gerrit.cloudera.org:8080/6568
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-04-18 20:11:23 +00:00
Thomas Tauber-Marshall
6a9df54096 IMPALA-3524: Don't process spilled partitions with 0 probe rows
In the partitioned hash join node, if a spilled partition has no probe
rows, building the hash table is unnecessary.

For some build types (right outer, right anti, and full outer), we still
need to process the build side to output unmatched rows (in this case, all
rows since there were no probe rows to match).

Testing: Added some cases to spilling.test. Manually tested these cases
for performance, and they all show around a 6% improvement.

Change-Id: I175b32dd9031e51218b38c37693ac3e31dfab47b
Reviewed-on: http://gerrit.cloudera.org:8080/5389
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-02-06 20:22:33 +00:00
Thomas Tauber-Marshall
d72353d0c9 IMPALA-2932: Extend DistributedPlanner to account for hash table build cost
When deciding between a broadcast or repartition join, Impala calculates
the cost of each join as the total amount of data that is sent over the
network. This ignores some relevant costs, and can lead to bad plans.

One such relevant cost is the work to create the hash table used in the
join. This patch accounts for this by adding the amount of data inserted
into the hash table (the size of the right side of the join) to the
previous cost.

This generally increases the estimated cost of broadcast joins relative
to repartitioning joins, as the broadcast join must build the hash table
on each node the data was broadcast to, so its effect will be to make
repartitioning joins more likely to be chosen, especially in large
clusters.

This patch has not yet been performance tested.

Change-Id: I03a0f56f69c8deae68d48dfdb9dc95b71aec11f1
Reviewed-on: http://gerrit.cloudera.org:8080/4098
Tested-by: Internal Jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
2016-08-29 16:44:22 +00:00
Tim Armstrong
ee53ddb389 IMPALA-1346/1590/2344: fix sorter buffer mgmt when spilling
The Sorter's memory management logic failed to correctly manage buffers
when spilling. It would try to make use of all buffers in the system,
neglecting to account for other operators' buffer usage.

This patch adjusts the logic so that it handles contention for buffers
so long as it can get enough buffers to make progress. Instead of
precalculating the number of buffers it thinks it should be able to
pin, it just makes a best-effort attempt to pin the initial buffers
as many runs as possible, up to a limit. As long as it can pin three
runs, it can make progress.

Testing:
Added an additional test that failed before the patch without OOM.
An analytic function test that was meant to fail also started succeeding
so I had to adjust the limit there too.

Change-Id: Idfe55cc13c7f2b54cba1d05ade44cbcf6bb573c0
Reviewed-on: http://gerrit.cloudera.org:8080/2908
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-06-06 17:34:07 -07:00
Tim Armstrong
37ec25396f IMPALA-3344: Simplify sorter and document/enforce invariants.
Clarify relationships between classes, clean up the previous mess
where every class was friends with the other so there's an actual
distinction between public and private members. TupleIterator
is now no longer tied to TupleSorter, just Run.

Document and enforce invariants in many cases.

Factor out some functions from large functions.

Simplify and document iterator logic.

Make management of buffers when iterating over output stream more
explicitly correct: either use MarkNeedToReturn() or attach block
to the batch as appropriate. The SortedRunMerger didn't handle
resource transfer correctly, except if all the memory came from
the batch's MemPool. This patch fixes the cases when resources
are attached to the batches, but not the 'need_to_return' case.
Document that SortedRunMerger requires 'deep_copy_input' to be true
if batches can have the 'need_to_return' flag set.

Also use the atomic block exchange operation when moving between
blocks in unpinned runs to prevent pin failures at that point.
I explicitly have avoided changing the hairy block management logic
when allocating buffers for merging, that will need addressing in
a follow-up patch.

Add a SpilledRuns counter so that it's more explicit that spilling
occurred.

Testing:
Added some tests for corner cases with empty and NULL strings.
Fixed a test that previously failed with OOM but now succeeds.

Performance:
Benchmarking against old code initial revealed some regressions from
changes in inlining. Force inlining the TupleComparator::operator() and
iterator Next()/Prev() functions helped and performance seems similar or
slightly better on the targeted orderby benchmarks.

Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a
Reviewed-on: http://gerrit.cloudera.org:8080/2826
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-06-02 21:33:08 -07:00
Michael Ho
f7501d2ec1 IMPALA-3332: Free local allocations in sorter.
Sorter can have runaway memory consumption as it never frees
local allocations made in comparator_.Less(). In addition, it
doesn't check for errors generated during expression evaluation
so it may keep sorting even after failures have occurred.

This change fixes the problem by freeing local allocations for
every n invocations of comparator_.Less() where n is the row
batch size specified in the query options. Various error checks
are also added to return early if any error is encountered.

Change-Id: I941729b4836e5dbb827d4313a0b45bc5df2fa8e1
Reviewed-on: http://gerrit.cloudera.org:8080/3116
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-05-23 08:40:18 -07:00
Tim Armstrong
34c95c9590 IMPALA-2345,2991: test coverage for spilling and sorts
Add missing coverage for sorting by CHAR and VARCHAR.

Add more coverage for spilling sorts.

Fix spilling tests: ensure that they actually reliably spill (many of
them had memory limits high enough that they could run entirely in
memory).

I ran this in a loop for a while to flush out flaky tests. The tests
should be fairly predictable given that they're not run concurrently
with other tests and we allocate enough block manager memory so that
each operator can obtain its reservation.

Change-Id: Ia2d2627a2c327dcdf269ea3216385b1af9dfa305
Reviewed-on: http://gerrit.cloudera.org:8080/2877
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:55 -07:00
Tim Armstrong
212bea529f IMPALA-2994: Temporary workaround for flaky spilling test
The test was recently reenabled in commit
71a0a7d998702781ae44270f8c742b10c34c0efc.

Continue running the test but loosen the memory limit and don't check
the runtime profile. The memory limits for this set of tests needs
revisiting in any case.

Change-Id: I195e8ad3b67c8ff85d5d15c2646a13f5feb57553
Reviewed-on: http://gerrit.cloudera.org:8080/2183
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit 51632f39a45ba9deac9b86bbdb14ff10cbee35ac)
2016-02-17 20:21:57 -08:00
Tim Armstrong
1c102d9d8e Reenable tests that were disabled for IMPALA-1305
A couple of tests were disabled because of IMPALA-1305. Now that the fix
is in, those tests can be reenabled. I ran them in a loop to make sure
that they weren't flaky.

Also fix the spelling mistake in the file name.

Change-Id: I1bfcc619911a92d93b871be3a14852aa11f78da9
Reviewed-on: http://gerrit.cloudera.org:8080/2150
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-02-13 10:08:13 +00:00
Michael Ho
968c61c940 IMPALA-2824: Restore query options after each test.
A failed test case inside a test file will leave the rest of
the test cases in the file unexecuted. Some test cases may
modify some query options such as memory limit and then
restore them in the subsequent test cases in the same file.
The failure of those test cases will leave the query options
modified, causing cascading failures to other test cases
which aren't expected to be run with the modified query
options (e.g. lowered memory limit). This problem may lead
to broken builds which are recorded in IMPALA-2724 and
IMPALA-2824.

This change fixes the problem above by checking if a test
case modifies any query option and if so, restore those
modified query options to their default values. This change
makes the assumption that a test should not modify an option
specified in its test vector so it's safe to restore the
modified query options to their default values.

Change-Id: Ib88d1dcb6a65183e1afc8eef0c764179a9f6a8ce
Reviewed-on: http://gerrit.cloudera.org:8080/1774
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 03:13:05 +00:00
Michael Ho
ba0bd1d0da IMPALA-2612: Free local allocations once for every row batch when building hash tables.
When building hash tables for the build side in partitioned
hash join or aggreagtion, we will evaluate the build or probe
side expressions to compute the hash values for each TupleRow.
Evaluation of certain expressions (e.g. CastToChar) requires
"local" memory allocation. "Local" memory allocation is supposed
to be freed after processing each row batch.

However, the calls to free local allocations are missing in
PartitionedHashJoinNode::BuildHashTableInternal() and
PartitionedAggregationNode::ProcessStream(). This causes all
"local" memory allocation to accumulate potentially for the
entire duration of the query or until GetNext() is called.
This may lead to unnecessary memory allocation failure as
memory limit is exceeded.

This patch calls ExecNode::FreeLocalAllocations() at least once
per row-batch when building hash tables. It also adds the missing
checks for the query status in the loop building hash tables.
Please note that QueryMaintenance() isn't called due to its
overhead in memory limit checks.

Change-Id: Idbeab043a45b0aaf6b6a8c560882bd1474a1216d
Reviewed-on: http://gerrit.cloudera.org:8080/1448
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2015-11-26 03:21:46 +00:00
Dan Hecht
84c4c2ce86 IMPALA-2480, IMPALA-2519: Don't force IO-buffer on probe side when spilling PHJ
This fixes a regression introduced with:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce
min mem needed for PAGG/PHJ

Prior to that change, as soon as any partition's stream overflowed
its small buffers, all partitions' streams would be switched
immediately to IO-buffers, which would be satisfied by the initial
buffer "reservation".

After that change, individual streams are switched to IO-buffers on
demand as they overflow their small buffers.  However, that change
also made it so that Partition::Spill() would eagerly switch that
partition's streams to IO-buffers, and fail the query if the buffer
is not available.  The buffer may not be available because the
reserved buffers may be in use by other partition's streams.

We don't need to fail the query if the switch to IO-buffers in
Partition::Spill() fails.  Instead, we should just let the streams
switch on demand as they fill up the small buffers.  When that
happens, if the IO buffer is not available, then we already have a
mechanism to pick partitions to spill until we can get the IO-buffer
(in the worst case it means working our way back down to the initial
reservation).  See AppendRowStreamFull() and BuildHashTables().

The symptom of this regression was that some queries would fail at a
lower memory limit than before.

Also revert the max_block_mgr_memory values back to their originals.

Additional testing: loop custom_cluster/spilling.py.  We should also
remeasure minimum memory required by queries after this change.

Change-Id: I11add15540606d42cd64f2af99f4e96140ae8bb5
Reviewed-on: http://gerrit.cloudera.org:8080/1228
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-10-12 14:41:08 -07:00
Ippokratis Pandis
49b588a714 IMPALA-2265: Sorter was not checking the returned Status of PrepareRead
The sorter was dropping on the floor the returned Status of the
PrepareRead() calls. PrepareRead() tries to Pin() blocks. In some
queries with large sorts, those Pin() calls could fail with OOM,
but because the sorter was ignoring the returned Status it would
happily put the unpinned block in the vector of blocks and eventually
seg fault, because the buffer_desc_ of that block was NULL.

This patch fixes this problem and adds a test that eventually we may
want to move to the exhaustive build because it takes quite some time.
It also changes the comments of the sorter class to the doxygen style.

Change-Id: Icad48bcfbb97a68f2d51b015a37a7345ebb5e479
Reviewed-on: http://gerrit.cloudera.org:8080/1156
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-09 16:42:03 -07:00
Dan Hecht
df5271e7c1 Fix flaky test_spilling.py test case
The commit:
IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ

recently lowered a couple of limits from 100m to 40m, which appears to
be too aggressive based on occasionol gvm failures.  Let's bump
it back up.

Change-Id: I2c3cc24841cf3a305785890329d77e4e9e74f6e5
Reviewed-on: http://gerrit.cloudera.org:8080/1125
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-10-05 11:30:34 -07:00
Ippokratis Pandis
48699de6e3 IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ
PAGG and PHJ were using an all-or-nothing approach wrt spilling. In
particular, they were trying to switch to IO-sized buffers for both
streams (aggregated and unaggregated in PAGG; build and probe in PHJ)
of every partition (currently 16 partitions for a total of 32
streams), even if some of the streams had very few rows, they were
empty or simply they would not spill so there was no need to allocate
IO-buffers for them. That was increasing the min mem needed by those
operators in many queries.

This patch decouples the decision to switch to IO-buffers for each
stream of each partition. Streams will switch to IO-sized buffers
whenever the rows they contain do not fit in the first two small
buffers (64KB and 512KB respectively). When we decide to spill a
partition, we switch to IO buffers both streams.

With these change many streams of PAGG and PHJ nodes do not need to
use IO-sized buffers, reducing the min mem requirement. For example,
below is the min mem needed (in MBs) for some of the TPC-H queries.
Some need half or less mem from the mem they needed before:

  TPC-H Q3: 645 -> 240
  TPC-H Q5: 375 -> 245
  TPC-H Q7: 685 -> 265
  TPC-H Q8: 740 -> 250
  TPC-H Q9: 650 -> 400
  TPC-H Q18: 1100 -> 425
  TPC-H Q20: 420 -> 250
  TPC-H Q21: 975 -> 620

To make this small buffer optimization to work, we had to fix
IMPALA-2352. That is, the AllocateRow() call of
PAGG::ConstructIntermediateTuple() could return unsuccessfully just
because the small buffers of the stream were exhausted. In that case,
previously we would treat it as an indication that there is no memory
left, start spilling a partition and switching all stream to
IO-buffes. Now we make a best effort, trying to first
SwitchToIoffers() and if that is successful, we re-attempt the
AllocateRow() call. See IMPALA-2352 for more details.

Another change is that now SwitchToIoBuffers() will reset the flag
using_small_buffers_ back to false, in case we are in a very low
memory situation and it fails to get a buffer. That allows us to
retry calling SwitchToIoBuffers() once we free up some space. See
IMPALA-2330 for more details.

With the above fixes we should also have fixed IMPALA-2241 and
IMPALA-2271 that are essentially stream::using_small_buffers_-related
DCHECKs.

This patch adds all 22 TPC-H queries in test_mem_usage_scaling test
and updates the per-query min mem limits in it. Additionally, it adds
a new aggregation test that uses the TPC-H dataset for larger
aggregations (TestTPCHAggregationQueries). It also removes some
dead test code.

Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38
Reviewed-on: http://gerrit.cloudera.org:8080/818
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins

Conflicts:

	tests/query_test/test_aggregation.py
2015-09-23 11:07:42 -07:00
aacalfa
57dd4d1502 IMPALA-1309: Add support for distinct in group_concat function.
Change-Id: I2790f1d2a7bfd0ecc7ef66cc5d91dafe3414e111
Reviewed-on: http://gerrit.cloudera.org:8080/892
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-09-23 09:42:17 +00:00
Ippokratis Pandis
d58aedff42 IMPALA-1820: Start with small pages for hash tables during repartitioning
The change of the PARTITION_FANOUT from 32 to 16 exposed a pathological case due to
the lack of coordination across concurrently executing spilling nodes of the same query.
In particular, when we repartition a partition we try to initialize hash tables for the
new partitions. But each hash table needs a block (for the nodes). In case there were not
any IO-sized blocks available, because they had been consumed by other nodes, we would get
into a loop trying to repartition those smaller partitions that couldn't initialize their
hash table. Additional repartitions that, among others, would need additional blocks for
the new streams. These partitions would end up being very small, still we would fail the
query when we were reaching the MAX_PARTITION_DEPTH limit, which was fixed to 4.

This patch fixes the problem by initializing the hash tables during repartitions with
small pages. That is, the hash tables always first use a 64KB and a 512KB block for their
nodes before switching to IO-sized blocks. This helps the partitioning algorithm to
finish when we end up with partitions that can fit in those small pages. The performance
may not be optimal, still the memory consumption is lower and the algorithm finishes. For
example, without this patch and with PARTITION_FANOUT == 16 in order to run TPC-H Q18 and
Q20 we needed 3.4GB and 3.1GB respectively. With this patch TPC-H Q18 needs ~1GB and Q20
975MB.

This patch also removes the restriction of stopping repartitioning when we are reaching
4 levels of repartitioning. Instead, whenever we repartition we compare the size of
the input partition to the size of the largest new partition. If there is no reduction
on the size we stop the algorithm. Otherwise, we keep on repartitioning. That should
help in cases of skew (e.g. due to bad hashing). There is a new MAX_PARTITION_DEPTH limit
of 16. It is very unlikely we will ever hit this limit.

Change-Id: Ib33fece10585448bc2d07bb39d0535d78b168ccc
Reviewed-on: http://gerrit.cloudera.org:8080/119
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-02-28 00:42:04 +00:00
Alex Behm
f696861c5c Throw error on unrecognized test sections.
Our .test file parser used to not abort tests when there
is a malformed test/section. This patch changes that behavior
to report an error and treat the test as failed.

Quite a few tests were not well-formed, and were not executed
as a result. This patch fixes those tests.

Arguably, the test file parser should be more flexible in which places
to accept comments, but this patch does not address that problem.

Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-12-02 18:08:09 -08:00
Ippokratis Pandis
87502f829c IMPALA-1471: Bug in spilling of PHJ that was affecting left anti and outer joins.
In cases where we had to spill the probe side of PHJs, we were not only appending
the probe row to the tuple stream to be spilled, but we were also getting into the
regular processing loop with the iterator set to End(). In the case of left anti
and left outer joins, the result was to incorrectly output this row, since it did not
have a match.

This bug had a small perf impact for all spilling joins because we were doing an
unnecessary loop for each probe row we had to spill.

This patch solves the problem by immediately going to the next probe row if the
current row is spilled. Additionally, it fixes a bug in the block mgr where there
was a code path we were not counting correctly the number of pinned buffers.
It also adds tpch-q21 in the set of queries to run in the spilling test.

Change-Id: I762f5c41fe468e4485a4b31dabe2e53f6b49ae24
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5313
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5334
2014-11-20 02:21:14 -08:00
ishaan
23964c19af [CDH5] Fix bad merge in in spilling.test
Change-Id: Ia6e30cf5916c737088d8cb969e0167b9d69a599e
2014-10-08 23:19:02 -07:00
Nong Li
de31fa8e21 Disable spilling tests that are too flaky.
Change-Id: I4ac877c3fa8297d873c67f219bb0c75f0001562d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4731
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-10-06 15:18:56 -07:00
Nong Li
e08ffde009 PA/PHJ: Increase fanout to 32 and fix interaction with small buffers.
Small buffers introduced an issue that is exacerbated by the large fanout. A stream can
only be appended to forever once it has grabbed the initial io sized buffer. With small
buffers, we don't grab that at the beginning anymore and, before this patch, it is
grabbed when the stream first needs it. This means when one stream needs it, another
stream could have already grabbed it (meaning this stream is pinned with multiple
buffers).

This patch has all the streams grab an IO buffer as soon as the first stream needs an
io buffer. This guarantees that all streams get 1 before any get 2.

Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d
2014-10-06 15:16:16 -07:00
Nong Li
3e632ef6ad Reduce min PA/PHJ mem requirement.
Update PA/PHJ to use small (< io sized buffers) initially. Without this we would
not be able to run at the QPS that we need just due to the buffering requirements
of these operators.

Change-Id: Ic8a777d147893567c9590fbab17f561eadb6ee19
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4623
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-10-06 15:14:10 -07:00
ishaan
010cc22a2f [CDH5] Fix test spilling.
tpch in cdh5 does not have double columns. Also, remove round calls to test that we get
consistent results.

Change-Id: Ia45ef08644ed78b05a08c47422733ab38a26b508
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4595
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-09-26 22:57:02 -07:00
Nong Li
d5c948c351 Increase the mem limit for one of the spilling queries.
Change-Id: I9b52582b2ded82821ecc446762f07d7702dedabf
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4555
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-09-26 12:27:29 -07:00
Nong Li
f03b05ed50 Fix hash table buckets to allocate memory from the BlockMgr.
This was always a TODO. We want memory to come from the block mgr and trigger spilling.

Change-Id: I07f1f79fbbb33068fb2df64510a80a9b008ef73d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4466
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-09-26 12:26:09 -07:00
Matthew Jacobs
da5198e615 Add spilling test for an analytic fn
Change-Id: Ia93c71c9c2a01f7f04a81593d51f5ca565286b7d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4447
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-09-23 07:26:09 -07:00
Nong Li
8a661d0787 [CDH5] cherry pick conflicts.
Change-Id: Ic11237b7ead4a810b523d6b6095781efbc5bb66b
2014-09-20 19:41:42 -07:00
Nong Li
6b73eec02d PHJ: Fix block management when spilling.
The previous code did not handle well the case where the spilling happens when
building the hash table (i.e. partitioning the build rows fit). This caused the
probe partition to be starved causing queries that should be able to run to fail
with a not enough buffers error.

Change-Id: I3a9a84e8800a72ed3ce6f5ab7ff03bc2d6eb7ad8
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4403
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-09-20 16:12:21 -07:00
Skye Wanderman-Milne
2a449651da Use CRC hash for 0th partition level.
Change-Id: Ie845e0edb684f13421eea41327b1571b368db21a
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4370
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-09-20 16:11:40 -07:00
ishaan
c4b4e010ff Buffered Tuple Stream fixes.
This patch fixes two issues:
  - Add API to buffered block mgr to allow an atomic Unpin and GetNewBlock. This has
    the semantics of unpinning a block and giving the buffer to the new block. This
    is necessary for the tuple stream to make sure another thread does not grab the
    unpinned block in between.
  - Buffer management reading an unpinned stream. Before moving onto a new block (and
    unpinning the current), we need to make sure all the tuples returned from the
    current block are returned up the operator tree.

Change-Id: I95ee58d1019dd971f6a7dc19ecafdfa54cdbf942
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4333
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-09-20 16:05:11 -07:00