Commit Graph

89 Commits

Author SHA1 Message Date
Sailesh Mukil
cbc8c63ef0 IMPALA-7163: Implement a state machine for the QueryState class
This patch adds a state machine for the QueryState class. The motivation
behind this patch is to make the query lifecycle from the point of
view of an executor much easier to reason about and this patch is key
for a follow on patch for IMPALA-2990 where the status reporting will
be per-query rather than per-fragment-instance. Currently, the state
machine provides no other purpose, and it will mostly be used for
IMPALA-2990.

We introduce 5 possible states for the QueryState which include 3
terminal states (FINISHED, CANCELLED and ERROR) and 2 non-terminal
states (PREPARING, EXECUTING). The transition from one state to the
next is always handled by a single thread which is also the QueryState
thread. This thread will additionally bear the purpose of sending
periodic updates after IMPALA-4063, which is the primary reason behind
having only this thread modify the state of the query.

Counting barriers are introduced to keep a count of how many fragment
instances have finished Preparing and Executing. These barriers also
block until all the fragment instances have finished a respective state.
The fragment instances update the query wide query status if an error is
hit and unblocks the barrier if it is in the EXECUTING state. The
PREPARING state blocks regardless of whether a fragment instance hit an
error or not, until all the fragment instances have completed
successfully or unsuccessfully, to maintain the invariant that fragment
instances cannot be cancelled until the entire QueryState has finished
PREPARING.

The status reporting protocol has not been changed and remains exactly
as it was.

Testing:
- Added 3 failure points in the query lifecycle using debug actions
  and added tests to validate the same (extension of IMPALA-7376).
- Ran 'core' and 'exhaustive' tests.

Future related work:
1) IMPALA-2990: Make status reporting per-query.
2) Try to logically align the FIS states with the QueryState states.
3) Consider mirroring the QueryState state machine to
CoordinatorBackendState

Change-Id: Iec5670a7db83ecae4656d7bb2ea372d3767ba7fe
Reviewed-on: http://gerrit.cloudera.org:8080/10813
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-08-08 00:16:18 +00:00
Tim Armstrong
68736bc378 IMPALA-7297: soft memory limit for reservation increases
https://goo.gl/N9LgQt summarises the memory problems I'm trying to
solve here.

Reject reservation increases that would leave < 10% of the memory
available under a query or process mem_limit, which put the query
or impalad into a state with a heightened chance of OOM. This
avoids OOM in some scenarios where reserved memory is increased
too aggressively and starves non-reserved memory.

One way of thinking about this that it's extending the 80%
limit heuristic for reserved memory so that there's a soft
limit of min(80%, 90% - non-reserved memory consumption)

A follow-on will use this for scanner threads to prevent
starting scanner threads that may exceed the soft limit.

Testing:
Added unit tests for MemTracker and for ReservationTracker.

Change-Id: I39dcf2dd870a59c8b8cda4488fe41ce936d715ac
Reviewed-on: http://gerrit.cloudera.org:8080/10988
Reviewed-by: Todd Lipcon <todd@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-08-02 01:46:19 +00:00
Bikramjeet Vig
2de9db8fc6 IMPALA-5216: Make admission control queuing async
Implement asynchronous admission control queuing. This is achieved by
running the admission control code-path in a separate thread. Major
changes include: propagating cancellation to the admission control
thread and dequeuing thread, and adding a new Query Operation State
called "PENDING" that represents the state between completion of
planning and starting of query execution.

Testing:
- Added a deterministic end to end test and a session expiry test.
- Ran multiple stress tests successfully with a cancellation probability
of 60% and with different values for the following parameters:
max_requests, queue_wait_timeout_ms. Ensured that the impalad was in a
valid state afterwards (no orphan fragments or wrong metrics).
- Ran all exhaustive tests and ASAN core tests successfully.
- Ran data load successfully.

Change-Id: I989cf5b259afb8f5bc5c35590c94961c81ce88bf
Reviewed-on: http://gerrit.cloudera.org:8080/10060
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-13 15:48:17 +00:00
Tim Armstrong
e12ee485cf IMPALA-6957: calc thread resource requirement in planner
This only factors in fragment execution threads. E.g. this does *not*
try to account for the number of threads on the old Thrift RPC
code path if that is enabled.

This is loosely related to the old VCores estimate, but is different in
that it:
* Directly ties into the notion of required threads in
  ThreadResourceMgr.
* Is a strict upper bound on the number of such threads, rather than
  an estimate.

Does not include "optional" threads. ThreadResourceMgr in the backend
bounds the number of "optional" threads per impalad, so the number of
execution threads on a backend is limited by

  sum(required threads per query) +
      CpuInfo::num_cores() * FLAGS_num_threads_per_core

DCHECKS in the backend enforce that the calculation is correct. They
were actually hit in KuduScanNode because of some races in thread
management leading to multiple "required" threads running. Now the
first thread in the multithreaded scans never exits, which means
that it's always safe for any of the other threads to exit early,
which simplifies the logic a lot.

Testing:
Updated planner tests.

Ran core tests.

Change-Id: I982837ef883457fa4d2adc3bdbdc727353469140
Reviewed-on: http://gerrit.cloudera.org:8080/10256
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-12 01:43:37 +00:00
Tianyi Wang
8e86678d65 IMPALA-5690: Part 1: Rename ostream operators for thrift types
Thrift 0.9.3 implements "ostream& operator<<(ostream&, T)" for thrift
data types while impala did the same to enums and special types
including TNetworkAddress and TUniqueId. To prepare for the upgrade of
thrift 0.9.3, this patch renames these impala defined functions. In the
absence of operator<<, assertion macros like DCHECK_EQ can no longer be
used on non-enum thrift defined types.

Change-Id: I9c303997411237e988ef960157f781776f6fcb60
Reviewed-on: http://gerrit.cloudera.org:8080/9168
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-20 10:28:12 +00:00
Sailesh Mukil
4d6b07f0e2 IMPALA-6792: Fail status reporting if coordinator refuses connections
The ReportExecStatusAux() function is run on a dedicated thread per
fragment instance. This thread will run until the fragment instance
completes executing.

On every attempt to send a report to the coordinator, it will attempt
to send up to 3 RPCs. If all 3 of them fail, then the fragment instance
will cancel itself.

However, there is one case where a failure to send the RPC will not
be considered a failed RPC. If when we attempt to obtain a new
connection, we end up creating a new connection
(via ClientCache::CreateClient()) instead of getting a previously
cached connection, and this new connection fails to even Open(),
it will not be counted as a RPC failure.

This patch counts such an error as a failed RPC too.

This patch also clarifies some of the error log messages and introduces
a flag to control the sleep interval between failed ReportExecStatus RPC
retries.

Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Reviewed-on: http://gerrit.cloudera.org:8080/9916
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-05 02:17:10 +00:00
Dan Hecht
408ee4d68c IMPALA-5384, part 1: introduce DmlExecState
This change is based on a patch by Marcel Kornacker.

Move data structures that collect DML operation stats from the
RuntimeState and Coordinator into a new InsertExecState class, which
has it's own lock.  This removes a dependency on the coordinator's
lock, which will allow further coordinator locking cleanup in the next
patch.

Change-Id: Id4c025917620a7bff2acbeb46464f107ab4b7565
Reviewed-on: http://gerrit.cloudera.org:8080/9793
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-29 06:29:16 +00:00
Michael Ho
077f94a82b IMPALA-6609: Fix ownership of class members in KrpcDataStreamRecvr
A KrpcDataStreamRecvr is co-owned by the singleton KrpcDataStreamMgr
and an exchange node. It's possible that some threads (e.g. RPC service
threads) may still retain reference to the KrpcDataStreamRecvr after its
owning exchange node has been closed. This is problematic as some members
in the receiver (e.g. sender/receiver profiles) are actually owned by the
exchange node so accessing them after the exchange node is closed and
possibly deleted may lead to user-after-free.

This patch changes the ownership of some members in KrpcDataStreamRecvr
to the receiver. In particular, the profiles are now owned by the receiver
and various stat counters and timers are in turn owned by these profiles.
This prevents the use-after-free problem described above. This patch also
moves the access to deferred_rpc_tracker_ in TakeOverEarlySenders() to be
under the sender queue's 'lock_' to prevent another instance of IMPALA-6554.

Testing done: core debug build.

Change-Id: I3378496e2201b16c662b9daa634333480705c61a
Reviewed-on: http://gerrit.cloudera.org:8080/9527
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-12 22:31:01 +00:00
Michael Ho
61ca7be6ec IMPALA-2990: Add a warning message during cancellation
Until IMPALA-2990 is fixed, there is no easy way to
prevent query hangs due to cancellation of fragment
instances in a backend node when it fails to report
status to the coordinator. Given it's rather hard to
diagnose IMPALA-2990, this change adds a log statement
at the point of cancellation to warn about IMPALA-2990.
This hopefully makes diagnostics slightly easier.

Change-Id: Icfd56edfe74707592f4dd9c840550b13cb80e9a0
Reviewed-on: http://gerrit.cloudera.org:8080/9413
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-27 11:43:10 +00:00
Lars Volker
be0f4f9d21 IMPALA-6190/6246: Add instances tab and event sequence
This change adds tracking of the current state during the execution of a
fragment instance. The current state is then reported back to the
coordinator and exposed to users via a new tab in the query detail debug
webpage.

This change also adds an event timeline to fragment instances in the
query profile. The timeline measures the time since backend-local query
start at which particular events complete. Events are derived from the
current state of the execution of a fragment instance. For example:

    - Prepare Finished: 13.436ms (13.436ms)
    - First Batch Produced: 1s022ms (1s008ms)
    - First Batch Sent: 1s022ms (455.558us)
    - ExecInternal Finished: 2s783ms (1s760ms)

I added automated tests for both extensions and additionally verified
the change by manual inspection.

Here are the TPCH performance comparison results between this change and
the previous commit on a 16 node cluster.

+------------+-----------------------+---------+------------+------------+----------------+
| Workload   | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+------------+-----------------------+---------+------------+------------+----------------+
| TPCH(_300) | parquet / none / none | 18.47   | -0.94%     | 9.72       | -1.08%         |
+------------+-----------------------+---------+------------+------------+----------------+

+------------+----------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| Workload   | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
+------------+----------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| TPCH(_300) | TPCH-Q5  | parquet / none / none | 48.88  | 46.93       |   +4.15%   |   0.14%    |   3.61%        | 1           | 3     |
| TPCH(_300) | TPCH-Q13 | parquet / none / none | 21.64  | 21.15       |   +2.29%   |   2.06%    |   1.84%        | 1           | 3     |
| TPCH(_300) | TPCH-Q11 | parquet / none / none | 1.71   | 1.70        |   +1.12%   |   0.54%    |   2.51%        | 1           | 3     |
| TPCH(_300) | TPCH-Q18 | parquet / none / none | 33.15  | 32.79       |   +1.09%   |   0.13%    |   2.03%        | 1           | 3     |
| TPCH(_300) | TPCH-Q14 | parquet / none / none | 5.95   | 5.90        |   +0.82%   |   2.19%    |   0.49%        | 1           | 3     |
| TPCH(_300) | TPCH-Q1  | parquet / none / none | 13.99  | 13.90       |   +0.63%   |   0.25%    |   1.39%        | 1           | 3     |
| TPCH(_300) | TPCH-Q2  | parquet / none / none | 3.44   | 3.44        |   +0.00%   | * 20.29% * | * 20.76% *     | 1           | 3     |
| TPCH(_300) | TPCH-Q6  | parquet / none / none | 1.21   | 1.22        |   -0.01%   |   0.06%    |   0.06%        | 1           | 3     |
| TPCH(_300) | TPCH-Q20 | parquet / none / none | 3.51   | 3.51        |   -0.11%   |   7.15%    |   7.30%        | 1           | 3     |
| TPCH(_300) | TPCH-Q16 | parquet / none / none | 6.89   | 6.91        |   -0.21%   |   0.65%    |   0.55%        | 1           | 3     |
| TPCH(_300) | TPCH-Q4  | parquet / none / none | 4.78   | 4.80        |   -0.38%   |   0.06%    |   0.59%        | 1           | 3     |
| TPCH(_300) | TPCH-Q19 | parquet / none / none | 30.78  | 31.04       |   -0.83%   |   0.45%    |   1.03%        | 1           | 3     |
| TPCH(_300) | TPCH-Q22 | parquet / none / none | 6.06   | 6.12        |   -1.02%   |   1.51%    |   2.12%        | 1           | 3     |
| TPCH(_300) | TPCH-Q10 | parquet / none / none | 9.43   | 9.58        |   -1.54%   |   0.69%    |   3.30%        | 1           | 3     |
| TPCH(_300) | TPCH-Q21 | parquet / none / none | 93.41  | 95.18       |   -1.86%   |   0.08%    |   0.81%        | 1           | 3     |
| TPCH(_300) | TPCH-Q15 | parquet / none / none | 3.40   | 3.47        |   -1.99%   |   0.72%    |   1.27%        | 1           | 3     |
| TPCH(_300) | TPCH-Q7  | parquet / none / none | 44.98  | 46.24       |   -2.71%   |   1.83%    |   1.27%        | 1           | 3     |
| TPCH(_300) | TPCH-Q3  | parquet / none / none | 28.06  | 29.11       |   -3.61%   |   1.62%    |   1.23%        | 1           | 3     |
| TPCH(_300) | TPCH-Q12 | parquet / none / none | 3.15   | 3.28        |   -3.80%   |   0.96%    |   1.32%        | 1           | 3     |
| TPCH(_300) | TPCH-Q9  | parquet / none / none | 29.47  | 30.80       |   -4.30%   |   0.29%    |   0.34%        | 1           | 3     |
| TPCH(_300) | TPCH-Q17 | parquet / none / none | 4.37   | 4.62        |   -5.33%   |   0.63%    |   0.54%        | 1           | 3     |
| TPCH(_300) | TPCH-Q8  | parquet / none / none | 7.99   | 8.46        |   -5.53%   |   7.95%    |   1.11%        | 1           | 3     |
+------------+----------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+

Here are the TPCDS performance comparison results between this change
and the previous commit on a 16 node cluster. I inspected the Q2 results
and concluded that the variability is unrelated to this change.

+--------------+-----------------------+---------+------------+------------+----------------+
| Workload     | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+--------------+-----------------------+---------+------------+------------+----------------+
| TPCDS(_1000) | parquet / none / none | 13.07   | +0.51%     | 4.27       | +1.83%         |
+--------------+-----------------------+---------+------------+------------+----------------+

+--------------+------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| Workload     | Query      | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
+--------------+------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| TPCDS(_1000) | TPCDS-Q2   | parquet / none / none | 8.36   | 4.25        | R +96.81%  | * 48.88% * |   0.42%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q8   | parquet / none / none | 1.59   | 1.35        |   +17.86%  | * 13.91% * |   4.01%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q73  | parquet / none / none | 1.81   | 1.71        |   +5.92%   |   5.53%    |   0.15%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q28  | parquet / none / none | 7.26   | 6.95        |   +4.47%   |   1.09%    |   1.11%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q46  | parquet / none / none | 2.36   | 2.30        |   +2.62%   |   1.45%    |   0.40%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q7   | parquet / none / none | 2.78   | 2.73        |   +1.98%   |   1.21%    |   2.23%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q55  | parquet / none / none | 1.05   | 1.03        |   +1.91%   |   1.16%    |   2.20%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q42  | parquet / none / none | 1.05   | 1.04        |   +1.71%   |   0.90%    |   2.63%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q19  | parquet / none / none | 1.67   | 1.65        |   +1.55%   |   1.12%    |   1.96%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q23  | parquet / none / none | 151.75 | 149.94      |   +1.20%   |   3.23%    |   1.83%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q64  | parquet / none / none | 40.25  | 39.79       |   +1.16%   |   0.43%    |   0.28%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q96  | parquet / none / none | 2.25   | 2.22        |   +1.05%   |   1.00%    |   0.11%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q53  | parquet / none / none | 1.60   | 1.58        |   +1.01%   |   1.28%    |   0.04%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q79  | parquet / none / none | 4.17   | 4.13        |   +0.94%   |   0.89%    |   0.06%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q59  | parquet / none / none | 5.74   | 5.71        |   +0.60%   |   1.22%    |   2.56%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q52  | parquet / none / none | 0.89   | 0.89        |   +0.14%   |   0.03%    |   0.63%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q88  | parquet / none / none | 7.10   | 7.12        |   -0.23%   |   0.43%    |   0.47%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q3   | parquet / none / none | 1.10   | 1.11        |   -0.40%   |   0.58%    |   0.36%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q98  | parquet / none / none | 2.30   | 2.31        |   -0.49%   |   3.58%    |   1.04%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q61  | parquet / none / none | 1.87   | 1.89        |   -1.08%   |   1.68%    |   0.14%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q27a | parquet / none / none | 2.93   | 2.96        |   -1.18%   |   1.74%    |   1.54%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q34  | parquet / none / none | 2.23   | 2.27        |   -1.73%   |   1.91%    |   1.32%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q63  | parquet / none / none | 1.56   | 1.60        |   -1.96%   |   1.91%    |   3.33%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q89  | parquet / none / none | 2.64   | 2.70        |   -2.20%   |   1.93%    |   1.88%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q47  | parquet / none / none | 30.41  | 31.17       |   -2.41%   |   1.09%    |   1.52%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q1   | parquet / none / none | 3.77   | 3.86        |   -2.46%   |   1.91%    |   0.61%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q6   | parquet / none / none | 61.67  | 63.34       |   -2.65%   |   3.77%    |   0.31%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q4   | parquet / none / none | 31.11  | 31.96       |   -2.66%   |   0.61%    |   0.77%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q43  | parquet / none / none | 4.10   | 4.22        |   -2.87%   |   1.40%    |   2.85%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q5   | parquet / none / none | 8.30   | 8.56        |   -3.13%   |   1.55%    |   0.47%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q27  | parquet / none / none | 2.28   | 2.35        |   -3.13%   |   1.17%    |   1.56%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q65  | parquet / none / none | 31.74  | 32.77       |   -3.15%   |   1.47%    |   1.11%        | 1           | 3     |
| TPCDS(_1000) | TPCDS-Q68  | parquet / none / none | 1.56   | 1.62        |   -3.58%   |   9.37%    | * 11.93% *     | 1           | 3     |
+--------------+------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+

(R) Regression: TPCDS(_1000) TPCDS-Q2 [parquet / none / none] (4.25s -> 8.36s [+96.81%])
+---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+---------+-----------+
| Operator            | % of Query | Avg      | Base Avg | Delta(Avg) | StdDev(%)  | Max      | Base Max | Delta(Max) | #Hosts | #Rows   | Est #Rows |
+---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+---------+-----------+
| 27:MERGING-EXCHANGE | 22.48%     | 6.97s    | 2.85s    | +144.40%   | * 58.44% * | 11.05s   | 2.86s    | +286.33%   | 1      | 2.51K   | 2.56K     |
| 26:EXCHANGE         | 7.65%      | 2.37s    | 2.43s    | -2.16%     |   1.82%    | 2.46s    | 2.50s    | -1.65%     | 14     | 365     | 2.56K     |
| 23:EXCHANGE         | 8.58%      | 2.66s    | 2.70s    | -1.46%     |   1.67%    | 2.74s    | 2.78s    | -1.47%     | 14     | 516     | 10.64K    |
| 13:AGGREGATE        | 4.21%      | 1.31s    | 1.30s    | +0.65%     |   0.06%    | 1.47s    | 1.43s    | +2.38%     | 14     | 516     | 10.64K    |
| 12:HASH JOIN        | 2.89%      | 896.20ms | 885.79ms | +1.17%     |   1.43%    | 1.06s    | 1.01s    | +4.77%     | 14     | 433.27M | 2.16B     |
| 06:SCAN HDFS        | 2.83%      | 877.34ms | 886.93ms | -1.08%     |   1.23%    | 888.16ms | 906.88ms | -2.06%     | 1      | 365     | 373       |
| 19:EXCHANGE         | 23.20%     | 7.20s    | 3.12s    | +130.58%   | * 56.73% * | 11.33s   | 3.17s    | +256.92%   | 14     | 520     | 10.64K    |
| 05:AGGREGATE        | 12.06%     | 3.74s    | 1.34s    | +178.49%   | * 64.53% * | 6.33s    | 1.53s    | +314.84%   | 14     | 520     | 10.64K    |
| 04:HASH JOIN        | 7.71%      | 2.39s    | 956.81ms | +149.90%   | * 60.36% * | 4.04s    | 1.13s    | +256.75%   | 14     | 442.29M | 2.16B     |
| 03:SCAN HDFS        | 2.83%      | 878.97ms | 894.11ms | -1.69%     |   1.34%    | 890.78ms | 910.22ms | -2.14%     | 1      | 371     | 73.05K    |
+---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+---------+-----------+

Change-Id: I626456b6afa9101eeeeffd5cda10c4096d63d7f9
Reviewed-on: http://gerrit.cloudera.org:8080/8758
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-24 02:41:31 +00:00
Michael Ho
e714f2b33c IMPALA-2397: Use atomics for IntGauge and IntCounter
This change removes the spinlock in IntGauge and IntCounter
and uses AtomicInt64 instead. As shown in IMPALA-2397, multiple
threads can be contending for the spinlocks of some global metrics
under concurrent queries.

This change also breaks up SimpleMetric is renamed to ScalarMetric
and broken into two subclasses:
- LockedMetric:
  - a value store for any primitive type (int,float,string etc).
  - atomic read and write via GetValue() and SetValue() respectively.

- AtomicMetric:
  - the basis of IntGauge and IntCounter. Support atomic increment
    of the metric value via Increment() interface.
  - atomic read and write via GetValue() and SetValue() respectively.
  - only support int64_t type.

Change-Id: I48dfa5443cd771916b53541a0ffeaf1bcc7e7606
Reviewed-on: http://gerrit.cloudera.org:8080/9012
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-18 23:31:52 +00:00
Zoltan Borok-Nagy
8047b1dcb4 IMPALA-3703: Store query context in thread-local variables
This commit introduces the ThreadDebugInfo class which can
hold information about the current thread that can be useful
in debug sessions. It needs to be allocated on the stack of
each thread in order to include it to minidumps.

Currently a ThreadDebugInfo object is created in
Thread::SuperviseThread. This object is available in all
the child stack frames through the global function
GetThreadDebugInfo().

ThreadDebugInfo has members for the thread name and instance
id. These are fixed size char buffers.

If you have a core dump written by Impala, you can locate the
ThreadDebugInfo for the current thread through the global
pointer impala::thread_debug_info.

In a core file that has been created from a minidump, we need
to select the stack frame that allocated the ThreadDebugInfo
object in order to inspect it. It is currently allocated in
Thread::SuperviseThread().

We can use printf in gdb to print the members, e.g.:
printf "%s\n" thread_debug_info.instance_id

Currently the thread name and instance id is stored.

I created some tests in thread-debug-info-test.cc

Change-Id: I566f7f1db5117c498e86e0bd05b33bdcff624609
Reviewed-on: http://gerrit.cloudera.org:8080/8621
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-18 15:32:59 +00:00
Thomas Tauber-Marshall
f3fa3e017f IMPALA-6081: Fix test_basic_filters runtime profile failure
test_basic_filters has been occasionally failing due to a line missing
from a runtime profile for a particular query.

The problem is that the query returns all of its results before all of
its fragment instances are finished executing (due to a limit). Then,
when one fragment instance reports its status, the coordinator returns
to it a 'cancelled' status, causing all remaining instances for that
backend to be cancelled.

Sometimes this cancellation happens quickly enough that the relevant
fragment instances have not yet sent a status report when they are
cancelled. They will still send a report in finalize, but as the
coordinator only updates its runtime profile for 'ok' status reports,
not 'cancelled', the final runtime profile doesn't end up with any
data for those fragment instances, which means the test does not find
the line in the runtime profile its checking for.

The fix is to have the coordinator update its runtime profile with
every status report it recieves, regardless of error status.

Testing:
- Ran existing runtime profile tests, which rely on profile output,
  in a loop.
- Manually tested some scenarios with failed queries and checked that
  the new profile output is reasonable.
- Added a new e2e test that runs the affected query and checks for the
  presence of info for all expected exec node in the profile. This
  repros the underlying issue consistently.

Change-Id: I4f581c7c8039f02a33712515c5bffab942309bba
Reviewed-on: http://gerrit.cloudera.org:8080/8754
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-07 21:07:02 +00:00
Tim Armstrong
7487c5de04 IMPALA-1575: part 2: yield admission control resources
This change releases admission control resources more eagerly,
once the query has finished actively executing. Some resources
(tracked and untracked) are still consumed by the client request
as long as it remains open, e.g. memory for control structures
and the result cache. However, these resources are relatively
small and should not block admission of new queries.

The same as in part 1, query execution is considered to be finished
under any of the following conditions:
1. The query encounters an error and fails
2. The query is cancelled due to the idle query timeout
3. The query reaches eos (or the DML completes)
4. The client cancels the query without closing the query

Admission control resources are released in two ways:
1. by calling AdmissionController::ReleaseQuery() on the coordinator
   promptly after query execution finishes, instead of waiting for
   UnregisterQuery(). This means that the query and its memory is
   no longer considered "admitted".
2. by changing the behaviour of MemTracker::GetPoolMemReserved() so
   that it is aware of when a query has finished executing and does not
   consider its entire memory limit to be "reserved".

The preconditions for releasing an admitted query are subtle because the
queries are being admitted to a distributed system, not just the
coordinator.  The comment for ReleaseAdmissionControlResources()
documents the preconditions and rationale. Note that the preconditions
are not weaker than the preconditions of calling UnregisterQuery()
before this patch.

Testing:
TestAdmissionController is extended to end queries in four ways:
cancellation by client, idle timeout, the last row being fetched,
and the client closing the query. The test uses a mix of all four.
After the query ends, all clients wait for the test to complete
before closing the query or closing the connection. This ensures
that the admission control decisions are based entirely on the
query end behavior. This test works for both query admission control
and mem_limit admission control and can detect both kinds of admission
control resources ("admitted" and "reserved") not being released
promptly.

I ran into a problem similar to IMPALA-3772 with the admission control
tests becoming flaky due to query timeouts on release builds, which I
solved in a similar way by increasing the frequency of statestore
updates.

This is based on an earlier patch by Joe McDonnell.

Change-Id: Ib1fae8dc1c4b0eca7bfa8fadae4a56ef2b37947a
Reviewed-on: http://gerrit.cloudera.org:8080/8581
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-20 04:34:47 +00:00
Thomas Tauber-Marshall
2510fe0aa0 IMPALA-4252: Min-max runtime filters for Kudu
This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Reviewed-on: http://gerrit.cloudera.org:8080/7793
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-17 21:33:51 +00:00
Tim Armstrong
a772f84562 IMPALA-6171: Revert "IMPALA-1575: part 2: yield admission control resources"
This reverts commit fe90867d89.

Change-Id: I3eec4b5a6ff350933ffda0bb80949c5960ecdf25
Reviewed-on: http://gerrit.cloudera.org:8080/8499
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-08 22:03:59 +00:00
Tim Armstrong
fe90867d89 IMPALA-1575: part 2: yield admission control resources
This change releases admission control resources more eagerly,
once the query has finished actively executing. Some resources
(tracked and untracked) are still consumed by the client request
as long as it remains open, e.g. memory for control structures
and the result cache. However, these resources are relatively
small and should not block admission of new queries.

The same as in part 1, query execution is considered to be finished
under any of the following conditions:
1. The query encounters an error and fails
2. The query is cancelled due to the idle query timeout
3. The query reaches eos (or the DML completes)
4. The client cancels the query without closing the query

Admission control resources are released in two ways:
1. by calling AdmissionController::ReleaseQuery() on the coordinator
   promptly after query execution finishes, instead of waiting for
   UnregisterQuery(). This means that the query and its memory is
   no longer considered "admitted".
2. by changing the behaviour of MemTracker::GetPoolMemReserved() so
   that it is aware of when a query has finished executing and does not
   consider its entire memory limit to be "reserved".

The preconditions for releasing an admitted query are subtle because the
queries are being admitted to a distributed system, not just the
coordinator.  The comment for ReleaseAdmissionControlResources()
documents the preconditions and rationale. Note that the preconditions
are not weaker than the preconditions of calling UnregisterQuery()
before this patch.

Testing:
TestAdmissionController is extended to end queries in four ways:
cancellation by client, idle timeout, the last row being fetched,
and the client closing the query. The test uses a mix of all four.
After the query ends, all clients wait for the test to complete
before closing the query or closing the connection. This ensures
that the admission control decisions are based entirely on the
query end behavior. This test works for both query admission control
and mem_limit admission control and can detect both kinds of admission
control resources ("admitted" and "reserved") not being released
promptly.

This is based on an earlier patch by Joe McDonnell.

Change-Id: I80279eb2bda740d7f61420f52db3bfa42a6a51ac
Reviewed-on: http://gerrit.cloudera.org:8080/8323
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-07 05:16:11 +00:00
Tim Armstrong
9174af38c8 IMPALA-1575: Part 1: eagerly release query exec resources
Release of backend resources for query execution on the coordinator
daemon should occur eagerly as soon as query execution is finished
or cancelled. Before this patch some resources managed by QueryState,
like scratch files, were only released when the query was closed
and all of the query's control structures were torn down.

These resources are referred to as "ExecResources" in various places to
distinguish them from resources associated with the client request (like
the result cache) that are still required after the query finishes
executing.

This first change does not solve the admission control problem for two
reasons:
* We don't release the "admitted" memory on the coordinator until
  the query is unregistered.
* Admission control still considers the memory reserved until the
  query memtracker is unregistered, which happens only when the
  QueryState is destroyed: see MemTracker::GetPoolMemReserved().

The flow is mostly similar to initial_reservation_refcnt_, except the
coordinator also holds onto a reference count, which is released
when either the final row is returned or cancellation is initiated.
After the coordinator releases its refcount, the resources can be
freed as soon as local fragments also release their refcounts.

Also clean up Coordinator slightly by preventing runtime_state() from
leaking out to ClientRequestState - instead it's possible to log
the query MemTracker by following parent links in the MemTracker.

This patch is partially based on Joe McDonnell's IMPALA-1575 patch.

Testing:
Ran core tests.

Change-Id: I41ff374b0403f10a145f7fee9b3145953ee32341
Reviewed-on: http://gerrit.cloudera.org:8080/8303
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-10-28 02:45:35 +00:00
Tim Armstrong
7866eec5bd IMPALA-5895: clean up runtime profile lifecycle
Require callers to explicitly stop counter updating instead of doing it
in the destructor. This replaces ad-hoc logic to stop individual
counters.

Track which counters need to be stopped in separate lists instead of
stopping everything.

Force all RuntimeProfiles to use ObjectPools in a uniform way - the
profile, its counters and its children all are allocated from the
same pool. This is done via a new Create() method.

Consolidate 'time_series_counter_map_lock_' and 'counter_map_lock_'
to reduce complexity of the locking scheme. I didn't see any benefit
to sharding the locks in this way - there are only two time series
counters per fragment instance, which a small fraction of the
total number of counters.

Change-Id: I45c39ac36c8e3c277213d32f5ae5f14be6b7f0df
Reviewed-on: http://gerrit.cloudera.org:8080/8069
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-09-20 08:48:38 +00:00
Joe McDonnell
e993b9712c IMPALA-5750: Catch exceptions from boost thread creation
The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.

This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.

Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.

Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.

Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Reviewed-on: http://gerrit.cloudera.org:8080/7730
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-09-07 03:25:30 +00:00
Joe McDonnell
91f7bc1947 IMPALA-5892: Allow reporting status independent of fragment instance
Queries can hit an error that is not specific to a
particular fragment instance. For example, QueryState::StartFInstances()
calls DescriptorTbl::Create() before any fragment instances
start. This location has no reason to report status via a
particular fragment, and there is currently no way to report
status otherwise. This leads to a query hang, because the
status is never propagated back to the coordinator.

This adds the ability to report status that is not associated
with a particular fragment instance. By reporting status,
the coordinator will now correctly abort the query in the
case of the QueryState::StartFInstances() scenario described.

Change-Id: I4cd98022f1d62a999c7c80ff5474fa8d069eb12c
Reviewed-on: http://gerrit.cloudera.org:8080/7943
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2017-09-06 21:26:24 +00:00
Matthew Jacobs
7264c54751 IMPALA-5644,IMPALA-5810: Min reservation improvements
Rejects queries during admission control if:
* the largest (across all backends) min buffer reservation is
  greater than the query mem_limit or buffer_pool_limit
* the sum of the min buffer reservations across the cluster
  is larger than the pool max mem resources

There are some other interesting cases to consider later:
* every per-backend min buffer reservation is less than the
  associated backend's process mem_limit; the current
  admission control code doesn't know about other backend's
  proc mem_limits.

Also reduces minimum non-reservation memory (IMPALA-5810).
See the JIRA for experimental results that show this
slightly improves min memory requirements for small queries.
One reason to tweak this is to compensate for the fact that
BufferedBlockMgr didn't count small buffers against the
BlockMgr limit, but BufferPool counts all buffers against
it.

Testing:
* Adds new test cases in test_admission_controller.py
* Adds BE tests in reservation-tracker-test for the
  reservation-util code.

Change-Id: Iabe87ce8f460356cfe4d1be4d7092c5900f9d79b
Reviewed-on: http://gerrit.cloudera.org:8080/7678
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-22 08:27:12 +00:00
Matthew Jacobs
985698bb6f IMPALA-4833: Compute precise per-host reservation size, pt2
Addresses some comments from the CR:
https://gerrit.cloudera.org/#/c/7630/8

Notably changes the name of initial_reservation_total_bytes
to initial_reservation_total_claims and attempts to clarify
comments.

Change-Id: I1391d99ca15d2ebcaabd564fbe1242806be09f72
Reviewed-on: http://gerrit.cloudera.org:8080/7681
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-18 19:34:35 +00:00
Matthew Jacobs
6c12546561 IMPALA-4833: Compute precise per-host reservation size
Before this change, the per-host reservation size was computed
by the Planner. However, scheduling happens after planning,
so the Planner must assume that all fragments run on all
hosts, and the reservation size is likely much larger than
it needs to be.

This moves the computation of the per-host reservation size
to the BE where it can be computed more precisely. This also
includes a number of plan/profile changes.

Change-Id: Idbcd1e9b1be14edc4017b4907e83f9d56059fbac
Reviewed-on: http://gerrit.cloudera.org:8080/7630
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-12 08:10:07 +00:00
Tim Armstrong
27dcc768d2 IMPALA-5715: (mitigation only) defer destruction of MemTrackers
One potential candidate for the bad MemTracker IMPALA-5715 is one owned
by a join build sink. I haven't found a clear root cause, but we can
reduce the probability of bugs like this by deferring teardown of the
MemTrackers.

This patch defers destruction of the fragment instance, ExecNode,
DataSink and Codegen MemTrackers until query teardown. Instead
MemTracker::Close() is called at the place where the MemTracker
would have been destroyed to check that all memory is released
and enforce that no more memory is consumed. The entire query
MemTracker subtree is then disconnected in whole from the global
tree in QueryState::ReleaseResources(), instead of the MemTrackers
being incrementally disconnected bottom-up.

In cases where the MemTracker is owned by another object, this
required deferring teardown of the owner until query teardown.
E.g. for LlvmCodeGen I added a Close() method to release resources
and deferred calling the destructor.

We want to make this change anyway - see IMPALA-5587.

Testing:
Ran a core ASAN build.

Change-Id: I205abb0076d1ffd08cb93c0f1671c8b81e7fba0f
Reviewed-on: http://gerrit.cloudera.org:8080/7492
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-08 19:43:36 +00:00
Tim Armstrong
a98b90bd38 IMPALA-4674: Part 2: port backend exec to BufferPool
Always create global BufferPool at startup using 80% of memory and
limit reservations to 80% of query memory (same as BufferedBlockMgr).
The query's initial reservation is computed in the planner, claimed
centrally (managed by the InitialReservations class) and distributed
to query operators from there.

min_spillable_buffer_size and default_spillable_buffer_size query
options control the buffer size that the planner selects for
spilling operators.

Port ExecNodes to use BufferPool:
  * Each ExecNode has to claim its reservation during Open()
  * Port Sorter to use BufferPool.
  * Switch from BufferedTupleStream to BufferedTupleStreamV2
  * Port HashTable to use BufferPool via a Suballocator.

This also makes PAGG memory consumption more efficient (avoid wasting buffers)
and improve the spilling algorithm:
* Allow preaggs to execute with 0 reservation - if streams and hash tables
  cannot be allocated, it will pass through rows.
* Halve the buffer requirement for spilling aggs - avoid allocating
  buffers for aggregated and unaggregated streams simultaneously.
* Rebuild spilled partitions instead of repartitioning (IMPALA-2708)

TODO in follow-up patches:
* Rename BufferedTupleStreamV2 to BufferedTupleStream
* Implement max_row_size query option.

Testing:
* Updated tests to reflect new memory requirements

Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e
Reviewed-on: http://gerrit.cloudera.org:8080/5801
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-08-05 01:03:02 +00:00
Henry Robinson
d25db64f0e IMPALA-4905: Don't send empty insert status to coordinator
If the coordinator, in UpdateBackendExecStatus(), receives a report that
includes a TInsertExecStatus, it will call UpdateInsertExecStatus()
which takes the coordinator-wide lock. Avoid doing this for fragment
instances that would only send an empty TInsertExecStatus (including
instances that belong to SELECT queries, not DML queries).

Future changes should fix the locking around the
UpdateBackendExecStatus() path to remove dependencies on
Coordinator::lock_, but this fix is simple and addresses one point of
needless contention.

Change-Id: I314dd8d96922d273c6329266970d249ec8c5c608
Reviewed-on: http://gerrit.cloudera.org:8080/7457
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Impala Public Jenkins
2017-07-25 00:09:19 +00:00
Michael Ho
e0ba5ef6a2 IMPALA-5558/IMPALA-5576: Reopen stale client connection
Previously, the retry logic in DoRpc() only allows retry to
happen if send() didn't complete successfully and the exception
indicates a closed connection. However, send() returning
successfully doesn't guarantee that the bytes have actually
reached the remote peer. According to the man page of send(),
when the message does not fit into the send buffer of the socket,
send() normally blocks. So the payload of RPC may be buffered in
the kernel if there is room for it. TCP allows a connection to
be half-open. If an Impalad node is restarted, a stale client
connection to that node may still allow send() to appear to succeed
even though the payload wasn't sent. However, upon calling recv()
in the RPC call to fetch the response, the client will get a return
value of 0. In which case, thrift will throw an exception as the
connection to the remote peer is closed already. Apparently, the
existing retry logic doesn't quite handle this case. One can
consistently reproduce the problem by warming the client cache
followed by restarting one of the Impalad nodes. It will result
a series of query failures due to stale connections.

This change augments the retry logic to also retry the entire RPC
if the exception string contains the messages "No more data to read."
or "SSL_read: Connection reset by peer" to capture the case of stale
connections. Our usage of thrift doesn't involve half-open TCP connection
so having a broken connection in recv() indicates the remote end has
closed the socket already. The generated thrift code doesn't knowingly
close the socket before an RPC completes unless the process crashes,
the connection is stale (e.g. the remote node was restarted) or the
remote end fails to read from the client. In either cases, the entire
RPC should just be retried with a new connection.

This change also fixes QueryState::ReportExecStatusAux() to
unconditionally for up to 3 times when reporting exec status of a
fragment instance. Previously, it may break out of the loop early
if RPC fails with 'retry_is_safe' == true (e.g. due to recv() timeout)
or if the connection to coordinator fails (IMPALA-5576). Declaring the
RPC to have failed may cause all fragment instances of a query to be
cancelled locally, triggering query hang due to IMPALA-2990. Similarly,
the cancellation RPC is also idempotent so it should be unconditionally
retried up to 3 times with 100ms sleep time in between.

The status reporting is idempotent as the handler simply ignores
RPC if it determines that all fragment instances on a given backend
is done so it should be safe to retry the RPC. This change updates
ApplyExecStatusReport() to handle duplicated status reports with
done bit set. Previously we would drop some other fragment instances'
statuses if we received duplicate 'done' statuses from the same
fragment instance(s).

Testing done: Warmed up client cache by running stress test followed by
restarting some Impalad nodes. Running queries used to fail or hang
consistently in the past. It now works with patch. Also ran CSL enduranace
tests and exhaustive builds.

Change-Id: I4d722c8ad3bf0e78e89887b6cb286c69ca61b8f5
Reviewed-on: http://gerrit.cloudera.org:8080/7284
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-26 09:06:38 +00:00
Michael Ho
b38d9826d7 IMPALA-4192: Disentangle Expr and ExprContext
This change separates Expr and ExprContext. This is a preparatory
step for factoring out static data (e.g. Exprs) of plan fragments
to be shared by multiple plan fragment instances.

This change includes the followings:

1. Include aggregate functions (AggFn) as Expr. This separates
   AggFn from its evaluator. AggFn is similar to existing Expr
   as both are represented as a tree of Expr nodes but it doesn't
   really make sense to call Get*Val() on AggFn. This change
   restructures the class hierarchy: much of the existing Expr
   class is now renamed to ScalarExpr. Expr is the parent class
   of both AggFn and ScalarExpr. Expr is defined to be a tree
   with root of either AggFn or ScalarExpr and all descendants
   being ScalarExpr.

2. ExprContext is renamed to ScalarExprEvaluator which is the
   interface for evaluating ScalarExpr; AggFnEvaluator is the
   interface for evaluating AggFn. Multiple evaluators can be
   instantiated per Expr. Expr contains static states of an
   expression while evaluator contains runtime states needed
   for execution (i.e. evaluating the expression).

3. Update all exec nodes to instantiate Expr and their evaluators
   separately. ExecNode::Init() will be responsible for creating
   all the Exprs in an ExecNode while their evaluators are created
   in ExecNode::Prepare(). Certain evaluators are also moved into
   the data structures which actually utilize them. For instance,
   HashTableCtx now owns the build and probe expression evaluators.
   Similarly, TupleRowComparator and Sorter also own the evaluators.
   ExecNode which utilizes these data structures are only responsible
   for creating the expressions used by these data structures.

4. All codegen functions take Exprs instead of evaluators. Also, codegen
   functions will not return error status should the IR function fails the
   LLVM verification step.

5. The assignment of index into the FunctionContext vector is now done
   during the construction of ScalarExpr. Evaluators are only responsible
   for allocating and initializing the FunctionContexts.

6. Open(), Prepare() are now removed from Expr classes. The interface
   for creating any Expr is via either ScalarExpr::Create() or AggFn::Create()
   which will convert a thrift Expr into an initialized Expr object.
   Similarly, Create() interface is used for creating evaluators from an
   intialized Expr object.

This separation allows the future change to introduce PlanNode data structures.
The plan is to move all ExecNode::Init() logic to PlanNode and call them once
per plan fragment.

Change-Id: Iefdc9aeeba033355cb9497e3a5d2363627dcf2f3
Reviewed-on: http://gerrit.cloudera.org:8080/5483
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-18 11:08:25 +00:00
Henry Robinson
9caea9bfad IMPALA-5350: Tidy up thread groups for finst exec threads
Put all per-finst threads (profile report, exec thread, async build
thread, scanner threads) together in one group "fragment-execution", and
add finst ID to all their names.

Remove -<tid> suffix from thread names and add a separate column with it
to /threadz web page.

Testing: eyeballed with a join query.

Change-Id: I6fabfcc923863e5f9db4bb8b016c08ff226c079f
Reviewed-on: http://gerrit.cloudera.org:8080/6951
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Impala Public Jenkins
2017-06-01 22:45:53 +00:00
Thomas Tauber-Marshall
eb54287fb4 IMPALA-5167: Reduce the number of Kudu clients created (BE)
Creating Kudu clients is very expensive as each will fetch
metadata from the Kudu master, so we should minimize the
number of Kudu clients that get created.

This patch stores a map from Kudu master addressed to Kudu
clients in the ExecEnv to be used across the BE for all
queries. Another patch will address the FE.

This relies on a change on the Kudu side that clears
non-covered range entries from the client's cache on
table open (fdc022fe6231af20e307012d98c35b16cbfa7b33)

Testing:
- Ran a stress test on a 10 node cluster: scan of a small
  Kudu table, 1000 concurrent queries, load on the Kudu
  master was reduced signficantly, from ~50% cpu to ~5%.
  (with the FE changes included)
- Ran the Kudu e2e tests.
- Manually ran a test with concurrent INSERTs and
  'ALTER TABLE ADD PARTITION' (which is affected by the
  Kudu side change mentiond above) and verified
  correctness.

Change-Id: I6b0c12a256c33e8ef32315b3736cae2dea2ae705
Reviewed-on: http://gerrit.cloudera.org:8080/6792
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-24 22:27:37 +00:00
Marcel Kornacker
368115cdae IMPALA-2550: Switch to per-query exec rpc
Coordinator:
- FragmentInstanceState -> BackendState, which in turn records
  FragmentInstanceStats

QueryState
- does query-wide setup in a separate thread (which also launches
  the instance exec threads)
- has a query-wide 'prepared' state at which point all static setup
  is done and all FragmentInstanceStates are accessible

Also renamed QueryExecState to ClientRequestState.

Simplified handling of execution status (in FragmentInstanceState):
- status only transmitted via ReportExecStatus rpc
- in particular, it's not returned anymore from the Cancel rpc

FIS: Fixed bugs related to partially-prepared state (in Close() and ReleaseThreadToken())

Change-Id: I20769e420711737b6b385c744cef4851cee3facd
Reviewed-on: http://gerrit.cloudera.org:8080/6535
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-09 04:04:50 +00:00
Joe McDonnell
5bb988b1c5 IMPALA-4996: Single-threaded KuduScanNode
This introduces KuduScanNodeMt, the single-threaded version
of KuduScanNode that materializes the tuples in GetNext().
KuduScanNodeMt is enabled by the same condition as
HdfsScanNodeMt: mt_dop is greater than or equal to 1.

To share code between the two implementations, KuduScanNode
and KuduScanNodeMt are now subclasses of KuduScanNodeBase,
which implements the shared code. The KuduScanner is
minimally impacted, as it already had the required GetNext
interface.

Since the KuduClient is a heavy-weight object, it is now
shared at the QueryState level. We try to share the
KuduClient as much as possible, but there are times when
the KuduClient cannot be shared. Each Kudu table has
master addresses stored in the Hive Metastore. We only
share KuduClients for tables that have an identical value
for the master addresses. In the ideal case, every Kudu
table will have the same value, but there is no explicit
guarantee of this.

The testing for this is a modified version of
kudu-scan-node.test run with various mt_dop values.

Change-Id: I6e4593300e376bc508b78acaea64ffdd2c73a67a
Reviewed-on: http://gerrit.cloudera.org:8080/6312
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
2017-03-17 19:33:31 +00:00
Tim Armstrong
663285244e IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool
Add a copy of BufferedTupleStream that allocates memory from BufferPool.
This will replace the original implementation when IMPALA-3200 is
completed.

The major changes are:
* Terminology is updated from "blocks" to "pages"
* No small buffer support is needed (hooray!).
* BufferedTupleStream needs to do its own tracking of # of rows per
  page, etc instead of relying on BufferedBlockMgr to do it. A
  wrapper around PageHandle is used.
* Profile counters (unpin, pin, get new block time) are removed -
  similar counters in the BufferPool client are more useful.
* Changed the tuple null indicators so that they are allocated
  before each tuple, rather than in a block at the start of the
  page. This slightly reduces the memory density, but greatly
  simplifies much logic. In particular, it avoids problems with
  RowIdx and larger pages with offsets that don't fit in 16 bits.
* Buffer management of read/write streams uses the new pin-counting
  support to separate pinning of the read and write pages.
  This means that the reservation usage of an unpinned read/write
  stream does not fluctuate and the read/write iterators can always
  advance without requiring additional reservation.

Testing this required some further changes. TestEnv was refactored
so it can set up either BufferPool or BufferedBlockMgr. Some
BufferPool-related state is added to ExecEnv, QueryState and
RuntimeState, but is only created for backend tests that explicitly
create a BufferPool.

The following is left for future work:
* IMPALA-3808 (large row support) is not added. I've added
  TODOs to the code to places that will require changes.
* IMPALA-4179 (remove MarkNeedsDeepCopy()) is not fixed, since
  it requires global changes to operators that accumulate memory.

Testing:
All of the BufferedTupleStream unit tests are ported to the new
implementation, except for ones specifically testing the small
buffer functionality.

Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709
Reviewed-on: http://gerrit.cloudera.org:8080/5811
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-03-16 21:07:58 +00:00
Henry Robinson
d074f71d81 IMPALA-3909 (follow-up): Properly qualify min() and max() in header
parquet-column-stats.h was using unqualified references to std::max and
std::min, which we disallow in header files. The reason this compiled is
that some headers included "names.h". This commit flushes those out as
well, and fixes the resulting compilation errors.

I tried to make a static_assert method that complained if names.h was
included in a header file, but couldn't do so without needing an extra
macro, e.g.:

USE_IMPORTED_NAMES();

Which seemed too verbose for now.

Change-Id: If8381d5b1fa072e89ac1f7f54959c02bf8e43b4b
Reviewed-on: http://gerrit.cloudera.org:8080/5916
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-02-08 00:57:35 +00:00
Tim Armstrong
cb52b2b8ae IMPALA-3748: add query-wide resource acquisition step
This adds a Prepare() method to QueryState that allows
query execution to fail up-front if resources cannot be
acquired. We don't make full use of this yet, but it
provides a suitable place to acquire a memory reservation
for the query or fail cleanly.

The process memory limit check is moved here so that all
startup memory checks will be eventually consolidated in
one place.

Also switch a boost::mutex to a SpinLock for consistency
and to improve performance (see lock-benchmark.cc).

Change-Id: Ia21a3c0f0b0a7175116883ef9871b93c8ce8bb81
Reviewed-on: http://gerrit.cloudera.org:8080/5739
Tested-by: Impala Public Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2017-01-24 18:49:24 +00:00
Tim Armstrong
85edc15fef IMPALA-4678: move query MemTracker into QueryState
The query MemTracker for query execution is now owned directly by
QueryState, which greatly simplifies the lifecycle of the MemTracker.
This required various other changes and enabled some simplifications.

* The coordinator's QueryState is constructed earlier before fragments
  are sent out, since we need a MemTracker at that point.
* The global query MemTracker map can be removed.
* The static request pool mem tracker is moved into into ExecEnv.
* Temporary query MemTrackers used to evaluate expressions during
  planning do not need to be registered globally and are owned
  directly by the RuntimeState.
* Various cleanup logic is moved around to reflect the other changes.

Change-Id: Id6b46652932b5638993623e98d1f0d60d8380ba0
Reviewed-on: http://gerrit.cloudera.org:8080/5630
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-19 02:34:39 +00:00
Tim Armstrong
95ed4434f2 IMPALA-3202,IMPALA-2079: rework scratch file I/O
Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into
TmpFileMgr, in anticipation of it being shared with BufferPool.
TmpFileMgr now handles:
* Scratch space allocation and recycling
* Read and write I/O

The interface is also greatly changed so that it is built around Write()
and Read() calls, abstracting away the details of temporary file
allocation from clients. This means the TmpFileMgr::File class can
be hidden from clients.

Write error recovery:
Also implement write error recovery in TmpFileMgr.

If an error occurs while writing to scratch and we have multiple
scratch directories, we will try one of the other directories
before cancelling the query. File-level blacklisting is used to
prevent excessive repeated attempts to resize a scratch file during
a single query. Device-level blacklisting is not implemented because
it is problematic to permanently take a scratch directory out of use.

To reduce the number of error paths, all I/O errors are now handled
asynchronously. Previously errors creating or extending the file were
returned synchronously from WriteUnpinnedBlock(). This required
modifying DiskIoMgr to create the file if not present when opened.

Also set the default max_errors value in the thrift definition file,
so that it is in effect for backend tests.

Future Work:
* Support for recycling variable-length scratch file ranges. I omitted
  this to avoid making the patch even large.

Testing:
Updated BufferedBlockMgr unit test to reflect changes in behaviour:
* Scratch space is no longer permanently associated with a block, and
  is remapped every time a new block is written to disk .
* Files are now blacklisted - updated existing tests and enable the
  disable blacklisting test.

Added some basic testing of recycling of scratch file ranges in
the TmpFileMgr unit test.

I also manually tested the code in two ways. First by removing permissions
for /tmp/impala-scratch and ensuring that a spilling query fails cleanly.
Second, by creating a tiny ramdisk (16M) and running with two scratch
directories: one on /tmp and one on the tiny ramdisk. When spilling, an
out of space error is encountered for the tiny ramdisk and impala spills
the remaining data (72M) to /tmp.

Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17
Reviewed-on: http://gerrit.cloudera.org:8080/5141
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-01-05 02:26:24 +00:00
Marcel Kornacker
4b2d76dbb5 IMPALA-4014: Introduce query-wide execution state.
This introduces a global structure to coordinate execution
of fragment instances on a backend for a single query.

New classes:
- QueryExecMgr: subsumes FragmentMgr
- QueryState
- FragmentInstanceState: replaces FragmentExecState

Change-Id: I962ae6b7cb7dc0d07fbb8f70317aeb01d88d400b
Reviewed-on: http://gerrit.cloudera.org:8080/4418
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2016-12-11 02:29:28 +00:00