impala

mirror of https://github.com/apache/impala.git synced 2025-12-23 11:55:25 -05:00

Author	SHA1	Message	Date
Sailesh Mukil	cbc8c63ef0	IMPALA-7163: Implement a state machine for the QueryState class This patch adds a state machine for the QueryState class. The motivation behind this patch is to make the query lifecycle from the point of view of an executor much easier to reason about and this patch is key for a follow on patch for IMPALA-2990 where the status reporting will be per-query rather than per-fragment-instance. Currently, the state machine provides no other purpose, and it will mostly be used for IMPALA-2990. We introduce 5 possible states for the QueryState which include 3 terminal states (FINISHED, CANCELLED and ERROR) and 2 non-terminal states (PREPARING, EXECUTING). The transition from one state to the next is always handled by a single thread which is also the QueryState thread. This thread will additionally bear the purpose of sending periodic updates after IMPALA-4063, which is the primary reason behind having only this thread modify the state of the query. Counting barriers are introduced to keep a count of how many fragment instances have finished Preparing and Executing. These barriers also block until all the fragment instances have finished a respective state. The fragment instances update the query wide query status if an error is hit and unblocks the barrier if it is in the EXECUTING state. The PREPARING state blocks regardless of whether a fragment instance hit an error or not, until all the fragment instances have completed successfully or unsuccessfully, to maintain the invariant that fragment instances cannot be cancelled until the entire QueryState has finished PREPARING. The status reporting protocol has not been changed and remains exactly as it was. Testing: - Added 3 failure points in the query lifecycle using debug actions and added tests to validate the same (extension of IMPALA-7376). - Ran 'core' and 'exhaustive' tests. Future related work: 1) IMPALA-2990: Make status reporting per-query. 2) Try to logically align the FIS states with the QueryState states. 3) Consider mirroring the QueryState state machine to CoordinatorBackendState Change-Id: Iec5670a7db83ecae4656d7bb2ea372d3767ba7fe Reviewed-on: http://gerrit.cloudera.org:8080/10813 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-08 00:16:18 +00:00
Tim Armstrong	68736bc378	IMPALA-7297: soft memory limit for reservation increases https://goo.gl/N9LgQt summarises the memory problems I'm trying to solve here. Reject reservation increases that would leave < 10% of the memory available under a query or process mem_limit, which put the query or impalad into a state with a heightened chance of OOM. This avoids OOM in some scenarios where reserved memory is increased too aggressively and starves non-reserved memory. One way of thinking about this that it's extending the 80% limit heuristic for reserved memory so that there's a soft limit of min(80%, 90% - non-reserved memory consumption) A follow-on will use this for scanner threads to prevent starting scanner threads that may exceed the soft limit. Testing: Added unit tests for MemTracker and for ReservationTracker. Change-Id: I39dcf2dd870a59c8b8cda4488fe41ce936d715ac Reviewed-on: http://gerrit.cloudera.org:8080/10988 Reviewed-by: Todd Lipcon <todd@apache.org> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-02 01:46:19 +00:00
Bikramjeet Vig	2de9db8fc6	IMPALA-5216: Make admission control queuing async Implement asynchronous admission control queuing. This is achieved by running the admission control code-path in a separate thread. Major changes include: propagating cancellation to the admission control thread and dequeuing thread, and adding a new Query Operation State called "PENDING" that represents the state between completion of planning and starting of query execution. Testing: - Added a deterministic end to end test and a session expiry test. - Ran multiple stress tests successfully with a cancellation probability of 60% and with different values for the following parameters: max_requests, queue_wait_timeout_ms. Ensured that the impalad was in a valid state afterwards (no orphan fragments or wrong metrics). - Ran all exhaustive tests and ASAN core tests successfully. - Ran data load successfully. Change-Id: I989cf5b259afb8f5bc5c35590c94961c81ce88bf Reviewed-on: http://gerrit.cloudera.org:8080/10060 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-13 15:48:17 +00:00
Tim Armstrong	e12ee485cf	IMPALA-6957: calc thread resource requirement in planner This only factors in fragment execution threads. E.g. this does not try to account for the number of threads on the old Thrift RPC code path if that is enabled. This is loosely related to the old VCores estimate, but is different in that it: * Directly ties into the notion of required threads in ThreadResourceMgr. * Is a strict upper bound on the number of such threads, rather than an estimate. Does not include "optional" threads. ThreadResourceMgr in the backend bounds the number of "optional" threads per impalad, so the number of execution threads on a backend is limited by sum(required threads per query) + CpuInfo::num_cores() * FLAGS_num_threads_per_core DCHECKS in the backend enforce that the calculation is correct. They were actually hit in KuduScanNode because of some races in thread management leading to multiple "required" threads running. Now the first thread in the multithreaded scans never exits, which means that it's always safe for any of the other threads to exit early, which simplifies the logic a lot. Testing: Updated planner tests. Ran core tests. Change-Id: I982837ef883457fa4d2adc3bdbdc727353469140 Reviewed-on: http://gerrit.cloudera.org:8080/10256 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-12 01:43:37 +00:00
Tianyi Wang	8e86678d65	IMPALA-5690: Part 1: Rename ostream operators for thrift types Thrift 0.9.3 implements "ostream& operator<<(ostream&, T)" for thrift data types while impala did the same to enums and special types including TNetworkAddress and TUniqueId. To prepare for the upgrade of thrift 0.9.3, this patch renames these impala defined functions. In the absence of operator<<, assertion macros like DCHECK_EQ can no longer be used on non-enum thrift defined types. Change-Id: I9c303997411237e988ef960157f781776f6fcb60 Reviewed-on: http://gerrit.cloudera.org:8080/9168 Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-20 10:28:12 +00:00
Sailesh Mukil	4d6b07f0e2	IMPALA-6792: Fail status reporting if coordinator refuses connections The ReportExecStatusAux() function is run on a dedicated thread per fragment instance. This thread will run until the fragment instance completes executing. On every attempt to send a report to the coordinator, it will attempt to send up to 3 RPCs. If all 3 of them fail, then the fragment instance will cancel itself. However, there is one case where a failure to send the RPC will not be considered a failed RPC. If when we attempt to obtain a new connection, we end up creating a new connection (via ClientCache::CreateClient()) instead of getting a previously cached connection, and this new connection fails to even Open(), it will not be counted as a RPC failure. This patch counts such an error as a failed RPC too. This patch also clarifies some of the error log messages and introduces a flag to control the sleep interval between failed ReportExecStatus RPC retries. Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220 Reviewed-on: http://gerrit.cloudera.org:8080/9916 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-05 02:17:10 +00:00
Dan Hecht	408ee4d68c	IMPALA-5384, part 1: introduce DmlExecState This change is based on a patch by Marcel Kornacker. Move data structures that collect DML operation stats from the RuntimeState and Coordinator into a new InsertExecState class, which has it's own lock. This removes a dependency on the coordinator's lock, which will allow further coordinator locking cleanup in the next patch. Change-Id: Id4c025917620a7bff2acbeb46464f107ab4b7565 Reviewed-on: http://gerrit.cloudera.org:8080/9793 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-29 06:29:16 +00:00
Michael Ho	077f94a82b	IMPALA-6609: Fix ownership of class members in KrpcDataStreamRecvr A KrpcDataStreamRecvr is co-owned by the singleton KrpcDataStreamMgr and an exchange node. It's possible that some threads (e.g. RPC service threads) may still retain reference to the KrpcDataStreamRecvr after its owning exchange node has been closed. This is problematic as some members in the receiver (e.g. sender/receiver profiles) are actually owned by the exchange node so accessing them after the exchange node is closed and possibly deleted may lead to user-after-free. This patch changes the ownership of some members in KrpcDataStreamRecvr to the receiver. In particular, the profiles are now owned by the receiver and various stat counters and timers are in turn owned by these profiles. This prevents the use-after-free problem described above. This patch also moves the access to deferred_rpc_tracker_ in TakeOverEarlySenders() to be under the sender queue's 'lock_' to prevent another instance of IMPALA-6554. Testing done: core debug build. Change-Id: I3378496e2201b16c662b9daa634333480705c61a Reviewed-on: http://gerrit.cloudera.org:8080/9527 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-12 22:31:01 +00:00
Michael Ho	61ca7be6ec	IMPALA-2990: Add a warning message during cancellation Until IMPALA-2990 is fixed, there is no easy way to prevent query hangs due to cancellation of fragment instances in a backend node when it fails to report status to the coordinator. Given it's rather hard to diagnose IMPALA-2990, this change adds a log statement at the point of cancellation to warn about IMPALA-2990. This hopefully makes diagnostics slightly easier. Change-Id: Icfd56edfe74707592f4dd9c840550b13cb80e9a0 Reviewed-on: http://gerrit.cloudera.org:8080/9413 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-27 11:43:10 +00:00
Lars Volker	be0f4f9d21	IMPALA-6190/6246: Add instances tab and event sequence This change adds tracking of the current state during the execution of a fragment instance. The current state is then reported back to the coordinator and exposed to users via a new tab in the query detail debug webpage. This change also adds an event timeline to fragment instances in the query profile. The timeline measures the time since backend-local query start at which particular events complete. Events are derived from the current state of the execution of a fragment instance. For example: - Prepare Finished: 13.436ms (13.436ms) - First Batch Produced: 1s022ms (1s008ms) - First Batch Sent: 1s022ms (455.558us) - ExecInternal Finished: 2s783ms (1s760ms) I added automated tests for both extensions and additionally verified the change by manual inspection. Here are the TPCH performance comparison results between this change and the previous commit on a 16 node cluster. +------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +------------+-----------------------+---------+------------+------------+----------------+ \| TPCH(_300) \| parquet / none / none \| 18.47 \| -0.94% \| 9.72 \| -1.08% \| +------------+-----------------------+---------+------------+------------+----------------+ +------------+----------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Num Clients \| Iters \| +------------+----------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| TPCH(_300) \| TPCH-Q5 \| parquet / none / none \| 48.88 \| 46.93 \| +4.15% \| 0.14% \| 3.61% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q13 \| parquet / none / none \| 21.64 \| 21.15 \| +2.29% \| 2.06% \| 1.84% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q11 \| parquet / none / none \| 1.71 \| 1.70 \| +1.12% \| 0.54% \| 2.51% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q18 \| parquet / none / none \| 33.15 \| 32.79 \| +1.09% \| 0.13% \| 2.03% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q14 \| parquet / none / none \| 5.95 \| 5.90 \| +0.82% \| 2.19% \| 0.49% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q1 \| parquet / none / none \| 13.99 \| 13.90 \| +0.63% \| 0.25% \| 1.39% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q2 \| parquet / none / none \| 3.44 \| 3.44 \| +0.00% \| * 20.29% * \| * 20.76% * \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q6 \| parquet / none / none \| 1.21 \| 1.22 \| -0.01% \| 0.06% \| 0.06% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q20 \| parquet / none / none \| 3.51 \| 3.51 \| -0.11% \| 7.15% \| 7.30% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q16 \| parquet / none / none \| 6.89 \| 6.91 \| -0.21% \| 0.65% \| 0.55% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q4 \| parquet / none / none \| 4.78 \| 4.80 \| -0.38% \| 0.06% \| 0.59% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q19 \| parquet / none / none \| 30.78 \| 31.04 \| -0.83% \| 0.45% \| 1.03% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q22 \| parquet / none / none \| 6.06 \| 6.12 \| -1.02% \| 1.51% \| 2.12% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q10 \| parquet / none / none \| 9.43 \| 9.58 \| -1.54% \| 0.69% \| 3.30% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q21 \| parquet / none / none \| 93.41 \| 95.18 \| -1.86% \| 0.08% \| 0.81% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q15 \| parquet / none / none \| 3.40 \| 3.47 \| -1.99% \| 0.72% \| 1.27% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q7 \| parquet / none / none \| 44.98 \| 46.24 \| -2.71% \| 1.83% \| 1.27% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q3 \| parquet / none / none \| 28.06 \| 29.11 \| -3.61% \| 1.62% \| 1.23% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q12 \| parquet / none / none \| 3.15 \| 3.28 \| -3.80% \| 0.96% \| 1.32% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q9 \| parquet / none / none \| 29.47 \| 30.80 \| -4.30% \| 0.29% \| 0.34% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q17 \| parquet / none / none \| 4.37 \| 4.62 \| -5.33% \| 0.63% \| 0.54% \| 1 \| 3 \| \| TPCH(_300) \| TPCH-Q8 \| parquet / none / none \| 7.99 \| 8.46 \| -5.53% \| 7.95% \| 1.11% \| 1 \| 3 \| +------------+----------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ Here are the TPCDS performance comparison results between this change and the previous commit on a 16 node cluster. I inspected the Q2 results and concluded that the variability is unrelated to this change. +--------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +--------------+-----------------------+---------+------------+------------+----------------+ \| TPCDS(_1000) \| parquet / none / none \| 13.07 \| +0.51% \| 4.27 \| +1.83% \| +--------------+-----------------------+---------+------------+------------+----------------+ +--------------+------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Num Clients \| Iters \| +--------------+------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| TPCDS(_1000) \| TPCDS-Q2 \| parquet / none / none \| 8.36 \| 4.25 \| R +96.81% \| * 48.88% * \| 0.42% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q8 \| parquet / none / none \| 1.59 \| 1.35 \| +17.86% \| * 13.91% * \| 4.01% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q73 \| parquet / none / none \| 1.81 \| 1.71 \| +5.92% \| 5.53% \| 0.15% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q28 \| parquet / none / none \| 7.26 \| 6.95 \| +4.47% \| 1.09% \| 1.11% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q46 \| parquet / none / none \| 2.36 \| 2.30 \| +2.62% \| 1.45% \| 0.40% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q7 \| parquet / none / none \| 2.78 \| 2.73 \| +1.98% \| 1.21% \| 2.23% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q55 \| parquet / none / none \| 1.05 \| 1.03 \| +1.91% \| 1.16% \| 2.20% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q42 \| parquet / none / none \| 1.05 \| 1.04 \| +1.71% \| 0.90% \| 2.63% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q19 \| parquet / none / none \| 1.67 \| 1.65 \| +1.55% \| 1.12% \| 1.96% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q23 \| parquet / none / none \| 151.75 \| 149.94 \| +1.20% \| 3.23% \| 1.83% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q64 \| parquet / none / none \| 40.25 \| 39.79 \| +1.16% \| 0.43% \| 0.28% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q96 \| parquet / none / none \| 2.25 \| 2.22 \| +1.05% \| 1.00% \| 0.11% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q53 \| parquet / none / none \| 1.60 \| 1.58 \| +1.01% \| 1.28% \| 0.04% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q79 \| parquet / none / none \| 4.17 \| 4.13 \| +0.94% \| 0.89% \| 0.06% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q59 \| parquet / none / none \| 5.74 \| 5.71 \| +0.60% \| 1.22% \| 2.56% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q52 \| parquet / none / none \| 0.89 \| 0.89 \| +0.14% \| 0.03% \| 0.63% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q88 \| parquet / none / none \| 7.10 \| 7.12 \| -0.23% \| 0.43% \| 0.47% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q3 \| parquet / none / none \| 1.10 \| 1.11 \| -0.40% \| 0.58% \| 0.36% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q98 \| parquet / none / none \| 2.30 \| 2.31 \| -0.49% \| 3.58% \| 1.04% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q61 \| parquet / none / none \| 1.87 \| 1.89 \| -1.08% \| 1.68% \| 0.14% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q27a \| parquet / none / none \| 2.93 \| 2.96 \| -1.18% \| 1.74% \| 1.54% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q34 \| parquet / none / none \| 2.23 \| 2.27 \| -1.73% \| 1.91% \| 1.32% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q63 \| parquet / none / none \| 1.56 \| 1.60 \| -1.96% \| 1.91% \| 3.33% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q89 \| parquet / none / none \| 2.64 \| 2.70 \| -2.20% \| 1.93% \| 1.88% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q47 \| parquet / none / none \| 30.41 \| 31.17 \| -2.41% \| 1.09% \| 1.52% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q1 \| parquet / none / none \| 3.77 \| 3.86 \| -2.46% \| 1.91% \| 0.61% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q6 \| parquet / none / none \| 61.67 \| 63.34 \| -2.65% \| 3.77% \| 0.31% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q4 \| parquet / none / none \| 31.11 \| 31.96 \| -2.66% \| 0.61% \| 0.77% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q43 \| parquet / none / none \| 4.10 \| 4.22 \| -2.87% \| 1.40% \| 2.85% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q5 \| parquet / none / none \| 8.30 \| 8.56 \| -3.13% \| 1.55% \| 0.47% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q27 \| parquet / none / none \| 2.28 \| 2.35 \| -3.13% \| 1.17% \| 1.56% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q65 \| parquet / none / none \| 31.74 \| 32.77 \| -3.15% \| 1.47% \| 1.11% \| 1 \| 3 \| \| TPCDS(_1000) \| TPCDS-Q68 \| parquet / none / none \| 1.56 \| 1.62 \| -3.58% \| 9.37% \| * 11.93% * \| 1 \| 3 \| +--------------+------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ (R) Regression: TPCDS(_1000) TPCDS-Q2 [parquet / none / none] (4.25s -> 8.36s [+96.81%]) +---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+---------+-----------+ \| Operator \| % of Query \| Avg \| Base Avg \| Delta(Avg) \| StdDev(%) \| Max \| Base Max \| Delta(Max) \| #Hosts \| #Rows \| Est #Rows \| +---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+---------+-----------+ \| 27:MERGING-EXCHANGE \| 22.48% \| 6.97s \| 2.85s \| +144.40% \| * 58.44% * \| 11.05s \| 2.86s \| +286.33% \| 1 \| 2.51K \| 2.56K \| \| 26:EXCHANGE \| 7.65% \| 2.37s \| 2.43s \| -2.16% \| 1.82% \| 2.46s \| 2.50s \| -1.65% \| 14 \| 365 \| 2.56K \| \| 23:EXCHANGE \| 8.58% \| 2.66s \| 2.70s \| -1.46% \| 1.67% \| 2.74s \| 2.78s \| -1.47% \| 14 \| 516 \| 10.64K \| \| 13:AGGREGATE \| 4.21% \| 1.31s \| 1.30s \| +0.65% \| 0.06% \| 1.47s \| 1.43s \| +2.38% \| 14 \| 516 \| 10.64K \| \| 12:HASH JOIN \| 2.89% \| 896.20ms \| 885.79ms \| +1.17% \| 1.43% \| 1.06s \| 1.01s \| +4.77% \| 14 \| 433.27M \| 2.16B \| \| 06:SCAN HDFS \| 2.83% \| 877.34ms \| 886.93ms \| -1.08% \| 1.23% \| 888.16ms \| 906.88ms \| -2.06% \| 1 \| 365 \| 373 \| \| 19:EXCHANGE \| 23.20% \| 7.20s \| 3.12s \| +130.58% \| * 56.73% * \| 11.33s \| 3.17s \| +256.92% \| 14 \| 520 \| 10.64K \| \| 05:AGGREGATE \| 12.06% \| 3.74s \| 1.34s \| +178.49% \| * 64.53% * \| 6.33s \| 1.53s \| +314.84% \| 14 \| 520 \| 10.64K \| \| 04:HASH JOIN \| 7.71% \| 2.39s \| 956.81ms \| +149.90% \| * 60.36% * \| 4.04s \| 1.13s \| +256.75% \| 14 \| 442.29M \| 2.16B \| \| 03:SCAN HDFS \| 2.83% \| 878.97ms \| 894.11ms \| -1.69% \| 1.34% \| 890.78ms \| 910.22ms \| -2.14% \| 1 \| 371 \| 73.05K \| +---------------------+------------+----------+----------+------------+------------+----------+----------+------------+--------+---------+-----------+ Change-Id: I626456b6afa9101eeeeffd5cda10c4096d63d7f9 Reviewed-on: http://gerrit.cloudera.org:8080/8758 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-24 02:41:31 +00:00
Michael Ho	e714f2b33c	IMPALA-2397: Use atomics for IntGauge and IntCounter This change removes the spinlock in IntGauge and IntCounter and uses AtomicInt64 instead. As shown in IMPALA-2397, multiple threads can be contending for the spinlocks of some global metrics under concurrent queries. This change also breaks up SimpleMetric is renamed to ScalarMetric and broken into two subclasses: - LockedMetric: - a value store for any primitive type (int,float,string etc). - atomic read and write via GetValue() and SetValue() respectively. - AtomicMetric: - the basis of IntGauge and IntCounter. Support atomic increment of the metric value via Increment() interface. - atomic read and write via GetValue() and SetValue() respectively. - only support int64_t type. Change-Id: I48dfa5443cd771916b53541a0ffeaf1bcc7e7606 Reviewed-on: http://gerrit.cloudera.org:8080/9012 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-18 23:31:52 +00:00
Zoltan Borok-Nagy	8047b1dcb4	IMPALA-3703: Store query context in thread-local variables This commit introduces the ThreadDebugInfo class which can hold information about the current thread that can be useful in debug sessions. It needs to be allocated on the stack of each thread in order to include it to minidumps. Currently a ThreadDebugInfo object is created in Thread::SuperviseThread. This object is available in all the child stack frames through the global function GetThreadDebugInfo(). ThreadDebugInfo has members for the thread name and instance id. These are fixed size char buffers. If you have a core dump written by Impala, you can locate the ThreadDebugInfo for the current thread through the global pointer impala::thread_debug_info. In a core file that has been created from a minidump, we need to select the stack frame that allocated the ThreadDebugInfo object in order to inspect it. It is currently allocated in Thread::SuperviseThread(). We can use printf in gdb to print the members, e.g.: printf "%s\n" thread_debug_info.instance_id Currently the thread name and instance id is stored. I created some tests in thread-debug-info-test.cc Change-Id: I566f7f1db5117c498e86e0bd05b33bdcff624609 Reviewed-on: http://gerrit.cloudera.org:8080/8621 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-18 15:32:59 +00:00
Thomas Tauber-Marshall	f3fa3e017f	IMPALA-6081: Fix test_basic_filters runtime profile failure test_basic_filters has been occasionally failing due to a line missing from a runtime profile for a particular query. The problem is that the query returns all of its results before all of its fragment instances are finished executing (due to a limit). Then, when one fragment instance reports its status, the coordinator returns to it a 'cancelled' status, causing all remaining instances for that backend to be cancelled. Sometimes this cancellation happens quickly enough that the relevant fragment instances have not yet sent a status report when they are cancelled. They will still send a report in finalize, but as the coordinator only updates its runtime profile for 'ok' status reports, not 'cancelled', the final runtime profile doesn't end up with any data for those fragment instances, which means the test does not find the line in the runtime profile its checking for. The fix is to have the coordinator update its runtime profile with every status report it recieves, regardless of error status. Testing: - Ran existing runtime profile tests, which rely on profile output, in a loop. - Manually tested some scenarios with failed queries and checked that the new profile output is reasonable. - Added a new e2e test that runs the affected query and checks for the presence of info for all expected exec node in the profile. This repros the underlying issue consistently. Change-Id: I4f581c7c8039f02a33712515c5bffab942309bba Reviewed-on: http://gerrit.cloudera.org:8080/8754 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-07 21:07:02 +00:00
Tim Armstrong	7487c5de04	IMPALA-1575: part 2: yield admission control resources This change releases admission control resources more eagerly, once the query has finished actively executing. Some resources (tracked and untracked) are still consumed by the client request as long as it remains open, e.g. memory for control structures and the result cache. However, these resources are relatively small and should not block admission of new queries. The same as in part 1, query execution is considered to be finished under any of the following conditions: 1. The query encounters an error and fails 2. The query is cancelled due to the idle query timeout 3. The query reaches eos (or the DML completes) 4. The client cancels the query without closing the query Admission control resources are released in two ways: 1. by calling AdmissionController::ReleaseQuery() on the coordinator promptly after query execution finishes, instead of waiting for UnregisterQuery(). This means that the query and its memory is no longer considered "admitted". 2. by changing the behaviour of MemTracker::GetPoolMemReserved() so that it is aware of when a query has finished executing and does not consider its entire memory limit to be "reserved". The preconditions for releasing an admitted query are subtle because the queries are being admitted to a distributed system, not just the coordinator. The comment for ReleaseAdmissionControlResources() documents the preconditions and rationale. Note that the preconditions are not weaker than the preconditions of calling UnregisterQuery() before this patch. Testing: TestAdmissionController is extended to end queries in four ways: cancellation by client, idle timeout, the last row being fetched, and the client closing the query. The test uses a mix of all four. After the query ends, all clients wait for the test to complete before closing the query or closing the connection. This ensures that the admission control decisions are based entirely on the query end behavior. This test works for both query admission control and mem_limit admission control and can detect both kinds of admission control resources ("admitted" and "reserved") not being released promptly. I ran into a problem similar to IMPALA-3772 with the admission control tests becoming flaky due to query timeouts on release builds, which I solved in a similar way by increasing the frequency of statestore updates. This is based on an earlier patch by Joe McDonnell. Change-Id: Ib1fae8dc1c4b0eca7bfa8fadae4a56ef2b37947a Reviewed-on: http://gerrit.cloudera.org:8080/8581 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-20 04:34:47 +00:00
Thomas Tauber-Marshall	2510fe0aa0	IMPALA-4252: Min-max runtime filters for Kudu This patch implements min-max filters for runtime filters. Each runtime filter generates a bloom filter or a min-max filter, depending on if it has HDFS or Kudu targets, respectively. In RuntimeFilterGenerator in the planner, each hash join node generates a bloom and min-max filter for each equi-join predicate, but only those filters that end up being assigned to a target make it into the final plan. Min-max filters are only assigned to Kudu scans if the target expr is a column, as Kudu doesn't support bounds on general exprs, and only if the join op is '=' and not 'is distinct from', as Kudu doesn't support returning NULLs if a bound is set. Min-max filters are inserted into by the PartitionedHashJoinBuilder. Codegen is used to eliminate branching on the type of filter. String min-max filters truncate their bounds at 1024 chars, so that the max amount of memory used by min-max filters is negligible. For now, min-max filters are only applied at the KuduScanner, which passes them into the Kudu client. Future work will address applying min-max filters at HDFS scan nodes and applying bloom filters at Kudu scan nodes. Functional Testing: - Added new planner tests and updated the old ones. (in old tests, a lot of runtime filters are renumbered as we always generate min-max filters even if they don't end up getting assigned and they take up some of the RF ids). - Updated existing runtime filter tests to work with Kudu. - Added e2e tests for min-max filter specific functionality. Perf Testing: - All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu, timings are averages of 3 runs. - Ran a contrived query with a filter that does not eliminate any rows (full self join of lineitem). The difference in running time was negligible - 24.46s with filters on, 24.15s with filters off for a ~1% slowdown. - Ran a contrived query with a filter that elimiates all rows (self join on lineitem with a join condition that never matches). The filters resulted in a significant speedup - 0.26s with filters on, 1.46s with filters off for a ~5.6x speedup. This query is added to targeted-perf. Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c Reviewed-on: http://gerrit.cloudera.org:8080/7793 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-17 21:33:51 +00:00
Tim Armstrong	a772f84562	IMPALA-6171: Revert "IMPALA-1575: part 2: yield admission control resources" This reverts commit `fe90867d89`. Change-Id: I3eec4b5a6ff350933ffda0bb80949c5960ecdf25 Reviewed-on: http://gerrit.cloudera.org:8080/8499 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-08 22:03:59 +00:00
Tim Armstrong	fe90867d89	IMPALA-1575: part 2: yield admission control resources This change releases admission control resources more eagerly, once the query has finished actively executing. Some resources (tracked and untracked) are still consumed by the client request as long as it remains open, e.g. memory for control structures and the result cache. However, these resources are relatively small and should not block admission of new queries. The same as in part 1, query execution is considered to be finished under any of the following conditions: 1. The query encounters an error and fails 2. The query is cancelled due to the idle query timeout 3. The query reaches eos (or the DML completes) 4. The client cancels the query without closing the query Admission control resources are released in two ways: 1. by calling AdmissionController::ReleaseQuery() on the coordinator promptly after query execution finishes, instead of waiting for UnregisterQuery(). This means that the query and its memory is no longer considered "admitted". 2. by changing the behaviour of MemTracker::GetPoolMemReserved() so that it is aware of when a query has finished executing and does not consider its entire memory limit to be "reserved". The preconditions for releasing an admitted query are subtle because the queries are being admitted to a distributed system, not just the coordinator. The comment for ReleaseAdmissionControlResources() documents the preconditions and rationale. Note that the preconditions are not weaker than the preconditions of calling UnregisterQuery() before this patch. Testing: TestAdmissionController is extended to end queries in four ways: cancellation by client, idle timeout, the last row being fetched, and the client closing the query. The test uses a mix of all four. After the query ends, all clients wait for the test to complete before closing the query or closing the connection. This ensures that the admission control decisions are based entirely on the query end behavior. This test works for both query admission control and mem_limit admission control and can detect both kinds of admission control resources ("admitted" and "reserved") not being released promptly. This is based on an earlier patch by Joe McDonnell. Change-Id: I80279eb2bda740d7f61420f52db3bfa42a6a51ac Reviewed-on: http://gerrit.cloudera.org:8080/8323 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-07 05:16:11 +00:00
Tim Armstrong	9174af38c8	IMPALA-1575: Part 1: eagerly release query exec resources Release of backend resources for query execution on the coordinator daemon should occur eagerly as soon as query execution is finished or cancelled. Before this patch some resources managed by QueryState, like scratch files, were only released when the query was closed and all of the query's control structures were torn down. These resources are referred to as "ExecResources" in various places to distinguish them from resources associated with the client request (like the result cache) that are still required after the query finishes executing. This first change does not solve the admission control problem for two reasons: * We don't release the "admitted" memory on the coordinator until the query is unregistered. * Admission control still considers the memory reserved until the query memtracker is unregistered, which happens only when the QueryState is destroyed: see MemTracker::GetPoolMemReserved(). The flow is mostly similar to initial_reservation_refcnt_, except the coordinator also holds onto a reference count, which is released when either the final row is returned or cancellation is initiated. After the coordinator releases its refcount, the resources can be freed as soon as local fragments also release their refcounts. Also clean up Coordinator slightly by preventing runtime_state() from leaking out to ClientRequestState - instead it's possible to log the query MemTracker by following parent links in the MemTracker. This patch is partially based on Joe McDonnell's IMPALA-1575 patch. Testing: Ran core tests. Change-Id: I41ff374b0403f10a145f7fee9b3145953ee32341 Reviewed-on: http://gerrit.cloudera.org:8080/8303 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-28 02:45:35 +00:00
Tim Armstrong	7866eec5bd	IMPALA-5895: clean up runtime profile lifecycle Require callers to explicitly stop counter updating instead of doing it in the destructor. This replaces ad-hoc logic to stop individual counters. Track which counters need to be stopped in separate lists instead of stopping everything. Force all RuntimeProfiles to use ObjectPools in a uniform way - the profile, its counters and its children all are allocated from the same pool. This is done via a new Create() method. Consolidate 'time_series_counter_map_lock_' and 'counter_map_lock_' to reduce complexity of the locking scheme. I didn't see any benefit to sharding the locks in this way - there are only two time series counters per fragment instance, which a small fraction of the total number of counters. Change-Id: I45c39ac36c8e3c277213d32f5ae5f14be6b7f0df Reviewed-on: http://gerrit.cloudera.org:8080/8069 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-20 08:48:38 +00:00
Joe McDonnell	e993b9712c	IMPALA-5750: Catch exceptions from boost thread creation The boost thread constructor will throw boost::thread_resource_error if it is unable to spawn a thread on the system (e.g. due to a ulimit). This uncaught exception crashes Impala. Systems with a large number of nodes and threads are hitting this limit. This change catches the exception from the thread constructor and converts it to a Status. This requires several changes: 1. util/thread.h's Thread constructor is now private and all Threads are constructed via a new Create() static factory method. 2. util/thread-pool.h's ThreadPool requires that Init() be called after the ThreadPool is constructed. 3. To propagate the Status, Threads cannot be created in constructors, so this is moved to initialization methods that can return Status. 4. Threads now use unique_ptr's for management in all cases. Threads cannot be used as stack-allocated local variables or direct declarations in classes. Query execution code paths will now handle the error: 1. If the scan node fails to spawn any scanner thread, it will abort the query. 2. Failing to spawn a fragment instance from the query state in StartFInstances() will correctly report the error to the coordinator and tear down the query. Testing: This introduces the parameter thread_creation_fault_injection, which will cause Thread::Create() calls in eligible locations to fail randomly roughly 1% of the time. Quite a few locations of Thread::Create() and ThreadPool::Init() are necessary for startup and cannot be eligible. However, all the locations used for query execution are marked as eligible and governed by this parameter. The code was tested by setting this parameter to true and running queries to verify that queries either run to completion with the correct result or fail with appropriate status. Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0 Reviewed-on: http://gerrit.cloudera.org:8080/7730 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-07 03:25:30 +00:00
Joe McDonnell	91f7bc1947	IMPALA-5892: Allow reporting status independent of fragment instance Queries can hit an error that is not specific to a particular fragment instance. For example, QueryState::StartFInstances() calls DescriptorTbl::Create() before any fragment instances start. This location has no reason to report status via a particular fragment, and there is currently no way to report status otherwise. This leads to a query hang, because the status is never propagated back to the coordinator. This adds the ability to report status that is not associated with a particular fragment instance. By reporting status, the coordinator will now correctly abort the query in the case of the QueryState::StartFInstances() scenario described. Change-Id: I4cd98022f1d62a999c7c80ff5474fa8d069eb12c Reviewed-on: http://gerrit.cloudera.org:8080/7943 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-06 21:26:24 +00:00
Matthew Jacobs	7264c54751	IMPALA-5644,IMPALA-5810: Min reservation improvements Rejects queries during admission control if: * the largest (across all backends) min buffer reservation is greater than the query mem_limit or buffer_pool_limit * the sum of the min buffer reservations across the cluster is larger than the pool max mem resources There are some other interesting cases to consider later: * every per-backend min buffer reservation is less than the associated backend's process mem_limit; the current admission control code doesn't know about other backend's proc mem_limits. Also reduces minimum non-reservation memory (IMPALA-5810). See the JIRA for experimental results that show this slightly improves min memory requirements for small queries. One reason to tweak this is to compensate for the fact that BufferedBlockMgr didn't count small buffers against the BlockMgr limit, but BufferPool counts all buffers against it. Testing: * Adds new test cases in test_admission_controller.py * Adds BE tests in reservation-tracker-test for the reservation-util code. Change-Id: Iabe87ce8f460356cfe4d1be4d7092c5900f9d79b Reviewed-on: http://gerrit.cloudera.org:8080/7678 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-22 08:27:12 +00:00
Matthew Jacobs	985698bb6f	IMPALA-4833: Compute precise per-host reservation size, pt2 Addresses some comments from the CR: https://gerrit.cloudera.org/#/c/7630/8 Notably changes the name of initial_reservation_total_bytes to initial_reservation_total_claims and attempts to clarify comments. Change-Id: I1391d99ca15d2ebcaabd564fbe1242806be09f72 Reviewed-on: http://gerrit.cloudera.org:8080/7681 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-18 19:34:35 +00:00
Matthew Jacobs	6c12546561	IMPALA-4833: Compute precise per-host reservation size Before this change, the per-host reservation size was computed by the Planner. However, scheduling happens after planning, so the Planner must assume that all fragments run on all hosts, and the reservation size is likely much larger than it needs to be. This moves the computation of the per-host reservation size to the BE where it can be computed more precisely. This also includes a number of plan/profile changes. Change-Id: Idbcd1e9b1be14edc4017b4907e83f9d56059fbac Reviewed-on: http://gerrit.cloudera.org:8080/7630 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-12 08:10:07 +00:00
Tim Armstrong	27dcc768d2	IMPALA-5715: (mitigation only) defer destruction of MemTrackers One potential candidate for the bad MemTracker IMPALA-5715 is one owned by a join build sink. I haven't found a clear root cause, but we can reduce the probability of bugs like this by deferring teardown of the MemTrackers. This patch defers destruction of the fragment instance, ExecNode, DataSink and Codegen MemTrackers until query teardown. Instead MemTracker::Close() is called at the place where the MemTracker would have been destroyed to check that all memory is released and enforce that no more memory is consumed. The entire query MemTracker subtree is then disconnected in whole from the global tree in QueryState::ReleaseResources(), instead of the MemTrackers being incrementally disconnected bottom-up. In cases where the MemTracker is owned by another object, this required deferring teardown of the owner until query teardown. E.g. for LlvmCodeGen I added a Close() method to release resources and deferred calling the destructor. We want to make this change anyway - see IMPALA-5587. Testing: Ran a core ASAN build. Change-Id: I205abb0076d1ffd08cb93c0f1671c8b81e7fba0f Reviewed-on: http://gerrit.cloudera.org:8080/7492 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-08 19:43:36 +00:00
Tim Armstrong	a98b90bd38	IMPALA-4674: Part 2: port backend exec to BufferPool Always create global BufferPool at startup using 80% of memory and limit reservations to 80% of query memory (same as BufferedBlockMgr). The query's initial reservation is computed in the planner, claimed centrally (managed by the InitialReservations class) and distributed to query operators from there. min_spillable_buffer_size and default_spillable_buffer_size query options control the buffer size that the planner selects for spilling operators. Port ExecNodes to use BufferPool: * Each ExecNode has to claim its reservation during Open() * Port Sorter to use BufferPool. * Switch from BufferedTupleStream to BufferedTupleStreamV2 * Port HashTable to use BufferPool via a Suballocator. This also makes PAGG memory consumption more efficient (avoid wasting buffers) and improve the spilling algorithm: * Allow preaggs to execute with 0 reservation - if streams and hash tables cannot be allocated, it will pass through rows. * Halve the buffer requirement for spilling aggs - avoid allocating buffers for aggregated and unaggregated streams simultaneously. * Rebuild spilled partitions instead of repartitioning (IMPALA-2708) TODO in follow-up patches: * Rename BufferedTupleStreamV2 to BufferedTupleStream * Implement max_row_size query option. Testing: * Updated tests to reflect new memory requirements Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e Reviewed-on: http://gerrit.cloudera.org:8080/5801 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-05 01:03:02 +00:00
Henry Robinson	d25db64f0e	IMPALA-4905: Don't send empty insert status to coordinator If the coordinator, in UpdateBackendExecStatus(), receives a report that includes a TInsertExecStatus, it will call UpdateInsertExecStatus() which takes the coordinator-wide lock. Avoid doing this for fragment instances that would only send an empty TInsertExecStatus (including instances that belong to SELECT queries, not DML queries). Future changes should fix the locking around the UpdateBackendExecStatus() path to remove dependencies on Coordinator::lock_, but this fix is simple and addresses one point of needless contention. Change-Id: I314dd8d96922d273c6329266970d249ec8c5c608 Reviewed-on: http://gerrit.cloudera.org:8080/7457 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-25 00:09:19 +00:00
Michael Ho	e0ba5ef6a2	IMPALA-5558/IMPALA-5576: Reopen stale client connection Previously, the retry logic in DoRpc() only allows retry to happen if send() didn't complete successfully and the exception indicates a closed connection. However, send() returning successfully doesn't guarantee that the bytes have actually reached the remote peer. According to the man page of send(), when the message does not fit into the send buffer of the socket, send() normally blocks. So the payload of RPC may be buffered in the kernel if there is room for it. TCP allows a connection to be half-open. If an Impalad node is restarted, a stale client connection to that node may still allow send() to appear to succeed even though the payload wasn't sent. However, upon calling recv() in the RPC call to fetch the response, the client will get a return value of 0. In which case, thrift will throw an exception as the connection to the remote peer is closed already. Apparently, the existing retry logic doesn't quite handle this case. One can consistently reproduce the problem by warming the client cache followed by restarting one of the Impalad nodes. It will result a series of query failures due to stale connections. This change augments the retry logic to also retry the entire RPC if the exception string contains the messages "No more data to read." or "SSL_read: Connection reset by peer" to capture the case of stale connections. Our usage of thrift doesn't involve half-open TCP connection so having a broken connection in recv() indicates the remote end has closed the socket already. The generated thrift code doesn't knowingly close the socket before an RPC completes unless the process crashes, the connection is stale (e.g. the remote node was restarted) or the remote end fails to read from the client. In either cases, the entire RPC should just be retried with a new connection. This change also fixes QueryState::ReportExecStatusAux() to unconditionally for up to 3 times when reporting exec status of a fragment instance. Previously, it may break out of the loop early if RPC fails with 'retry_is_safe' == true (e.g. due to recv() timeout) or if the connection to coordinator fails (IMPALA-5576). Declaring the RPC to have failed may cause all fragment instances of a query to be cancelled locally, triggering query hang due to IMPALA-2990. Similarly, the cancellation RPC is also idempotent so it should be unconditionally retried up to 3 times with 100ms sleep time in between. The status reporting is idempotent as the handler simply ignores RPC if it determines that all fragment instances on a given backend is done so it should be safe to retry the RPC. This change updates ApplyExecStatusReport() to handle duplicated status reports with done bit set. Previously we would drop some other fragment instances' statuses if we received duplicate 'done' statuses from the same fragment instance(s). Testing done: Warmed up client cache by running stress test followed by restarting some Impalad nodes. Running queries used to fail or hang consistently in the past. It now works with patch. Also ran CSL enduranace tests and exhaustive builds. Change-Id: I4d722c8ad3bf0e78e89887b6cb286c69ca61b8f5 Reviewed-on: http://gerrit.cloudera.org:8080/7284 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-26 09:06:38 +00:00
Michael Ho	b38d9826d7	IMPALA-4192: Disentangle Expr and ExprContext This change separates Expr and ExprContext. This is a preparatory step for factoring out static data (e.g. Exprs) of plan fragments to be shared by multiple plan fragment instances. This change includes the followings: 1. Include aggregate functions (AggFn) as Expr. This separates AggFn from its evaluator. AggFn is similar to existing Expr as both are represented as a tree of Expr nodes but it doesn't really make sense to call Get*Val() on AggFn. This change restructures the class hierarchy: much of the existing Expr class is now renamed to ScalarExpr. Expr is the parent class of both AggFn and ScalarExpr. Expr is defined to be a tree with root of either AggFn or ScalarExpr and all descendants being ScalarExpr. 2. ExprContext is renamed to ScalarExprEvaluator which is the interface for evaluating ScalarExpr; AggFnEvaluator is the interface for evaluating AggFn. Multiple evaluators can be instantiated per Expr. Expr contains static states of an expression while evaluator contains runtime states needed for execution (i.e. evaluating the expression). 3. Update all exec nodes to instantiate Expr and their evaluators separately. ExecNode::Init() will be responsible for creating all the Exprs in an ExecNode while their evaluators are created in ExecNode::Prepare(). Certain evaluators are also moved into the data structures which actually utilize them. For instance, HashTableCtx now owns the build and probe expression evaluators. Similarly, TupleRowComparator and Sorter also own the evaluators. ExecNode which utilizes these data structures are only responsible for creating the expressions used by these data structures. 4. All codegen functions take Exprs instead of evaluators. Also, codegen functions will not return error status should the IR function fails the LLVM verification step. 5. The assignment of index into the FunctionContext vector is now done during the construction of ScalarExpr. Evaluators are only responsible for allocating and initializing the FunctionContexts. 6. Open(), Prepare() are now removed from Expr classes. The interface for creating any Expr is via either ScalarExpr::Create() or AggFn::Create() which will convert a thrift Expr into an initialized Expr object. Similarly, Create() interface is used for creating evaluators from an intialized Expr object. This separation allows the future change to introduce PlanNode data structures. The plan is to move all ExecNode::Init() logic to PlanNode and call them once per plan fragment. Change-Id: Iefdc9aeeba033355cb9497e3a5d2363627dcf2f3 Reviewed-on: http://gerrit.cloudera.org:8080/5483 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-18 11:08:25 +00:00
Henry Robinson	9caea9bfad	IMPALA-5350: Tidy up thread groups for finst exec threads Put all per-finst threads (profile report, exec thread, async build thread, scanner threads) together in one group "fragment-execution", and add finst ID to all their names. Remove -<tid> suffix from thread names and add a separate column with it to /threadz web page. Testing: eyeballed with a join query. Change-Id: I6fabfcc923863e5f9db4bb8b016c08ff226c079f Reviewed-on: http://gerrit.cloudera.org:8080/6951 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-01 22:45:53 +00:00
Thomas Tauber-Marshall	eb54287fb4	IMPALA-5167: Reduce the number of Kudu clients created (BE) Creating Kudu clients is very expensive as each will fetch metadata from the Kudu master, so we should minimize the number of Kudu clients that get created. This patch stores a map from Kudu master addressed to Kudu clients in the ExecEnv to be used across the BE for all queries. Another patch will address the FE. This relies on a change on the Kudu side that clears non-covered range entries from the client's cache on table open (fdc022fe6231af20e307012d98c35b16cbfa7b33) Testing: - Ran a stress test on a 10 node cluster: scan of a small Kudu table, 1000 concurrent queries, load on the Kudu master was reduced signficantly, from ~50% cpu to ~5%. (with the FE changes included) - Ran the Kudu e2e tests. - Manually ran a test with concurrent INSERTs and 'ALTER TABLE ADD PARTITION' (which is affected by the Kudu side change mentiond above) and verified correctness. Change-Id: I6b0c12a256c33e8ef32315b3736cae2dea2ae705 Reviewed-on: http://gerrit.cloudera.org:8080/6792 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-24 22:27:37 +00:00
Marcel Kornacker	368115cdae	IMPALA-2550: Switch to per-query exec rpc Coordinator: - FragmentInstanceState -> BackendState, which in turn records FragmentInstanceStats QueryState - does query-wide setup in a separate thread (which also launches the instance exec threads) - has a query-wide 'prepared' state at which point all static setup is done and all FragmentInstanceStates are accessible Also renamed QueryExecState to ClientRequestState. Simplified handling of execution status (in FragmentInstanceState): - status only transmitted via ReportExecStatus rpc - in particular, it's not returned anymore from the Cancel rpc FIS: Fixed bugs related to partially-prepared state (in Close() and ReleaseThreadToken()) Change-Id: I20769e420711737b6b385c744cef4851cee3facd Reviewed-on: http://gerrit.cloudera.org:8080/6535 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 04:04:50 +00:00
Joe McDonnell	5bb988b1c5	IMPALA-4996: Single-threaded KuduScanNode This introduces KuduScanNodeMt, the single-threaded version of KuduScanNode that materializes the tuples in GetNext(). KuduScanNodeMt is enabled by the same condition as HdfsScanNodeMt: mt_dop is greater than or equal to 1. To share code between the two implementations, KuduScanNode and KuduScanNodeMt are now subclasses of KuduScanNodeBase, which implements the shared code. The KuduScanner is minimally impacted, as it already had the required GetNext interface. Since the KuduClient is a heavy-weight object, it is now shared at the QueryState level. We try to share the KuduClient as much as possible, but there are times when the KuduClient cannot be shared. Each Kudu table has master addresses stored in the Hive Metastore. We only share KuduClients for tables that have an identical value for the master addresses. In the ideal case, every Kudu table will have the same value, but there is no explicit guarantee of this. The testing for this is a modified version of kudu-scan-node.test run with various mt_dop values. Change-Id: I6e4593300e376bc508b78acaea64ffdd2c73a67a Reviewed-on: http://gerrit.cloudera.org:8080/6312 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-17 19:33:31 +00:00
Tim Armstrong	663285244e	IMPALA-4674: Part 1: port BufferedTupleStream to BufferPool Add a copy of BufferedTupleStream that allocates memory from BufferPool. This will replace the original implementation when IMPALA-3200 is completed. The major changes are: * Terminology is updated from "blocks" to "pages" * No small buffer support is needed (hooray!). * BufferedTupleStream needs to do its own tracking of # of rows per page, etc instead of relying on BufferedBlockMgr to do it. A wrapper around PageHandle is used. * Profile counters (unpin, pin, get new block time) are removed - similar counters in the BufferPool client are more useful. * Changed the tuple null indicators so that they are allocated before each tuple, rather than in a block at the start of the page. This slightly reduces the memory density, but greatly simplifies much logic. In particular, it avoids problems with RowIdx and larger pages with offsets that don't fit in 16 bits. * Buffer management of read/write streams uses the new pin-counting support to separate pinning of the read and write pages. This means that the reservation usage of an unpinned read/write stream does not fluctuate and the read/write iterators can always advance without requiring additional reservation. Testing this required some further changes. TestEnv was refactored so it can set up either BufferPool or BufferedBlockMgr. Some BufferPool-related state is added to ExecEnv, QueryState and RuntimeState, but is only created for backend tests that explicitly create a BufferPool. The following is left for future work: * IMPALA-3808 (large row support) is not added. I've added TODOs to the code to places that will require changes. * IMPALA-4179 (remove MarkNeedsDeepCopy()) is not fixed, since it requires global changes to operators that accumulate memory. Testing: All of the BufferedTupleStream unit tests are ported to the new implementation, except for ones specifically testing the small buffer functionality. Change-Id: I7bb47a818564ac19a6be4ca88db0578f4ea0b709 Reviewed-on: http://gerrit.cloudera.org:8080/5811 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-16 21:07:58 +00:00
Henry Robinson	d074f71d81	IMPALA-3909 (follow-up): Properly qualify min() and max() in header parquet-column-stats.h was using unqualified references to std::max and std::min, which we disallow in header files. The reason this compiled is that some headers included "names.h". This commit flushes those out as well, and fixes the resulting compilation errors. I tried to make a static_assert method that complained if names.h was included in a header file, but couldn't do so without needing an extra macro, e.g.: USE_IMPORTED_NAMES(); Which seemed too verbose for now. Change-Id: If8381d5b1fa072e89ac1f7f54959c02bf8e43b4b Reviewed-on: http://gerrit.cloudera.org:8080/5916 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-02-08 00:57:35 +00:00
Tim Armstrong	cb52b2b8ae	IMPALA-3748: add query-wide resource acquisition step This adds a Prepare() method to QueryState that allows query execution to fail up-front if resources cannot be acquired. We don't make full use of this yet, but it provides a suitable place to acquire a memory reservation for the query or fail cleanly. The process memory limit check is moved here so that all startup memory checks will be eventually consolidated in one place. Also switch a boost::mutex to a SpinLock for consistency and to improve performance (see lock-benchmark.cc). Change-Id: Ia21a3c0f0b0a7175116883ef9871b93c8ce8bb81 Reviewed-on: http://gerrit.cloudera.org:8080/5739 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2017-01-24 18:49:24 +00:00
Tim Armstrong	85edc15fef	IMPALA-4678: move query MemTracker into QueryState The query MemTracker for query execution is now owned directly by QueryState, which greatly simplifies the lifecycle of the MemTracker. This required various other changes and enabled some simplifications. * The coordinator's QueryState is constructed earlier before fragments are sent out, since we need a MemTracker at that point. * The global query MemTracker map can be removed. * The static request pool mem tracker is moved into into ExecEnv. * Temporary query MemTrackers used to evaluate expressions during planning do not need to be registered globally and are owned directly by the RuntimeState. * Various cleanup logic is moved around to reflect the other changes. Change-Id: Id6b46652932b5638993623e98d1f0d60d8380ba0 Reviewed-on: http://gerrit.cloudera.org:8080/5630 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-01-19 02:34:39 +00:00
Tim Armstrong	95ed4434f2	IMPALA-3202,IMPALA-2079: rework scratch file I/O Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into TmpFileMgr, in anticipation of it being shared with BufferPool. TmpFileMgr now handles: * Scratch space allocation and recycling * Read and write I/O The interface is also greatly changed so that it is built around Write() and Read() calls, abstracting away the details of temporary file allocation from clients. This means the TmpFileMgr::File class can be hidden from clients. Write error recovery: Also implement write error recovery in TmpFileMgr. If an error occurs while writing to scratch and we have multiple scratch directories, we will try one of the other directories before cancelling the query. File-level blacklisting is used to prevent excessive repeated attempts to resize a scratch file during a single query. Device-level blacklisting is not implemented because it is problematic to permanently take a scratch directory out of use. To reduce the number of error paths, all I/O errors are now handled asynchronously. Previously errors creating or extending the file were returned synchronously from WriteUnpinnedBlock(). This required modifying DiskIoMgr to create the file if not present when opened. Also set the default max_errors value in the thrift definition file, so that it is in effect for backend tests. Future Work: * Support for recycling variable-length scratch file ranges. I omitted this to avoid making the patch even large. Testing: Updated BufferedBlockMgr unit test to reflect changes in behaviour: * Scratch space is no longer permanently associated with a block, and is remapped every time a new block is written to disk . * Files are now blacklisted - updated existing tests and enable the disable blacklisting test. Added some basic testing of recycling of scratch file ranges in the TmpFileMgr unit test. I also manually tested the code in two ways. First by removing permissions for /tmp/impala-scratch and ensuring that a spilling query fails cleanly. Second, by creating a tiny ramdisk (16M) and running with two scratch directories: one on /tmp and one on the tiny ramdisk. When spilling, an out of space error is encountered for the tiny ramdisk and impala spills the remaining data (72M) to /tmp. Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17 Reviewed-on: http://gerrit.cloudera.org:8080/5141 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-01-05 02:26:24 +00:00
Marcel Kornacker	4b2d76dbb5	IMPALA-4014: Introduce query-wide execution state. This introduces a global structure to coordinate execution of fragment instances on a backend for a single query. New classes: - QueryExecMgr: subsumes FragmentMgr - QueryState - FragmentInstanceState: replaces FragmentExecState Change-Id: I962ae6b7cb7dc0d07fbb8f70317aeb01d88d400b Reviewed-on: http://gerrit.cloudera.org:8080/4418 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-12-11 02:29:28 +00:00

1 2

89 Commits