Move parquet classes into exec/parquet.
Move CollectionColumnReader and ParquetLevelDecoder into separate files.
Remove unnecessary 'encoding_' field from ParquetLevelDecoder.
Switch BOOLEAN decoding to use composition instead of inheritance. This
lets the boolean decoding use the faster batched implementations in
ScalarColumnReader and avoids some confusing aspects of the class
hierarchy, like the ReadValueBatch() implementation on the base class
that was shared between BoolColumnReader and CollectionColumnReader.
Improve compile times by instantiating BitPacking templates in a
separate file (this looks to give a 30s+ speedup for
compiling parquet-column-readers.cc).
Testing:
Ran exhaustive tests.
Change-Id: I0efd5c50b781fe9e3c022b33c66c06cfb529c0b8
Reviewed-on: http://gerrit.cloudera.org:8080/11949
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00
Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
converted back to timestamp
- timezone conversions handle the overflowing time part and give
a valid timestamp result (at least since CCTZ, I did not check
older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
resulting Parquet file will also contain corrupt timestamps
The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. A new error code is added for this case. If both the date
and the time part is corrupt, then error about corrupt time is
returned.
Testing:
- added a new scanner test that reads a corrupted Parquet file
with edge values
Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Reviewed-on: http://gerrit.cloudera.org:8080/11521
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the same patch except with fixes for the test failures
on EC and S3 noted in the JIRA.
This allows graceful shutdown of executors and partially graceful
shutdown of coordinators (new operations fail, old operations can
continue).
Details:
* In order to allow future admin commands, this is implemented with
function-like syntax and does not add any reserved words.
* ALL privilege is required on the server
* The coordinator impalad that the client is connected to can be shut
down directly with ":shutdown()".
* Remote shutdown of another impalad is supported, e.g. with
":shutdown('hostname')", so that non-coordinators can be shut down
and for the convenience of the client, which does not have to
connect to the specific impalad. There is no assumption that the
other impalad is registered in the statestore; just that the
coordinator can connect to the other daemon's thrift endpoint.
This simplifies things and allows shutdown in various important
cases, e.g. statestore down.
* The shutdown time limit can be overridden to force a quicker or
slower shutdown by specifying a deadline in seconds after the
statement is executed.
* If shutting down, a banner is shown on the root debug page.
Workflow:
1. (if a coordinator) clients are prevented from submitting
queries to this coordinator via some out-of-band mechanism,
e.g. load balancer
2. the shutdown process is started via ":shutdown()"
3. a bit is set in the statestore and propagated to coordinators,
which stop scheduling fragment instances on this daemon
(if an executor).
4. the query startup grace period (which is ideally set to the AC
queueing delay plus some additional leeway) expires
5. once the daemon is quiesced (i.e. no fragments, no registered
queries), it shuts itself down.
6. If the daemon does not successfully quiesce (e.g. rogue clients,
long-running queries), after a longer timeout (counted from the start
of the shutdown process) it will shut down anyway.
What this does:
* Executors can be shut down without causing a service-wide outage
* Shutting down an executor will not disrupt any short-running queries
and will wait for long-running queries up to a threshold.
* Coordinators can be shut down without query failures only if
there is an out-of-band mechanism to prevent submission of more
queries to the shut down coordinator. If queries are submitted to
a coordinator after shutdown has started, they will fail.
* Long running queries or other issues (e.g. stuck fragments) will
slow down but not prevent eventual shutdown.
Limitations:
* The startup grace period needs to be configured to be greater than
the latency of statestore updates + scheduling + admission +
coordinator startup. Otherwise a coordinator may send a
fragment instance to the shutting down impalad. (We could
automate this configuration as a follow-on)
* The startup grace period means a minimum latency for shutdown,
even if the cluster is idle.
* We depend on the statestore detecting the process going down
if queries are still running on that backend when the timeout
expires. This may still be subject to existing problems,
e.g. IMPALA-2990.
Tests:
* Added parser, analysis and authorization tests.
* End-to-end test of shutting down impalads.
* End-to-end test of shutting down then restarting an executor while
queries are running.
* End-to-end test of shutting down a coordinator
- New queries cannot be started on coord, existing queries continue to run
- Exercises various Beeswax and HS2 operations.
Change-Id: I8f3679ef442745a60a0ab97c4e9eac437aef9463
Reviewed-on: http://gerrit.cloudera.org:8080/11484
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
I started by converting scan and spill-to-disk because the
cancellation there is always meant to be internal to the scan and
spill-to-disk subsystems.
I updated all places that checked for TErrorCode::CANCELLED to treat
CANCELLED_INTERNALLY the same.
This is to aid triage and debugging of bugs like IMPALA-7418
where an "internal" cancellation leaks out into the query state.
This will make it easier to determine if an internal cancellation
somehow "leaked" out.
Testing:
Ran exhaustive tests.
Change-Id: If25d5b539d68981359e4d881cae7b08728ba2999
Reviewed-on: http://gerrit.cloudera.org:8080/11464
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This allows graceful shutdown of executors and partially graceful
shutdown of coordinators (new operations fail, old operations can
continue).
Details:
* In order to allow future admin commands, this is implemented with
function-like syntax and does not add any reserved words.
* ALL privilege is required on the server
* The coordinator impalad that the client is connected to can be shut
down directly with ":shutdown()".
* Remote shutdown of another impalad is supported, e.g. with
":shutdown('hostname')", so that non-coordinators can be shut down
and for the convenience of the client, which does not have to
connect to the specific impalad. There is no assumption that the
other impalad is registered in the statestore; just that the
coordinator can connect to the other daemon's thrift endpoint.
This simplifies things and allows shutdown in various important
cases, e.g. statestore down.
* The shutdown time limit can be overridden to force a quicker or
slower shutdown by specifying a deadline in seconds after the
statement is executed.
* If shutting down, a banner is shown on the root debug page.
Workflow:
1. (if a coordinator) clients are prevented from submitting
queries to this coordinator via some out-of-band mechanism,
e.g. load balancer
2. the shutdown process is started via ":shutdown()"
3. a bit is set in the statestore and propagated to coordinators,
which stop scheduling fragment instances on this daemon
(if an executor).
4. the query startup grace period (which is ideally set to the AC
queueing delay plus some additional leeway) expires
5. once the daemon is quiesced (i.e. no fragments, no registered
queries), it shuts itself down.
6. If the daemon does not successfully quiesce (e.g. rogue clients,
long-running queries), after a longer timeout (counted from the start
of the shutdown process) it will shut down anyway.
What this does:
* Executors can be shut down without causing a service-wide outage
* Shutting down an executor will not disrupt any short-running queries
and will wait for long-running queries up to a threshold.
* Coordinators can be shut down without query failures only if
there is an out-of-band mechanism to prevent submission of more
queries to the shut down coordinator. If queries are submitted to
a coordinator after shutdown has started, they will fail.
* Long running queries or other issues (e.g. stuck fragments) will
slow down but not prevent eventual shutdown.
Limitations:
* The startup grace period needs to be configured to be greater than
the latency of statestore updates + scheduling + admission +
coordinator startup. Otherwise a coordinator may send a
fragment instance to the shutting down impalad. (We could
automate this configuration as a follow-on)
* The startup grace period means a minimum latency for shutdown,
even if the cluster is idle.
* We depend on the statestore detecting the process going down
if queries are still running on that backend when the timeout
expires. This may still be subject to existing problems,
e.g. IMPALA-2990.
Tests:
* Added parser, analysis and authorization tests.
* End-to-end test of shutting down impalads.
* End-to-end test of shutting down then restarting an executor while
queries are running.
* End-to-end test of shutting down a coordinator
- New queries cannot be started on coord, existing queries continue to run
- Exercises various Beeswax and HS2 operations.
Change-Id: I4d5606ccfec84db4482c1e7f0f198103aad141a0
Reviewed-on: http://gerrit.cloudera.org:8080/10744
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The error text with AES-GCM enabled looks like:
Error reading 44 bytes from scratch file
'/tmp/impala-scratch/0:0_d43635d0-8f55-485e-8899-907af289ac86' on
backend tarmstrong-box:22000 at offset 0: verification of read data
failed.
OpenSSL error in EVP_DecryptFinal:
139634997483216:error:0607C083:digital envelope
routines:EVP_CIPHER_CTX_ctrl:no cipher set:evp_enc.c:610:
139634997483216:error:0607C083:digital envelope
routines:EVP_CIPHER_CTX_ctrl:no cipher set:evp_enc.c:610:
139634997483216:error:0607C083:digital envelope
routines:EVP_CIPHER_CTX_ctrl:no cipher set:evp_enc.c:610:
139634997483216:error:0607C083:digital envelope
routines:EVP_CIPHER_CTX_ctrl:no cipher set:evp_enc.c:610:
Testing:
Added a backend test to exercise the code path and verify the error
code.
Change-Id: I0652d6cdfbb4e543dd0ca46b7cc65edc4e41a2d8
Reviewed-on: http://gerrit.cloudera.org:8080/10204
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When an impalad is in executor-only mode, it receives no
catalog updates. As a result, lib-cache entries are never
refreshed. A consequence is that udf queries can return
incorrect results or may not run due to resolution issues.
Both cases are caused by the executor using a stale copy
of the lib file. For incorrect results, an old version of
the method may be used. Resolution issues can come up if
a method is added to a lib file.
The solution in this change is to capture the coordinator's
view of the lib file's last modified time when planning.
This last modified time is then shipped with the plan to
executors. Executors must then use both the lib file path
and the last modified time as a key for the lib-cache.
If the coordinator's last modified time is more recent than
the executor's lib-cache entry, then the entry is refreshed.
Brief discussion of alternatives:
- lib-cache always checks last modified time
+ easy/local change to lib-cache
- adds an fs lookup always. rejected for this reason
- keep the last modified time in the catalog
- bound on staleness is too loose. consider the case where
fn's f1, f2, f3 are created with last modified times of
t1, t2, t3. treat the fn's last modified time as a low-watermark;
if the cache entry has a more recent time, use it. Such a scheme
would allow the version at t2 to persist. An old fn may keep the
state from converging to the latest. This could end up with strange
cases where different versions of the lib are used across executors
for a single query.
In contrast, the change in this path relies on the statestore to
push versions forward at all coordinators, so will push all
versions at all caches forward as well.
Testing:
- added an e2e custom cluster test
Change-Id: Icf740ea8c6a47e671427d30b4d139cb8507b7ff6
Reviewed-on: http://gerrit.cloudera.org:8080/9697
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The serialization format of a row batch relies on
tuple offsets. In its current form, the tuple offsets
are int32s. This means that it is impossible to generate
a valid serialization of a row batch that is larger
than INT_MAX.
This changes RowBatch::SerializeInternal() to return an
error if trying to serialize a row batch larger than INT_MAX.
This prevents a DCHECK on debug builds when creating a row
larger than 2GB.
This also changes the compression logic in RowBatch::Serialize()
to avoid a DCHECK if LZ4 will not be able to compress the
row batch. Instead, it returns an error.
This modifies row-batch-serialize-test to verify behavior at
each of the limits. Specifically:
RowBatches up to size LZ4_MAX_INPUT_SIZE succeed.
RowBatches with size range [LZ4_MAX_INPUT_SIZE+1, INT_MAX]
fail on LZ4 compression.
RowBatches with size > INT_MAX fail with RowBatch too large.
Change-Id: I3b022acdf3bc93912d6d98829b30e44b65890d91
Reviewed-on: http://gerrit.cloudera.org:8080/9367
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
The encoding was added in an early version of the Parquet
spec and deprecated even in the Parquet 1.0 spec.
Parquet-MR switched to generating RLE at the same time as
the spec changed in mid-2013. Impala always wrote RLE:
see commit 6e293090e6.
The Impala implementation of BIT_PACKED was never correct
because it implemented little endian bit unpacking instead of
the big endian unpacking required by the spec for levels.
Testing:
Updated tests to reflect expected behaviour for supported
and unsupported def level encodings.
Cherry-picks: not for 2.x.
Change-Id: I12c75b7f162dd7de8e26cf31be142b692e3624ae
Reviewed-on: http://gerrit.cloudera.org:8080/9241
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This patch adds the following details to the error message encountered
on failure to get minimum memory reservation:
- which ReservationTracker hit its limit
- top 5 admitted queries that are consuming the most memory under the
ReservationTracker that hit its limit
Testing:
- added tests to reservation-tracker-test.cc that verify the error
message returned for different cases.
- tested "initial reservation failed" condition manually to verify
the error message returned.
Change-Id: Ic4675fe923b33fdc4ddefd1872e6d6b803993d74
Reviewed-on: http://gerrit.cloudera.org:8080/8781
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
There is not much benefit in printing the stack trace when
Thrift RPC hits an error. As long as we print enough info about
the error and identify the caller, that should be sufficient.
In fact, it has been observed that stack crawl caused unnecessary
CPU spikes in the past. This change replaces Status() with
Status::Expected() in DoRpc(), RetryRpc(), RetryRpcRecv() and
Coordinator::BackendState::Exec() to avoid unnecessary stack crawls.
Testing done: private core build. Verified error strings with
test_rpc_timeout.py and test_rpc_exception.py
Change-Id: Ia83294494442ef21f7934f92ba9112e80d81fa58
Reviewed-on: http://gerrit.cloudera.org:8080/8788
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Previously, we implicitly create a local string object created from
the char* in argv[0] when calling InitAuth(). This string object goes
out of scope once InitAuth() returns but the pointer of this local
string's buffer is passed to the Sasl library which may reference
it after the local string has been deleted, leading to use-after-free.
This bug is exposed by recent change to enable Kerberos with KRPC as
we now always initialize Sasl even if Kerberos is not enabled.
This change fixes the problem above by making a copy of 'appname'
passed to InitAuth(). Also, the new code enforces that multiple
calls to InitAuth() must use the same 'appname' or it will fail.
Testing done: Verified rpc-mgr-test and thrift-server-test no longer
fail in ASAN build.
Change-Id: I1f29c2396df114264dfc23726b8ba778f50e12e9
Reviewed-on: http://gerrit.cloudera.org:8080/8777
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
This change augments the message of TErrorCode::DATASTREAM_SENDER_TIMEOUT
to include the source address when KRPC is enabled. The source address is
not readily available in Thrift. The new message includes the destination
plan node id in case there are multiple exchange nodes in a fragment instance.
Testing done: Confirmed the error message by testing with following options:
"--stress_datastream_recvr_delay_ms=90000 datastream_sender_timeout_ms=1000"
Change-Id: Ie3e83773fe6feda057296e7d5544690aa9271fa0
Reviewed-on: http://gerrit.cloudera.org:8080/8751
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
When Lz4Compressor::MaxOutputLen returns 0, it
means that the input is too large to compress.
When invoked Lz4Compressor::ProcessBlock with
an input too large, it silently produced a bogus
result. This bogus result even decompresses
successfully, but not to the data that was
originally compressed.
After this commit, Lz4Compressor::ProcessBlock
will return error if it cannot compress the
input.
I also added a comment on Codec::MaxOutputLen()
that return value 0 means that the input is too
large.
I added some checks after the invocations of
MaxOutputLen() where the compressor can be
a Lz4Compressor.
I added an automated test case to
be/src/util/decompress-test.cc.
Change-Id: Ifb0bc4ed98c5d7b628b791aa90ead36347b9fbb8
Reviewed-on: http://gerrit.cloudera.org:8080/8748
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
KuduRPC has support for Kerberos. However, since Impala's client transport
still uses the Thrift transport stack, we need to make sure that a single
security configuration applies to both internal communication (KuduRPC)
and external communication (Thrift's TSaslTransport).
This patch changes InitAuth() to start Sasl regardless of security
configuration, since KRPC uses plain SASL for negotiation on insecure
clusters.
It also moves some utility code out of authentication.cc into
auth-util.cc for resuse by the RpcMgr while enabling kerberos.
The MiniKDC related code is moved out of thrift-server-test.cc into a
new file called mini-kdc-wrapper.h/cc. This file exposes a new class
MiniKdcWrapper which can be easily used by the tests to configure the
kerberos environment, create the keytab, start the KDC and also
initialize the Impala security library.
Tests are added to rpc-mgr-test for kerberos tests over KRPC.
thrift-server-test also has a mechanical change to use MiniKdcWrapper.
Also tested on a live cluster configured to use kerberos.
Change-Id: I8cec5cca5fdb4b1d46bab19e86cb1a8a3ad718fd
Reviewed-on: http://gerrit.cloudera.org:8080/8270
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
This patch implements a new data stream service which utilizes KRPC.
Similar to the thrift RPC implementation, there are 3 major components
to the data stream services: KrpcDataStreamSender serializes and sends
row batches materialized by a fragment instance to a KrpcDataStreamRecvr.
KrpcDataStreamMgr is responsible for routing an incoming row batch to
the appropriate receiver. The data stream service runs on the port
FLAGS_krpc_port which is 29000 by default.
Unlike the implementation with thrift RPC, KRPC provides an asynchronous
interface for invoking remote methods. As a result, KrpcDataStreamSender
doesn't need to create a thread per connection. There is one connection
between two Impalad nodes for each direction (i.e. client and server).
Multiple queries can multi-plex on the same connection for transmitting
row batches between two Impalad nodes. The asynchronous interface also
prevents avoids the possibility that a thread is stuck in the RPC code
for extended amount of time without checking for cancellation. A TransmitData()
call with KRPC is in essence a trio of RpcController, a serialized protobuf
request buffer and a protobuf response buffer. The call is invoked via a
DataStreamService proxy object. The serialized tuple offsets and row batches
are sent via "sidecars" in KRPC to avoid extra copy into the serialized
request buffer.
Each impalad node creates a singleton DataStreamService object at start-up
time. All incoming calls are served by a service thread pool created as part
of DataStreamService. By default, the number of service threads equals the
number of logical cores. The service threads are shared across all queries so
the RPC handler should avoid blocking as much as possible. In thrift RPC
implementation, we make a thrift thread handling a TransmitData() RPC to block
for extended period of time when the receiver is not yet created when the call
arrives. In KRPC implementation, we store TransmitData() or EndDataStream()
requests which arrive before the receiver is ready in a per-receiver early
sender list stored in KrpcDataStreamMgr. These RPC calls will be processed
and responded to when the receiver is created or when timeout occurs.
Similarly, there is limited space in the sender queues in KrpcDataStreamRecvr.
If adding a row batch to a queue in KrpcDataStreamRecvr causes the buffer limit
to exceed, the request will be stashed in a queue for deferred processing.
The stashed RPC requests will not be responded to until they are processed
so as to exert back pressure to the senders. An alternative would be to reply with
an error and the request / row batches need to be sent again. This may end up
consuming more network bandwidth than the thrift RPC implementation. This change
adopts the behavior of allowing one stashed request per sender.
All rpc requests and responses are serialized using protobuf. The equivalent of
TRowBatch would be ProtoRowBatch which contains a serialized header about the
meta-data of the row batch and two Kudu Slice objects which contain pointers to
the actual data (i.e. tuple offsets and tuple data).
This patch is based on an abandoned patch by Henry Robinson.
TESTING
-------
* Builds {exhaustive/debug, core/release, asan} passed with FLAGS_use_krpc=true.
TO DO
-----
* Port some BE tests to KRPC services.
Change-Id: Ic0b8c1e50678da66ab1547d16530f88b323ed8c1
Reviewed-on: http://gerrit.cloudera.org:8080/8023
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:
- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.
In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.
To fix this, we change the I/O manager to always return DISK_IO_ERROR
for errors and we abort a query if we receive such an error in the
scanner.
This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.
I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.
Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.
Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: <UTF8 Garbage>
Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Reviewed-on: http://gerrit.cloudera.org:8080/8011
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Adds GetBackendAddress() (which is host:port) to error messages stemming
from SCRATCH_LIMIT_EXCEEDED, SCRATCH_READ_TRUNCATED, and
SCRATCH_ALLOCATION_FAILED messages.
Testing:
* Unit tests assert the string is updated for SCRATCH_LIMIT_EXCEEDED
and SCRATCH_ALLOCATION_FAILED. SCRATCH_READ_TRUNCATED doesn't
have an existing test, and I didn't add a new one.
* Manually testing a query that spills after "chmod 000 /tmp/impala-scratch":
$ chmod 000 /tmp/impala-scratch
$ impala-shell
[dev:21000] > set mem_limit=100m;
MEM_LIMIT set to 100m
[dev:21000] > select count(*) from tpch_parquet.lineitem join tpch_parquet.orders on l_orderkey = o_orderkey;
Query: select count(*) from tpch_parquet.lineitem join tpch_parquet.orders on l_orderkey = o_orderkey
Query submitted at: 2017-09-11 11:07:06 (Coordinator: http://dev:25000)
Query progress can be monitored at: http://dev:25000/query_plan?query_id=5c48ff8f4103c194:1b40a6c00000000
WARNINGS: Could not create files in any configured scratch directories (--scratch_dirs=/tmp/impala-scratch) on backend 'dev:22002'. See logs for previous errors that may have prevented creating or writing scratch files.
Opening '/tmp/impala-scratch/5c48ff8f4103c194:1b40a6c00000000_08e8d63b-169d-4571-a0fe-c48fa08d73e6' for write failed with errno=13 description=Error(13): Permission denied
Opening '/tmp/impala-scratch/5c48ff8f4103c194:1b40a6c00000000_08e8d63b-169d-4571-a0fe-c48fa08d73e6' for write failed with errno=13 description=Error(13): Permission denied
Opening '/tmp/impala-scratch/5c48ff8f4103c194:1b40a6c00000000_08e8d63b-169d-4571-a0fe-c48fa08d73e6' for write failed with errno=13 description=Error(13): Permission denied
Change-Id: If31a50fdf6031312d0348d48aeb8f9688274cac2
Reviewed-on: http://gerrit.cloudera.org:8080/7816
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The boost thread constructor will throw boost::thread_resource_error
if it is unable to spawn a thread on the system
(e.g. due to a ulimit). This uncaught exception crashes
Impala. Systems with a large number of nodes and threads
are hitting this limit.
This change catches the exception from the thread
constructor and converts it to a Status. This requires
several changes:
1. util/thread.h's Thread constructor is now private
and all Threads are constructed via a new Create()
static factory method.
2. util/thread-pool.h's ThreadPool requires that Init()
be called after the ThreadPool is constructed.
3. To propagate the Status, Threads cannot be created in
constructors, so this is moved to initialization methods
that can return Status.
4. Threads now use unique_ptr's for management in all
cases. Threads cannot be used as stack-allocated local
variables or direct declarations in classes.
Query execution code paths will now handle the error:
1. If the scan node fails to spawn any scanner thread,
it will abort the query.
2. Failing to spawn a fragment instance from the query
state in StartFInstances() will correctly report the error
to the coordinator and tear down the query.
Testing:
This introduces the parameter thread_creation_fault_injection,
which will cause Thread::Create() calls in eligible
locations to fail randomly roughly 1% of the time.
Quite a few locations of Thread::Create() and
ThreadPool::Init() are necessary for startup and cannot
be eligible. However, all the locations used for query
execution are marked as eligible and governed by this
parameter. The code was tested by setting this parameter
to true and running queries to verify that queries either
run to completion with the correct result or fail with
appropriate status.
Change-Id: I15a2f278dc71892b7fec09593f81b1a57ab725c0
Reviewed-on: http://gerrit.cloudera.org:8080/7730
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Augment the error message to mention that oversubscription is likely the
problem and hint at solutions.
Change-Id: I8e367e1b0cb08e11fdd0546880df23b785e3b7c9
Reviewed-on: http://gerrit.cloudera.org:8080/7861
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Sometimes the client is not open when the debug action fires at the
start of Open() or Prepare(). In that case we should set the
probability when the client is opened later.
This caused one of the large row tests to start failing with a "failed
to repartition" error in the aggregation. The error is a false positive
caused by two distinct keys hashing to the same partition. Removing the
check allows the query to succeed because the keys hash to different
partitions in the next round of repartitioning.
If we repeatedly get unlucky and have collisions, the query will still
fail when it reaches MAX_PARTITION_DEPTH.
Testing:
Ran TestSpilling in a loop for a couple of hours, including the
exhaustive-only tests.
Change-Id: Ib26b697544d6c2312a8e1fe91b0cf8c0917e5603
Reviewed-on: http://gerrit.cloudera.org:8080/7771
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Adds support for a "max_row_size" query option that instructs Impala
to reserve enough memory to process rows of the specified size. For
spilling operators, the planner reserves enough memory to process
rows of this size. The advantage of this compared to simply
specifying larger values for min_spillable_buffer_size and
default_spillable_buffer_size is that operators may be able to
handler larger rows without increasing the size of all their
buffers.
The default value is 512KB. I picked that number because it doesn't
increase minimum reservations *too* much even with smaller buffers
like 64kb but should be large enough for almost all reasonable
workloads.
This is implemented in the aggs and joins using the variable page size
support added to BufferedTupleStream in an earlier commit. The synopsis
is that each stream requires reservation for one default-sized page
per read and write iterator, and temporarily requires reservation
for a max-sized page when reading or writing larger pages. The
max-sized write reservation is released immediately after the row
is appended and the max-size read reservation is released after
advancing to the next row.
The sorter and analytic simply use max-sized buffers for all pages
in the stream.
Testing:
Updated existing planner tests to reflect default max_row_size. Added
new planner tests to test the effect of the query option.
Added "set" test to check validation of query option.
Added end-to-end tests exercising spilling operators with large rows
with and without spilling induced by SET_DENY_RESERVATION_PROBABILITY.
Change-Id: Ic70f6dddbcef124bb4b329ffa2e42a74a1826570
Reviewed-on: http://gerrit.cloudera.org:8080/7629
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Rejects queries during admission control if:
* the largest (across all backends) min buffer reservation is
greater than the query mem_limit or buffer_pool_limit
* the sum of the min buffer reservations across the cluster
is larger than the pool max mem resources
There are some other interesting cases to consider later:
* every per-backend min buffer reservation is less than the
associated backend's process mem_limit; the current
admission control code doesn't know about other backend's
proc mem_limits.
Also reduces minimum non-reservation memory (IMPALA-5810).
See the JIRA for experimental results that show this
slightly improves min memory requirements for small queries.
One reason to tweak this is to compensate for the fact that
BufferedBlockMgr didn't count small buffers against the
BlockMgr limit, but BufferPool counts all buffers against
it.
Testing:
* Adds new test cases in test_admission_controller.py
* Adds BE tests in reservation-tracker-test for the
reservation-util code.
Change-Id: Iabe87ce8f460356cfe4d1be4d7092c5900f9d79b
Reviewed-on: http://gerrit.cloudera.org:8080/7678
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
Remove BTS_BLOCK_OVERFLOW error code, which is no longer used and
referenced --read_size.
Improve the flag description. The output is now:
-read_size ((Advanced) The preferred I/O request size in bytes to issue to
HDFS or the local filesystem. Increasing the read size will increase
memory requirements. Decreasing the read size may decrease I/O
throughput.) type: int32 default: 8388608
Testing:
Tested that Impala built and basic queries could run.
Change-Id: I3c20a9d55f89170b11f569c90b7f2949ddbe4211
Reviewed-on: http://gerrit.cloudera.org:8080/7623
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Always create global BufferPool at startup using 80% of memory and
limit reservations to 80% of query memory (same as BufferedBlockMgr).
The query's initial reservation is computed in the planner, claimed
centrally (managed by the InitialReservations class) and distributed
to query operators from there.
min_spillable_buffer_size and default_spillable_buffer_size query
options control the buffer size that the planner selects for
spilling operators.
Port ExecNodes to use BufferPool:
* Each ExecNode has to claim its reservation during Open()
* Port Sorter to use BufferPool.
* Switch from BufferedTupleStream to BufferedTupleStreamV2
* Port HashTable to use BufferPool via a Suballocator.
This also makes PAGG memory consumption more efficient (avoid wasting buffers)
and improve the spilling algorithm:
* Allow preaggs to execute with 0 reservation - if streams and hash tables
cannot be allocated, it will pass through rows.
* Halve the buffer requirement for spilling aggs - avoid allocating
buffers for aggregated and unaggregated streams simultaneously.
* Rebuild spilled partitions instead of repartitioning (IMPALA-2708)
TODO in follow-up patches:
* Rename BufferedTupleStreamV2 to BufferedTupleStream
* Implement max_row_size query option.
Testing:
* Updated tests to reflect new memory requirements
Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e
Reviewed-on: http://gerrit.cloudera.org:8080/5801
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This change separates Expr and ExprContext. This is a preparatory
step for factoring out static data (e.g. Exprs) of plan fragments
to be shared by multiple plan fragment instances.
This change includes the followings:
1. Include aggregate functions (AggFn) as Expr. This separates
AggFn from its evaluator. AggFn is similar to existing Expr
as both are represented as a tree of Expr nodes but it doesn't
really make sense to call Get*Val() on AggFn. This change
restructures the class hierarchy: much of the existing Expr
class is now renamed to ScalarExpr. Expr is the parent class
of both AggFn and ScalarExpr. Expr is defined to be a tree
with root of either AggFn or ScalarExpr and all descendants
being ScalarExpr.
2. ExprContext is renamed to ScalarExprEvaluator which is the
interface for evaluating ScalarExpr; AggFnEvaluator is the
interface for evaluating AggFn. Multiple evaluators can be
instantiated per Expr. Expr contains static states of an
expression while evaluator contains runtime states needed
for execution (i.e. evaluating the expression).
3. Update all exec nodes to instantiate Expr and their evaluators
separately. ExecNode::Init() will be responsible for creating
all the Exprs in an ExecNode while their evaluators are created
in ExecNode::Prepare(). Certain evaluators are also moved into
the data structures which actually utilize them. For instance,
HashTableCtx now owns the build and probe expression evaluators.
Similarly, TupleRowComparator and Sorter also own the evaluators.
ExecNode which utilizes these data structures are only responsible
for creating the expressions used by these data structures.
4. All codegen functions take Exprs instead of evaluators. Also, codegen
functions will not return error status should the IR function fails the
LLVM verification step.
5. The assignment of index into the FunctionContext vector is now done
during the construction of ScalarExpr. Evaluators are only responsible
for allocating and initializing the FunctionContexts.
6. Open(), Prepare() are now removed from Expr classes. The interface
for creating any Expr is via either ScalarExpr::Create() or AggFn::Create()
which will convert a thrift Expr into an initialized Expr object.
Similarly, Create() interface is used for creating evaluators from an
intialized Expr object.
This separation allows the future change to introduce PlanNode data structures.
The plan is to move all ExecNode::Init() logic to PlanNode and call them once
per plan fragment.
Change-Id: Iefdc9aeeba033355cb9497e3a5d2363627dcf2f3
Reviewed-on: http://gerrit.cloudera.org:8080/5483
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
The stream defaults to pages of default_page_len_. If a row doesn't
fit in that page, it will allocate another page up to max_page_len_
bytes and append a single row to that page, then immediately unpin
the page. This means that when writing a stream, the large
page only needs to be kept in memory temporarily, which helps with
memory requirements. E.g. consider a hash join that is repartitioning
1 unpinned stream into 16 unpinned streams. We will need
default_page_len_ * 15 + max_page_len_ * 2 bytes of reservation because
when processing a large row we only need one large write buffer at a
time.
Also switches the stream to lazily allocating write pages, so that
we don't need to allocate a page until we know the size of the row
to go in it. This required a mechanism to "save" reservation in
PrepareForRead()/PrepareForWrite(). A SubReservation APi is added
to BufferPool for this purpose and the stream now saves read and
write reservation for lazy page allocation. It also saves reservation
instead of double-pinning pages in the read/write case.
The large row cases are not as optimised for memory consumption or
performance - queries processing very large numbers of large rows
are an extreme edge case that is likely to hit other performance
bottlenecks first. Pages with large rows can have up to 50%
internal fragmentation.
To avoid duplicating more logic between AddRow() and AllocateRow()
I restructured things so that AddRowSlow() is implemented in terms
of AllocateRowSlow(). AllocateRow() now takes a function as an
argument to populate the row.
Testing:
* Added tests for the case where 0 rows are added to the stream
* Extend BigRow to exercise the new code.
* Also test large strings and read/write streams.
Change-Id: I2861c58efa7bc1aeaa5b7e2f043c97cb3985c8f5
Reviewed-on: http://gerrit.cloudera.org:8080/6638
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Adds Impala support for TIMESTAMP types stored in Kudu.
Impala stores TIMESTAMP values in 96-bits and has nanosecond
precision. Kudu's timestamp is a 64-bit microsecond delta
from the Unix epoch (called UNIXTIME_MICROS), so a conversion
is necessary.
When writing to Kudu, TIMESTAMP values in nanoseconds are
averaged to the nearest microsecond.
When reading from Kudu, the KuduScanner returns
UNIXTIME_MICROS with 8bytes of padding so Impala can convert
the value to a TimestampValue in-line and copy the entire
row.
Testing:
Updated the functional_kudu schema to use TIMESTAMPs instead
of converting to STRING, so this provides some decent
coverage. Some BE tests were added, and some EE tests as
well.
TODO: Support pushing down TIMESTAMP predicates
TODO: Support TIMESTAMPs in range partitioning expressions
Change-Id: Iae6ccfffb79118a9036fb2227dba3a55356c896d
Reviewed-on: http://gerrit.cloudera.org:8080/6526
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.
After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.
New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE <file> will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.
HDFS tables created by Impala using CREATE TABLE LIKE <other table>
will copy the property of the table that is copied.
This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.
Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Reviewed-on: http://gerrit.cloudera.org:8080/5939
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Support allocating with mmap instead of TCMalloc to give more control
over memory usage. Also tell Linux to back larger buffers with huge
pages when possible to reduce TLB pressure. The main complication is
that memory returned by mmap() is not necessarily aligned to a huge
page boundary, so we need to "fix up" the mapping ourselves.
Adds additional memory metrics, since we previously relied on the
assumption that all memory was allocated through TCMalloc.
memory.total-used tracks the total across the buffer pool and
TCMalloc. When the buffer pool is not present, they just report
the TCMalloc values.
This can be enabled with the --mmap_buffers flag. The transparent
huge pages support can be disabled with the --madvise_huge_pages
startup flag.
At some point this should become the default, but it requires
more work to validate perf and resource used (virtual address
space, etc).
Testing:
Added some unit tests to test edge cases and the different supported
flags. Many pre-existing tests also exercise the modified code.
Change-Id: Ifbc748f74adcbbdcfa45f3ec7df98284925acbd6
Reviewed-on: http://gerrit.cloudera.org:8080/6474
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Adds tests for read errors from permissions (i.e. open() fails),
corrupt data (integrity check fails) and truncated files (read() fails).
Fixes a couple of bugs:
* Truncated reads were not detected in TmpFilemgr
* IoMgr buffers weren't returned on error paths (this isn't a true leak
but results in DCHECKs being hit).
Change-Id: I3f2b93588dd47f70a4863ecad3b5556c3634ccb4
Reviewed-on: http://gerrit.cloudera.org:8080/6562
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The test should allow Unpin() to fail with a scratch allocation error to
handle the case where the first write fails and blacklists the scratch
disk around the same time that the second write starts. Usually either
the second write succeeds because it started before the first write
failed or it fails with CANCELLED because the
BufferedBlockMgr::is_cancelled_ flag is set. There is a small
window for a race after the disk is blacklisted in TmpFileMgr but
before BufferedBlockMgr::WriteComplete() is called.
Testing:
I was able to reproduce the problem locally by adding some delays
to the test. I added a variant of the WriteError test that more reliably
reproduces the bug. Ran both WriteError tests in a loop locally to try
to flush out flakiness.
Change-Id: I9878d7000b03a64ee06c2088a8c30e318fe1d2a3
Reviewed-on: http://gerrit.cloudera.org:8080/5940
Tested-by: Impala Public Jenkins
Reviewed-by: Michael Ho <kwho@cloudera.com>
- Removes the runtime unknown disk ID reporting and instead moves
it to the explain plan as a counter that prints the number of
scan ranges missing disk IDs in the corresponding HDFS scan nodes.
- Adds a warning to the header of query profile/explain plan with a
list of tables missing disk ids.
- Removes reference to enabling dfs block metadata configuration,
since it doesn't apply anymore.
- Removes VolumeId terminology from the runtime profile.
Change-Id: Iddb132ff7ad66f3291b93bf9d8061bd0525ef1b2
Reviewed-on: http://gerrit.cloudera.org:8080/5828
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins
Before this patch, we would simply read the INT96 Parquet timestamp
representation and assume that it's valid. However, not all bit
permutations represent a valid timestamp. One of the boost functions
raised an exception (that we didn't catch) when passed an invalid
boost date object, which resulted in a crash. This patch fixes
problem by validating that the date falls into 1400..9999 year
range as we are scanning Parquet.
Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac
Reviewed-on: http://gerrit.cloudera.org:8080/5343
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
Second part of IMPALA-3710, which removed the IGNORE DML
option and changed the following errors on Kudu DML
operations to be ignored:
1) INSERT where the PK already exists
2) UPDATE/DELETE where the PK doesn't exist
This changes other data-related errors to be ignored as
well:
3) NULLs in non-nullable columns, i.e. null constraint
violoations.
4) Rows with PKs that are in an 'uncovered range'.
It became clear that we can't differentiate between (3) and
(4) because both return a Kudu 'NotFound' error code. The
Impala error codes have been simplified as well: we just
report a generic KUDU_NOT_FOUND error in these cases.
This also adds some metadata to the thrift report sent to
the coordinator from sinks so the total number of rows with
errors can be added to the profile. Note that this does not
include a breakdown of error counts by type/code because we
cannot differentiate between all of these cases yet.
An upcoming change will add this new info to the beeswax
interface and show it in the shell output (IMPALA-3713).
Testing: Updated kudu_crud tests to check the number of rows
with errors.
Change-Id: I4eb1ad91dc355ea51de261c3a14df0f9d28c879c
Reviewed-on: http://gerrit.cloudera.org:8080/4985
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This patch prevents an invalid decimal type in an Avro file schema from
crashing Impala. Most invalid Avro schemas are caught by the frontend,
but file schemas still need to be validated by the backend.
After this patch files with bad schemas are skipped.
Testing:
This was hit very rarely by the scanner fuzzing. Added a regression test that
scans a file with a bad schema.
Change-Id: I25a326ee2220bc14d3b5f887dc288b4adf859cfc
Reviewed-on: http://gerrit.cloudera.org:8080/4876
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
1.) IMPALA-4134: Use Kudu AUTO FLUSH
Improves performance of writes to Kudu up to 4.2x in
bulk data loading tests (load 200 million rows from
lineitem).
2.) IMPALA-3704: Improve errors on PK conflicts
The Kudu client reports an error for every PK conflict,
and all errors were being returned in the error status.
As a result, inserts/updates/deletes could return errors
with thousands errors reported. This changes the error
handling to log all reported errors as warnings and
return only the first error in the query error status.
3.) Improve the DataSink reporting of the insert stats.
The per-partition stats returned by the data sink weren't
useful for Kudu sinks. Firstly, the number of appended rows
was not being displayed in the profile. Secondly, the
'stats' field isn't populated for Kudu tables and thus was
confusing in the profile, so it is no longer printed if it
is not set in the thrift struct.
Testing: Ran local tests, including new tests to verify
the query profile insert stats. Manual cluster testing was
conducted of the AUTO FLUSH functionality, and that testing
informed the default mutation buffer value of 100MB which
was found to provide good results.
Change-Id: I5542b9a061b01c543a139e8722560b1365f06595
Reviewed-on: http://gerrit.cloudera.org:8080/4728
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
The scheduler crashed with a segmentation fault when there were no
backends registered: After not being able to find a local backend (none
are configured at all) in ComputeScanRangeAssignment(), the previous
code would eventually try to return the top of
assignment_ctx.assignment_heap in SelectRemoteBackendHost(), but that
heap would be empty. Subsequently, when using the IP address of that
heap node, a segmentation fault would occur.
This change adds a check and aborts scheduling with an error. It also
contains a test.
Change-Id: I6d93158f34841ea66dc3682290266262c87ea7ff
Reviewed-on: http://gerrit.cloudera.org:8080/4776
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Adds code comments and issues a warning for Parquet files
with num_rows=0 but at least one non-empty row group.
Change-Id: I72ccf00191afddb8583ac961f1eaf11e5eb28791
Reviewed-on: http://gerrit.cloudera.org:8080/4696
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*
A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:
find . | grep "\.java\|\.py\|\.h\|\.cc" | \
xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g
along with some manual fixes.
After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/
Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
This patch implements basic in-memory buffer management, with
reservations managed by ReservationTrackers.
Locks are fine-grained so that the buffer pool can scale to many
concurrent queries.
Includes basic tests for buffer pool setup, allocation and reservations.
Change-Id: I4bda61c31cc02d26bc83c3d458c835b0984b86a0
Reviewed-on: http://gerrit.cloudera.org:8080/4070
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Currently we can only disable spilling via a startup option which means
we need to restart the cluster for this.
This patch adds a new query option 'SCRATCH_LIMIT' that limits the amount of
scratch directory space that can be used. This would be useful to prevent
runaway queries or to prevent queries from spilling when that is not desired.
This also adds a 'ScratchSpace' counter to the runtime profile of the
BlockMgr that keeps track of the scratch space allocated.
Valid values for the SCRATCH_LIMIT query option are:
- unspecified or a limit of -1 means no limit
- a limit of 0 (zero) means spilling is disabled
- an int (= number of bytes)
- a float followed by "M" (MB) or "G" (GB)
Testing:
A new test file "test_scratch_limit.py" was added for testing functionality.
Change-Id: Ibf8842626ded1345b632a0ccdb9a580e6a0ad470
Reviewed-on: http://gerrit.cloudera.org:8080/4497
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This patch adds a configurable timeout for all backend client
RPC to avoid query hang issue.
Prior to this change, Impala doesn't set socket send/recv timeout for
backend client. RPC will wait forever for data. In extreme cases
of bad network or destination host has kernel panic, sender will not
get response and RPC will hang. Query hang is hard to detect. If
hang happens at ExecRemoteFragment() or CancelPlanFragments(), query
cannot be canelled unless you restart coordinator.
Added send/recv timeout to all RPCs to avoid query hang. For catalog
client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting
for HMS API call.
Added a wrapper RetryRpcRecv() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response to add back pressure.
If an RPC fails, the connection is left in an unrecoverable state.
we don't put the underlying connection back to cache but close it. This
is to make sure broken connection won't cause more RPC failure.
Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.
Besides the new EE test, I used the following iptables rule to
inject network failure to verify RPCs never hang.
1. Block network traffic on a port completely
iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packets to slowdown network
iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random --probability 0.05 -j DROP
Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Reviewed-on: http://gerrit.cloudera.org:8080/3343
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
This change extends MemPool, FreePool and StringBuffer to support
64-bit allocations, fixes a bug in decompressor and extends various
places in the code to support 64-bit allocation sizes. With this
change, the text scanner can now decompress compressed files larger
than 1GB.
Note that the UDF interfaces FunctionContext::Allocate() and
FunctionContext::Reallocate() still use 32-bit for the input
argument to avoid breaking compatibility.
In addition, the byte size of a tuple is still assumed to be
within 32-bit. If it needs to be upgraded to 64-bit, it will be
done in a separate change.
A new test has been added to test the decompression of a 2GB
snappy block compressed text file.
Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f
Reviewed-on: http://gerrit.cloudera.org:8080/3575
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
This reverts commit 1ffb2bd5a2a2faaa759ebdbaf49bf00aa8f86b5e.
Unbreak the packaging builds for now.
Change-Id: Id079acb83d35b51ba4dfe1c8042e1c5ec891d807
Reviewed-on: http://gerrit.cloudera.org:8080/3543
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Michael Ho <kwho@cloudera.com>
This change extends MemPool, FreePool and StringBuffer to support
64-bit allocations, fixes a bug in decompressor and extends various
places in the code to support 64-bit allocation sizes. With this
change, the text scanner can now decompress compressed files larger
than 1GB.
Note that the UDF interfaces FunctionContext::Allocate() and
FunctionContext::Reallocate() still use 32-bit for the input
argument to avoid breaking compatibility.
In addition, the byte size of a tuple is still assumed to be
within 32-bit. If it needs to be upgraded to 64-bit, it will be
done in a separate change.
Change-Id: I7ed28083d809a86d801a9c063a0aa32c50d32b20
Reviewed-on: http://gerrit.cloudera.org:8080/2781
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins