CFB mode is a stream cipher and is secure when used with a different nonce/IV
for every message. However it can be a performance bottleneck.
CTR mode is also stream cipher and is secure, 4~6x faster than CFB mode in
OpenSSL. AES-CTR+SHA256 is about 40~70% faster than AES-CFB+SHA256.
CTR mode is used if OpenSSL version>=1.0.1 at runtime, otherwise
fall back to using CFB mode.
Testing:
run runtime tmp-file-mgr-test, openssl-util-test, buffer-pool-test and
buffered-tuple-stream-test
The ut case openssl-util-test.EncryptInPlace tests encryption in both modes.
Change-Id: I9debc240615dd8cdbf00ec8730cff62ffef52aff
Reviewed-on: http://gerrit.cloudera.org:8080/8861
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Implement a test that generates random decimal numbers in the pytest
framework, performs a random mathemtaical operation in Impala and
verifies that the result is correct by doing the same operating using
the Python decimal module. We try to generate not only completely random
decimal numbers, but also numbers that have interesting properties, such
as the number being a power of two.
Change-Id: I4328125de5c583ec8ead1f78d9a08703b18b2d85
Reviewed-on: http://gerrit.cloudera.org:8080/8898
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
Allow to specify Oracle-style hint on INSERT/UPSERT statements. For example,
- insert /* +noshuffle */ into table functional.alltypes partition(year,
month) select * from functional.alltypes;
- upsert /* +noshuffle */ into functional_kudu.alltypes select * from
functional.alltypes;
Testing:
Add unit tests to ParserTest#TestPlanHints
Add plan check tests to PlannerTest#testInsert, PlannerTest#testKuduUpsert
Add tests to ToSqlTest#planHintsTest
Change-Id: Ied7629d70197a0270cdc0853e00cc021fdb4dc20
Reviewed-on: http://gerrit.cloudera.org:8080/8676
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
This patch maps a signed integer logical type in parquet to a supported
Impala column type. This change introduces the following mapping -
INT_8 -> TINYINT
INT_16 -> SMALLINT
INT_32 -> INT
INT_64 -> BIGINT
Also, added a parquet file with the following schema for testing -
schema {
optional int32 id;
optional int32 tinyint_col (INT_8);
optional int32 smallint_col (INT_16);
optional int32 int_col;
optional int64 bigint_col;
}
Change-Id: I47a8371858c9597c6a440808cf6f933532468927
Reviewed-on: http://gerrit.cloudera.org:8080/8548
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins
When the data loading finishes, it is possible for some HDFS blocks to
be under replicated. If impala gets the metadata before the replication
is done, some tests may fail. This patch adds a replication waiting step
in the data loading script.
Resubmitted with filesystem type check.
Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8
Reviewed-on: http://gerrit.cloudera.org:8080/8916
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Without this patch, redaction is applied to every field in the
runtime profile. This approach has an undesired side effect when
Kerberos auth + email redaction is in place.
Since the redaction applies to every field, even principals
(from Connected/Delegated User fields) are redacted, as the Kerberos
principal format generally pattern matches with an email redactor
template.
This is particularly problematic for monitoring tools that consume
runtime profiles and use these fields to group the queries by user.
This patch fixes the problem by redacting only the following sensitive
fields.
- Query Statement
- Error logs (since they can contain column references etc.)
- Query Status
- Query Plan
Other fields in the runtime profile are left unredacted.
Change-Id: Iae3b6726009bf458a7ec73131e5d659b12ab73cf
Reviewed-on: http://gerrit.cloudera.org:8080/8934
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins
This commit makes idle_session_timeout a query option.
idle_session_timeout currently can be set as a command line
option, which will be the default timeout for sessions.
HS2 sessions can override it with a smaller value by setting
it in the configuration overlay of HS2 OpenSession().
However, we can't override idle_session_timeout for JDBC/ODBC
connections, because we cannot put this in the connection string.
This commit is a workaround for this problem, it allows JDBC/ODBC
connections to set the session timeout as a query option
with the SET statement.
After this commit, the session timeout can be overridden to
any value, i.e. the command line flag idle_session_timeout
doesn't limit this option anymore.
I created an automated test case in JdbcTest.java based on
test_hs2.py::test_concurrent_session_mixed_idle_timeout. I also
extended the test_session_expiration and test_set_and_unset
test suites.
Change-Id: I32e2775f80da387b0df4195fe2c5435b3f8e585e
Reviewed-on: http://gerrit.cloudera.org:8080/8490
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.
This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.
Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.
Testing
-------
Ran all the backend and end-end tests with no failures.
Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Reviewed-on: http://gerrit.cloudera.org:8080/8034
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Avoid the circular dependency between ReservationTracker::lock_ and
MemTracker::child_trackers_lock_ by not acquiring
ReservationTracker::lock_ in GetReservation(), where an atomic
operation is sufficient.
Testing:
Added a unit test that reproed the deadlock.
Change-Id: Id7adbe961a925075422c685690dd3d1609779ced
Reviewed-on: http://gerrit.cloudera.org:8080/8933
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, all HdfsFileHandles are owned and constructed
by the file handle cache. When the file handle cache
is disabled or the file handle is not eligible for
caching, the HdfsFileHandle is stored exclusively in
ScanRange::exclusive_hdfs_fh_, but the HdfsFileHandle still
comes from the file handle cache. It is created via a call to
DiskIoMgr::GetCachedHdfsFileHandle() with 'require_new_handle'
set to true and destroyed via
DiskIoMgr::ReleaseCachedHdfsFileHandle() with 'destroy_handle'
set to true.
Recent testing has revealed that the lock on the file handle
cache is a bottleneck for workloads with many small remote
files. There is no benefit to storing these exclusive file
handles in the file handle cache, as they do not participate
in the caching.
This change introduces DiskIoMgr::GetExclusiveHdfsFileHandle()
and DiskIoMgr::ReleaseExclusiveHdfsFileHandle(). These are
equivalent to the Get/ReleaseCachedHdfsFileHandle() calls, except
they bypass the file handle cache and create/destroy the
file handle directly. ScanRange::Open()/Close(), which
populates and frees ScanRange::exclusive_hdfs_fh_, now uses
these new calls rather than accessing the file handle cache.
This avoids the locking entirely, solving the bottleneck.
To draw a distinction between the two codepaths, HdfsFileHandle
is now an abstract class with two subclasses:
- CachedHdfsFileHandles cover all handles that live in file handle
cache. Get/ReleaseCachedHdfsFileHandle() use this subclass.
- ExclusiveHdfsFileHandles cover all cases where a file handle
does not come from the cache. The new
Get/ReleaseExclusiveHdfsFileHandle() use this subclass.
Separately, testing revealed that increasing the number of
partitions for the file handle cache also fixes the contention
problem. This changes the file handle cache to make the number
of partitions configurable via startup parameter
num_file_handle_cache_partitions. This allows mitigation of
future bottlenecks without a patch.
Change-Id: I4ab52b0884a909a4faeb6692f32d45878ea2838f
Reviewed-on: http://gerrit.cloudera.org:8080/8945
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
When materialising a nested collection, has_template_tuple() should use
the template tuple for the collection, not the top-level tuple.
Testing:
Added tests based on nested-types-basic.test that operate on a simple
partitioned table. The tests reliably crashed Impala before the fix.
Change-Id: Ic808b824ce3b31af0539036d8ca23d17b18deab4
Reviewed-on: http://gerrit.cloudera.org:8080/8947
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, when debug mode is enabled, any query using codegen can result
in an Impala daemon crash as it hits a DCHECK.
This patch ensures the DCHECK is hit only when specific condition is met to
avoid the crash. That condition here is to DCHECK only when 'emit_perf_map_'
evaluates to True ensuring 'perf_map_lock_' is not empty when asserted.
Change-Id: I93e2b1efb325100d01d398e68e789d87b877167e
Reviewed-on: http://gerrit.cloudera.org:8080/8923
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
Because there was no obvious subcategory of known issues to
put this one in, I made a new subcategory 'startup issues'.
Change-Id: Ib039d0102878f1c05470371f581cb258287b9bc0
Reviewed-on: http://gerrit.cloudera.org:8080/7388
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
The bug is that ConvertToInt128(), by design, only sets 'overflow' if an
overflow occurs. This means that caller needs to initialise the overflow
variable to false, otherwise the value when overflow occurs is
undefined.
Testing:
Reprod the expr-test failure under ASAN, confirmed that it passed after
this fix.
Change-Id: Ifd7ac691155442ba7cba71dd3647208b7c1c0bf9
Reviewed-on: http://gerrit.cloudera.org:8080/8929
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
This patch fixes several issues related to the min/max aggregate
functions and their handling of 'nan' and 'inf':
- Previously, if 'inf' or '-inf' was the only value for the min/max
and codegen was being used, the result would be incorrect. This
occurred, for example in the case of 'inf' and 'min', because we
set an initial value of numeric_limits::max, which is less than
'inf', so the returned min was numeric_limits::max when it should be
'inf'. The fix is to set the initial value to
numeric_limits::infinity.
- Previously, if one of the values was 'nan', the result of min/max
was non-deterministic depending on the order the values were
evaluated in. This occurs because 'nan' < or > 'any value' is always
false, so if the first value added was 'nan', all other comparisons
would be false and 'nan' would be returned, whereas if the first
value wasn't 'nan' then the 'nan' wouldn't be returned. The fix is
to treat 'nan' specially and to always return 'nan' if there is a
single 'nan' value.
Testing:
- Added e2e tests for both scenarios, as well as adding a little extra
nan/inf coverage for other aggregate functions.
Change-Id: Ia1e206105937ce5afc75ca5044597d39b3dc6a81
Reviewed-on: http://gerrit.cloudera.org:8080/8854
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Apparently test_query_cancellation_during_fetch hangs occasionally
in Jenkins builds. The Impala debug page shows the query being
cancelled, however, on the host the ImpalaShell process related to
that query is still running.
Since I had no luck in reproducing the issue locally I only have a
theory what might be going on here: The query is cancelled
successfully on Impala backend and when the test tries to get the
stdout and stderr from the ImpalaShell it gets stuck. It might be
the case that ImpalaShell process fetching the query results holds
the stdout. According to the documentation of subprocess.communicate()
it may cause issues to fetch data when the data size is large or
unlimited, that we can consider to be the case here.
As a workaround there is a new optional parameter to
util.ImpalaShell to omit the stdout because this test wouldn't use
it anyway and we get rid of fetching the large result from
ImpalaShell.
Change-Id: I082c83b91b6d0c527de92c7992f0dc9d1b290433
Reviewed-on: http://gerrit.cloudera.org:8080/8852
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, the RPC layer accesses many gflags directly to take
certain decisions, eg. whether to turn on encryption,
authentication, etc.
Since the RPC layer is to be used more like a library, these should
be configurable options that are passed to the Messenger
(which is the API endpoint for the application using the RPC layer),
instead of the RPC layer itself directly accessing these flags.
This patch converts the following flags to Messenger options and moves
the flag definitions to server_base.cc which is the "application" in
Kudu that uses the Messenger:
FLAGS_rpc_default_keepalive_time_ms
FLAGS_rpc_negotiation_timeout_ms
FLAGS_rpc_authentication
FLAGS_rpc_encryption
FLAGS_rpc_tls_ciphers
FLAGS_rpc_tls_min_protocol
FLAGS_rpc_certificate_file
FLAGS_rpc_private_key_file
FLAGS_rpc_ca_certificate_file
FLAGS_rpc_private_key_password_cmd
FLAGS_keytab_file
Most of the remaining flags are test or benchmark related flags. There
may be a few more flags that can be moved out and converted to options,
but we can leave that as future work if we decide to move them.
In addition to the cherry-pick above, this change also updates Impala
code to pass the key_tab file to InitKerberosForServer() which was changed
by this Kudu patch.
Change-Id: Ia21814ffb6e283c2791985b089878b579905f0ba
Reviewed-on: http://gerrit.cloudera.org:8080/8789
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert <danburkert@apache.org>
Reviewed-on: http://gerrit.cloudera.org:8080/8878
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
This patch adds a query rewriter to remove redundant explicit casts to
a string type (string, char, varchar) from binary predicates of the form
"cast(<non-const expr> to <string type>) <eq/ne op> <string constant>".
The cast is redundant if the predicate evaluation is the same even if
the cast is removed and the constant is converted to the original type
of the expression. For example:
cast(int_col as string) = '123456' -> int_col = 123456
Performance:
For the following query on a table having 6001215 records -
select * from tpch.lineitem where cast(l_linenumber as string) = '0'
+-----------------+-----------+--------+
| | Scan Time |
+-----------------+-----------+--------+
| | Avg | St dev |
| Without rewrite | 1s406ms | 44ms |
| With rewrite | 1s099ms | 28ms |
+-----------------+-----------+--------+
Testing:
- Added unit tests to ExprRewriteRulesTest
- Added functional test to expr.test
- Current FE planner tests and BE expr-test run successfully with this
change.
Change-Id: I91b7c6452d0693115f9b9ed9ba09f3ffe0f36b2b
Reviewed-on: http://gerrit.cloudera.org:8080/8660
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
This change adds a new flag FLAGS_tcmalloc_max_total_thread_cache_bytes
which specifies the maximum size in bytes which the total of all TCMalloc
thread caches can grow to. By default, it's set to 0 and the default value
in TCMalloc library is used. This change also always enables "aggressive
decommit" feature in TCMalloc instead of just validating that it's enabled
in the code by default.
Testing done: core build;
test_tpch.py with FLAGS_tcmalloc_max_total_thread_cache_bytes set.
Change-Id: Ib968ee7d20458143ef6ac14ad3ac2c4d84d31dc5
Reviewed-on: http://gerrit.cloudera.org:8080/8906
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
This change makes sure backend connections with KRPC are always
kept alive by disabling the idle connection detection logic.
Idle connections all tend to be closed and re-opened around the
same time, which may easily lead to negotiation timeouts. Until
KUDU-279 is fixed, closing idle connections is also racy and leads
to query failures.
This disablement was validated with Kudu's rpc-test.
Change-Id: I0871dd9c9bbe455466b9b2d2a2bbedec79cf0775
Reviewed-on: http://gerrit.cloudera.org:8080/8910
Reviewed-by: Michael Ho <kwho@cloudera.com>
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, if an error is encountered during the creation of a
handcrafted codegen method, then the resulting IR is left in an
incomplete state. This patch ensures that all such IRs are cleaned up
(method is deleted from the module) before the llvm module is finalized.
Testing:
- added a backend test to exercise the added code path.
- tested manually by executing the following query:
select * from charTable A, charTable B
where A.charColumn = B.charColumn and A.charColumn = 'foo';
and looking at the logs to verify that 'InsertRuntimeFilters' and
'FilterContextInsert' methods have been removed.
Change-Id: If975cfb3906482b36dd6ede32ca81de6fcee1d7f
Reviewed-on: http://gerrit.cloudera.org:8080/8541
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, a server connection being idle for more than
FLAGS_rpc_default_keepalive_time_ms ms will be closed.
However, certain services (e.g. Impala) using KRPC may want to
keep the idle connections alive for various reasons (e.g. sheer
number of connections to re-establish, negotiation overhead
in a secure cluster). To avoid idle connection from being
closed, one currently have to set FLAGS_rpc_default_keepalive_time_ms
to a very large value.
This change implements a cleaner solution by disabling idle
connection scanning if FLAGS_rpc_default_keepalive_time_ms is
set to any negative value. This avoids the unnecessary
overhead of scanning for idle server connections and alleviates
the user from having to pick a random large number to make sure
the connection is always kept alive.
Change-Id: I6161b9e753f05620784565a417d248acf8e7050a
Reviewed-on: http://gerrit.cloudera.org:8080/8831
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <todd@apache.org>
Reviewed-on: http://gerrit.cloudera.org:8080/8909
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
In order to compute the modulo of two decimals, we need to bring the
underlying datatype to the same scale first. It turns out we could
overflow when scaling up one of the values.
In this patch we fix the problem by using a larger data type when we
detect that the scaled up value will not fit into the original data
type.
Testing:
- Added some expr tests that reproduce the issue.
Change-Id: I27c7f25f68353c19c315e1639311ec06b2dea686
Reviewed-on: http://gerrit.cloudera.org:8080/8833
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
In this patch we implement rounding when casting string to decimal if
DECIMAL_V2 is enabled. The backend method that parses strings and
converts them to decimals is refactored to make it easier to understand.
Testing:
- Added some BE tests.
Change-Id: Icd8b92727fb384e6ff2d145e4aab7ae5d27db26d
Reviewed-on: http://gerrit.cloudera.org:8080/8774
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
The layout of the Impala-auxiliary-tests/tests directory is
changing to allow for different kinds of tests to be saved
there. But just in case the new functional sub-directory
does not exist, preserve backwards compatibility with the
older layout.
Change-Id: Ifb2bbbebc38bbaf3d6a4ad01fa8dd918b7d99b3b
Reviewed-on: http://gerrit.cloudera.org:8080/8896
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins
precision.
This commit follows 16d8dd58.
This patch adds a test case that inspects the thrift profile of a
completed query, and verifies that the "Start Time" and
"End Time" of the query have nanosecond precision. We chose to
work with the thrift profile directly, rather than parse the debug
web page, as it is the thrift profile which is consumed by
management API clients of Impala.
Change-Id: Id3421a34cc029ebca551730084c7cbd402d5c109
Reviewed-on: http://gerrit.cloudera.org:8080/8784
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
This adds a workaround for an issue reported on the user mailing list.
Some systems are configured such that the auth_to_local mapping provided
by the krb5 library doesn't work properly for service accounts.
This patch adds a new configuration which allows Kudu to disregard the
system auth_to_local rules and instead just map kerberos principals to
their first component, which is typically the username.
Change-Id: I2e893493f52965ea54d2ceaac83d375285b49486
Reviewed-on: http://gerrit.cloudera.org:8080/8373
Reviewed-by: Alexey Serbin <aserbin@cloudera.com>
Reviewed-by: Dan Burkert <danburkert@apache.org>
Tested-by: Kudu Jenkins
Reviewed-on: http://gerrit.cloudera.org:8080/8875
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
As a follow-on to centralizing into one parent pom, we can now manage
thirdparty dependency versions in Java a little bit more clearly.
Upgrades SLF4J, commons.io:
slf4j: 1.7.5 -> 1.7.25
commons.io: 2.4 -> 2.6
The SLF4J upgrade is nice to be able to run under Java9. The release
notes at https://www.slf4j.org/news.html are uneventful.
Commons IO 2.6 supports Java 9 and is source and binary compatible,
per https://commons.apache.org/proper/commons-io/upgradeto2_6.html and
https://commons.apache.org/proper/commons-io/upgradeto2_5.html.
Removes the following dependencies:
htrace-core
hadoop-mapreduce-client-core
hive-shims
com.stumbleupon:async
commons-dbcp
jdo-api
I ran "mvn dependency:analyze" and these were some (but not all)
of the "Unused declared dependencies found." Spelunking in git logs,
these dependencies are from 2013 and possibly from an effort
to run with dependencies from the filesystem. They don't seem
to be required anymore.
Stops pulling in an old version of hadoop-client and kite-data-core in
testdata/TableFlattener by using the same versions as the Hadoop we use.
Doing so was unnecessarily causing us to download extra, old Hadoop
jars, and the new Hadoop jars seem to work just as well. This is the
kind of divergence that centralizing the versions into variables will
help with.
Creates variables for:
junit.version
slf4j.version
hadoop.version
commons-io.version
httpcomponents.core.version
thrift.version
kite.version (controlled via $IMPALA_KITE_VERSION in impala-config.sh)
Cleans up unused IMPALA_PARQUET_URL variables in impala-config.sh. We
only download Parquet via Maven, rather than downloading it in the
toolchain, so this variable wasn't doing anything.
I ran the core tests with this change.
Change-Id: I717e0625dfe0fdbf7e9161312e9e80f405a359c5
Reviewed-on: http://gerrit.cloudera.org:8080/8853
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
Using fsck breaks non-HDFS builds: local, S3, and Isilon.
This reverts commit 5a7c10ec3d.
Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc
Reviewed-on: http://gerrit.cloudera.org:8080/8880
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
In the commit 4feb4f3a, the third party library pcg-cpp was excluded
from the clang-tidy check. It could make unexpected side effect, so
fixing some warnings from clang-tidy is better than avoidance of the
check.
Change-Id: I591d30373cb13f0eb89afbe16d81b1d3fb783365
Reviewed-on: http://gerrit.cloudera.org:8080/8829
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
This commit introduces the ThreadDebugInfo class which can
hold information about the current thread that can be useful
in debug sessions. It needs to be allocated on the stack of
each thread in order to include it to minidumps.
Currently a ThreadDebugInfo object is created in
Thread::SuperviseThread. This object is available in all
the child stack frames through the global function
GetThreadDebugInfo().
ThreadDebugInfo has members for the thread name and instance
id. These are fixed size char buffers.
If you have a core dump written by Impala, you can locate the
ThreadDebugInfo for the current thread through the global
pointer impala::thread_debug_info.
In a core file that has been created from a minidump, we need
to select the stack frame that allocated the ThreadDebugInfo
object in order to inspect it. It is currently allocated in
Thread::SuperviseThread().
We can use printf in gdb to print the members, e.g.:
printf "%s\n" thread_debug_info.instance_id
Currently the thread name and instance id is stored.
I created some tests in thread-debug-info-test.cc
Change-Id: I566f7f1db5117c498e86e0bd05b33bdcff624609
Reviewed-on: http://gerrit.cloudera.org:8080/8621
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV()
function.
Testing:
- modified/improved existing functional tests
- core/hdfs run passed
Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b
Reviewed-on: http://gerrit.cloudera.org:8080/8840
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
When the data loading finishes, it is possible for some HDFS blocks to
be under replicated. If impala gets the metadata before the replication
is done, some tests may fail. This patch adds a replication waiting step
in the data loading script.
Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322
Reviewed-on: http://gerrit.cloudera.org:8080/8846
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
We saw some failures on the exhaustive release build because the
compiler assumed that the pointer to the intermediate struct that is
used for computing decimal average was aligned.
To fix the problem, we mark the struct with a "packed" attribute so
that the compiler does not expect it to be aligned.
Testing:
- Ran the failing test locally on an release build and it passed.
Change-Id: Id25ec6e20dde3f50fb37a22135b355ad251809e0
Reviewed-on: http://gerrit.cloudera.org:8080/8836
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
Impala partitions and sorts rows according to the target table's
partitioning scheme before inserting them into Kudu in order to
improve the performance of large inserts.
A recent change added the ability to create unpartitioned Kudu tables,
but Impala still does the partitioning/sorting for them even though
its wasted work.
This patch modifies the planner to not add the partition/sort for Kudu
inserts if the table is unpartitioned, unless the clustered/shuffle
hints are used.
It also removes the exchange in the case where the partition exprs are
all constant.
Testing:
- Added planner tests for inserting into an unpartitioned Kudu table,
with and without hints, and for when the partition exprs are
constant.
- Ran the existing correctness tests for inserts into unpartitioned
Kudu tables in kudu_create.test
Change-Id: I3e01a7dd5284767a25df3218656746a5d0ee4632
Reviewed-on: http://gerrit.cloudera.org:8080/8810
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The current default for krpc_port is 29000, which conflicts
with Sentry's WebUI. This changes the default to 27000,
which is currently free.
Core tests pass with this change.
Change-Id: Iaf5ccedfd9bc1eff9786e4b019c1bb68bf757300
Reviewed-on: http://gerrit.cloudera.org:8080/8841
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins
Impala shell can accidentally convert certain
literal strings to lowercase. Impala shell splits
each command into tokens and then converts the
first token to lowercase to figure out how it
should execute the command. The splitting is done
by spaces only. Thus, if the user types a TAB
after the SELECT, the first token after the split
becomes the SELECT plus whatever comes after it.
Testing:
TestImpalaShellInteractive.test_case_sensitive_command
TestImpalaShellInteractive.test_unexpected_conversion_for_literal_string_to_lowercase
TestImpalaShell.test_var_substitution
Change-Id: Ifdce9781d1d97596c188691b62a141b9bd137610
Reviewed-on: http://gerrit.cloudera.org:8080/8762
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This patch fixes a regression introduced as part of IMPALA-1788, where
an expression like 'CAST(0 AS DECIMAL(14))' is rewritten as a
NumericLiteral expression of type DECIMAL(14,0). The query had
another NumericLiteral of type TINYINT. While analyzing the DISTINCT
aggregation clause of the SELECT query, AggregateInfo::create() removes
duplicate expressions from groupingExprs.
NumericLiteral::localEquals() is used to check for equality. Now
since the method does not consider expression types, a TINYINT literal
is considered to be duplicate of a DECIMAL literal. This results in a
query like the following to fail:
SELECT DISTINCT CAST(0 AS DECIMAL(14), 0 FROM functional.alltypes
We propose to fix the issue by accounting for types as well
when comparing analyzed numeric literals.
A test case has been added to AnalyzeStmtsTest.
Change-Id: Ia88d54088dfd128b103759dc01103b6c35bf6257
Reviewed-on: http://gerrit.cloudera.org:8080/8448
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This patch adds the following details to the error message encountered
on failure to get minimum memory reservation:
- which ReservationTracker hit its limit
- top 5 admitted queries that are consuming the most memory under the
ReservationTracker that hit its limit
Testing:
- added tests to reservation-tracker-test.cc that verify the error
message returned for different cases.
- tested "initial reservation failed" condition manually to verify
the error message returned.
Change-Id: Ic4675fe923b33fdc4ddefd1872e6d6b803993d74
Reviewed-on: http://gerrit.cloudera.org:8080/8781
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
Testing under ASAN:
- reproduced locally
- does not reproduce after fix
- locally ran test_aggregation.py which passed
Change-Id: Ia695201e61d8afc23636f826264635c85d3a228a
Reviewed-on: http://gerrit.cloudera.org:8080/8838
Tested-by: Impala Public Jenkins
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
The two Kudu loads and Hive UDFs can all run in parallel. This
should shave about 4 minutes off of the data load. (Current
timings are 3.5, 4, and 0.6 minutes, see below.)
I've run dataload with this change many times.
Loading Kudu functional (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu.log)...
Loading workload 'functional-query' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 3 min 29 sec)
Loading Kudu TPCH (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu-tpch.log)...
Loading workload 'tpch' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 4 min 0 sec)
Loading Hive UDFs (logging to /home/ubuntu/Impala/logs/data_loading/build-and-copy-hive-udfs.log)...
Loading Hive UDFs OK (Took: 0 min 41 sec)
Change-Id: I7e93ee5a77ec9271b980b88bef7ad512ecbe0407
Reviewed-on: http://gerrit.cloudera.org:8080/8822
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
Some tests use the local user's group name to construct SQLs, which may
lead to syntax errors when group name contains dots. We need to quote
the group names in SQL to avoid this error. Besides, a test in
test_admission_controller uses '\w+' to match the local user name. This
expression cannot match usernames with dots, which causes test failure
as well. Instead, we should use '\S+'.
Change-Id: Ib8ae15bb6a929dc48d3ad2176c8b3fafff87f32b
Reviewed-on: http://gerrit.cloudera.org:8080/8807
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Removes properties that are already defined in the impala-parent pom.
I ran the tests.
Change-Id: I6812e11bb41716450ef29bb523773479e9f76eec
Reviewed-on: http://gerrit.cloudera.org:8080/8827
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
Currently implementation of rand/random built-in functions
use rand_r of C library. We recognized its randomness was poor.
pcg32 of third party library shows better randomness than rand_r.
Testing:
Revise unit test in expr-test
Add E2E test to random.test
Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684
Reviewed-on: http://gerrit.cloudera.org:8080/8355
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
This patch adds a new MemTracker under the Process MemTracker called
"TCMalloc Overhead" which accounts for different cache freelists
maintained by TCMalloc. This added accounting also helps bring down
the amount of untracked memory.
An example dump of the Process MemTracker now looks like:
Process: Limit=8.34 GB Total=119.10 MB Peak=119.10 MB
Buffer Pool: Free Buffers: Total=0
Buffer Pool: Clean Pages: Total=0
Buffer Pool: Unused Reservation: Total=0
TCMalloc Overhead: Total=11.42 MB
Untracked Memory: Total=107.69 MB
Testing:
Tested manually by checking the memz webpage.
Change-Id: I602e9d5e8e8d7470dcfe4addde3265057c16263a
Reviewed-on: http://gerrit.cloudera.org:8080/8782
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.
SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.
Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html
The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.
Testing:
- added functional tests
- core/hdfs run passed
Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Reviewed-on: http://gerrit.cloudera.org:8080/8569
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Previously, DataStreamSender::FlushAndSendEos() makes some wrong
assumption about the format string of the error status returned
from DoTransmitDataRpc(). In particular, it assumes that the format
string has exactly one substitution argument. This is a pretty bad
assumption as new types of errors could be returned from
DoTransmitDataRpc() and they can take different numbers of
substitution arguments. Recent changes in the format string
TErrorCode::RPC_RECV_TIMEOUT exposed this bug.
This change fixes the problem by removing this bad usage of Status()
in data-stream-sender.cc
Change-Id: I1cc04db670069a7df484e2f2bafb93c5315b9c82
Reviewed-on: http://gerrit.cloudera.org:8080/8817
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This commit links together all the individual pom.xml files to have a
new "impala-parent" pom as the parent. This enables de-duplicating all
the repository configuration.
I ran the build to test this.
Change-Id: Id744e4357ee4d8e4be4e5490b2159bb76a2192f0
Reviewed-on: http://gerrit.cloudera.org:8080/8753
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
"be/src/kudu/security/test/mini_kdc.cc" uses lsof, which
doesn't exist on the base ubuntu:16.04 Docker image; adding
it in.
Change-Id: I6a458f2ef0313b2d08d6dd21290f8a38fa6d07f7
Reviewed-on: http://gerrit.cloudera.org:8080/8813
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The fix for HIVE-3140 started indenting multi-line comments,
which breaks Impala testing when run against Hive 2.1.1.
To test this using the pure test runner proved difficult
since it would require extensive changes to support both
row_regexes (since the columns changed order) and subset
support (since the number of rows changed).
Instead, we manually verify the hints are present in the
output in the python test.
The fact that the hints have been reformatted leaves us
in an uncertain state as to whether they actually get applied,
so a new test case has been added to run EXPLAIN SELECT
on the view and verify the joins happen exactly as we
expect.
Testing: Ran the views-ddl test against Impala mini-cluster
setups using both Hive 2.1.1 and Hive 1.1.0
Change-Id: I49e53b1230520ca6e850af28078526e6627d69de
Reviewed-on: http://gerrit.cloudera.org:8080/8719
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins