impala

mirror of https://github.com/apache/impala.git synced 2025-12-30 12:02:10 -05:00

Author	SHA1	Message	Date
Xianda Ke	514dfaf9fd	IMPALA-6128: Add support for AES-CTR encryption when spilling to disk CFB mode is a stream cipher and is secure when used with a different nonce/IV for every message. However it can be a performance bottleneck. CTR mode is also stream cipher and is secure, 4~6x faster than CFB mode in OpenSSL. AES-CTR+SHA256 is about 40~70% faster than AES-CFB+SHA256. CTR mode is used if OpenSSL version>=1.0.1 at runtime, otherwise fall back to using CFB mode. Testing: run runtime tmp-file-mgr-test, openssl-util-test, buffer-pool-test and buffered-tuple-stream-test The ut case openssl-util-test.EncryptInPlace tests encryption in both modes. Change-Id: I9debc240615dd8cdbf00ec8730cff62ffef52aff Reviewed-on: http://gerrit.cloudera.org:8080/8861 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 05:39:09 +00:00
Taras Bobrovytsky	f810458ca4	IMPALA-6231: Implement decimal_v2 fuzz test Implement a test that generates random decimal numbers in the pytest framework, performs a random mathemtaical operation in Impala and verifies that the result is correct by doing the same operating using the Python decimal module. We try to generate not only completely random decimal numbers, but also numbers that have interesting properties, such as the number being a power of two. Change-Id: I4328125de5c583ec8ead1f78d9a08703b18b2d85 Reviewed-on: http://gerrit.cloudera.org:8080/8898 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 03:03:52 +00:00
Jinchul	99962d2e81	IMPALA-4168: Adds Oracle-style hint placement for INSERT/UPSERT Allow to specify Oracle-style hint on INSERT/UPSERT statements. For example, - insert /* +noshuffle / into table functional.alltypes partition(year, month) select from functional.alltypes; - upsert /* +noshuffle / into functional_kudu.alltypes select from functional.alltypes; Testing: Add unit tests to ParserTest#TestPlanHints Add plan check tests to PlannerTest#testInsert, PlannerTest#testKuduUpsert Add tests to ToSqlTest#planHintsTest Change-Id: Ied7629d70197a0270cdc0853e00cc021fdb4dc20 Reviewed-on: http://gerrit.cloudera.org:8080/8676 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 03:03:49 +00:00
aphadke	38461c524f	IMPALA-5052: Read and write signed integer logical types in Parquet This patch maps a signed integer logical type in parquet to a supported Impala column type. This change introduces the following mapping - INT_8 -> TINYINT INT_16 -> SMALLINT INT_32 -> INT INT_64 -> BIGINT Also, added a parquet file with the following schema for testing - schema { optional int32 id; optional int32 tinyint_col (INT_8); optional int32 smallint_col (INT_16); optional int32 int_col; optional int64 bigint_col; } Change-Id: I47a8371858c9597c6a440808cf6f933532468927 Reviewed-on: http://gerrit.cloudera.org:8080/8548 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 04:55:59 +00:00
Tianyi Wang	c4d950b9e9	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Resubmitted with filesystem type check. Change-Id: I64d9a8ea1d0a32b40047321b50a7139a8f48eac8 Reviewed-on: http://gerrit.cloudera.org:8080/8916 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 03:24:36 +00:00
Bharath Vissapragada	6a87eb20a5	IMPALA-6348: Redact only sensitive fields in runtime profiles Without this patch, redaction is applied to every field in the runtime profile. This approach has an undesired side effect when Kerberos auth + email redaction is in place. Since the redaction applies to every field, even principals (from Connected/Delegated User fields) are redacted, as the Kerberos principal format generally pattern matches with an email redactor template. This is particularly problematic for monitoring tools that consume runtime profiles and use these fields to group the queries by user. This patch fixes the problem by redacting only the following sensitive fields. - Query Statement - Error logs (since they can contain column references etc.) - Query Status - Query Plan Other fields in the runtime profile are left unredacted. Change-Id: Iae3b6726009bf458a7ec73131e5d659b12ab73cf Reviewed-on: http://gerrit.cloudera.org:8080/8934 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-06 22:54:17 +00:00
Zoltan Borok-Nagy	ce65b43d47	IMPALA-2248: Make idle_session_timeout a query option This commit makes idle_session_timeout a query option. idle_session_timeout currently can be set as a command line option, which will be the default timeout for sessions. HS2 sessions can override it with a smaller value by setting it in the configuration overlay of HS2 OpenSession(). However, we can't override idle_session_timeout for JDBC/ODBC connections, because we cannot put this in the connection string. This commit is a workaround for this problem, it allows JDBC/ODBC connections to set the session timeout as a query option with the SET statement. After this commit, the session timeout can be overridden to any value, i.e. the command line flag idle_session_timeout doesn't limit this option anymore. I created an automated test case in JdbcTest.java based on test_hs2.py::test_concurrent_session_mixed_idle_timeout. I also extended the test_session_expiration and test_set_and_unset test suites. Change-Id: I32e2775f80da387b0df4195fe2c5435b3f8e585e Reviewed-on: http://gerrit.cloudera.org:8080/8490 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-06 01:47:47 +00:00
Pranay	302ec25b2e	IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder Currently DictDecoder class and DictEncoder class uses std::vector to store the tables mapping codeword to value and vice-versa. It is hard to detect the memory usage by these tables when they becomes very large, since this memory is not accounted by Impala's memory mangement infrastructure. This patch uses the memory tracker of HdfsScanner to track the memory used by dictionary in DictDecoder class. Similary it uses memory tracker of HdfsTableSink to track the memory used by dictionary in DictEncoder class. Memory for the dictionary, stored as std::vector is still allocated from std:allocator but the amount allocated is accounted by introducing a counter which is incremented and decremented as the memory is consumed and released by vector. Testing ------- Ran all the backend and end-end tests with no failures. Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299 Reviewed-on: http://gerrit.cloudera.org:8080/8034 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-06 01:30:36 +00:00
Tim Armstrong	d25607d01b	IMPALA-6362: avoid Reservation/MemTracker deadlock Avoid the circular dependency between ReservationTracker::lock_ and MemTracker::child_trackers_lock_ by not acquiring ReservationTracker::lock_ in GetReservation(), where an atomic operation is sufficient. Testing: Added a unit test that reproed the deadlock. Change-Id: Id7adbe961a925075422c685690dd3d1609779ced Reviewed-on: http://gerrit.cloudera.org:8080/8933 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-05 22:57:21 +00:00
Joe McDonnell	d1a0510bfe	IMPALA-6364: Bypass file handle cache for ineligible files Currently, all HdfsFileHandles are owned and constructed by the file handle cache. When the file handle cache is disabled or the file handle is not eligible for caching, the HdfsFileHandle is stored exclusively in ScanRange::exclusive_hdfs_fh_, but the HdfsFileHandle still comes from the file handle cache. It is created via a call to DiskIoMgr::GetCachedHdfsFileHandle() with 'require_new_handle' set to true and destroyed via DiskIoMgr::ReleaseCachedHdfsFileHandle() with 'destroy_handle' set to true. Recent testing has revealed that the lock on the file handle cache is a bottleneck for workloads with many small remote files. There is no benefit to storing these exclusive file handles in the file handle cache, as they do not participate in the caching. This change introduces DiskIoMgr::GetExclusiveHdfsFileHandle() and DiskIoMgr::ReleaseExclusiveHdfsFileHandle(). These are equivalent to the Get/ReleaseCachedHdfsFileHandle() calls, except they bypass the file handle cache and create/destroy the file handle directly. ScanRange::Open()/Close(), which populates and frees ScanRange::exclusive_hdfs_fh_, now uses these new calls rather than accessing the file handle cache. This avoids the locking entirely, solving the bottleneck. To draw a distinction between the two codepaths, HdfsFileHandle is now an abstract class with two subclasses: - CachedHdfsFileHandles cover all handles that live in file handle cache. Get/ReleaseCachedHdfsFileHandle() use this subclass. - ExclusiveHdfsFileHandles cover all cases where a file handle does not come from the cache. The new Get/ReleaseExclusiveHdfsFileHandle() use this subclass. Separately, testing revealed that increasing the number of partitions for the file handle cache also fixes the contention problem. This changes the file handle cache to make the number of partitions configurable via startup parameter num_file_handle_cache_partitions. This allows mitigation of future bottlenecks without a patch. Change-Id: I4ab52b0884a909a4faeb6692f32d45878ea2838f Reviewed-on: http://gerrit.cloudera.org:8080/8945 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-05 21:21:46 +00:00
Tim Armstrong	d3ff67b8b3	IMPALA-6370: fix partitioned parquet tables with nested types When materialising a nested collection, has_template_tuple() should use the template tuple for the collection, not the top-level tuple. Testing: Added tests based on nested-types-basic.test that operate on a simple partitioned table. The tests reliably crashed Impala before the fix. Change-Id: Ic808b824ce3b31af0539036d8ca23d17b18deab4 Reviewed-on: http://gerrit.cloudera.org:8080/8947 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-05 20:44:21 +00:00
manaswinimaharana	eea8ade36d	IMPALA-6296: Avoid crash caused by DCHECK in Codegen in debug mode Currently, when debug mode is enabled, any query using codegen can result in an Impala daemon crash as it hits a DCHECK. This patch ensures the DCHECK is hit only when specific condition is met to avoid the crash. That condition here is to DCHECK only when 'emit_perf_map_' evaluates to True ensuring 'perf_map_lock_' is not empty when asserted. Change-Id: I93e2b1efb325100d01d398e68e789d87b877167e Reviewed-on: http://gerrit.cloudera.org:8080/8923 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-05 02:18:05 +00:00
John Russell	b4a73a68fb	IMPALA-4978 / IMPALA-5631: [DOCS] Add FQDN known issue Because there was no obvious subcategory of known issues to put this one in, I made a new subcategory 'startup issues'. Change-Id: Ib039d0102878f1c05470371f581cb258287b9bc0 Reviewed-on: http://gerrit.cloudera.org:8080/7388 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-04 07:27:32 +00:00
Tim Armstrong	4f4912c532	IMPALA-6355: fix overflow DCHECK in decimal mod The bug is that ConvertToInt128(), by design, only sets 'overflow' if an overflow occurs. This means that caller needs to initialise the overflow variable to false, otherwise the value when overflow occurs is undefined. Testing: Reprod the expr-test failure under ASAN, confirmed that it passed after this fix. Change-Id: Ifd7ac691155442ba7cba71dd3647208b7c1c0bf9 Reviewed-on: http://gerrit.cloudera.org:8080/8929 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-04 02:30:48 +00:00
Thomas Tauber-Marshall	96b976aff3	IMPALA-6295: Fix mix/max handling of 'nan' and 'inf' This patch fixes several issues related to the min/max aggregate functions and their handling of 'nan' and 'inf': - Previously, if 'inf' or '-inf' was the only value for the min/max and codegen was being used, the result would be incorrect. This occurred, for example in the case of 'inf' and 'min', because we set an initial value of numeric_limits::max, which is less than 'inf', so the returned min was numeric_limits::max when it should be 'inf'. The fix is to set the initial value to numeric_limits::infinity. - Previously, if one of the values was 'nan', the result of min/max was non-deterministic depending on the order the values were evaluated in. This occurs because 'nan' < or > 'any value' is always false, so if the first value added was 'nan', all other comparisons would be false and 'nan' would be returned, whereas if the first value wasn't 'nan' then the 'nan' wouldn't be returned. The fix is to treat 'nan' specially and to always return 'nan' if there is a single 'nan' value. Testing: - Added e2e tests for both scenarios, as well as adding a little extra nan/inf coverage for other aggregate functions. Change-Id: Ia1e206105937ce5afc75ca5044597d39b3dc6a81 Reviewed-on: http://gerrit.cloudera.org:8080/8854 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-04 01:23:43 +00:00
Gabor Kaszab	7810d1f9a2	IMPALA-6318: Adjustment for hanging query cancellation test Apparently test_query_cancellation_during_fetch hangs occasionally in Jenkins builds. The Impala debug page shows the query being cancelled, however, on the host the ImpalaShell process related to that query is still running. Since I had no luck in reproducing the issue locally I only have a theory what might be going on here: The query is cancelled successfully on Impala backend and when the test tries to get the stdout and stderr from the ImpalaShell it gets stuck. It might be the case that ImpalaShell process fetching the query results holds the stdout. According to the documentation of subprocess.communicate() it may cause issues to fetch data when the data size is large or unlimited, that we can consider to be the case here. As a workaround there is a new optional parameter to util.ImpalaShell to omit the stdout because this test wouldn't use it anyway and we get rid of fetching the large result from ImpalaShell. Change-Id: I082c83b91b6d0c527de92c7992f0dc9d1b290433 Reviewed-on: http://gerrit.cloudera.org:8080/8852 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-03 20:32:24 +00:00
Sailesh Mukil	6f2ebadf8d	KUDU-2228: Make Messenger options configurable Currently, the RPC layer accesses many gflags directly to take certain decisions, eg. whether to turn on encryption, authentication, etc. Since the RPC layer is to be used more like a library, these should be configurable options that are passed to the Messenger (which is the API endpoint for the application using the RPC layer), instead of the RPC layer itself directly accessing these flags. This patch converts the following flags to Messenger options and moves the flag definitions to server_base.cc which is the "application" in Kudu that uses the Messenger: FLAGS_rpc_default_keepalive_time_ms FLAGS_rpc_negotiation_timeout_ms FLAGS_rpc_authentication FLAGS_rpc_encryption FLAGS_rpc_tls_ciphers FLAGS_rpc_tls_min_protocol FLAGS_rpc_certificate_file FLAGS_rpc_private_key_file FLAGS_rpc_ca_certificate_file FLAGS_rpc_private_key_password_cmd FLAGS_keytab_file Most of the remaining flags are test or benchmark related flags. There may be a few more flags that can be moved out and converted to options, but we can leave that as future work if we decide to move them. In addition to the cherry-pick above, this change also updates Impala code to pass the key_tab file to InitKerberosForServer() which was changed by this Kudu patch. Change-Id: Ia21814ffb6e283c2791985b089878b579905f0ba Reviewed-on: http://gerrit.cloudera.org:8080/8789 Tested-by: Kudu Jenkins Reviewed-by: Dan Burkert <danburkert@apache.org> Reviewed-on: http://gerrit.cloudera.org:8080/8878 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-03 11:45:48 +00:00
Bikramjeet Vig	545163bb0a	IMPALA-5929: Remove redundant explicit casts to string This patch adds a query rewriter to remove redundant explicit casts to a string type (string, char, varchar) from binary predicates of the form "cast(<non-const expr> to <string type>) <eq/ne op> <string constant>". The cast is redundant if the predicate evaluation is the same even if the cast is removed and the constant is converted to the original type of the expression. For example: cast(int_col as string) = '123456' -> int_col = 123456 Performance: For the following query on a table having 6001215 records - select * from tpch.lineitem where cast(l_linenumber as string) = '0' +-----------------+-----------+--------+ \| \| Scan Time \| +-----------------+-----------+--------+ \| \| Avg \| St dev \| \| Without rewrite \| 1s406ms \| 44ms \| \| With rewrite \| 1s099ms \| 28ms \| +-----------------+-----------+--------+ Testing: - Added unit tests to ExprRewriteRulesTest - Added functional test to expr.test - Current FE planner tests and BE expr-test run successfully with this change. Change-Id: I91b7c6452d0693115f9b9ed9ba09f3ffe0f36b2b Reviewed-on: http://gerrit.cloudera.org:8080/8660 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-03 01:15:42 +00:00
Michael Ho	f168a7d472	IMPALA-5528: Add a flag to tune TCMalloc total thread caches size This change adds a new flag FLAGS_tcmalloc_max_total_thread_cache_bytes which specifies the maximum size in bytes which the total of all TCMalloc thread caches can grow to. By default, it's set to 0 and the default value in TCMalloc library is used. This change also always enables "aggressive decommit" feature in TCMalloc instead of just validating that it's enabled in the code by default. Testing done: core build; test_tpch.py with FLAGS_tcmalloc_max_total_thread_cache_bytes set. Change-Id: Ib968ee7d20458143ef6ac14ad3ac2c4d84d31dc5 Reviewed-on: http://gerrit.cloudera.org:8080/8906 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-02 23:25:04 +00:00
Michael Ho	a550b5b560	IMPALA-5557: Disable rpc_default_keepalive_time_ms This change makes sure backend connections with KRPC are always kept alive by disabling the idle connection detection logic. Idle connections all tend to be closed and re-opened around the same time, which may easily lead to negotiation timeouts. Until KUDU-279 is fixed, closing idle connections is also racy and leads to query failures. This disablement was validated with Kudu's rpc-test. Change-Id: I0871dd9c9bbe455466b9b2d2a2bbedec79cf0775 Reviewed-on: http://gerrit.cloudera.org:8080/8910 Reviewed-by: Michael Ho <kwho@cloudera.com> Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-02 22:25:39 +00:00
Bikramjeet Vig	4a45c829bd	IMPALA-6177: Cleanup incomplete handcrafted IRs before finalizing module Currently, if an error is encountered during the creation of a handcrafted codegen method, then the resulting IR is left in an incomplete state. This patch ensures that all such IRs are cleaned up (method is deleted from the module) before the llvm module is finalized. Testing: - added a backend test to exercise the added code path. - tested manually by executing the following query: select * from charTable A, charTable B where A.charColumn = B.charColumn and A.charColumn = 'foo'; and looking at the logs to verify that 'InsertRuntimeFilters' and 'FilterContextInsert' methods have been removed. Change-Id: If975cfb3906482b36dd6ede32ca81de6fcee1d7f Reviewed-on: http://gerrit.cloudera.org:8080/8541 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-30 05:31:38 +00:00
Michael Ho	fac2b2cd9b	KUDU-2237: Allow idle server connection scanning to be disabled Currently, a server connection being idle for more than FLAGS_rpc_default_keepalive_time_ms ms will be closed. However, certain services (e.g. Impala) using KRPC may want to keep the idle connections alive for various reasons (e.g. sheer number of connections to re-establish, negotiation overhead in a secure cluster). To avoid idle connection from being closed, one currently have to set FLAGS_rpc_default_keepalive_time_ms to a very large value. This change implements a cleaner solution by disabling idle connection scanning if FLAGS_rpc_default_keepalive_time_ms is set to any negative value. This avoids the unnecessary overhead of scanning for idle server connections and alleviates the user from having to pick a random large number to make sure the connection is always kept alive. Change-Id: I6161b9e753f05620784565a417d248acf8e7050a Reviewed-on: http://gerrit.cloudera.org:8080/8831 Tested-by: Kudu Jenkins Reviewed-by: Todd Lipcon <todd@apache.org> Reviewed-on: http://gerrit.cloudera.org:8080/8909 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-23 01:34:23 +00:00
Taras Bobrovytsky	abab176e25	IMPALA-6300: Fix decimal modulo overflow In order to compute the modulo of two decimals, we need to bring the underlying datatype to the same scale first. It turns out we could overflow when scaling up one of the values. In this patch we fix the problem by using a larger data type when we detect that the scaled up value will not fit into the original data type. Testing: - Added some expr tests that reproduce the issue. Change-Id: I27c7f25f68353c19c315e1639311ec06b2dea686 Reviewed-on: http://gerrit.cloudera.org:8080/8833 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-22 19:53:38 +00:00
Taras Bobrovytsky	a16fe803ca	IMPALA-5014: Part 1: Round when casting string to decimal In this patch we implement rounding when casting string to decimal if DECIMAL_V2 is enabled. The backend method that parses strings and converts them to decimals is refactored to make it easier to understand. Testing: - Added some BE tests. Change-Id: Icd8b92727fb384e6ff2d145e4aab7ae5d27db26d Reviewed-on: http://gerrit.cloudera.org:8080/8774 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-22 11:39:08 +00:00
David Knupp	4ce23f72ba	Move symlinked auxiliary tests/* to tests/functional/* The layout of the Impala-auxiliary-tests/tests directory is changing to allow for different kinds of tests to be saved there. But just in case the new functional sub-directory does not exist, preserve backwards compatibility with the older layout. Change-Id: Ifb2bbbebc38bbaf3d6a4ad01fa8dd918b7d99b3b Reviewed-on: http://gerrit.cloudera.org:8080/8896 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-22 04:48:53 +00:00
Zoram Thanga	b581a9d1ee	IMPALA-6225: Part 2: Query profile date-time strings should have ns precision. This commit follows `16d8dd58`. This patch adds a test case that inspects the thrift profile of a completed query, and verifies that the "Start Time" and "End Time" of the query have nanosecond precision. We chose to work with the thrift profile directly, rather than parse the debug web page, as it is the thrift profile which is consumed by management API clients of Impala. Change-Id: Id3421a34cc029ebca551730084c7cbd402d5c109 Reviewed-on: http://gerrit.cloudera.org:8080/8784 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-21 04:26:33 +00:00
Todd Lipcon	ac9ca569db	KUDU-2198. Allow disregarding system-wide auth-to-local mapping This adds a workaround for an issue reported on the user mailing list. Some systems are configured such that the auth_to_local mapping provided by the krb5 library doesn't work properly for service accounts. This patch adds a new configuration which allows Kudu to disregard the system auth_to_local rules and instead just map kerberos principals to their first component, which is typically the username. Change-Id: I2e893493f52965ea54d2ceaac83d375285b49486 Reviewed-on: http://gerrit.cloudera.org:8080/8373 Reviewed-by: Alexey Serbin <aserbin@cloudera.com> Reviewed-by: Dan Burkert <danburkert@apache.org> Tested-by: Kudu Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/8875 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-21 00:13:42 +00:00
Philip Zeyliger	f755910e97	Remove unused deps, centralize some pom versions, upgrade SLF4J and commons-io. As a follow-on to centralizing into one parent pom, we can now manage thirdparty dependency versions in Java a little bit more clearly. Upgrades SLF4J, commons.io: slf4j: 1.7.5 -> 1.7.25 commons.io: 2.4 -> 2.6 The SLF4J upgrade is nice to be able to run under Java9. The release notes at https://www.slf4j.org/news.html are uneventful. Commons IO 2.6 supports Java 9 and is source and binary compatible, per https://commons.apache.org/proper/commons-io/upgradeto2_6.html and https://commons.apache.org/proper/commons-io/upgradeto2_5.html. Removes the following dependencies: htrace-core hadoop-mapreduce-client-core hive-shims com.stumbleupon:async commons-dbcp jdo-api I ran "mvn dependency:analyze" and these were some (but not all) of the "Unused declared dependencies found." Spelunking in git logs, these dependencies are from 2013 and possibly from an effort to run with dependencies from the filesystem. They don't seem to be required anymore. Stops pulling in an old version of hadoop-client and kite-data-core in testdata/TableFlattener by using the same versions as the Hadoop we use. Doing so was unnecessarily causing us to download extra, old Hadoop jars, and the new Hadoop jars seem to work just as well. This is the kind of divergence that centralizing the versions into variables will help with. Creates variables for: junit.version slf4j.version hadoop.version commons-io.version httpcomponents.core.version thrift.version kite.version (controlled via $IMPALA_KITE_VERSION in impala-config.sh) Cleans up unused IMPALA_PARQUET_URL variables in impala-config.sh. We only download Parquet via Maven, rather than downloading it in the toolchain, so this variable wasn't doing anything. I ran the core tests with this change. Change-Id: I717e0625dfe0fdbf7e9161312e9e80f405a359c5 Reviewed-on: http://gerrit.cloudera.org:8080/8853 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-20 22:04:18 +00:00
David Knupp	2fb11fb732	Revert "IMPALA-3887: Wait for HDFS replication in data loading" Using fsck breaks non-HDFS builds: local, S3, and Isilon. This reverts commit `5a7c10ec3d`. Change-Id: I0b12a42049543ca0b267b5146a0bbcdd2316abfc Reviewed-on: http://gerrit.cloudera.org:8080/8880 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-19 23:26:29 +00:00
Jinchul	061287d912	IMPALA-5754: Rollback the exclusion of clang-tidy check for pcg-cpp In the commit `4feb4f3a`, the third party library pcg-cpp was excluded from the clang-tidy check. It could make unexpected side effect, so fixing some warnings from clang-tidy is better than avoidance of the check. Change-Id: I591d30373cb13f0eb89afbe16d81b1d3fb783365 Reviewed-on: http://gerrit.cloudera.org:8080/8829 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-12-18 18:37:45 +00:00
Zoltan Borok-Nagy	8047b1dcb4	IMPALA-3703: Store query context in thread-local variables This commit introduces the ThreadDebugInfo class which can hold information about the current thread that can be useful in debug sessions. It needs to be allocated on the stack of each thread in order to include it to minidumps. Currently a ThreadDebugInfo object is created in Thread::SuperviseThread. This object is available in all the child stack frames through the global function GetThreadDebugInfo(). ThreadDebugInfo has members for the thread name and instance id. These are fixed size char buffers. If you have a core dump written by Impala, you can locate the ThreadDebugInfo for the current thread through the global pointer impala::thread_debug_info. In a core file that has been created from a minidump, we need to select the stack frame that allocated the ThreadDebugInfo object in order to inspect it. It is currently allocated in Thread::SuperviseThread(). We can use printf in gdb to print the members, e.g.: printf "%s\n" thread_debug_info.instance_id Currently the thread name and instance id is stored. I created some tests in thread-debug-info-test.cc Change-Id: I566f7f1db5117c498e86e0bd05b33bdcff624609 Reviewed-on: http://gerrit.cloudera.org:8080/8621 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-18 15:32:59 +00:00
Alex Behm	1f7b3b00e9	IMPALA-5310: Part 3: Use SAMPLED_NDV() in COMPUTE STATS. Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV() function. Testing: - modified/improved existing functional tests - core/hdfs run passed Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b Reviewed-on: http://gerrit.cloudera.org:8080/8840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:58:59 +00:00
Tianyi Wang	5a7c10ec3d	IMPALA-3887: Wait for HDFS replication in data loading When the data loading finishes, it is possible for some HDFS blocks to be under replicated. If impala gets the metadata before the replication is done, some tests may fail. This patch adds a replication waiting step in the data loading script. Change-Id: I88dfb7165b7515b3e96111436be490f2068ec322 Reviewed-on: http://gerrit.cloudera.org:8080/8846 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:53:56 +00:00
Taras Bobrovytsky	7256fcefb4	IMPALA-6284: Mark the intermediate decimal avg struct as packed We saw some failures on the exhaustive release build because the compiler assumed that the pointer to the intermediate struct that is used for computing decimal average was aligned. To fix the problem, we mark the struct with a "packed" attribute so that the compiler does not expect it to be aligned. Testing: - Ran the failing test locally on an release build and it passed. Change-Id: Id25ec6e20dde3f50fb37a22135b355ad251809e0 Reviewed-on: http://gerrit.cloudera.org:8080/8836 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 03:26:43 +00:00
Thomas Tauber-Marshall	3f1f706393	IMPALA-6297: Don't partition/sort for DML on unpartitioned Kudu table Impala partitions and sorts rows according to the target table's partitioning scheme before inserting them into Kudu in order to improve the performance of large inserts. A recent change added the ability to create unpartitioned Kudu tables, but Impala still does the partitioning/sorting for them even though its wasted work. This patch modifies the planner to not add the partition/sort for Kudu inserts if the table is unpartitioned, unless the clustered/shuffle hints are used. It also removes the exchange in the case where the partition exprs are all constant. Testing: - Added planner tests for inserting into an unpartitioned Kudu table, with and without hints, and for when the partition exprs are constant. - Ran the existing correctness tests for inserts into unpartitioned Kudu tables in kudu_create.test Change-Id: I3e01a7dd5284767a25df3218656746a5d0ee4632 Reviewed-on: http://gerrit.cloudera.org:8080/8810 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 03:06:19 +00:00
Joe McDonnell	e0c0848f15	IMPALA-5948: Change Kudu RPC port to 27000 The current default for krpc_port is 29000, which conflicts with Sentry's WebUI. This changes the default to 27000, which is currently free. Core tests pass with this change. Change-Id: Iaf5ccedfd9bc1eff9786e4b019c1bb68bf757300 Reviewed-on: http://gerrit.cloudera.org:8080/8841 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 02:42:30 +00:00
Jinchul	bfbcd1fe86	IMPALA-4664: Unexpected string conversion in Shell Impala shell can accidentally convert certain literal strings to lowercase. Impala shell splits each command into tokens and then converts the first token to lowercase to figure out how it should execute the command. The splitting is done by spaces only. Thus, if the user types a TAB after the SELECT, the first token after the split becomes the SELECT plus whatever comes after it. Testing: TestImpalaShellInteractive.test_case_sensitive_command TestImpalaShellInteractive.test_unexpected_conversion_for_literal_string_to_lowercase TestImpalaShell.test_var_substitution Change-Id: Ifdce9781d1d97596c188691b62a141b9bd137610 Reviewed-on: http://gerrit.cloudera.org:8080/8762 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-15 21:32:20 +00:00
Zoram Thanga	f4eb00123f	IMPALA-6114: Require type equality for NumericLiteral::localEquals(). This patch fixes a regression introduced as part of IMPALA-1788, where an expression like 'CAST(0 AS DECIMAL(14))' is rewritten as a NumericLiteral expression of type DECIMAL(14,0). The query had another NumericLiteral of type TINYINT. While analyzing the DISTINCT aggregation clause of the SELECT query, AggregateInfo::create() removes duplicate expressions from groupingExprs. NumericLiteral::localEquals() is used to check for equality. Now since the method does not consider expression types, a TINYINT literal is considered to be duplicate of a DECIMAL literal. This results in a query like the following to fail: SELECT DISTINCT CAST(0 AS DECIMAL(14), 0 FROM functional.alltypes We propose to fix the issue by accounting for types as well when comparing analyzed numeric literals. A test case has been added to AnalyzeStmtsTest. Change-Id: Ia88d54088dfd128b103759dc01103b6c35bf6257 Reviewed-on: http://gerrit.cloudera.org:8080/8448 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-15 01:28:13 +00:00
Bikramjeet Vig	de29925912	IMPALA-6222: Add details to error msg on failure to get min reservation This patch adds the following details to the error message encountered on failure to get minimum memory reservation: - which ReservationTracker hit its limit - top 5 admitted queries that are consuming the most memory under the ReservationTracker that hit its limit Testing: - added tests to reservation-tracker-test.cc that verify the error message returned for different cases. - tested "initial reservation failed" condition manually to verify the error message returned. Change-Id: Ic4675fe923b33fdc4ddefd1872e6d6b803993d74 Reviewed-on: http://gerrit.cloudera.org:8080/8781 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-14 22:34:30 +00:00
Alex Behm	3cbbaf3b30	IMPALA-6319: Fix alloc/free mismatch. Testing under ASAN: - reproduced locally - does not reproduce after fix - locally ran test_aggregation.py which passed Change-Id: Ia695201e61d8afc23636f826264635c85d3a228a Reviewed-on: http://gerrit.cloudera.org:8080/8838 Tested-by: Impala Public Jenkins Reviewed-by: Jim Apple <jbapple-impala@apache.org>	2017-12-14 16:58:53 +00:00
Philip Zeyliger	11dbb3952a	IMPALA-6070: Parallelize another bit of data load. The two Kudu loads and Hive UDFs can all run in parallel. This should shave about 4 minutes off of the data load. (Current timings are 3.5, 4, and 0.6 minutes, see below.) I've run dataload with this change many times. Loading Kudu functional (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu.log)... Loading workload 'functional-query' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 3 min 29 sec) Loading Kudu TPCH (logging to /home/ubuntu/Impala/logs/data_loading/load-kudu-tpch.log)... Loading workload 'tpch' using exploration strategy 'core' in table formats 'kudu/none/none' OK (Took: 4 min 0 sec) Loading Hive UDFs (logging to /home/ubuntu/Impala/logs/data_loading/build-and-copy-hive-udfs.log)... Loading Hive UDFs OK (Took: 0 min 41 sec) Change-Id: I7e93ee5a77ec9271b980b88bef7ad512ecbe0407 Reviewed-on: http://gerrit.cloudera.org:8080/8822 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-14 02:28:40 +00:00
stiga-huang	5c593be59c	IMPALA-6301: Fix test failures when username or group name contains dots Some tests use the local user's group name to construct SQLs, which may lead to syntax errors when group name contains dots. We need to quote the group names in SQL to avoid this error. Besides, a test in test_admission_controller uses '\w+' to match the local user name. This expression cannot match usernames with dots, which causes test failure as well. Instead, we should use '\S+'. Change-Id: Ib8ae15bb6a929dc48d3ad2176c8b3fafff87f32b Reviewed-on: http://gerrit.cloudera.org:8080/8807 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-13 23:06:45 +00:00
Philip Zeyliger	2fcbf36c32	IMPALA-6270: remove redundant version properties Removes properties that are already defined in the impala-parent pom. I ran the tests. Change-Id: I6812e11bb41716450ef29bb523773479e9f76eec Reviewed-on: http://gerrit.cloudera.org:8080/8827 Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-13 22:48:10 +00:00
Jinchul	4feb4f3a54	IMPALA-5754: Improve randomness of rand()/random() Currently implementation of rand/random built-in functions use rand_r of C library. We recognized its randomness was poor. pcg32 of third party library shows better randomness than rand_r. Testing: Revise unit test in expr-test Add E2E test to random.test Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684 Reviewed-on: http://gerrit.cloudera.org:8080/8355 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-12-13 10:04:40 +00:00
Bikramjeet Vig	9a6e5fa996	IMPALA-5848: Account for TCMalloc overhead in MemTracker This patch adds a new MemTracker under the Process MemTracker called "TCMalloc Overhead" which accounts for different cache freelists maintained by TCMalloc. This added accounting also helps bring down the amount of untracked memory. An example dump of the Process MemTracker now looks like: Process: Limit=8.34 GB Total=119.10 MB Peak=119.10 MB Buffer Pool: Free Buffers: Total=0 Buffer Pool: Clean Pages: Total=0 Buffer Pool: Unused Reservation: Total=0 TCMalloc Overhead: Total=11.42 MB Untracked Memory: Total=107.69 MB Testing: Tested manually by checking the memz webpage. Change-Id: I602e9d5e8e8d7470dcfe4addde3265057c16263a Reviewed-on: http://gerrit.cloudera.org:8080/8782 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-13 02:05:21 +00:00
Alex Behm	0936e32966	IMPALA-5310: Part 2: Add SAMPLED_NDV() function. Adds a new SAMPLED_NDV() aggregate function that is intended to be used in COMPUTE STATS TABLESAMPLE. This patch only adds the function itself. Integration with COMPUTE STATS will come in a separate patch. SAMPLED_NDV() estimates the number of distinct values (NDV) based on a sample of data and the corresponding sampling rate. The main idea is to collect several x/y data points where x is the number of rows and y is the corresponding NDV estimate. These data points are used to fit an objective function to the data such that the true NDV can be extrapolated. The aggregate function maintains a fixed number of HyperLogLog intermediates to compute the x/y points. Several objective functions are fit and the best-fit one is used for extrapolation. Adds the MPFIT C library to perform curve fitting: https://www.physics.wisc.edu/~craigm/idl/cmpfit.html The library is a C port from Fortran. Scipy uses the Fortran version of the library for curve fitting. Testing: - added functional tests - core/hdfs run passed Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820 Reviewed-on: http://gerrit.cloudera.org:8080/8569 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-12 22:20:18 +00:00
Michael Ho	e57c77f095	IMPALA-6308: Fix bad Status() usage in data-stream-sender.cc Previously, DataStreamSender::FlushAndSendEos() makes some wrong assumption about the format string of the error status returned from DoTransmitDataRpc(). In particular, it assumes that the format string has exactly one substitution argument. This is a pretty bad assumption as new types of errors could be returned from DoTransmitDataRpc() and they can take different numbers of substitution arguments. Recent changes in the format string TErrorCode::RPC_RECV_TIMEOUT exposed this bug. This change fixes the problem by removing this bad usage of Status() in data-stream-sender.cc Change-Id: I1cc04db670069a7df484e2f2bafb93c5315b9c82 Reviewed-on: http://gerrit.cloudera.org:8080/8817 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-12 21:05:54 +00:00
Philip Zeyliger	d2fe9f437e	IMPALA-6270: create Impala parent pom This commit links together all the individual pom.xml files to have a new "impala-parent" pom as the parent. This enables de-duplicating all the repository configuration. I ran the build to test this. Change-Id: Id744e4357ee4d8e4be4e5490b2159bb76a2192f0 Reviewed-on: http://gerrit.cloudera.org:8080/8753 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-12 04:30:15 +00:00
Philip Zeyliger	ebc00c75ac	Add 'lsof' to bootstrap_system. "be/src/kudu/security/test/mini_kdc.cc" uses lsof, which doesn't exist on the base ubuntu:16.04 Docker image; adding it in. Change-Id: I6a458f2ef0313b2d08d6dd21290f8a38fa6d07f7 Reviewed-on: http://gerrit.cloudera.org:8080/8813 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-12 04:01:42 +00:00
Zach Amsden	245df3c69a	IMPALA-6245: Tolerate column indenting from Hive The fix for HIVE-3140 started indenting multi-line comments, which breaks Impala testing when run against Hive 2.1.1. To test this using the pure test runner proved difficult since it would require extensive changes to support both row_regexes (since the columns changed order) and subset support (since the number of rows changed). Instead, we manually verify the hints are present in the output in the python test. The fact that the hints have been reformatted leaves us in an uncertain state as to whether they actually get applied, so a new test case has been added to run EXPLAIN SELECT on the view and verify the joins happen exactly as we expect. Testing: Ran the views-ddl test against Impala mini-cluster setups using both Hive 2.1.1 and Hive 1.1.0 Change-Id: I49e53b1230520ca6e850af28078526e6627d69de Reviewed-on: http://gerrit.cloudera.org:8080/8719 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-12 00:17:56 +00:00

1 2 3 4 5 ...

6492 Commits