impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 12:02:54 -05:00

Author	SHA1	Message	Date
Taras Bobrovytsky	57d7c614bc	IMPALA-5036: Parquet count star optimization Instead of materializing empty rows when computing count star, we use the data stored in the Parquet RowGroup.num_rows field. The Parquet scanner tuple is modified to have one slot into which we will write the num rows statistic. The aggregate function is changed from count to a special sum function that gets initialized to 0. We also add a rewrite rule so that count(<literal>) is rewritten to count(*) in order to make sure that this optimization is applied in all cases. Testing: - Added functional and planner tests Change-Id: I536b85c014821296aed68a0c68faadae96005e62 Reviewed-on: http://gerrit.cloudera.org:8080/6812 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-06 01:26:44 +00:00
Michael Ho	fb8ea6a9cc	IMPALA-5588: Reduce the frequency of fault injection Previously, the fault injection utility will inject a fault on every 3 RPC calls for ReportExecStatus() RPCs. As shown in IMPALA-5588, with an unfortunate sequence in which other RPCs happen between the retry of ReportExecStatus() RPC in QueryState::ReportExecStatusAux(), ReportExecStatus() can hit injected faults 3 times in a row, causing the query to be cancelled in QueryState::ReportExecStatusAux(). This change fixes the problem by reducing the fault injection frequency to once every 16 RPC calls for ReportExecStatus(), CancelQueryFInstances() and ExecQueryFInstances() RPCs. Also incorporated the fix by Michael Brown for a python bug in test_rpc_exception.py so tests hitting unexpected exception will re-throw that exception for better diagnosis on test failure. Change-Id: I0ce4445e8552a22f23371bed1196caf7d0a3f312 Reviewed-on: http://gerrit.cloudera.org:8080/7310 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-30 09:27:47 +00:00
Michael Brown	f4c82bf540	IMPALA-5281: stress test: introduce stricter pass guidelines 1. Report incorrect results count in the console log table. Previously, the stress test knew about incorrect results but only reported them to the console log inline. In was on the onus of a caller to find this. Now we have a summed count. 2. Fail the process if there are errors, incorrect results, or timeouts. Previously, the stress test just counted these, but would not fail its process. This leads to a much stricter pass criteria for the stress test. This will allow CI to fail and alert a maintainer that something went wrong. Testing: I modified the result hashes for queries in a local runtime_info.json and observed the reporting of incorrect results, incremented incorrect results counts, and ultimately process failure. Change-Id: I9f2174a527193ae01be45b8ed56315c465883346 Reviewed-on: http://gerrit.cloudera.org:8080/7282 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-29 22:40:04 +00:00
Tim Armstrong	c4d284f3cc	IMPALA-5483: Automatically disable codegen for small queries This is similar to the single-node execution optimisation, but applies to slightly larger queries that should run in a distributed manner but won't benefit from codegen. This adds a new query option disable_codegen_rows_threshold that defaults to 50,000. If fewer than this number of rows are processed by a plan node per impalad, the cost of codegen almost certainly outweighs the benefit. Using rows processed as a threshold is justified by a simple model that assumes the cost of codegen and execution per row for the same operation are proportional. E.g. if x is the complexity of the operation, n is the number of rows processed, C is a constant factor giving the cost of codegen and Ec/Ei are constant factor giving the cost of codegen'd and interpreted execution and d, then the cost of the codegen'd operator is C * x + Ec * x * n and the cost of the interpreted operator is Ei * x * n. Rearranging means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that (at least with the simplified model) it makes sense to choose interpretation or codegen based on a constant threshold. The model also implies that it is somewhat safer to choose codegen because the additional cost of codegen is O(1) but the additional cost of interpretation is O(n). I ran some experiments with TPC-H Q1, varying the input table size, to determine what the cut-over point where codegen was beneficial was. The cutover was around 150k rows per node for both text and parquet. At 50k rows per node disabling codegen was very beneficial - around 0.12s versus 0.24s. To be somewhat conservative I set the default threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the cutover tends to be higher because there are plan nodes that process many fewer than the max rows. Fix a couple of minor issues in the frontend - the numNodes_ calculation could return 0 for Kudu, and the single node optimization didn't handle the case where for a scan node with conjuncts, a limit and missing stats correctly (it considered the estimate still valid.) Testing: Updated e2e tests that set disable_codegen to set disable_codegen_rows_threshold to 0, so that those tests run both with and without codegen still. Added an e2e test to make sure that the optimisation is applied in the backend. Added planner tests for various cases where codegen should and shouldn't be disabled. Perf: Added a targeted perf test for a join+agg over a small input, which benefits from this change. Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e Reviewed-on: http://gerrit.cloudera.org:8080/7153 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-29 21:14:59 +00:00
Dimitris Tsirogiannis	d2430ea0ab	IMPALA-5500: Reduce catalog update topic size Problem: IMPALA-4029 introduced the use of the flatbuffers serialization libary for storing file and block metadata. That change reduced the effectiveness of the Thrift compaction protocol (when --compact_catalog_topic is used), thereby causing a 2X increase in catalog update topic size when the compact protocol is used. Fix: LZ4 compress the catalog topic updates before sent to the statestore when --compact_catalog_topic is set to true. Results: ~4X reduction in catalog update topic size Change-Id: I2f725cd8596205e6101d5b56abf08125faa30b0a Reviewed-on: http://gerrit.cloudera.org:8080/7268 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-29 01:20:47 +00:00
Michael Ho	e0ba5ef6a2	IMPALA-5558/IMPALA-5576: Reopen stale client connection Previously, the retry logic in DoRpc() only allows retry to happen if send() didn't complete successfully and the exception indicates a closed connection. However, send() returning successfully doesn't guarantee that the bytes have actually reached the remote peer. According to the man page of send(), when the message does not fit into the send buffer of the socket, send() normally blocks. So the payload of RPC may be buffered in the kernel if there is room for it. TCP allows a connection to be half-open. If an Impalad node is restarted, a stale client connection to that node may still allow send() to appear to succeed even though the payload wasn't sent. However, upon calling recv() in the RPC call to fetch the response, the client will get a return value of 0. In which case, thrift will throw an exception as the connection to the remote peer is closed already. Apparently, the existing retry logic doesn't quite handle this case. One can consistently reproduce the problem by warming the client cache followed by restarting one of the Impalad nodes. It will result a series of query failures due to stale connections. This change augments the retry logic to also retry the entire RPC if the exception string contains the messages "No more data to read." or "SSL_read: Connection reset by peer" to capture the case of stale connections. Our usage of thrift doesn't involve half-open TCP connection so having a broken connection in recv() indicates the remote end has closed the socket already. The generated thrift code doesn't knowingly close the socket before an RPC completes unless the process crashes, the connection is stale (e.g. the remote node was restarted) or the remote end fails to read from the client. In either cases, the entire RPC should just be retried with a new connection. This change also fixes QueryState::ReportExecStatusAux() to unconditionally for up to 3 times when reporting exec status of a fragment instance. Previously, it may break out of the loop early if RPC fails with 'retry_is_safe' == true (e.g. due to recv() timeout) or if the connection to coordinator fails (IMPALA-5576). Declaring the RPC to have failed may cause all fragment instances of a query to be cancelled locally, triggering query hang due to IMPALA-2990. Similarly, the cancellation RPC is also idempotent so it should be unconditionally retried up to 3 times with 100ms sleep time in between. The status reporting is idempotent as the handler simply ignores RPC if it determines that all fragment instances on a given backend is done so it should be safe to retry the RPC. This change updates ApplyExecStatusReport() to handle duplicated status reports with done bit set. Previously we would drop some other fragment instances' statuses if we received duplicate 'done' statuses from the same fragment instance(s). Testing done: Warmed up client cache by running stress test followed by restarting some Impalad nodes. Running queries used to fail or hang consistently in the past. It now works with patch. Also ran CSL enduranace tests and exhaustive builds. Change-Id: I4d722c8ad3bf0e78e89887b6cb286c69ca61b8f5 Reviewed-on: http://gerrit.cloudera.org:8080/7284 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-26 09:06:38 +00:00
aphadke	54cda78585	IMPALA-4866: Hash join node does not apply limits correctly Hash join node currently does not apply the limits correctly. This issue gets masked most of the times since the planner sticks an exchange node on top of most of the joins. This issue gets exposed when NUM_NODES=1. Change-Id: I414124f8bb6f8b2af2df468e1c23418d05a0e29f Reviewed-on: http://gerrit.cloudera.org:8080/6778 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-22 21:56:25 +00:00
Thomas Tauber-Marshall	a306096e56	IMPALA-4622: Add ALTER COLUMN statement. Kudu recently added the ability to alter a column's default value and storage attributes (KUDU-861). This patch adds the ability to modify these from Impala using ALTER. It also supports altering a column's comment for non-Kudu tables. It does not support setting a column to be a primary key or changing a column's nullability, because those are not supported on the Kudu side yet. Syntax: ALTER TABLE <table> ALTER [COLUMN] <column> SET <attr> <value> [<attr> <value> [<attr> <value>...]] where <attr> is one of: - DEFAULT, BLOCK_SIZE, ENCODING, COMPRESSION (Kudu tables) - COMMENT (non-Kudu tables) ALTER TABLE <table> ALTER [COLUMN] <column> DROP DEFAULT This is similar to the existing CHANGE statement: ALTER TABLE <table> CHANGE <column> <new_col_name> <type> [COMMENT <comment>] but the new syntax is more natural for setting column properties when the column name and type are not being changed. Both ALTER COLUMN and CHANGE COLUMN operations use AlterTableAlterColStmt and are sent to the catalog as ALTER_COLUMN operations. Testing: - Added FE tests to ParserTest and AnalyzeDDLTest - Added EE tests to test_kudu.py Change-Id: Id2e8bd65342b79644a0fdcd925e6f17797e89ad6 Reviewed-on: http://gerrit.cloudera.org:8080/6955 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-22 19:08:59 +00:00
Michael Brown	46ebbab0cf	IMPALA-3040 addendum: use specific_build_type_timeout for slow builds IMPALA-3040 was initially fixed to use a timeout with HDFS caching tests, however some test executions against slow-running builds such as ASAN indicate this timeout may not be high enough. Use the specific_build_type_timeout() method to set a much higher timeout for slower builds such as ASAN. This allows us to virtually ignore timeout values on slow builds, but doesn't force us to unconditionally increase the timeout in a release or debug build. Testing: Ran all tests that use get_num_cache_requests() in a loop 100 times each under an ASAN build. All test iterations passed. Change-Id: I80f1c8a0e634a3726c53ef7297c5b162dd57a3a2 Reviewed-on: http://gerrit.cloudera.org:8080/7115 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-21 19:16:28 +00:00
Michael Ho	2356516688	IMPALA-5537: Retry RPC on somes exceptions with SSL connection After the fix for IMPALA-5388, all TSSLException thrown will be treated as fatal error and the query will fail. Turns out that this is too strict and in a secure cluster under load, queries can easily hit timeout waiting for RPC response. When running without SSL, we call RetryRpcRecv() to retry the recv part of an RPC if the TSocket underlying the RPC gets an EAGAIN during recv(). This change extends that logic to cover secure connection. In particular, we pattern match against the exception string "SSL_read: Resource temporarily unavailable" which corresponds to EAGAIN error code being thrown in the SSL_read() path. Similarly, we will handle closed connection in send() path with secure connection by pattern matching against the exception string "TTransportException: Transport not open". To verify that the exception is thrown during the send part of a RPC call, the RPC client interface has been augmented to take a bool* argument which is set to true after the send part of the RPC has completed but before the recv part starts. If DoRPC() catches an exception and the send part isn't done yet, the entire RPC if the exception string matches certain substrings which are safe to retry. The fault injection utility has also been updated to distinguish between time out and lost connection to exercise different error handling paths in the send and recv paths. Change-Id: I8243d4cac93c453e9396b0e24f41e147c8637b8c Reviewed-on: http://gerrit.cloudera.org:8080/7229 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-21 10:04:15 +00:00
Joe McDonnell	53287df0a1	IMPALA-5488: Fix handling of exclusive HDFS file handles This change fixes three issues: 1. File handle caching is expected to be disabled for remote files (using exclusive HDFS file handles), however the file handles are still being cached. 2. The retry logic for exclusive file handles is broken, leading the number of open files to be incorrect. 3. There is no test coverage for disabling the file handle cache. To fix issue #1, when a scan range is requesting an exclusive file handle from the cache, it will always request a newly opened file handle. It also will destroy the file handle when the scan range is closed. To fix issue #2, exclusive file handles will no longer retry IOs. Since the exclusive file handle is always a fresh file handle, it will never have a bad file handle from the cache. This returns the logic to its state before IMPALA-4623 in these cases. If a file handle is borrowed from the cache, then the code will continue to retry once with a fresh handle. To fix issue #3, custom_cluster/test_hdfs_fd_caching.py now does both positive and negative tests for the file handle cache. It verifies that setting max_cached_file_handles to zero disables caching. It also verifies that caching is disabled on remote files. (This change will resolve IMPALA-5390.) Change-Id: I4c03696984285cc9ce463edd969c5149cd83a861 Reviewed-on: http://gerrit.cloudera.org:8080/7181 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-21 09:42:34 +00:00
Thomas Tauber-Marshall	7f3817982f	IMPALA-5286/IMPALA-5283: Kudu column name case cleanup Impala is case insensitive for column names and generally deals with them in all lower case. Kudu is case sensitive. This can lead to a problems when a table is created externally in Kudu with a column name with upper case letters. This patch solves the problem by having KuduColumn always store its name in lower case, so that general Impala code that has been written expecting lower cased column names can use Column.getName() safely. It also adds the method KuduColumn.getKuduName(), which returns the column name in the case that it appears in Kudu. Any code that passes column names into the Kudu API must call this method first to get the correct column name. There are four specific situations fixed by this patch: - When ordering on a Kudu column, the Analyzer would create two SlotDescriptors that point to the same column because registerSlotRef() was being called with inconsistent casing. It is now always called with the lower cased names. - 'ADD RANGE PARTITION' would fail to find the range partition column if it isn't all lower case in Kudu. - 'ALTER TABLE DROP COLUMN' and 'ALTER TABLE CHANGE' only worked if the column name was specified in Kudu case. - 'CREATE EXTERNAL TABLE' called on a Kudu table with column names that differ only in case now returns an error, since Impala has no way of handling this situation. Testing: - Added e2e tests in test_kudu.py. - Manually edited functional_kudu to change column names to have mixed casing and ran the kudu tests. Change-Id: I14aba88510012174716691b9946e1c7d54d01b44 Reviewed-on: http://gerrit.cloudera.org:8080/6902 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-17 01:16:00 +00:00
Vincent Tran	1fc7e65723	IMPALA-4418: Fixes extra blank lines in query result This change avoids printing blank lines when the Impala shell fetches 0 rows from a statement. Change-Id: I6e18ce36be07ee90a16b007b1e30d5255ef8a839 Reviewed-on: http://gerrit.cloudera.org:8080/7055 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-16 09:33:40 +00:00
poojanilangekar	6d5cd6174e	IMPALA-5061: Populate null_count in parquet::statistics The null_count in the statistics field is updated each time a null value is encountered by parquet table writer. The value is written to the parquet header if it has one or more null values in the row_group. Testing: Modified the existing end-to-end test in the test_insert_parquet.py file to make sure each parquet header has the appropriate null_count. Verified the correctness of the nulltable test and added an additional test which populates a parquet file with the functional_parquet.zipcode_incomes table and ensures that the expected null_count is populated. Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb Reviewed-on: http://gerrit.cloudera.org:8080/7058 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-15 22:00:54 +00:00
Michael Brown	428b5a1bfe	IMPALA-5263: test infra: support CA bundles with secure clusters This patch adds the command line option --ca_cert to the common test infra CLI options for use alongside --use-ssl. This is useful when testing against a secured Impala cluster in which the SSL certs are self-signed. This will allow the SSL request to be validated. Using this option will also suppress noisy console warnings like: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html We also go further in this patch and use the warnings module to print these SSL-related warnings once and only once, instead of all over the place. In the case of the stress test, this greatly reduces the noise in the console log. Testing: - quick concurrent_select.py calls with and without --ca_cert to observe that connections still get made and the test runs smoothly. Some of this testing occurred without warning suppression, so that I could be sure the InsecureRequestWarnings were not occurring when using --ca_cert anymore. - ensured warnings are printed once, not multiple times Change-Id: Ifb9e466e4b7cde704cdc4cf98159c068c0a400a9 Reviewed-on: http://gerrit.cloudera.org:8080/7152 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-13 19:25:57 +00:00
David Knupp	adbb0b7f81	IMPALA-5413: Add a hive user for test_seq_writer_hive_compatibility. This patch includes a change to the framework to permit the passing of a username to the run_stmt_in_hive() method in the ImpalaTestSuite class, but retains the same default value as before. This is to allow a test to issue a 'select count(*) from foo' query through hive. Hive needs to set up a job to perform this query, and HDFS write access to do so. In typical cases, the HDFS user is 'hdfs'. however it may be necessary to change this depending on the cluster. On a local mini-cluster, the username appears to be irrelevant, so this won't affect locally run tests. Tested by running the core set of tests on a local minicluster to make sure there were no regressions. Also confirmed that the test in question now passes on a remote physical cluster. Change-Id: I1cc8824800e4339874b9c4e3a84969baf848d941 Reviewed-on: http://gerrit.cloudera.org:8080/7046 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-10 02:26:13 +00:00
Dimitris Tsirogiannis	aba37d3218	IMPALA-4965: Authorize access to runtime profile and exec summary Bug: When Sentry-based authorization is enabled, a user that isn't authorized to EXPLAIN a statement that uses a view can still access unauthorized information, such as view's definition, by running the statement and asking for the query profile or the execution summary. Fix: During query compilation, determine if the user can access the the runtime profile or the execution summary. Upon request for a runtime profile or execution summary from a user, determine based on that information and the user that is asking for the profile if the runtime profile (or execution summary) will be returned or an authorization error. The authorization rule enforced is the following: - User A runs statement S, A asks for profile, A has profile access: Runtime profile is returned - User A runs statement S, A asks for profile, A doesn't have profile access: Authorization error - User A runs statement S, user B asks for profile: Authorization error. This patch doesn't enforce access to the runtime profile or execution summary through the Web UI. Change-Id: I2255d587367c2d328590ae8534a5406c4b0c9b15 Reviewed-on: http://gerrit.cloudera.org:8080/7064 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-10 02:08:37 +00:00
Joe McDonnell	5f9f704bde	IMPALA-5386: Fix ReopenCachedHdfsFileHandle failure case This fixes three issues with the file handle cache. The first issue is that ReopenCachedHdfsFileHandle can destroy the passed in file handle without removing the reference to it. The old file handle then refers to a piece of memory that is not a handle in the cache, so future use of the handle fails with an assert. The fix is to always overwrite the reference to the file handle when it has been destroyed. The second issue is that query_test/test_hdfs_fd_caching.py should run on anything that supports the hdfs commandline and tolerate query failure. It's logic is not specific to file handle caching, so it has been renamed to query_test/test_hdfs_file_mods.py. Finally, custom_cluster/test_hdfs_fd_caching.py should not be running on remote files (S3, ADLS, Isilon, remote clusters). The file handle cache semantics won't apply on those platforms. Change-Id: Iee982fa5e964f6c8969b2eb7e5f3eca89e793b3a Reviewed-on: http://gerrit.cloudera.org:8080/7020 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-09 01:45:37 +00:00
Michael Ho	7db2d30646	IMPALA-5388: Only retry RPC on lost connection in send call Previously, DoRpc() blacklists only a couple of conditions which shouldn't retry the RPC on exception. This is fragile as the errors could have happened after the payload has been successfully sent to the destination. Such aggressive retry behavior can lead to duplicated row batches being sent, causing wrong results in queries. This change fixes the problem by whitelisting the conditions in which the RPC can be retried. Specifically, it pattern-matches against certain errors in TSocket::write_partial() in the thrift library and only retries the RPC in those cases. With SSL enabled, we will never retry. We should investigate whether there are some cases in which it's safe to retry. This change also adds fault injection in the TransmitData() RPC caller's path to emulate different exception cases. Change-Id: I176975f2aa521d5be8a40de51067b1497923d09b Reviewed-on: http://gerrit.cloudera.org:8080/7063 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-08 10:09:18 +00:00
Michael Brown	4882910226	IMPALA-5455: test infra: default --cm-port based on --use-tls This patch sets the default --cm-port (for the CM ApiResource initialization) based on a new flag, --use-tls, which enables test infra to talk to CM clusters with TLS enabled. It is still possible to set a port override, but in general it will not be needed. Reference: https://cloudera.github.io/cm_api/epydoc/5.4.0/cm_api.api_client.ApiResource-class.html#__init__ Testing: Connected both to TLS-disabled and TLS-enabled CM instances. Before this patch, we would fail hard when trying to talk to the TLS-enabled CM instance. Change-Id: Ie7dfa6c400687f3c5ccaf578fd4fb17dedd6eded Reviewed-on: http://gerrit.cloudera.org:8080/7107 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-08 05:49:37 +00:00
anujphadke	70657a860a	IMPALA-5400: Execute tests in subplans.test This change executes the tests added to subplans.test and removes a test which incorrectly references subplannull_data.test (a file which does not exist) Change-Id: I02b4f47553fb8f5fe3425cde2e0bcb3245c39b91 Reviewed-on: http://gerrit.cloudera.org:8080/7038 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-08 02:02:00 +00:00
Taras Bobrovytsky	6604083f51	IMPALA-5355: Fix the order of Sentry roles and privileges After a single Impalad is restarted, it is possible that order in which it receives roles and privileges from the statestore is incorrect. The correct order is for the role to appear first in the update, before the privilege that references it. If a user updates a role, its catalog version number can become larger than the catalog numbers of the privileges that reference it. This causes the role to come after the privilege in the initial metastore update. The issue is fixed by doing two passes over the catalog objects in the Impalad. The first pass updates the top level objects. The second pass updates the dependent objects Testing: - Added a test that reproduced the problem. Change-Id: I7072e95b74952ce5a51ea1b6e2ae3e80fb0940e0 Reviewed-on: http://gerrit.cloudera.org:8080/7004 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-07 00:30:46 +00:00
Lars Volker	5518cbcb78	IMPALA-5424: Ignore errors when removing minidumps folder On developer machines it can happen that /tmp/minidumps does not exists when test_minidump_relative_path gets executed. In this case errors from rmtree should be ignored. Change-Id: Ifab76a30898805d2df5e7452079a536d8747ac50 Reviewed-on: http://gerrit.cloudera.org:8080/7062 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-02 23:23:52 +00:00
Matthew Jacobs	2dcbefc652	IMPALA-5338: Fix Kudu timestamp column default values While support for TIMESTAMP columns in Kudu tables has been committed (IMPALA-5137), it does not support TIMESTAMP column default values. This supports CREATE TABLE syntax to specify the default values, but more importantly this fixes the loading of Kudu tables that may have had default values set on UNIXTIME_MICROS columns, e.g. if the table was created via the python client. This involves fixing KuduColumn to hide the LiteralExpr representing the default value because it will be a BIGINT if the column type is TIMESTAMP. It is only needed to call toSql() and toStringValue(), so helper functions are added to KuduColumn to encapsulate special logic for TIMESTAMP. TODO: Add support and tests for ALTER setting the default value (when IMPALA-4622 is committed). Change-Id: I655910fb4805bb204a999627fa9f68e43ea8aaf2 Reviewed-on: http://gerrit.cloudera.org:8080/6936 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-02 01:47:48 +00:00
Jim Apple	07a7138817	Add a script to test performance on a developer machine This is a migration from an old and broken script from another repository. Example use: bin/single_node_perf_run.py --ninja --workloads targeted-perf \ --load --scale 4 --iterations 20 --num_impalads 3 \ --start_minicluster --query_names PERF_AGG-Q3 \ $(git rev-parse HEAD~1) $(git rev-parse HEAD) The script can load data, run benchmarks, and compare the statistics of those runs for significant differences in performance. It glues together buildall.sh, bin/load-data.py, bin/run-workload.py, and tests/benchmark/report_benchmark_results.py. Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9 Reviewed-on: http://gerrit.cloudera.org:8080/6818 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-31 08:10:48 +00:00
Sailesh Mukil	1f34a9e703	IMPALA-5383: Fix PARQUET_FILE_SIZE option for ADLS PARQUET_FILE_SIZE query option doesn't work with ADLS because the AdlFileSystem doesn't have a notion of block sizes. And impala depends on the filesystem remembering the block size which is then used as the target parquet file size (this is done for Hdfs so that the parquet file size and block size match even if the parquet_file_size isn't a valid blocksize). We special case for Adls just like we do for S3 to bypass the FileSystem block size, and instead just use the requested PARQUET_FILE_SIZE as the output partitions block_size (and consequently the parquet file target size). Testing: Re-enabled test_insert_parquet_verify_size() for ADLS. Also fixed a miscellaneous bug with the ADLS client listing helper function. Change-Id: I474a913b0ff9b2709f397702b58cb1c74251c25b Reviewed-on: http://gerrit.cloudera.org:8080/7018 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-31 07:41:24 +00:00
Michael Ho	f15589573b	IMPALA-5376: Loads all TPC-DS tables This change loads the missing tables in TPC-DS. In addition, it also fixes up the loading of the partitioned table store_sales so all partitions will be loaded. The existing TPC-DS queries are also updated to use the parameters for qualification runs as noted in the TPC-DS specification. Some hard-coded partition filters were also removed. They were there due to the lack of dynamic partitioning in the past. Some missing TPC-DS queries are also added to this change, including query28 which discovered the infamous IMPALA-5251. Having all tables in TPC-DS available paves the way for us to include all supported TPCDS queries in our functional testing. Due to the change in the data, planner tests and the E2E tests have different results than before. The results of E2E tests were compared against the run done with Netezza and Vertica. The divergence were all due to the truncation behavior of decimal types in DECIMAL_V1. Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9 Reviewed-on: http://gerrit.cloudera.org:8080/6877 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-27 05:19:53 +00:00
Joe McDonnell	effe973a58	IMPALA-4623: Enable file handle cache Currently, every scan range maintains a file handle, even when multiple scan ranges are accessing the same file. Opening the file handles causes load on the NameNode, which can lead to scaling issues. There are two parts to this transaction: 1. Enable file handle caching by default for local files 2. Share the file handle between scan ranges from the same file Local scan ranges no longer maintain their own Hdfs file handles. On each read, the io thread will get the Hdfs file handle from the cache (opening it if necessary) and use that for the read. This allows multiple scan ranges on the same file to use the same file handle. Since the file offsets are no longer consistent for an individual scan range, all Hdfs reads need to either use hdfsPread or do a seek before reading. Additionally, since Hdfs read statistics are maintained on the file handle, the read statistics must be retrieved and cleared after each read. To manage contention, the file handle cache is now partitioned by a hash of the key into independent caches with independent locks. The allowed capacity of the file handle cache is split evenly among the partitions. File handles are evicted independently for each partition. The file handle cache maintains ownership of the file handles at all times, but it will not evict a file handle that is in use. If max_cached_file_handles is set to 0 or the the scan range is accessing data cached by Hdfs or the scan range is remote, the scan range will get a file handle from the cache and hold it until the scan range is closed. This mimics the existing behavior, except the file handle stays in the cache and is owned by the cache. Since it is in use, it will not be evicted. If a file handle in the cache becomes invalid, it may result in Read() calls failing. Consequently, if Read() encounters an error using a file handle from the cache, it will destroy the handle and retry once with a new file handle. Any subsequent error is unrelated to the file handle cache and will be returned. Tests: query_test/test_hdfs_fd_caching.py copies the files from an existing table into a new directory and uses that to create an external table. It queries the external table, then uses the hdfs commandline to manipulate the hdfs file (delete, move, etc). It queries again to make sure we don't crash. Then, it runs "invalidate metadata". It checks the row counts before the modification and after "invalidate metadata", but it does not check the results in between. custom_cluster/test_hdfs_fd_caching.py starts up a cluster with a small file handle cache size. It verifies that a file handle can be reused (i.e. rerunning a query does not result in more file handles cached). It also verifies that the cache capacity is enforced. Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d Reviewed-on: http://gerrit.cloudera.org:8080/6478 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-27 03:04:38 +00:00
Alex Behm	e89d7057a6	IMPALA-2373: Extrapolate row counts for HDFS tables. The main idea of this patch is to use table stats to extrapolate the row counts for new/modified partitions. Existing behavior: - Partitions that lack the row count stat are ignored when estimating the cardinality of HDFS scans. Such partitions effectively have an estimated row count of zero. - We always use the row count stats for partitions that have one. The row count may be innaccurate if data in such partitions has changed significantly. Summary of changes: - Enhance COMPUTE STATS to also store the total number of file bytes in the table. - Use the table-level row count and file bytes stats to estimate the number of rows in a scan. - A new impalad startup flag is added to enable/disable the extrapolation behavior. The feature is disabled by default. Note that even with the feature disabled, COMPUTE STATS stores the file bytes so you can enable the feature without having to run COMPUTE STATS again. Testing: - Added new FE unit test - Added new EE test Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/6840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-26 21:06:17 +00:00
Sailesh Mukil	50bd015f2d	IMPALA-5333: Add support for Impala to work with ADLS This patch leverages the AdlFileSystem in Hadoop to allow Impala to talk to the Azure Data Lake Store. This patch has functional changes as well as adds test infrastructure for testing Impala over ADLS. We do not support ACLs on ADLS since the Hadoop ADLS connector does not integrate ADLS ACLs with Hadoop users/groups. For testing, we use the azure-data-lake-store-python client from Microsoft. This client seems to have some consistency issues. For example, a drop table through Impala will delete the files in ADLS, however, listing that directory through the python client immediately after the drop, will still show the files. This behavior is unexpected since ADLS claims to be strongly consistent. Some tests have been skipped due to this limitation with the tag SkipIfADLS.slow_client. Tracked by IMPALA-5335. The azure-data-lake-store-python client also only works on CentOS 6.6 and over, so the python dependencies for Azure will not be downloaded when the TARGET_FILESYSTEM is not "adls". While running ADLS tests, the expectation will be that it runs on a machine that is at least running CentOS 6.6. Note: This is only a test limitation, not a functional one. Clusters with older OSes like CentOS 6.4 will still work with ADLS. Added another dependency to bootstrap_build.sh for the ADLS Python client. Testing: Ran core tests with and without TARGET_FILESYSTEM as 'adls' to make sure that all tests pass and that nothing breaks. Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542 Reviewed-on: http://gerrit.cloudera.org:8080/6910 Tested-by: Impala Public Jenkins Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>	2017-05-25 19:35:24 +00:00
Taras Bobrovytsky	24ff0f2fc2	IMPALA-5259: Add REFRESH FUNCTIONS <db> statement Before this patch, Impala relied on INVALIDATE METADATA to load externally added UDFs from HMS. The problem with this approach is that INVALIDATE METADATA affects all databases and tables in the entire cluster. In this patch, we add a REFRESH FUNCTIONS <db> statement that reloads the functions of a database from HMS. We return a list of updated and removed db functions to the issuing Impalad in order to update its local catalog cache. Testing: - Ran a private build which passed. Change-Id: I3625c88bb51cca833f3293c224d3f0feb00e6e0b Reviewed-on: http://gerrit.cloudera.org:8080/6878 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-25 03:30:03 +00:00
Bharath Vissapragada	aa076491b9	IMPALA-1972/IMPALA-3882: Fix client_request_state_map_lock_ contention Holding client_request_state_map_lock_ and CRS::lock_ together in certain paths could potentially block the impalad from registering new queries. The most common occurrence of this is while loading the webpage of a query while the query planning is still in progress. Since we hold the CRS::lock_ during planning, it blocks the web page from loading which inturn blocks incoming queries by holding client_request_state_map_lock_. This patch makes client_request_state_map_lock_ a terminal lock so that we don't have interleaving locking with CRS::lock_. Testing: Tested it locally by adding a long sleep in JniFrontend.createExecRequest() and still was able to refresh the web UI and run parallel queries. Also added a custom cluster test that does the same sequence of actions by injecting a metadata loading pause. Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026 Reviewed-on: http://gerrit.cloudera.org:8080/6707 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-24 03:36:02 +00:00
Alex Behm	ee0fc260d1	IMPALA-5309: Adds TABLESAMPLE clause for HDFS table refs. Syntax: <tableref> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)] The first number specifies the percent of table bytes to sample. The second number specifies the random seed to use. The sampling is coarse-grained. Impala keeps randomly adding files to the sample until at least the desired percentage of file bytes have been reached. Examples: SELECT * FROM t TABLESAMPLE SYSTEM(10) SELECT * FROM t TABLESAMPLE SYSTEM(50) REPEATABLE(1234) Testing: - Added parser, analyser, planner, and end-to-end tests - Private core/hdfs run passed Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70 Reviewed-on: http://gerrit.cloudera.org:8080/6868 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-24 02:38:08 +00:00
Sailesh Mukil	44e8bbffc3	IMPALA-5331: Use new libHDFS API to address "Unknown Error 255" We use the new libHDFS API hdfsGetLastExceptionRootCause() to return the last seen HDFS error on that thread. This patch depends on the recent HDFS commit: `fda86ef2a3` Testing: A test has been added which puts HDFS in safe mode and then verifies that we see a 255 error with the root cause. Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d Reviewed-on: http://gerrit.cloudera.org:8080/6894 Tested-by: Impala Public Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2017-05-23 16:42:48 +00:00
Attila Jeges	21f9063304	Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" Reverting IMPALA-2716 as SparkSQL does not agree with the approach taken. More details can be found at: https://issues.apache.org/jira/browse/SPARK-12297 Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Reviewed-on: http://gerrit.cloudera.org:8080/6896 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-23 01:46:22 +00:00
Lars Volker	0c8b2d3dbe	IMPALA-5144: Remove sortby() hint The sortby() hint is superseded by the SORT BY SQL clause, which has been introduced in IMPALA-4166. This changes removes the hint. Change-Id: I83e1cd6fa7039035973676322deefbce00d3f594 Reviewed-on: http://gerrit.cloudera.org:8080/6885 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-22 00:40:04 +00:00
Thomas Tauber-Marshall	ee9fbeca90	IMPALA-5340: Query profile displays stale query state Previously, updates to the query state in ClientRequestState were not immediately reflected in the query profile, potentially leading to the profile showing an incorrect state for an extended perioud during execution. In particular, queries were being shown in the 'CREATED' state long after they had started 'RUNNING'. The fix is to update the profile whenever the state is updated. Testing: - Extended existing hs2 tests and added a beeswax test to check for expected query states in the profile Change-Id: I952319b7308a24d4e2dff924199c0c771bce25b3 Reviewed-on: http://gerrit.cloudera.org:8080/6923 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-20 03:17:59 +00:00
Matthew Jacobs	d04f96b990	IMPALA-5301: Set Kudu minicluster memory limit By default, Kudu assumes it has 80% of system memory which is far too high for the minicluster. This sets a mem limit of 2gb and lowers the limit of the block cache. These values were tested on a gerrit-verify-dryrun job as well as an exhaustive run. This patch also simplifies TestKuduMemLimits which was unnecessarily creating a large table during test execution. Change-Id: I7fd7e1cd9dc781aaa672a2c68c845cb57ec885d5 Reviewed-on: http://gerrit.cloudera.org:8080/6844 Reviewed-by: Todd Lipcon <todd@apache.org> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-17 23:44:37 +00:00
Lars Volker	8afb59045e	IMPALA-5187, IMPALA-5208: Bump Breakpad Version, undo IMPALA-3794 This change switches to a new Breakpad version, which includes fixes for Breakpad bugs #681 and #728. The toolchain change was reviewed here: https://gerrit.cloudera.org/6866 The change also undoes the workaround introduced in IMPALA-3794. In addition to running test_breakpad.py in a loop for a while, I tested Then I verified that the test fails with the old toolchain version (88e5b2) and works with the new one (ffe3e4). To test #728 I added a sleep() call before SendContinueSignalToChild() and then killed the parent process, manually observing that the child would die, too. Change-Id: Ic541ccd565f2bb51f68c085747fc47ae8c905d19 Reviewed-on: http://gerrit.cloudera.org:8080/6883 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-17 15:19:12 +00:00
Taras Bobrovytsky	b73c445eba	IMPALA-5246: Verify that a UDF test completely finishes before moving on A memory intensive UDF test takes a while to completely finish and for the memory in Impala to be completely freed. This caused a problem in ASAN builds (and potentially in normal builds) because we would start the next test right away, before the memory is freed. We fix the issue by checking that all fragments finish executing before starting the next test. Testing: - Ran a private ASAN build which passed. Change-Id: I0555b5327945c522f70f449caa1214ee0bfd84fe Reviewed-on: http://gerrit.cloudera.org:8080/6893 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-17 04:31:08 +00:00
Thomas Tauber-Marshall	f195b7577c	IMPALA-5305: test_observability.py failing on s3, localFS and Isilon A test that was recently added, test_observability::test_scan_summary, uses an HBase table. It needs to be restricted not to run on S3, localFS or Isilon. Change-Id: I9863cf3f885eb1d2152186de34e093497af83d99 Reviewed-on: http://gerrit.cloudera.org:8080/6859 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-12 19:34:16 +00:00
Matthew Jacobs	a16a0fa84d	IMPALA-5137: Support Kudu UNIXTIME_MICROS as Impala TIMESTAMP Adds Impala support for TIMESTAMP types stored in Kudu. Impala stores TIMESTAMP values in 96-bits and has nanosecond precision. Kudu's timestamp is a 64-bit microsecond delta from the Unix epoch (called UNIXTIME_MICROS), so a conversion is necessary. When writing to Kudu, TIMESTAMP values in nanoseconds are averaged to the nearest microsecond. When reading from Kudu, the KuduScanner returns UNIXTIME_MICROS with 8bytes of padding so Impala can convert the value to a TimestampValue in-line and copy the entire row. Testing: Updated the functional_kudu schema to use TIMESTAMPs instead of converting to STRING, so this provides some decent coverage. Some BE tests were added, and some EE tests as well. TODO: Support pushing down TIMESTAMP predicates TODO: Support TIMESTAMPs in range partitioning expressions Change-Id: Iae6ccfffb79118a9036fb2227dba3a55356c896d Reviewed-on: http://gerrit.cloudera.org:8080/6526 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-11 20:55:51 +00:00
Thomas Tauber-Marshall	49b6af54c8	IMPALA-4499: Table name missing from exec summary For scan nodes, previously only HDFS tables showed the name of the table in the 'Detail' section for the scan node. This change adds the table name for all scan node types (Kudu, HBase, and DataSource). Testing: - Added an e2e test in test_observability. Change-Id: If4fd13f893aea4e7df8a2474d7136770660e4324 Reviewed-on: http://gerrit.cloudera.org:8080/6832 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-10 22:56:38 +00:00
Lars Volker	9270346825	IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types This change adds functionality to write and read parquet::Statistics for Decimal, String, and Timestamp values. As an exception, we don't read statistics for CHAR columns, since CHAR support is broken in Impala (IMPALA-1652). This change also switches from using the deprecated fields 'min' and 'max' to populate the new fields 'min_value' and 'max_value' in parquet::Statistics, that were added in parquet-format pull request #46. The HdfsParquetScanner will preferably read the new fields if they are populated and if the column order 'TypeDefinedOrder' has been used to compute the statistics. For columns without a column order set or with only the deprecated fields populated, the scanner will read them only if they are of simple numeric type, i.e. boolean, integer, or floating point. This change removes the validation of the Parquet Statistics we write to Hive from the tests, since Hive does not write the new fields. Instead it adds a parquet file written by Hive that uses the deprecated fields for its statistics. It uses that file to exercise the fallback logic for supported types in a test. This change also cleans up the interface of ParquetPlainEncoder in parquet-common.h. Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Reviewed-on: http://gerrit.cloudera.org:8080/6563 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2017-05-09 15:47:21 +00:00
Michael Ho	249632b308	IMPALA-5197: Erroneous corrupted Parquet file message The Parquet file column reader may fail in the middle of producing a scratch tuple batch for various reasons such as exceeding memory limit or cancellation. In which case, the scratch tuple batch may not have materialized all the rows in a row group. We shouldn't erroneously report that the file is corrupted in this case as the column reader didn't completely read the entire row group. A new test case is added to verify that we won't see this error message. A new failpoint phase GETNEXT_SCANNER is also added to differentiate it from the GETNEXT in the scan node itself. Change-Id: I9138039ec60fbe9deff250b8772036e40e42e1f6 Reviewed-on: http://gerrit.cloudera.org:8080/6787 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 09:27:39 +00:00
Marcel Kornacker	368115cdae	IMPALA-2550: Switch to per-query exec rpc Coordinator: - FragmentInstanceState -> BackendState, which in turn records FragmentInstanceStats QueryState - does query-wide setup in a separate thread (which also launches the instance exec threads) - has a query-wide 'prepared' state at which point all static setup is done and all FragmentInstanceStates are accessible Also renamed QueryExecState to ClientRequestState. Simplified handling of execution status (in FragmentInstanceState): - status only transmitted via ReportExecStatus rpc - in particular, it's not returned anymore from the Cancel rpc FIS: Fixed bugs related to partially-prepared state (in Close() and ReleaseThreadToken()) Change-Id: I20769e420711737b6b385c744cef4851cee3facd Reviewed-on: http://gerrit.cloudera.org:8080/6535 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 04:04:50 +00:00
Lars Volker	12f3ecceab	IMPALA-5287: Test skip.header.line.count on gzip This change fixed IMPALA-4873 by adding the capability to supply a dict 'test_file_vars' to run_test_case(). Keys in this dict will be replaced with their values inside test queries before they are executed. Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70 Reviewed-on: http://gerrit.cloudera.org:8080/6817 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 01:36:46 +00:00
Jim Apple	374f1121da	IMPALA-3224: De-Cloudera non-docs JIRA URLs John Russell is planning to fix the URLS in docs in a separate commit. Fixed using: (git ls-files \| xargs replace \ 'https://issues.cloudera.org/browse/IMPALA' 'IMPALA' --) && \ git checkout HEAD docs Change-Id: I28ea06e89341de234f9005fdc72a2e43f0ab8182 Reviewed-on: http://gerrit.cloudera.org:8080/6487 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-07 04:44:57 +00:00
Thomas Tauber-Marshall	698f4a34c6	IMPALA-5262: test_analytic_order_by_random fails with assert This was a poorly written test that relies on assumptions about the behavior of 'rand' and the order that rows get processed in a table that Impala doesn't actually guarantee. The new version is still sensitive to the precise behavior of 'rand()', but shouldn't be flaky unless that behavior is changed. Change-Id: If1ba8154c2b6a8d508916d85391b95885ef915a9 Reviewed-on: http://gerrit.cloudera.org:8080/6775 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-03 01:19:58 +00:00
Attila Jeges	5803a0b074	IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE <file> will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE <other table> will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Reviewed-on: http://gerrit.cloudera.org:8080/5939 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 20:24:08 +00:00

1 2 3 4 5 ...

1217 Commits