impala

mirror of https://github.com/apache/impala.git synced 2026-01-06 06:01:03 -05:00

Author	SHA1	Message	Date
Zoram Thanga	b581a9d1ee	IMPALA-6225: Part 2: Query profile date-time strings should have ns precision. This commit follows `16d8dd58`. This patch adds a test case that inspects the thrift profile of a completed query, and verifies that the "Start Time" and "End Time" of the query have nanosecond precision. We chose to work with the thrift profile directly, rather than parse the debug web page, as it is the thrift profile which is consumed by management API clients of Impala. Change-Id: Id3421a34cc029ebca551730084c7cbd402d5c109 Reviewed-on: http://gerrit.cloudera.org:8080/8784 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-21 04:26:33 +00:00
Alex Behm	1f7b3b00e9	IMPALA-5310: Part 3: Use SAMPLED_NDV() in COMPUTE STATS. Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV() function. Testing: - modified/improved existing functional tests - core/hdfs run passed Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b Reviewed-on: http://gerrit.cloudera.org:8080/8840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-16 04:58:59 +00:00
Tim Armstrong	dc1282fbc9	IMPALA-6241: timeout in admission control test under ASAN The fix for IMPALA-6241 is to increase the timeout for all slow builds. While testing that fix, I discovered that the ASAN build detection logic was failing silently, resulting in it assuming that it was testing a DEBUG build. The error was: Unexpected DW_AT_name in first CU: /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.9.1.src/projects/compiler-rt/lib/asan/asan_preinit.cc; choosing DEBUG The fix for that issue is to remove the build type detection heuristic and instead just write a file with the build type as part of the build process. Testing: Before this change I was able to reproduce locally every 5-10 test iterations. After this change I haven't seen it reproduce. Change-Id: Ia4ed949cac99b9925f72e19e4adaa2ead370b536 Reviewed-on: http://gerrit.cloudera.org:8080/8652 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 03:28:22 +00:00
Thomas Tauber-Marshall	2510fe0aa0	IMPALA-4252: Min-max runtime filters for Kudu This patch implements min-max filters for runtime filters. Each runtime filter generates a bloom filter or a min-max filter, depending on if it has HDFS or Kudu targets, respectively. In RuntimeFilterGenerator in the planner, each hash join node generates a bloom and min-max filter for each equi-join predicate, but only those filters that end up being assigned to a target make it into the final plan. Min-max filters are only assigned to Kudu scans if the target expr is a column, as Kudu doesn't support bounds on general exprs, and only if the join op is '=' and not 'is distinct from', as Kudu doesn't support returning NULLs if a bound is set. Min-max filters are inserted into by the PartitionedHashJoinBuilder. Codegen is used to eliminate branching on the type of filter. String min-max filters truncate their bounds at 1024 chars, so that the max amount of memory used by min-max filters is negligible. For now, min-max filters are only applied at the KuduScanner, which passes them into the Kudu client. Future work will address applying min-max filters at HDFS scan nodes and applying bloom filters at Kudu scan nodes. Functional Testing: - Added new planner tests and updated the old ones. (in old tests, a lot of runtime filters are renumbered as we always generate min-max filters even if they don't end up getting assigned and they take up some of the RF ids). - Updated existing runtime filter tests to work with Kudu. - Added e2e tests for min-max filter specific functionality. Perf Testing: - All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu, timings are averages of 3 runs. - Ran a contrived query with a filter that does not eliminate any rows (full self join of lineitem). The difference in running time was negligible - 24.46s with filters on, 24.15s with filters off for a ~1% slowdown. - Ran a contrived query with a filter that elimiates all rows (self join on lineitem with a join condition that never matches). The filters resulted in a significant speedup - 0.26s with filters on, 1.46s with filters off for a ~5.6x speedup. This query is added to targeted-perf. Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c Reviewed-on: http://gerrit.cloudera.org:8080/7793 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-17 21:33:51 +00:00
Vuk Ercegovac	6a2b7a64fb	IMPALA-4704: Turns on client connections when local catalog initialized. Currently, impalad starts beeswax and hs2 servers even if the catalog has not yet been initialized. As a result, client connections see an error message stating that the impalad is not yet ready. This patch changes the impalad startup sequence to wait until the catalog is received before opening beeswax and hs2 ports and starting their servers. Testing: - python e2e tests that start a cluster without a catalog and check that client connections are rejected as expected. Change-Id: I52b881cba18a7e4533e21a78751c2e35c3d4c8a6 Reviewed-on: http://gerrit.cloudera.org:8080/8202 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-13 21:14:14 +00:00
Tim Armstrong	839c45777b	IMPALA-6106: handle comments before set in test parser The tpcds-q22a.test test file has a comment before a "set" command. The regex used to match "set" commands does not handle preceding comments, which are part of the query statement. Testing: Ran the test with the below command and confirmed that DECIMAL_V2 was automatically set back to 0. impala-py.test tests/query_test/test_tpcds_queries.py -k 22a \ --capture=no Change-Id: Id549dd3369dd163f3b3c8fe5685a52e0e6b2d134 Reviewed-on: http://gerrit.cloudera.org:8080/8384 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-26 00:35:39 +00:00
Philip Zeyliger	c9740b43d1	IMPALA-5908: Allow SET to unset modified query options. The query 'SET <option>=""' will now unset an option within the session, reverting it to its default state. This change became necessary when "SET" started returning an empty string for unset options which don't have a default. The test infrastructure (impala_test_suite.py) resets options to what it thinks is its defaults, and, when this broke, some ASAN builds started to fail, presumably due to a timing issue with how we re-use connections between tests. Previously, SessionState copied over the default options from the server when the session was created and then mutated that. To support unsetting options at the session layer, this change keeps a pointer to the default server settings, keeps separately the mutations, and overlays the options each time they're requested. Similarly, for configuration overlays that happen per-query, the overlay is now done explicitly, because empty per-query overlay values (key=..., value="") now have no effect. Because "set key=''" is ambiguous between "set to the empty string" and "unset", it's now impossible to set to the empty string, at the session layer, an option that is configured at a previous layer. In practice, this is just debug_action and request_pool. debug_action is essentially an internal tool. For request_pool, this means that setting the default request_pool via impalad command line is now a bad idea, as it can't be cleared at a per-session level. For request_pool, the correct course of action for users is to use placement rules, and to have a default placement rule. Testing: * Added a simple test that triggered this side-effect without this code. Specifically, "impala-python infra/python/env/bin/py.test tests/metadata/test_set.py -s" with the modified set.test triggers. * Amended tests/custom_cluster/test_admission_controller.py; it was useful for testing these code paths. * Added cases to query-options-test to check behavior for both defaulted and non-defaulted values. * Added a custom cluster test that checks that overlays are working against * Ran an ASAN build where this was triggering previously. Change-Id: Ia8c383e68064f839cb5000118901dff77b4e5cb9 Reviewed-on: http://gerrit.cloudera.org:8080/8070 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-05 03:04:38 +00:00
Tim Wood	1969c56c2f	IMPALA-5986: Correct set-option logic to recognize digits in names. Arose during work for IMPALA-5376; prevents tests from passing consistently. Change-Id: Ia3ba641553ff827dbd4673b9fe7ed7d9d5e83052 Reviewed-on: http://gerrit.cloudera.org:8080/8166 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-29 01:50:31 +00:00
Tianyi Wang	34d63e9dea	IMPALA-3516: Avoid writing to /tmp in testing Currently some parts of the tests write to /tmp: 1. PlannerTest result files are written to /tmp/PlannerTest 2. FE tests load libfesupport, which writes logs to /tmp 3. Updated results in EE tests (run-tests.py --update_results) is written to /tmp This patch changes them into writing to $IMPALA_HOME/logs. Specifically: 1. PlannerTest result files are written to $IMPALA_FE_TEST_LOGS_DIR/PlannerTest 2. libfesupport logs are written to $IMPALA_FE_TEST_LOGS_DIR 3. Updated EE test results are written to $IMPALA_EE_TEST_LOGS_DIR Change-Id: I9e503eb7d333c1b89dc8aea87cf30504838c44f9 Reviewed-on: http://gerrit.cloudera.org:8080/8047 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-13 07:36:04 +00:00
Tim Armstrong	b1edaf215e	IMPALA-5902: add ThreadSanitizer build This is sufficient to get Impala to come up and run queries with thread sanitizer enabled. I have not triaged or fixed the data races that are reported, that is left for follow-on work. Change-Id: I22f8faeefa5e157279c5973fe28bc573b7606d50 Reviewed-on: http://gerrit.cloudera.org:8080/7977 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-07 01:22:41 +00:00
Tim Armstrong	caefd86136	IMPALA-5830: SET_DENY_RESERVATION_PROBABILITY test Add a targeted test that confirms that setting the query option will force spilling. Testing: Ran test_spilling locally. Change-Id: Ida6b55b2dee0779b1739af5d75943518ec40d6ce Reviewed-on: http://gerrit.cloudera.org:8080/7809 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-29 23:01:10 +00:00
Dan Hecht	5f323124ae	IMPALA-4990: fix run_tests.py --update_results Seems to have broken with some recent commits. Change-Id: I9c22e197662228158d7935ebfb12d9b3691eb499 Reviewed-on: http://gerrit.cloudera.org:8080/6151 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-23 04:49:13 +00:00
Matthew Jacobs	4456ead841	IMPALA-5760: Revert IMPALA-4795 Revert commit `3059024bd8` for IMPALA-4795: Allow fetching function obj from catalog using signature This commit seems to cause TestUdfExecution.test_java_udfs to fail periodically. IMPALA-4795 wasn't a critical fix, so lets just revert it until we know we can fix the flaky test at the same time. Change-Id: Iae56a75e8ec44af6dae50f18869a486e5f8b608c Reviewed-on: http://gerrit.cloudera.org:8080/7616 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-08 22:13:11 +00:00
Tim Armstrong	507bd8be7e	IMPALA-4674: Part 1: remove old aggs and joins This is intended to be merged at the same time as Part 2 but is separated out to make the change more reviewable. Part 2 assumes that it does not need special logic to handle this mode (e.g. because the old aggs and joins don't use reservation). Disable the --enable_partitioned_{aggregation,hash_join} options and remove all product and test code associated with them. Change-Id: I5ce2236d37c0ced188a4a81f7e00d4b8ac98e7e9 Reviewed-on: http://gerrit.cloudera.org:8080/7102 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-02 01:49:12 +00:00
Bikramjeet Vig	3059024bd8	IMPALA-4795: Allow fetching function obj from catalog using signature Fixed a bug where the catalog throws a NullPointerException if trying to fetch a function object using function signature. This exception prevented the code paths in CatalogOpExecutor::HandleDropFunction and ImpalaServer::CatalogUpdateCallback to be exercised which prevented removal of a recreated function object necessary for maintaining metadata consistency. Change-Id: I2cfad0213a79d39b77ad9aff701a93f93be4bf7f Reviewed-on: http://gerrit.cloudera.org:8080/7479 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-01 01:53:04 +00:00
Matthew Jacobs	7a1ff1e5e9	IMPALA-5539: Fix Kudu timestamp with -use_local_tz_for_unix_ts The -use_local_tz_for_unix_timestamp_conversion flag exists to specify if TIMESTAMPs should be interpreted as localtime or UTC when converting to/from Unix time via builtins: from_unixtime(bigint unixtime) unix_timestamp(string datetime[, ...]) unix_timestamp(timestamp datetime) However, the KuduScanner was calling into code that, when the gflag above was set, interpreted Unix times as local time. Unfortunately the write path (KuduTableSink) and some FE TIMESTAMP code (see KuduUtil.java) did not have this behavior, i.e. we were handling the gflag inconsistently. Tests: * Adds a custom cluster test to run Kudu test cases with -use_local_tz_for_unix_timestamp_conversion. * Adds tests for the new builtin unix_micros_to_utc_timestamp() which run in a custom cluster test (added test_local_tz_conversion.py) as well as in the regular tests (added to test_exprs.py). Change-Id: I423a810427353be76aa64442044133a9a22cdc9b Reviewed-on: http://gerrit.cloudera.org:8080/7311 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-19 22:17:13 +00:00
Tim Armstrong	c4d284f3cc	IMPALA-5483: Automatically disable codegen for small queries This is similar to the single-node execution optimisation, but applies to slightly larger queries that should run in a distributed manner but won't benefit from codegen. This adds a new query option disable_codegen_rows_threshold that defaults to 50,000. If fewer than this number of rows are processed by a plan node per impalad, the cost of codegen almost certainly outweighs the benefit. Using rows processed as a threshold is justified by a simple model that assumes the cost of codegen and execution per row for the same operation are proportional. E.g. if x is the complexity of the operation, n is the number of rows processed, C is a constant factor giving the cost of codegen and Ec/Ei are constant factor giving the cost of codegen'd and interpreted execution and d, then the cost of the codegen'd operator is C * x + Ec * x * n and the cost of the interpreted operator is Ei * x * n. Rearranging means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that (at least with the simplified model) it makes sense to choose interpretation or codegen based on a constant threshold. The model also implies that it is somewhat safer to choose codegen because the additional cost of codegen is O(1) but the additional cost of interpretation is O(n). I ran some experiments with TPC-H Q1, varying the input table size, to determine what the cut-over point where codegen was beneficial was. The cutover was around 150k rows per node for both text and parquet. At 50k rows per node disabling codegen was very beneficial - around 0.12s versus 0.24s. To be somewhat conservative I set the default threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the cutover tends to be higher because there are plan nodes that process many fewer than the max rows. Fix a couple of minor issues in the frontend - the numNodes_ calculation could return 0 for Kudu, and the single node optimization didn't handle the case where for a scan node with conjuncts, a limit and missing stats correctly (it considered the estimate still valid.) Testing: Updated e2e tests that set disable_codegen to set disable_codegen_rows_threshold to 0, so that those tests run both with and without codegen still. Added an e2e test to make sure that the optimisation is applied in the backend. Added planner tests for various cases where codegen should and shouldn't be disabled. Perf: Added a targeted perf test for a join+agg over a small input, which benefits from this change. Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e Reviewed-on: http://gerrit.cloudera.org:8080/7153 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-29 21:14:59 +00:00
Joe McDonnell	53287df0a1	IMPALA-5488: Fix handling of exclusive HDFS file handles This change fixes three issues: 1. File handle caching is expected to be disabled for remote files (using exclusive HDFS file handles), however the file handles are still being cached. 2. The retry logic for exclusive file handles is broken, leading the number of open files to be incorrect. 3. There is no test coverage for disabling the file handle cache. To fix issue #1, when a scan range is requesting an exclusive file handle from the cache, it will always request a newly opened file handle. It also will destroy the file handle when the scan range is closed. To fix issue #2, exclusive file handles will no longer retry IOs. Since the exclusive file handle is always a fresh file handle, it will never have a bad file handle from the cache. This returns the logic to its state before IMPALA-4623 in these cases. If a file handle is borrowed from the cache, then the code will continue to retry once with a fresh handle. To fix issue #3, custom_cluster/test_hdfs_fd_caching.py now does both positive and negative tests for the file handle cache. It verifies that setting max_cached_file_handles to zero disables caching. It also verifies that caching is disabled on remote files. (This change will resolve IMPALA-5390.) Change-Id: I4c03696984285cc9ce463edd969c5149cd83a861 Reviewed-on: http://gerrit.cloudera.org:8080/7181 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-21 09:42:34 +00:00
David Knupp	adbb0b7f81	IMPALA-5413: Add a hive user for test_seq_writer_hive_compatibility. This patch includes a change to the framework to permit the passing of a username to the run_stmt_in_hive() method in the ImpalaTestSuite class, but retains the same default value as before. This is to allow a test to issue a 'select count(*) from foo' query through hive. Hive needs to set up a job to perform this query, and HDFS write access to do so. In typical cases, the HDFS user is 'hdfs'. however it may be necessary to change this depending on the cluster. On a local mini-cluster, the username appears to be irrelevant, so this won't affect locally run tests. Tested by running the core set of tests on a local minicluster to make sure there were no regressions. Also confirmed that the test in question now passes on a remote physical cluster. Change-Id: I1cc8824800e4339874b9c4e3a84969baf848d941 Reviewed-on: http://gerrit.cloudera.org:8080/7046 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-10 02:26:13 +00:00
Joe McDonnell	5f9f704bde	IMPALA-5386: Fix ReopenCachedHdfsFileHandle failure case This fixes three issues with the file handle cache. The first issue is that ReopenCachedHdfsFileHandle can destroy the passed in file handle without removing the reference to it. The old file handle then refers to a piece of memory that is not a handle in the cache, so future use of the handle fails with an assert. The fix is to always overwrite the reference to the file handle when it has been destroyed. The second issue is that query_test/test_hdfs_fd_caching.py should run on anything that supports the hdfs commandline and tolerate query failure. It's logic is not specific to file handle caching, so it has been renamed to query_test/test_hdfs_file_mods.py. Finally, custom_cluster/test_hdfs_fd_caching.py should not be running on remote files (S3, ADLS, Isilon, remote clusters). The file handle cache semantics won't apply on those platforms. Change-Id: Iee982fa5e964f6c8969b2eb7e5f3eca89e793b3a Reviewed-on: http://gerrit.cloudera.org:8080/7020 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-09 01:45:37 +00:00
Taras Bobrovytsky	6604083f51	IMPALA-5355: Fix the order of Sentry roles and privileges After a single Impalad is restarted, it is possible that order in which it receives roles and privileges from the statestore is incorrect. The correct order is for the role to appear first in the update, before the privilege that references it. If a user updates a role, its catalog version number can become larger than the catalog numbers of the privileges that reference it. This causes the role to come after the privilege in the initial metastore update. The issue is fixed by doing two passes over the catalog objects in the Impalad. The first pass updates the top level objects. The second pass updates the dependent objects Testing: - Added a test that reproduced the problem. Change-Id: I7072e95b74952ce5a51ea1b6e2ae3e80fb0940e0 Reviewed-on: http://gerrit.cloudera.org:8080/7004 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-07 00:30:46 +00:00
Jim Apple	07a7138817	Add a script to test performance on a developer machine This is a migration from an old and broken script from another repository. Example use: bin/single_node_perf_run.py --ninja --workloads targeted-perf \ --load --scale 4 --iterations 20 --num_impalads 3 \ --start_minicluster --query_names PERF_AGG-Q3 \ $(git rev-parse HEAD~1) $(git rev-parse HEAD) The script can load data, run benchmarks, and compare the statistics of those runs for significant differences in performance. It glues together buildall.sh, bin/load-data.py, bin/run-workload.py, and tests/benchmark/report_benchmark_results.py. Change-Id: I70ba7f3c28f612a370915615600bf8dcebcedbc9 Reviewed-on: http://gerrit.cloudera.org:8080/6818 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-05-31 08:10:48 +00:00
Sailesh Mukil	50bd015f2d	IMPALA-5333: Add support for Impala to work with ADLS This patch leverages the AdlFileSystem in Hadoop to allow Impala to talk to the Azure Data Lake Store. This patch has functional changes as well as adds test infrastructure for testing Impala over ADLS. We do not support ACLs on ADLS since the Hadoop ADLS connector does not integrate ADLS ACLs with Hadoop users/groups. For testing, we use the azure-data-lake-store-python client from Microsoft. This client seems to have some consistency issues. For example, a drop table through Impala will delete the files in ADLS, however, listing that directory through the python client immediately after the drop, will still show the files. This behavior is unexpected since ADLS claims to be strongly consistent. Some tests have been skipped due to this limitation with the tag SkipIfADLS.slow_client. Tracked by IMPALA-5335. The azure-data-lake-store-python client also only works on CentOS 6.6 and over, so the python dependencies for Azure will not be downloaded when the TARGET_FILESYSTEM is not "adls". While running ADLS tests, the expectation will be that it runs on a machine that is at least running CentOS 6.6. Note: This is only a test limitation, not a functional one. Clusters with older OSes like CentOS 6.4 will still work with ADLS. Added another dependency to bootstrap_build.sh for the ADLS Python client. Testing: Ran core tests with and without TARGET_FILESYSTEM as 'adls' to make sure that all tests pass and that nothing breaks. Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542 Reviewed-on: http://gerrit.cloudera.org:8080/6910 Tested-by: Impala Public Jenkins Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>	2017-05-25 19:35:24 +00:00
Attila Jeges	21f9063304	Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" Reverting IMPALA-2716 as SparkSQL does not agree with the approach taken. More details can be found at: https://issues.apache.org/jira/browse/SPARK-12297 Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Reviewed-on: http://gerrit.cloudera.org:8080/6896 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-23 01:46:22 +00:00
Michael Ho	249632b308	IMPALA-5197: Erroneous corrupted Parquet file message The Parquet file column reader may fail in the middle of producing a scratch tuple batch for various reasons such as exceeding memory limit or cancellation. In which case, the scratch tuple batch may not have materialized all the rows in a row group. We shouldn't erroneously report that the file is corrupted in this case as the column reader didn't completely read the entire row group. A new test case is added to verify that we won't see this error message. A new failpoint phase GETNEXT_SCANNER is also added to differentiate it from the GETNEXT in the scan node itself. Change-Id: I9138039ec60fbe9deff250b8772036e40e42e1f6 Reviewed-on: http://gerrit.cloudera.org:8080/6787 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 09:27:39 +00:00
Marcel Kornacker	368115cdae	IMPALA-2550: Switch to per-query exec rpc Coordinator: - FragmentInstanceState -> BackendState, which in turn records FragmentInstanceStats QueryState - does query-wide setup in a separate thread (which also launches the instance exec threads) - has a query-wide 'prepared' state at which point all static setup is done and all FragmentInstanceStates are accessible Also renamed QueryExecState to ClientRequestState. Simplified handling of execution status (in FragmentInstanceState): - status only transmitted via ReportExecStatus rpc - in particular, it's not returned anymore from the Cancel rpc FIS: Fixed bugs related to partially-prepared state (in Close() and ReleaseThreadToken()) Change-Id: I20769e420711737b6b385c744cef4851cee3facd Reviewed-on: http://gerrit.cloudera.org:8080/6535 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 04:04:50 +00:00
Lars Volker	12f3ecceab	IMPALA-5287: Test skip.header.line.count on gzip This change fixed IMPALA-4873 by adding the capability to supply a dict 'test_file_vars' to run_test_case(). Keys in this dict will be replaced with their values inside test queries before they are executed. Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70 Reviewed-on: http://gerrit.cloudera.org:8080/6817 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 01:36:46 +00:00
Attila Jeges	5803a0b074	IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE <file> will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE <other table> will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Reviewed-on: http://gerrit.cloudera.org:8080/5939 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 20:24:08 +00:00
Dimitris Tsirogiannis	e2c53a8bdf	IMPALA-5147: Add the ability to exclude hosts from query execution This commit introduces a new startup option, termed 'is_executor', that determines whether an impalad process can execute query fragments. The 'is_executor' option determines if a specific host will be included in the scheduler's backend configuration and hence included in scheduling decisions. Testing: - Added a customer cluster test. - Added a new scheduler test. Change-Id: I5d2ff7f341c9d2b0649e4d14561077e166ad7c4d Reviewed-on: http://gerrit.cloudera.org:8080/6628 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-26 01:45:40 +00:00
Sailesh Mukil	edcc593ee5	IMPALA-5244 test_hdfs_file_open_fail fails on local filesystem build This test had to be skipped for non HDFS filesystems. Change-Id: I5318a5eb27b15fed5df770b9c3ea23e7e1a97a4c Reviewed-on: http://gerrit.cloudera.org:8080/6723 Reviewed-by: Michael Ho <kwho@cloudera.com> Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-25 10:50:18 +00:00
Tim Armstrong	fb2a78567a	IMPALA-5231: skip test_explain_level on non-HDFS systems Some details of the plans change if we're not running against a 3-node minicluster. The point of these tests is to avoid unintended changes to the explain format, so we don't need to run it against all FSes. Change-Id: I604f83695e956ef6bc85b5d1bc754ccb1378eda1 Reviewed-on: http://gerrit.cloudera.org:8080/6703 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-20 22:50:52 +00:00
Matthew Jacobs	532c5f2605	IMPALA-5079: Flaky Kudu tests; fix HS2 connection timeouts Fixes the HS2 timeouts for _all_ Kudu EE tests. Previously only 2 classes had the timeout set, but all the Kudu tests appear to be susceptible to this issue. Change-Id: Ibc48b4b7ae65ddf4bba087d079d4e4032f4d5f0f Reviewed-on: http://gerrit.cloudera.org:8080/6616 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-13 09:06:50 +00:00
Dimitris Tsirogiannis	296df3c826	IMPALA-4041: Limit catalog and admission control updates to coordinators With this commit we add the ability to limit catalog updates to a limited set of coordinator nodes. A new startup option, termed 'is_coordinator' is added to indicate if a node is a coordinator. Coordinators accept connections through HS2 and Beeswax interfaces and can also participate in query execution. Non-coordinator nodes do not receive catalog updates from the statestore, do not initialize a query scheduler and cannot accept Beeswax and HS2 client connections. Testing: - Added a custom cluster test that launches a cluster in which the number of coordinators is less than the cluster size and runs a number of smoke queries. - Successfully run exhaustive tests. Change-Id: I5f2c74abdbcd60ac050efa323616bd41182ceff3 Reviewed-on: http://gerrit.cloudera.org:8080/6344 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-28 22:27:25 +00:00
Joe McDonnell	6441ca65bd	IMPALA-5039: Fix variability in parquet dictionary filtering test The tests for dictionary filtering look at how many row groups are processed and how many are filtered by matching text in the profile. However, the number of row groups processed and filtered by any individual fragment depends on how the work is split and how many impalads are running. This causes variability in the test output. To fix this, the test needs a way to aggregate the results across fragments. This fix introduces the following syntax for specifying these aggregates: aggregate(function_name, field_name): expected_value This searches the runtime profile for lines that contain 'field_name: number'. It skips the averaged fragment, as this is derived from all the other fragments. Currently, only SUM is implemented, and the expected_value is required to be an integer. It should be easy to implement other interesting functions like COUNT and MIN/MAX. It would also be possible to extend it to floats. Switching the dictionary filtering tests over to this new syntax eliminates the variability in the tests. Change-Id: I6b7b84d973b3ac678a24e82900f2637d569158bb Reviewed-on: http://gerrit.cloudera.org:8080/6301 Tested-by: Impala Public Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2017-03-13 17:37:15 +00:00
Matthew Jacobs	815c76f9cb	IMPALA-4828: Alter Kudu schema outside Impala may crash on read Creating a table in Impala, changing the column schema outside of Impala, and then reading again in Impala may result in a crash. Impala may attempt to dereference pointers that aren't there. This happens if a string column is dropped and then a new, non string column is added with the old string column's name. The Kudu scan token contains the projection schema, and that is validated when opening the Kudu scanner (with the exception of KUDU-1881), but the issue is that during planning, Impala assumes the types/nullability of columns haven't changed when creating the scan tokens. This is fixed by adding a check when creating the scan token, and failing the query if the column types changed. Impala then relies on the Kudu client to properly validate that the underlying schema is still represented by the scan token, and that deserialization will fail if it no longer matches. Test cases were added for this particular crash scenario, which now fails during planning as expected. This does not attempt to validate the Kudu client validation at deserialization time, though that would be valuable coverage to add in the future. Columns being removed don't produce a crash; the query fails gracefully. A test was added for this case. Columns being added should not affect this scenario, but a test was added anyway. Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a Reviewed-on: http://gerrit.cloudera.org:8080/5840 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-17 23:09:39 +00:00
David Knupp	894bb77855	IMPALA-4839: Remove implicit 'localhost' for KUDU_MASTER_HOSTS The Kudu query tests were failing on a remote cluster because the Kudu master was always set to '127.0.0.1', with no way to override it. This patch corrects the issue with a number of changes: - Add a pytest command line option to specify an arbitrary Kudu master - Consolidate the place where the default Kudu master is derived. It had been stored both in the env and in tests/common/__init__.py, with different files looking to different places. For now, just look to the env, and remove the value from __init__.py. - The kudu_client test fixture in conftest.py was using the connect() method from impala.dbapi (part of the Impyla library), without specifying the host param. In the absence of that, the default value is 'localhost', so add the host param to the connect() call. - Define the various defaults for pytest config as constants at the top of conftest.py. Change-Id: I9df71480a165f4ce21ae3edab6ce7227fbf76f77 Reviewed-on: http://gerrit.cloudera.org:8080/5877 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-14 21:51:39 +00:00
Thomas Tauber-Marshall	82290d61ad	IMPALA-4895: Memory limit exceeded in test_outer_joins A recent change (IMPALA-3524) removed a 'CATCH' section for a mem limit exceeded error because the other changes in the patch reduced the memory requirements for that particular query and the error was no longer being hit. This seemed okay because the point of the test wasn't to trigger the mem limit exceeded error, and I manually verified that the situation was the test was addressing was still covered even without the error being hit. It turns out, though, that the test still hits the error in some situations (local-filesystem and non-partitioned-aggs-and-joins builds). The fix is to make the test more permissive by adding '__NO_ERROR_' as one of the options in the 'CATCH: ANY_OF' section, so that it passes whether or not the mem limit is exceeded. Change-Id: I4731a3e83dd2142a1d83be963f83cd1847472295 Reviewed-on: http://gerrit.cloudera.org:8080/5941 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-09 00:50:15 +00:00
Attila Jeges	c452595bff	IMPALA-1670,IMPALA-4141: Support multiple partitions in ALTER TABLE ADD PARTITION Just like Hive, Implala should support multiple partitions in ALTER TABLE ADD PARTITION statements. The syntax is as follows: ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec1 [location_spec1] [cache_spec1] PARTITION partition_spec2 [location_spec2] [cache_spec2] ... Grammar was modified to handle the new syntax. Introduced PartitionDef class to capture the repeatable part of the statement. TPartitionDef is the name of the corresponding thrift class. AlterTableAddPartitionStmt and CatalogOpExecutor classes were also modified to work with a list of partitions. Duplicate partition specs are rejected in AlterTableAddPartitionStmt.analyze(). Added FE, E2E and integration tests. Change-Id: Iddbc951f2931f488f7048c9780260f6b49100750 Reviewed-on: http://gerrit.cloudera.org:8080/4144 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-04 01:47:23 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Lars Volker	8b7f876649	IMPALA-4722: Disable log caching in test_scratch_disk test_scratch_disk fails sporadically when trying to assert the presence of log messages. This is probably caused by log caching, since after such failures the log files do contains the lines in question. I manually tested this by running the tests repeatedly for 2 days (10k runs). To make future diagnosis of similar problems easier, this change also adds more output to assert_impalad_log_contains(). Change-Id: I9f21284338ee7b4374aca249b6556282b0148389 Reviewed-on: http://gerrit.cloudera.org:8080/5669 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-01-12 18:58:48 +00:00
Lars Volker	25ebf586e0	IMPALA-4689: Fix computation of last active time The last active time in impala-server.cc#L1806 is in milliseconds, but the TimestampValue c'tor expects seconds. This change also renames some variables to make their meaning more explicit, aiming to prevent similar bugs in the future. This change also fixes a bug that occurred when during startup of the local minicluster the operating system PIDs would wrap around. This way the first impalad would not be the one with the smallest PID and ImpalaCluster.get_first_impalad() would return the wrong one. I ran git-clang-format on the change. Change-Id: I283564c8d8e145d44d9493f4201555d3a1087edf Reviewed-on: http://gerrit.cloudera.org:8080/5546 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2017-01-04 12:12:04 +00:00
Tim Armstrong	88448d1d4a	IMPALA-4586: don't constant fold in backend This patch ensures that setting the query option enable_expr_rewrites=false will disable both constant folding in the frontend (which it did already) and constant caching in the backend (which is enabled in this patch). This gives a way for users to revert to the old behaviour of non-deterministic UDFs before these optimisations were added in Impala 2.8. Before this patch, the backend would cache values based on IsConstant(). This meant that there was no way to override caching of values of non-deterministic UDFs, e.g. with enable_expr_rewrites. After this patch, we only cache literal values in the backend. This offers the same performance as before in the common case where the frontend will constant fold the expressions anyway. Also rename some functions to more cleanly separate the backend concepts of "constant" expressions and expressions that can be evaluated without a TupleRow. In a future change (IMPALA-4617) we should remove the IsConstant() analysis logic from the backend entirely and pass the information from the frontend. We should also fix isConstant() in the frontend so that it only returns true when it is safe to constant-fold the expression (IMPALA-4606). Once that is done, we could revert back to using IsConstant() instead of IsLiteral(). Testing: Added targeted test to test constant folding of UDFs: we expect different results depending on whether constant folding is enabled. Also run TestUdfs with expr rewrites enabled and disabled, since this can exercise different code paths. Refactored test_udfs somewhat to avoid running uninteresting combinations of query options for targeted tests and removed some 'drop * if not exists' statements that aren't necessary when using unique_database. This change revealed flakiness in test_mem_limit, which seems to have only worked by coincidence. Updated TrackAllocation() to actually set the query status when a memory limit is exceeded. Looped this test for a while to make sure it isn't flaky any more. Also fix other test bugs where the vector argument is modified in-place, which can leak out to other tests. Change-Id: I0c76e3c8a8d92749256c312080ecd7aac5d99ce7 Reviewed-on: http://gerrit.cloudera.org:8080/5391 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2016-12-08 04:53:53 +00:00
Taras Bobrovytsky	1083639ff2	IMPALA-4585: Allow the $DATABASE template in the CATCH section In a recent change (IMPALA-4363) we introduced a change where all file paths in .test files should be replaced with '__HDFS_FILENAME__'. This caused problems for tests on non-HDFS file systems and we also lost some test coverage. This patch fixes the problem by allowing the $DATABASE template in the catch section of the .test file. Change-Id: If0f6ae8dea7ac4cdaf0c61ebd8f0c589c353a96e Reviewed-on: http://gerrit.cloudera.org:8080/5372 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2016-12-08 02:20:50 +00:00
Taras Bobrovytsky	858f5c2197	IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400..9999 year range as we are scanning Parquet. Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Reviewed-on: http://gerrit.cloudera.org:8080/5343 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-12-03 06:41:07 +00:00
Michael Ho	a41918d443	Fix E2E test infrastructure to handle missing exceptions correctly This change fixes a bug in the E2E infrastructure that handles the case when an expected exception wasn't thrown. The code was expecting that test_section['CATCH'] to be a string but in reality it's a list of strings. It also clarifies the error message about the missing exception. This change also enforces that the CATCH subsection in tests cannot be empty. Change-Id: I7d83c5db59e8a239e4e70694a1e625af6f21419c Reviewed-on: http://gerrit.cloudera.org:8080/5260 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-12-01 23:43:03 +00:00
Thomas Tauber-Marshall	3833707dbd	IMPALA-4466: Improve Kudu CRUD test coverage The results in the test files were verified by hand. This patch also introduces a new test section 'DML_RESULTS', which takes the name of a table as a comment and the contents of the table as its body and then verifies that the body matches the actual contents of the table. This makes it easy to check that a DML operation has the desired effect on the contents of a table, rather than always having to add another test case that runs a select on the table. For now, this section cannot be used in a test along with the RESULTS or ERRORS sections. TODO: Refactor the DML test case handling (IMPALA-4471) Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6 Reviewed-on: http://gerrit.cloudera.org:8080/4953 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 02:54:30 +00:00
Thomas Tauber-Marshall	e6e2baea33	IMPALA-4372: 'Describe formatted' returns types in upper case A recent change caused 'describe formatted' to display the types in all upper case, but we want 'describe formatted' to match Hive's 'describe' output, which displays the types in lower case. This patch also fixes several problems with test_describe_formatted, which was encountering an error but reporting success. Change-Id: I274b97d4d1247244247fb38a5ca7f4c10bba8d22 Reviewed-on: http://gerrit.cloudera.org:8080/4861 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 05:38:12 +00:00
Tim Armstrong	d7246d64c7	IMPALA-1430,IMPALA-4108: codegen all builtin aggregate functions This change enables codegen for all builtin aggregate functions, e.g. timestamp functions and group_concat. There are several parts to the change: * Adding support for generic UDAs. Previous the codegen code did not handle multiple input arguments or NULL return values. * Defaulting to using the UDA interface when there is not a special codegen path (we have implementations of all builtin aggregate functions for the interpreted path). * Remove all the logic to disable codegen for the special cases that now are supported. Also fix the generation of code to get/set NULL bits since I needed to add functionality there anyway. Testing: Add tests that check that codegen was enabled for builtin aggregate functions. Also fix some gaps in the preexisting tests. Also add tests for UDAs that check input/output nulls are handled correctly, in anticipation of enabling codegen for arbitrary UDAs. The tests are run with both codegen enabled and disabled. To avoid flaky tests, we switch the UDF tests to use "unique_database". Perf: Ran local TPC-H and targeted perf. Spent a lot of time on TPC-H Q1, since my original approach regressed it ~5%. In the end the problem was to do with the ordering of loads/stores to the slot and null bit in the generated code: the previous version of the code exploited some properties of the particular aggregate function. I ended up replicating this behaviour to avoid regressing perf. Change-Id: Id9dc21d1d676505d3617e1e4f37557397c4fb260 Reviewed-on: http://gerrit.cloudera.org:8080/4655 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 03:27:12 +00:00
Thomas Tauber-Marshall	3be4b3efd0	IMPALA-1169: Admission control info on the queries debug webpage This patch adds a new event, 'Queued', to the query event log to indicate when a query is queued by the admission controller. This means that queries on the '/queries' page that are currently queued will display this as their 'Last Event', making it possible to see which queries are current queued. It also adds a column to show the resource pool associated with the queries, and it updates the wording of the first event that gets marked for each query from 'Start execution' to 'Query submitted', since this is before planning and admission control and therefore execution hasn't actually startd yet. Change-Id: I504e3c829a14318721e3a42de6281bcc578f7283 Reviewed-on: http://gerrit.cloudera.org:8080/4756 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-07 23:26:02 +00:00
Matthew Jacobs	50f7753d2b	IMPALA-3771: Expose kudu client timeout and set default The Kudu client timeout was too low for Impala usage. This sets the default timeout to 3 minutes and exposes it as a gflag. New timeout tests were added. Change-Id: Iad95e8e38aad4f76d21bac6879db6c02b3c3e045 Reviewed-on: http://gerrit.cloudera.org:8080/4849 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-05 06:43:45 +00:00

1 2 3 4 5 ...

263 Commits