impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 18:02:33 -05:00

Author	SHA1	Message	Date
Tim Armstrong	3ebf30a2a4	IMPALA-6847: work around high memory estimates for AC Adds MAX_MEM_ESTIMATE_FOR_ADMISSION query option, which takes effect if and only if * Memory-based admission control is enabled for the pool * No mem_limit is set (i.e. best practices are not being followed) In that case min(MAX_MEM_ESTIMATE_FOR_ADMISSION, mem_estimate) is used for admission control instead of mem_estimate. This provides a way to override the planner's estimate if it happens to be incorrect and are preventing the query from running. Setting MEM_LIMIT is usually a better alternative but sometimes it is not feasible to set MEM_LIMIT for each individual query. Testing: Added an admission control test to verify that query option allows queries with high estimates to run. Also tested manually on a minicluster started with: start-impala-cluster.py --impalad_args='-vmodule admission-controller=3 \ -default_pool_mem_limit 12884901888' Change-Id: Ia5fc32a507ad0f00f564dfe4f954a829ac55d14e Reviewed-on: http://gerrit.cloudera.org:8080/10058 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-18 01:18:20 +00:00
Fredy Wijaya	51cf5b27fc	IMPALA-6850: Print actual error message on Sentry error The patch puts the output of Sentry to $IMPALA_CLUSTER_LOGS_DIR/sentry/sentry.out to follow the same convention as other service output logs. Testing: - Injected some failure in run-sentry-service.sh script to see if the error message was captured Change-Id: I76627bb5b986a548ec6e4f12b555bd6fc8c4dab8 Reviewed-on: http://gerrit.cloudera.org:8080/10064 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 01:41:38 +00:00
Joe McDonnell	d481cd4842	IMPALA-6372: Go parallel for Hive dataload This changes generate-schema-statements.py to produce separate SQL files for different file formats for Hive. This changes load-data.py to go parallel on these separate Hive SQL files. For correctness, the text version of all tables must be loaded before any of the other file formats. load-data.py runs DDLs to create the tables in Impala and goes parallel. Currently, there are some minor dependencies so that text tables must be created prior to creating the other table formats. This changes the definitions of some tables in testdata/datasets/functional/functional_schema_template.sql to remove these dependencies. Now, the DDLs for the text tables can run in parallel to the other file formats. To unify the parallelism for Impala and Hive, load-data.py now uses a single fixed-size pool of processes to run all SQL files rather than spawning a thread per SQL file. This also modifies the locations that do invalidate to use refresh where possible and eliminate global invalidates. For debuggability, different SQL executions output to different log files rather than to standard out. If an error occurs, this will point out the relevant log file. This saves about 10-15 minutes on dataload (including for GVO). Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd Reviewed-on: http://gerrit.cloudera.org:8080/8894 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-14 00:16:26 +00:00
Tianyi Wang	9a751f00b8	IMPALA-6822: Add a query option to control shuffling by distinct exprs IMPALA-4794 changed the distinct aggregation behavior to shuffling by both grouping exprs and the distinct expr. It's slower in queries where the NDVs of grouping exprs are high and data are uniformly distributed among groups. This patch adds a query option controlling this behavior, letting users switch to the old plan. Change-Id: Icb4b4576fb29edd62cf4b4ba0719c0e0a2a5a8dc Reviewed-on: http://gerrit.cloudera.org:8080/9949 Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-12 22:01:35 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Zoltan Borok-Nagy	2ee914d5b3	IMPALA-5903: Inconsistent specification of result set and result set metadata Before this commit it was quite random which DDL oprations returned a result set and which didn't. With this commit, every DDL operations return a summary of its execution. They declare their result set schema in Frontend.java, and provide the summary in CalatogOpExecutor.java. Updated the tests according to the new behavior. Change-Id: Ic542fb8e49e850052416ac663ee329ee3974e3b9 Reviewed-on: http://gerrit.cloudera.org:8080/9090 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 02:21:48 +00:00
Philip Zeyliger	6dc13d933b	Remove Yarn from minicluster by default. (2nd try) Remove Yarn from minicluster by default. Turns out that we start Yarn as part of the minicluster, but we never use it. (HiveServer2 is configured to run MR jobs "locally" in process.) Likely, this Yarn integration is a vestige of Yarn/Llama integration. We can save memory by not starting it by default. There are some less-common tooks like tests/comparison/cluster.py which use Yarn (and Hadoop Streaming). In deference to those tools, I've left a mechanism to start Yarn rather than excising it altogether. After running buildall the regular way, add Yarn to the cluster by running: testdata/cluster/admin -y start_cluster I tested by running core tests. I did not test the kerberized minicluster. [Due to a git mishap, a version of this was previously checked in and reverted.] Change-Id: I97053a44bbe32048e6c35cc28680d1c7696af13f Reviewed-on: http://gerrit.cloudera.org:8080/9970 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-10 20:15:30 +00:00
Philip Zeyliger	01b6995abf	Revert "Remove Yarn from minicluster by default." This reverts commit c05df104570fa2cb7067599bbe3b87740ca9f09e. Change-Id: I00151795581d22a9852cceaca1d21325d68dbe59 Reviewed-on: http://gerrit.cloudera.org:8080/9969 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Philip Zeyliger <philip@cloudera.com>	2018-04-10 16:21:09 +00:00
Philip Zeyliger	942781d80f	Remove Yarn from minicluster by default. Turns out that we start Yarn as part of the minicluster, but we never use it. (HiveServer2 is configured to run MR jobs "locally" in process.) Likely, this Yarn integration is a vestige of Yarn/Llama integration. We can save memory by not starting it by default. There are some less-common tooks like tests/comparison/cluster.py which use Yarn (and Hadoop Streaming). In deference to those tools, I've left a mechanism to start Yarn rather than excising it altogether. After running buildall the regular way, add Yarn to the cluster by running: testdata/cluster/admin -y start_cluster I tested by running core tests. I did not test the kerberized minicluster. Change-Id: I5504cc40b89e3c6d53fac0b7aa4b395fa63e8d79	2018-04-10 09:17:28 -07:00
Tim Armstrong	2995be8238	IMPALA-5607: part 1: breaking extract/date_part changes This is the compatibility-breaking part of Jinchul Kim's change to add additional units. To support nanoseconds we need to widen the output type of these functions. We also change the meaning of "milliseconds" to include the seconds component. Cherry-picks: not for 2.x Change-Id: I42d83712d9bb3a4900bec38a9c009dcf2a1fe019 Reviewed-on: http://gerrit.cloudera.org:8080/9957 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-10 04:00:37 +00:00
Thomas Tauber-Marshall	d437f956ca	IMPALA-6338: Disable flaky bloom filter test The underlying issue in IMPALA-6338 causes successful queries that are cancelled internally due to all results having been returned to, in rare cases, have info missing from the profile. This has caused flaky tests but has low impact on users, and unfortunately with the current query lifecycle logic in the coordinator, there is no simple solution. There is ongoing work to improve query lifecycle logic in the coordinator holistically, see IMPALA-5384. This work will eventually address the underlying cause of IMPALA-6338. Until then, we disable the tests that have been flaky. Change-Id: Ie30b88fb8fb7780fc3a7153c05fdc3606145ce35 Reviewed-on: http://gerrit.cloudera.org:8080/9822 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-09 21:52:57 +00:00
Philip Zeyliger	2896b8d127	IMPALA-6070: Expose using Docker to run tests faster. Allows running the tests that make up the "core" suite in about 2 hours. By comparison, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend tends to run in about 3.5 hours. This commit: * Adds "echo" statements in a few places, to facilitate timing. * Adds --skip-parallel/--skip-serial flags to run-tests.py, and exposes them in run-all-tests.sh. * Marks TestRuntimeFilters as a serial test. This test runs queries that need > 1GB of memory, and, combined with other tests running in parallel, can kill the parallel test suite. * Adds "test-with-docker.py", which runs a full build, data load, and executes tests inside of Docker containers, generating a timeline at the end. In short, one container is used to do the build and data load, and then this container is re-used to run various tests in parallel. All logs are left on the host system. Besides the obvious win of getting test results more quickly, this commit serves as an example of how to get various bits of Impala development working inside of Docker containers. For example, Kudu relies on atomic rename of directories, which isn't available in most Docker filesystems, and entrypoint.sh works around it. In addition, the timeline generated by the build suggests where further optimizations can be made. Most obviously, dataload eats up a precious ~30-50 minutes, on a largely idle machine. This work is significantly CPU and memory hungry. It was developed on a 32-core, 120GB RAM Google Compute Engine machine. I've worked out parallelism configurations such that it runs nicely on 60GB of RAM (c4.8xlarge) and over 100GB (eg., m4.10xlarge, which has 160GB). There is some simple logic to guess at some knobs, and there are knobs. By and large, EC2 and GCE price machines linearly, so, if CPU usage can be kept up, it's not wasteful to run on bigger machines. Change-Id: I82052ef31979564968effef13a3c6af0d5c62767 Reviewed-on: http://gerrit.cloudera.org:8080/9085 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-06 06:40:07 +00:00
Philip Zeyliger	8e5f923158	Loosen hive-exec.jar glob pattern in copy-udfs-udas.sh. This commit slightly loosens the coupling between IMPALA_HIVE_VERSION and "hive.version" in the Maven sense. Cherry-picks: not for 2.x Change-Id: Ifbe6f5208b4ad0ffc9cbfe4e93d712ce698beb23 Reviewed-on: http://gerrit.cloudera.org:8080/9925 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-05 06:54:53 +00:00
Bikramjeet Vig	75e1bd1bcd	IMPALA-6771: Fix in-predicate set up bug Fixes a bug that introduced default initialized values in the set data structure used to check for set membership that can cause wrong results. Testing: Added a test case that checks for the same. Change-Id: I7e776dbcb7ee4a9b64e1295134a27d332f5415b6 Reviewed-on: http://gerrit.cloudera.org:8080/9891 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-04 21:51:29 +00:00
Fredy Wijaya	8173e9ab4d	IMPALA-6571: NullPointerException in SHOW CREATE TABLE for HBase tables This patch fixes the NullPointerException in SHOW CREATE TABLE for HBase tables. Testing: - Moved the content of back hbase-show-create-table.test to show-create-table.test - Ran show-create-table end-to-end tests Change-Id: Ibe018313168fac5dcbd80be9a8f28b71a2c0389b Reviewed-on: http://gerrit.cloudera.org:8080/9884 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-04 00:12:30 +00:00
Philip Zeyliger	6b18b00310	IMPALA-6776: Increase region move timeout. Some builds are experiencing slow HBase region moves in the test minicluster. Trying to increase the timeout from 10s to 60s. Change-Id: Ic62719f1b1aad463bcdc18d0803e780ebb0f8b18 Reviewed-on: http://gerrit.cloudera.org:8080/9892 Reviewed-by: Zach Amsden <zamsden@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-03 21:41:56 +00:00
Fredy Wijaya	08d386f0fc	IMPALA-6724: Allow creating/dropping functions with the same name as built-ins This patch removes restriction on creating a function with the same name as the built-in function. The reason for lifting the restriction is to avoid a name clash when introducing new built-in functions. The patch also fixes some inconsistent behavior when creating or dropping a function when the name specified is fully-qualified or not. Refer to the below tables for more information. Create function: +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| FQ Name \| Built-in DB \| Function Name \| Existing Behavior \| New Behavior \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| Yes \| Yes \| Same as built-in \| Same name exception \| Cannot modify system database \| \| Yes \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| Yes \| No \| Same as built-in \| Function created \| Function created \| \| Yes \| No \| Different than built-in \| Function created \| Function created \| \| No \| Yes \| Same as built-in \| Same name exception \| Cannot modify system database \| \| No \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| No \| No \| Same as built-in \| Same name exception \| Function created \| \| No \| No \| Different than built-in \| Function created \| Function created \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ Drop function: +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| FQ Name \| Built-in DB \| Function Name \| Existing Behavior \| New Behavior \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ \| Yes \| Yes \| Same as built-in \| Cannot modify system database \| Cannot modify system database \| \| Yes \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| Yes \| No \| Same as built-in \| Function dropped \| Function dropped \| \| Yes \| No \| Different than built-in \| Function dropped \| Function dropped \| \| No \| Yes \| Same as built-in \| Cannot modify system database \| Cannot modify system database \| \| No \| Yes \| Different than built-in \| Cannot modify system database \| Cannot modify system database \| \| No \| No \| Same as built-in \| Cannot modify system database \| Function dropped \| \| No \| No \| Different than built-in \| Function dropped \| Function dropped \| +---------+-------------+-------------------------+-------------------------------+-------------------------------+ Select function (no new behavior): +---------+-------------+-------------------------+--------------------------------------------------------+ \| FQ Name \| Built-in DB \| Function Name \| Behavior \| +---------+-------------+-------------------------+--------------------------------------------------------+ \| Yes \| Yes \| Same as built-in \| Function in the specified database (built-in) executed \| \| Yes \| Yes \| Different than built-in \| Unknown function exception \| \| Yes \| No \| Same as built-in \| Function in the specified database executed \| \| Yes \| No \| Different than built-in \| Function in the specified database executed \| \| No \| Yes \| Same as built-in \| Built-in function executed \| \| No \| Yes \| Different than built-in \| Unknown function exception \| \| No \| No \| Same as built-in \| Built-in function executed \| \| No \| No \| Different than built-in \| Function in the current database executed \| +---------+-------------+-------------------------+--------------------------------------------------------+ Testing: - Ran front-end tests - Added end-to-end DDL function tests Cherry-picks: not for 2.x Change-Id: Ic30df56ac276970116715c14454a5a2477b185fa Reviewed-on: http://gerrit.cloudera.org:8080/9800 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-02 21:12:31 +00:00
Fredy Wijaya	c3ab27681f	IMPALA-6739: Exception in ALTER TABLE SET statements The patch fixes issues with executing ALTER TABLE SET statements when there are no matching partitions. The patch also removes incorrect precondition i.e. (partitionSet == null \|\| !partitionSet.isEmpty()) in ALTER TABLE SET statements because a partitionSet can be null when PARTITION is not specified in the ALTER TABLE SET statement and partitionSet can be empty when there is no matching partition. For example: Matching partitions (partitionSet != null && !partitionSet.isEmpty()): > alter table functional.alltypesagg partition(year=2009, month=1) set fileformat parquet; No matching partitions (partitionSet != null && partitionSet.isEmpty()): > alter table functional.alltypesagg partition(year=2009, month=1) set fileformat parquet; No partition specified (partitionSet == null): > alter table functional.alltypesagg set fileformat parquet; Testing: - Added a new test - Ran all front-end tests Change-Id: I793e827d5cf5b7986bd150dd9706df58da3417f3 Reviewed-on: http://gerrit.cloudera.org:8080/9819 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-02 21:05:40 +00:00
Thomas Tauber-Marshall	832974383c	IMPALA-6445: Test for kudu master address with whitespace A concern was brought up that Impala might not handle kudu master addresses containing whitespace correctly. Turns out that the Kudu client takes care of stripping whitespace, so it works, but it would be good to have a test to ensure it continues to work. Change-Id: I1857b8dbcb5af66d69f7620368cd3b9b85ae7576 Reviewed-on: http://gerrit.cloudera.org:8080/9876 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-04-02 20:29:51 +00:00
Bikramjeet Vig	4a39e7c29f	IMPALA-5980: Upgrade to LLVM 5.0.1 Highlighting a few changes in LLVM: - Minor changes to some function signatures - Minor changes to error handling - Split Bitcode/ReaderWriter.h - https://reviews.llvm.org/D26502 - Introduced an optional new GVN optimization pass. Needed to fix a bunch of new clang-tidy warnings. Testing: Ran core and ASAN tests successfully. Performance: Ran single node TPC-H and targeted perf with scale factor 60. Both improved on average. Identified regression in "primitive_filter_in_predicate" which will be addressed by IMPALA-6621. +-------------------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +-------------------+-----------------------+---------+------------+------------+----------------+ \| TARGETED-PERF(60) \| parquet / none / none \| 22.29 \| -0.12% \| 3.90 \| +3.16% \| \| TPCH(60) \| parquet / none / none \| 15.97 \| -3.64% \| 10.14 \| -4.92% \| +-------------------+-----------------------+---------+------------+------------+----------------+ +-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Num Clients \| Iters \| +-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ \| TARGETED-PERF(60) \| PERF_LIMIT-Q1 \| parquet / none / none \| 0.01 \| 0.00 \| R +156.43% \| * 25.80% * \| * 17.14% * \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_in_predicate \| parquet / none / none \| 3.39 \| 1.92 \| R +76.33% \| 3.23% \| 4.37% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_string_non_selective \| parquet / none / none \| 1.25 \| 1.11 \| +12.46% \| 3.41% \| 5.36% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_decimal_selective \| parquet / none / none \| 1.40 \| 1.25 \| +12.25% \| 3.57% \| 3.44% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_string_like \| parquet / none / none \| 16.87 \| 15.65 \| +7.78% \| 5.05% \| 0.37% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_min_max_runtime_filter \| parquet / none / none \| 1.79 \| 1.71 \| +4.77% \| 0.71% \| 1.73% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_broadcast_join_2 \| parquet / none / none \| 0.60 \| 0.58 \| +3.64% \| 3.19% \| 3.81% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_string_selective \| parquet / none / none \| 0.95 \| 0.93 \| +2.91% \| 5.23% \| 5.85% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_broadcast_join_3 \| parquet / none / none \| 4.33 \| 4.21 \| +2.83% \| 5.46% \| 3.25% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_bigint_lowndv \| parquet / none / none \| 4.59 \| 4.47 \| +2.82% \| 3.73% \| 1.14% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_3 \| parquet / none / none \| 0.20 \| 0.19 \| +2.65% \| 4.76% \| 2.24% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q1 \| parquet / none / none \| 2.49 \| 2.43 \| +2.31% \| 1.06% \| 1.93% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q6 \| parquet / none / none \| 2.04 \| 2.00 \| +2.09% \| 3.51% \| 2.80% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q3 \| parquet / none / none \| 12.37 \| 12.17 \| +1.62% \| 0.80% \| 2.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q5 \| parquet / none / none \| 4.52 \| 4.45 \| +1.54% \| 1.23% \| 1.08% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q6 \| parquet / none / none \| 2.95 \| 2.91 \| +1.33% \| 1.92% \| 1.67% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q4 \| parquet / none / none \| 3.71 \| 3.66 \| +1.26% \| 0.34% \| 0.53% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q1 \| parquet / none / none \| 18.69 \| 18.47 \| +1.19% \| 0.75% \| 0.31% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q7 \| parquet / none / none \| 8.15 \| 8.07 \| +0.99% \| 3.92% \| 1.58% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_decimal_highndv \| parquet / none / none \| 31.31 \| 31.01 \| +0.97% \| 1.74% \| 1.14% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q5 \| parquet / none / none \| 7.59 \| 7.53 \| +0.78% \| 0.38% \| 0.99% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q4 \| parquet / none / none \| 21.25 \| 21.09 \| +0.76% \| 0.76% \| 0.75% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_4 \| parquet / none / none \| 0.24 \| 0.24 \| +0.75% \| 3.14% \| 4.76% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q19 \| parquet / none / none \| 7.88 \| 7.82 \| +0.74% \| 2.39% \| 2.64% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_orderby_bigint \| parquet / none / none \| 5.10 \| 5.07 \| +0.61% \| 0.74% \| 0.54% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q3 \| parquet / none / none \| 3.61 \| 3.59 \| +0.60% \| 1.45% \| 0.90% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_orderby_all \| parquet / none / none \| 27.63 \| 27.48 \| +0.55% \| 0.85% \| 0.10% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q4 \| parquet / none / none \| 5.81 \| 5.79 \| +0.45% \| 1.65% \| 2.16% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q13 \| parquet / none / none \| 23.49 \| 23.43 \| +0.27% \| 0.83% \| 0.63% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q21 \| parquet / none / none \| 68.88 \| 68.76 \| +0.18% \| 0.22% \| 0.19% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_decimal_lowndv.test \| parquet / none / none \| 4.38 \| 4.37 \| +0.09% \| 2.45% \| 0.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_5 \| parquet / none / none \| 10.40 \| 10.40 \| +0.07% \| 0.77% \| 0.50% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_long_predicate \| parquet / none / none \| 222.37 \| 222.23 \| +0.06% \| 0.25% \| 0.25% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q8 \| parquet / none / none \| 10.65 \| 10.65 \| +0.03% \| 0.55% \| 1.40% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_shuffle_join_one_to_many_string_with_groupby \| parquet / none / none \| 261.84 \| 261.87 \| -0.01% \| 0.91% \| 0.74% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q3 \| parquet / none / none \| 9.44 \| 9.45 \| -0.02% \| 0.92% \| 1.33% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q16 \| parquet / none / none \| 5.21 \| 5.21 \| -0.02% \| 1.46% \| 1.64% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_top-n_all \| parquet / none / none \| 34.58 \| 34.62 \| -0.11% \| 0.22% \| 0.19% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_topn_bigint \| parquet / none / none \| 4.24 \| 4.25 \| -0.13% \| 6.66% \| 2.03% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q2 \| parquet / none / none \| 3.23 \| 3.24 \| -0.34% \| 2.03% \| 0.32% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_broadcast_join_1 \| parquet / none / none \| 0.18 \| 0.18 \| -0.40% \| 6.16% \| 2.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_exchange_broadcast \| parquet / none / none \| 46.27 \| 46.51 \| -0.52% \| 7.83% \| * 15.60% * \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_bigint_pk \| parquet / none / none \| 114.32 \| 114.92 \| -0.52% \| 0.24% \| 0.61% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q22 \| parquet / none / none \| 6.66 \| 6.70 \| -0.53% \| 1.39% \| 0.84% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q20 \| parquet / none / none \| 5.78 \| 5.81 \| -0.62% \| 1.25% \| 0.67% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q2 \| parquet / none / none \| 2.53 \| 2.55 \| -0.64% \| 3.86% \| 3.72% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q5 \| parquet / none / none \| 0.58 \| 0.58 \| -0.75% \| 0.99% \| 6.89% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q7 \| parquet / none / none \| 2.05 \| 2.07 \| -0.86% \| 2.16% \| 4.73% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_shuffle_join_union_all_with_groupby \| parquet / none / none \| 54.86 \| 55.34 \| -0.87% \| 0.25% \| 0.66% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_2 \| parquet / none / none \| 7.52 \| 7.59 \| -0.98% \| 1.53% \| 1.73% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q9 \| parquet / none / none \| 36.43 \| 36.79 \| -1.00% \| 1.60% \| 7.39% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q1 \| parquet / none / none \| 2.79 \| 2.82 \| -1.10% \| 1.15% \| 2.25% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q11 \| parquet / none / none \| 1.95 \| 1.97 \| -1.18% \| 3.14% \| 2.24% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_AGG-Q2 \| parquet / none / none \| 10.98 \| 11.11 \| -1.24% \| 0.77% \| 1.45% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_small_join_1 \| parquet / none / none \| 0.22 \| 0.22 \| -1.34% \| * 13.03% * \| * 12.31% * \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q7 \| parquet / none / none \| 42.82 \| 43.41 \| -1.37% \| 1.63% \| 1.51% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_empty_build_join_1 \| parquet / none / none \| 3.30 \| 3.35 \| -1.54% \| 2.15% \| 1.27% \| 1 \| 5 \| \| TARGETED-PERF(60) \| PERF_STRING-Q6 \| parquet / none / none \| 10.34 \| 10.54 \| -1.81% \| 0.24% \| 2.02% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_groupby_bigint_highndv \| parquet / none / none \| 32.80 \| 33.46 \| -1.98% \| 1.29% \| 0.61% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_decimal_non_selective \| parquet / none / none \| 1.62 \| 1.67 \| -3.01% \| 0.79% \| 1.65% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_conjunct_ordering_1 \| parquet / none / none \| 0.13 \| 0.14 \| -3.36% \| 8.66% \| * 12.66% * \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_exchange_shuffle \| parquet / none / none \| 84.92 \| 87.96 \| -3.46% \| 1.46% \| 1.50% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q12 \| parquet / none / none \| 6.98 \| 7.31 \| -4.57% \| 1.03% \| 7.13% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q18 \| parquet / none / none \| 47.54 \| 50.39 \| -5.64% \| 5.70% \| 5.53% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_bigint_non_selective \| parquet / none / none \| 0.88 \| 0.96 \| -7.81% \| 4.27% \| 5.97% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q15 \| parquet / none / none \| 8.14 \| 9.15 \| -11.09% \| 0.63% \| * 10.44% * \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q10 \| parquet / none / none \| 12.66 \| 14.28 \| -11.34% \| 4.32% \| 1.14% \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q17 \| parquet / none / none \| 10.31 \| 12.59 \| -18.14% \| 0.65% \| 3.72% \| 1 \| 5 \| \| TARGETED-PERF(60) \| primitive_filter_bigint_selective \| parquet / none / none \| 0.14 \| 0.19 \| I -27.60% \| * 32.55% * \| * 39.78% * \| 1 \| 5 \| \| TPCH(60) \| TPCH-Q14 \| parquet / none / none \| 6.10 \| 11.00 \| I -44.55% \| 4.06% \| 3.84% \| 1 \| 5 \| +-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+ Change-Id: Ib0a15cb53feab89e7b35a56b67b3b30eb3e62c6b Reviewed-on: http://gerrit.cloudera.org:8080/9584 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-28 04:25:27 +00:00
Taras Bobrovytsky	8fec1911e5	IMPALA-6230, IMPALA-6468: Fix the output type of round() and related fns Before this patch, the output type of round() ceil() floor() trunc() was not always the same as the input type. It was also inconsistent in general. For example, round(double) returned an integer, but round(double, int) returned a double. After looking at other database systems, we decided that the guideline should be that the output type should be the same as the input type. In this patch, we change the behavior of the previously mentioned functions so that if a double is given then a double is returned. We also modify the rounding behavior to always round away from zero. Before, we were rounding towards positive infinity in some cases. Testinging: - Updated tests - Ran an exhaustive build which passed. Cherry-picks: not for 2.x Change-Id: I77541678012edab70b182378b11ca8753be53f97 Reviewed-on: http://gerrit.cloudera.org:8080/9346 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-24 04:43:01 +00:00
Vuk Ercegovac	2894884deb	IMPALA-6670: refresh lib-cache entries from plan When an impalad is in executor-only mode, it receives no catalog updates. As a result, lib-cache entries are never refreshed. A consequence is that udf queries can return incorrect results or may not run due to resolution issues. Both cases are caused by the executor using a stale copy of the lib file. For incorrect results, an old version of the method may be used. Resolution issues can come up if a method is added to a lib file. The solution in this change is to capture the coordinator's view of the lib file's last modified time when planning. This last modified time is then shipped with the plan to executors. Executors must then use both the lib file path and the last modified time as a key for the lib-cache. If the coordinator's last modified time is more recent than the executor's lib-cache entry, then the entry is refreshed. Brief discussion of alternatives: - lib-cache always checks last modified time + easy/local change to lib-cache - adds an fs lookup always. rejected for this reason - keep the last modified time in the catalog - bound on staleness is too loose. consider the case where fn's f1, f2, f3 are created with last modified times of t1, t2, t3. treat the fn's last modified time as a low-watermark; if the cache entry has a more recent time, use it. Such a scheme would allow the version at t2 to persist. An old fn may keep the state from converging to the latest. This could end up with strange cases where different versions of the lib are used across executors for a single query. In contrast, the change in this path relies on the statestore to push versions forward at all coordinators, so will push all versions at all caches forward as well. Testing: - added an e2e custom cluster test Change-Id: Icf740ea8c6a47e671427d30b4d139cb8507b7ff6 Reviewed-on: http://gerrit.cloudera.org:8080/9697 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-24 04:38:53 +00:00
Philip Zeyliger	783de170c9	IMPALA-4277: Support multiple versions of Hadoop ecosystem Adds support for building against two sets of Hadoop ecosystem components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE, which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for Hadoop 3, Hive 2, and so on). We intend (in a trivial follow-on change soon) to make 3 the new default and to explicitly deprecate 2, but this change only does not switch the default yet. We support both to facilitate a smoother transition, but support will be removed soon in the Impala 3.x line. The switch is done at build time, following the pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). Switching back and forth requires running 'cmake' again. Doing this at build-time avoids complicating the Java code with classloader configuration. There are relatively few incompatible APIs. This implementation encapsulates that by extracting some Java code into fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the pattern established by IMPALA-5184, but, to avoid a proliferation of directories, I've moved the Hive files into the same tree.) pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I consolidated the Hive changes into the same directory structure. For Maven, I introduced Maven "profiles" to handle the two cases where the dependencies (and exclusions) differ. These are driven by the $IMPALA_MINICLUSTER_PROFILE environment variable. For Sentry, exception class names changed. We work around this by adding "isSentry...(Exception)" methods with two different implementations. Sentry is also doing some odd shading, whereby some exceptions are "sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism to create a SentryAuthProvider is slightly different. The easiest way to see the differences is to run: diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java The Sentry work is based on a change by Zach Amsden. In addition, we recently added an explicit "refresh" permission. In Sentry 2, this required creating an ImpalaPrivilegeModel to capture that. It's a slight customization of Hive's equivalent class. For Parquet, the difference is even more mechanical. The package names gone from "parquet" to "org.apache.parquet". The affected code was extracted into ParquetHelper, but only one copy exists. The second copy is generated at build-time using sed. In the rare cases where we need to behave differently at runtime, MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates what version we were built aginst. One of the cases is the results expected by various frontend tests. I avoided the issue by translating one error string into another, which handled the diversion in one place, rather than complicating the several locations which look for "No FileSystem for scheme..." errors. The HBase APIs we use for splitting regions at test time changed. This patch includes a re-write of that code for the new APIs. This piece was contributed by Zach Amsden. To work with newer versions of dependencies, I updated the version of httpcomponents.core we use to 4.4.9. We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase binaries to s3://native-toolchain, and amended the shell scripts to launch the right things. There are minor mechanical differences. Some of this was based on earlier work by Joe McDonnell and Zach Amsden. Hive's logging is changed in Hive 2, necessitating creating a log4j2.properties template and using it appropriately. Furthermore, Hadoop3's new shell script re-writes do a certain amount of classpath de-duplication, causing some issues with locating the relevant logging configurations. Accomodations exist in the code to deal with that. parquet-filtering.test was updated to turn off stats filtering. Older Hive didn't write Parquet statistics, but newer Hive does. By turning off stats filtering, we test what the test had intended to test. For views-compatibility.test, it seems that Hive 2 has fixed certain bugs that we were testing for in Hive. I've added a HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that. For AuthorizationTest, different hive versions show slightly different things for extended output. To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing to see here. rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%) rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%) rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%) CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This was done manually, but without changing anything except what Java required in terms of accessibility and boilerplate. rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%) copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%) Testing: Ran core & exhaustive tests with both profiles. Cherry-picks: not for 2.x. Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7 Reviewed-on: http://gerrit.cloudera.org:8080/9716 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-23 20:56:00 +00:00
Tim Armstrong	588e1d46e9	IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Reviewed-on: http://gerrit.cloudera.org:8080/9403 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-22 02:47:33 +00:00
Tianyi Wang	d03b66ca35	IMPALA-6394: Restart HDFS only when no replication progress is made In wait-hdfs-replication, the frequent and eager restart might slow the HDFS replication down. HDFS should be restarted only if no progress is made in a certain amount of time, and we should wait longer before failing the data loading. Testing: It's tested with a fake HDFS fsck script. Change-Id: Ib059480254643dc032731b4b3c55204a93b61e77 Reviewed-on: http://gerrit.cloudera.org:8080/9698 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-22 00:41:16 +00:00
Bikramjeet Vig	3d65f856f7	IMPALA-6621: Improve set lookup performance for in-predicate evaluation Currently when using a SET_LOOKUP strategy for in-predicates in impala we use a std:set object for checking membership. This patch takes a hybrid approach based on benchmarking results and uses boost::flat_set for int, big int, and float datatypes and boost::unordered_set for the rest (tiny int, small int, double, string, timestamp, decimal). The intent of this change is to fix a regression when upgrading the toolchain to use LLVM 5.0.1 (IMPALA-5980). Performance: Ran a query for each data type with a large in predicate containing 500 elements on a single node with mt_dop set to 1. +-----------+---------------+----------+---------------+----------+ \| Data Type \| Llvm 3 hybrid \| Llvm 3 \| Llvm 5 hybrid \| Llvm 5 \| +-----------+---------------+----------+---------------+----------+ \| Table used: tpch100_parquet.lineitem \| +-----------+---------------+----------+--------------+-----------+ \| big int \| 17s782ms \| 13s941ms \| 13s201ms \| 25s604ms \| \| string \| 40s750ms \| 64s \| 40s723ms \| 73s \| \| decimal \| 13s929ms \| 22s272ms \| 13s710ms \| 34s338ms \| \| int \| 19s368ms \| 11s308ms \| 9s169ms \| 15s254ms \| +-----------+---------------+----------+--------------+-----------+ \| Table used: alltypes with 33638400 rows \| +-----------+---------------+----------+--------------+-----------+ \| double \| 5s726ms \| 5s894ms \| 5s595ms \| 6s592ms \| \| small int \| 4s776ms \| 5s057ms \| 4s740ms \| 5s358ms \| \| float \| 7s223ms \| 6s397ms \| 6s287ms \| 6s926ms \| +-----------+---------------+----------+---------------+----------+ Also added a targeted perf query that uses a large in-predicate over a decimal column. Testing: - Ran expr-test and test_exprs successfully. Change-Id: Ifd1627d779d10a16468cc3c2d0bc26a497e048df Reviewed-on: http://gerrit.cloudera.org:8080/9570 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-21 00:40:10 +00:00
Philip Zeyliger	5c8da5d13a	Consistently use Java 1.7 compiler. We use Java 1.7 in fe/pom.xml, where most of our Java code is. For consistency, this updates the rest of our Maven configurations to use the same version of Java. A change I'm working with uses try-with-resources in HBase splitting, which is how I ran into this. Testing: ran core tests Change-Id: I6cecddf367f00185a14a8b08c03456e3b756bd70 Reviewed-on: http://gerrit.cloudera.org:8080/9600 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-17 04:08:53 +00:00
Tim Armstrong	e148c1a7c3	IMPALA-6589: remove invalid DCHECK in parquet reader The DCHECK was only valid if the Parquet file metadata is internally consistent, with the number of values reported by the metadata matching the number of encoded levels. The DCHECK was intended to directly detect misuse of the RleBatchDecoder interface, which would lead to incorrect results. However, our other test coverage for reading Parquet files is sufficient to test the correctness of level decoding. Testing: Added a minimal corrupt test file that reproduces the issue. Change-Id: Idd6e09f8c8cca8991be5b5b379f6420adaa97daa Reviewed-on: http://gerrit.cloudera.org:8080/9556 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-17 02:52:19 +00:00
Fredy Wijaya	41a516f949	IMPALA-6655: Add owner information on database creation Add owner information on database creation. > create database foo; > describe database extended foo; +---------+----------+---------+ \| name \| location \| comment \| +---------+----------+---------+ \| foo \| \| \| \| Owner: \| \| \| \| \| user1 \| USER \| +---------+----------+---------+ Testing: - Ran end-to-end query and metadata tests Change-Id: Id74ec9bd3cb7954999305e9cd9085cbf50921a78 Reviewed-on: http://gerrit.cloudera.org:8080/9637 Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-16 19:28:37 +00:00
Alex Behm	42abe8139e	IMPALA-5270: Pass resolved exprs into analytic SortInfo. The bug was that the SortInfo of analytics was given ordering exprs that were not fully resolved against their input (e.g. inline views were not resolved). As a result, the SortInfo logic did not materialize exprs like rand() coming from inline views. The fix is to pass fully resolved exprs to the analytic SortInfo, and then the existing materialization logic properly handles non-deterministic built-ins and UDFs. The code around sort generation was rather convoluted and difficult to understand. I overhauled SortInfo to unify the different uses of it under a common codepath After that cleanup, the fix for this issue was trivial. Testing: - Locally ran planner tests - Locally ran analytic EE tests in test_queries.py - Core/hdfs run passed Change-Id: Id2b3f4e5e3f1fd441a63160db3c703c432fbb072 Reviewed-on: http://gerrit.cloudera.org:8080/9631 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-15 02:00:46 +00:00
Philip Zeyliger	45aee121eb	Removing (broken) retries from split-hbase.sh. The retries in split-hbase.sh don't work in the common case, because $MINIKDC_PRINC_HIVE is not set in non-kerberized (common) environments. The regular data load scripts (create-load-data.sh) have code to manage that, but split-hbase.sh blindly forges ahead, leading to errors like: /home/impdev/Impala/testdata/bin/split-hbase.sh: line 49: MINIKDC_PRINC_HIVE: unbound variable Error in /home/impdev/Impala/testdata/bin/create-load-data.sh at line 48: LOAD_DATA_ARGS="" Since this hasn't been working, I opted to remove it entirely, as a failure on the line where HBase splitting actually failed would be significantly more useful than the error here. A search of mailing lists suggested that I was at least the second person to have run into this. (In my case, I did break HBase splitting, but it took me a second to identify the error, since the log was spammed with unrelated information relating to the cluster restart.) Testing: core tests. Change-Id: I715891c9e744f21002330c3ae3ebc14095d94ffd Reviewed-on: http://gerrit.cloudera.org:8080/9588 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-15 01:32:14 +00:00
Grant Henke	03794389fc	IMPALA-6551: Change Kudu TPCDS and TPCH columns to DECIMAL Before Kudu supported DECIMAL columns the TPCDS and TPCH columns were djusted to use DOUBLE in place of DECIMAL. This patch undoes that change now that Kudu supports DECIMAL. Testing: - Updated concurrent_select.py - Updated test_tpch_queries.py - Excersized by the Kudu planner tests Change-Id: I2f7e4464dc6705cadd610a82c459390a9c0dfe4f Reviewed-on: http://gerrit.cloudera.org:8080/9484 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-14 21:38:06 +00:00
Vincent Tran	0d7787fe4d	IMPALA-5315: Cast to timestamp fails for YYYY-M-D format This change allows casting of a string in 'lazy' date/time format to timestamp. The supported lazy date formats are: yyyy-[M]M-[d]d yyyy-[M]M-[d]d [H]H:[m]m:[s]s[.SSSSSSSSS] [H]H:[m]m:[s]s[.SSSSSSSSS] We will incur a SCAN performance penalty (approximately 1/2 TotalReadThroughput) when the string is in one of these lazy date/time format. Testing: Benchmarked the performance consequence by executing this SQL on a private build over 3.8 billion rows: select min(cast (time_string as timestamp)) from private.impala_5315 Added tests for valid and invalid date/time format strings in expr-test.cc to be inline with existing tests for CAST() function. Added end-to-end tests into exprs.test and select-lazy-timestamp.test to exercise the new function within the context of a query. Added tests to exercise the leading and trailing white space trimming behaviour in default and lazy date/time string format (IMPALA-6630). Change-Id: Ib9a184a09d7e7783f04d47588537612c2ecec28f Reviewed-on: http://gerrit.cloudera.org:8080/7009 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-13 22:10:18 +00:00
Grant Henke	6d8ce64020	IMPALA-6635: Add DECIMAL type to Kudu predicates This patch enables pushing scan predicates on DECIMAL columns down to Kudu. Testing: - Added Planner decimal predicate test to kudu.test - Added Planner decimal in-list test to kudu-selectivity.test Change-Id: I2569a9e1d58f1c58884d58633d46348364888ed7 Reviewed-on: http://gerrit.cloudera.org:8080/9578 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-13 20:31:43 +00:00
Philip Zeyliger	8e1bf0e99a	IMPALA-6341, IMPALA-5917: Reduce mem-limit for start-impala-cluster. We've observed empirically that giving Impala 80% of system memory doesn't leave enough room for the minicluster and ASAN overhead, leading to the OOM killer striking during test runs (sometimes). This commit reduces the threshold to 70%. This commit also reduces the memory usage of semi-joins-exhaustive.test by roughly halving the number of records it deals with. This was necessary for tests to pass on a machine with 32GB of RAM. Testing: I've run the ASAN build (more) happily with this change. I've run exhaustive tests on a 32GB machine. Change-Id: Iabca7a95560bd27c2de2b0a147ee9a3c45199db7 Reviewed-on: http://gerrit.cloudera.org:8080/9395 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-13 00:39:38 +00:00
Tianyi Wang	c7a58b8a73	IMPALA-6394: Restart HDFS when blocks are under replicated HDFS sometimes fails to fully replicate all the blocks in 30 seconds and no progress is made. This patch tries to restart HDFS several times before aborting the data loading. Change-Id: Iefd4c2fc6c287f054e385de52bdc42b0bdbd7915 Reviewed-on: http://gerrit.cloudera.org:8080/9469 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-09 22:54:47 +00:00
Zoltan Borok-Nagy	5d044e0cb2	IMPALA-6542: Fix inconsistent write path of Parquet min/max statistics Quick fix of Parquet write path until the Parquet community agrees on the ordering of floating point numbers. The behavior follows the way fmax()/fmin() works, ie. Impala will only write NaN into the stats when all the values are NaNs. This behavior is aligned with the quick fix of Parquet-CPP. Added e2e tests as well. Change-Id: I3957806948f7c661af4be5495f2ec92d1e9fc9d6 Reviewed-on: http://gerrit.cloudera.org:8080/9381 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-08 07:34:41 +00:00
Tim Armstrong	73e90d237e	IMPALA-6592: add test for invalid parquet codecs IMPALA-6592 revealed a gap in test coverage for files with invalid/unsupported Parquet codecs. This adds a test that reproduces the bug that was present in my IMPALA-4835 patch. master is unaffected by this bug. I also hid the conversion tables and made the conversion go through functions that validate the enum values, to make it easier to track down problems like this in the future. Testing: Ran exhaustive tests. Change-Id: I1502ea7b7f39aa09f0ed2677e84219b37c64c416 Reviewed-on: http://gerrit.cloudera.org:8080/9500 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-08 04:48:36 +00:00
Tim Armstrong	7376ca29b4	IMPALA-6595: fix crash in NljBuilder::Close() The bug is that the right child of a blocking join node could be closed before the builder if an error was encountered when sending a batch to the sink. This hits a DCHECK because Buffers owned by the sink may still be accounted against the child node. Testing: Added the test that originally triggered the problem. It reproduced the failure when based on the IMPALA-4835 patch, but I can't reproduce the failure after rebase onto master. Change-Id: Ie46b87a4889d7cee907124796c830db41125cf15 Reviewed-on: http://gerrit.cloudera.org:8080/9493 Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-03-06 16:12:05 +00:00
Taras Bobrovytsky	b0027575cb	IMPALA-6405: Error when string to decimal cast overflows Before this patch, when there was an error when converting a string to a decimal, a NULL was returned. In this patch, we change this behavior so that an error is returned if decimal_v2 is enabled. We also add a warning if there is an underflow. The reasoning is that we want stricter behavior in decimal_v2. Testing: - Added some EE tests. - Ran an exhaustive build, which passed. Change-Id: Icffccac1c1c2361447ae4b0de9b6c2ec7de071db Reviewed-on: http://gerrit.cloudera.org:8080/9339 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-06 03:29:47 +00:00
Tim Armstrong	161cbe30ff	Revert IMPALA-4835 and dependent changes Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit" This reverts commit `25bcb258df`. Revert "IMPALA-6588: don't add empty list of ranges in text scan" This reverts commit `d57fbec6f6`. Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool" This reverts commit `24b4ed0b29`. Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront" This reverts commit `5699b59d0c`. Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation" This reverts commit `65680dc421`. Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb Reviewed-on: http://gerrit.cloudera.org:8080/9480 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-03 04:22:12 +00:00
Joe McDonnell	fd66890bf1	IMPALA-6579: Always force reload Kudu tables for dataload When loading from an up-to-date snapshot, dataload will load all of the metadata and load data into HDFS. Then, it will skip load-data.py for functional/exhaustive, tpch/core, and tpcds/core. It will invoke a special round of load-data.py calls to populate Kudu tables, and it always runs these with a force reload. However, when loading from an old snapshot, dataload will still load all of the metadata and load the data into HDFS, but then it will still invoke load-data.py for functional/exhaustive, tpch/core, and tpcds/core. These invocations mostly do DDLs with very few load statements. However, these invocations are a problem for Kudu. The metadata of Impala tables referencing Kudu entities have been imported along with all the other metadata, but the Kudu entities have not been created, as they are separate from HDFS. This means that Kudu tables are not really valid in this circumstance. Since Kudu has been added to the list of data formats for tpch/core (see IMPALA-6475), load-data.py with tpch/core will attempt to insert into these invalid Kudu tables. To avoid this, always force reload any Kudu tables. generate-schema-statements.py will always generate a drop table statement before any create of a Kudu table. This guarantees that the create will also create the corresponding Kudu entity. Change-Id: I2d07f3513c543e2590f2f62b96b37472316868ee Reviewed-on: http://gerrit.cloudera.org:8080/9445 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-25 03:04:58 +00:00
Joe McDonnell	0f33370b8b	IMPALA-6580: Use LOAD DATA LOCAL for decimal tables IMPALA-5752 added support for Kudu decimal. As a part of that, it added Kudu versions of decimal_tbl and decimal_tiny. Kudu tables are created and loaded even on local tests, so these tables are loaded when they previously weren't. The LOAD sections for these tables rely on executing HDFS commmands to copy data to appropriate locations. These HDFS commands cannot work on local tests, causing this failure. Untangling when to execute LOAD sections is complicated, so this simply switches the decimal_tbl and decimal_tiny to do LOAD DATA LOCAL calls, which do not rely on HDFS commands. Change-Id: I1f717917269d116c07a6f17944583f5e8faf2932 Reviewed-on: http://gerrit.cloudera.org:8080/9438 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-24 03:59:18 +00:00
Tim Armstrong	24b4ed0b29	IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786 Reviewed-on: http://gerrit.cloudera.org:8080/8966 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 04:17:41 +00:00
aphadke	7ce519f92b	IMPALA-6008: Creating a UDF from a shared library with a .ll extenion crashes impala Impala crashes on creating a UDF from a shared library (.so file) which was renamed to have .ll extension. CreateFile() call in GetSymbols() fails and returns on error and does not close the codegen object. This patch closes the codegen object on failure. This avoids hitting a DCHECK later up in the stack. The chain of failures also invokes the DiagnosticHandlerFn. RuntimeState object is NULL when the DiagnosticHandlerFn gets called in this case. This change also adds a check before accessing it for logging. [localhost:21000] > create function foo4 (string, string) returns string location '/tmp/bad_udf.ll' symbol='MyAwesomeUdf'; Query: create function foo4 (string, string) returns string location '/tmp/bad_udf.ll' symbol='MyAwesomeUdf' ERROR: AnalysisException: Could not load binary: /tmp/bad_udf.ll LLVM diagnostic error: Invalid bitcode signature Change-Id: Id060668802ca9c80367cdc0e8a823b968d549bbb Reviewed-on: http://gerrit.cloudera.org:8080/9154 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 03:50:39 +00:00
Grant Henke	0c8eba076c	IMPALA-5752: Add support for DECIMAL on Kudu tables Adds support for the Kudu DECIMAL type introduced in Kudu 1.7.0. Note: Adding support for Kudu decimal min/max filters is tracked in IMPALA-6533. Tests: * Added Kudu create with decimal test to AnalyzeDDLTest.java * Added Kudu table_format to test_decimal_queries.py ** Both decimal.test and decimal-exprs.test workloads * Added decimal queries to the following Kudu workloads: kudu_create.test kudu_delete.test kudu_insert.test kudu_update.test ** kudu_upsert.test Change-Id: I3a9fe5acadc53ec198585d765a8cfb0abe56e199 Reviewed-on: http://gerrit.cloudera.org:8080/9368 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 00:03:54 +00:00
Csaba Ringhofer	5a1f432e81	IMPALA-4167: Support insert plan hints for CREATE TABLE AS SELECT This change adds support for "clustered", "noclustered", "shuffle" and "noshuffle" hints in CTAS statement. Example: create /+ clustered,noshuffle / table t partitioned by (year, month) as select * from functional.alltypes The effect of these hints are the same as in insert statements: clustered: Sort locally by partition expression before insert to ensure that only one partition is written at a time. The goal is to reduce the number of files kept open / buffers kept in memory simultaneously. noclustered: Do not sort by primary key before insert to Kudu table. No effect on HDFS tables currently, as this is their default behavior. shuffle: Forces the planner to add an exchange node that repartitions by the partition expression of the output table. This means that a partition will be written only by a single node, which minimizes the global number of simultaneous writes. If only one partition is written (because all partitioning columns are constant or the target table is not partitioned), then the shuffle hint leads to a plan where all rows are merged at the coordinator where the table sink is executed. noshuffle: Do not add exchange node before insert to partitioned tables. The parser needed some modifications to be able to differentiate between CREATE statements that allow hints (CTAS), and CREATE statements that do not (every other type of CREATE statements). As a result, KW_CREATE was moved from tbl_def_without_col_defs to statement rules. Testing: The parser tests mirror the tests of INSERT, while analysis and planner tests are minimal, as the underlying logic is the same as for INSERT. Query tests are not created, as the hints have no effect on the DDL part of CTAS, and the actual query ran is the same as in the insert case. Change-Id: I8d74bca999da8ae1bb89427c70841f33e3c56ab0 Reviewed-on: http://gerrit.cloudera.org:8080/8400 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-22 20:43:44 +00:00
Zoltan Borok-Nagy	881e00a8bf	IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN If the first number in a row group written by Impala is NaN, then Impala writes incorrect statistics in the metadata. This will result in incorrect results when filtering the data. This commit fixes the read path when encountering NaNs in Parquet min/max statistics. If min and max are both NaN, we can't use the statistics at all. If only one of them is NaN, the other still can be used. I added some tests to QueryTest/parqet-stats.test Change-Id: If3897fc1426541239223670812f59e2bed32f455 Reviewed-on: http://gerrit.cloudera.org:8080/9358 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-22 00:57:46 +00:00
Joe McDonnell	baec8cae34	IMPALA-4874: Increase maximum KRPC message size The default value for rpc_max_message_size is 50MB. Impala currently requires support for messages of up to 2GB. This changes the value of rpc_max_message_size to INT_MAX for Impala. Testing: - Added a test to test_very_large_strings that generates a row with multiple large strings. This row requires that the RPC framework successfully transmit over 400MB. This works for both KRPC and Thrift. This query operates under the same amount of memory as other queries in large_strings.test. - Tested separately that larger row sizes also work, including tests up to almost 2GB. Change-Id: I876bba0536e1d85e41eacd9c0aeccfe5c2126e58 Reviewed-on: http://gerrit.cloudera.org:8080/9337 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-21 03:17:57 +00:00
Michael Ho	62d8462e13	IMPALA-5518: Allocate KrpcDataStreamRecvr RowBatch tuples from BufferPool Previously, tuple pointers of a row batch are allocated from the heap via malloc() and tuple data is allocated from the MemPool associated with the RowBatch. This change converts the allocations of tuple pointers and tuple data to using BufferPool for row batches allocated from KrpcDataStreamRecvr. The primary motivation for this change is to take advantage of the fact that buffers allocated from BufferPool always go back to the per-core arena they came from when they are freed. This alleviates the TCMalloc imbalance between the RPC service threads and the fragment execution threads. As described in IMPALA-5518, row batches are always allocated from the service threads' TCMalloc cache and placed into the fragment execution threads' TCMalloc cache when they're freed. This leads to underflow and overflow in those threads' caches and high contention for the spinlock of the central free list. With BufferPool, the memory always went back to its originating arena so this kind of imbalance is less likely to occur. This also dovetails with the long term plan to put most allocations under BufferPool and have each operators in the plan reserved appropriate amount of memory before execution. Note that the proper reservation mechanism of the exchange node hasn't yet been implemented in this change so the buffer pool client handle used for allocating buffers has an ad-hoc set-up of no reservation limit and using root reservation tracker as parent. This needs to be fixed as part of IMPALA-6524. The default buffer pool limit is also bumped to 85% to account for the extra usage from the exchange nodes. The minimum buffer size is also lowered to 8KB to reduce amount of memory wastage as a row batch's tuple pointers / tuple data can sometimes be much smaller than 64KB. Testing done: Debug core build. Change-Id: If4b1a45f68b9df0d3b539511e15aff15700246f2 Reviewed-on: http://gerrit.cloudera.org:8080/9344 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-20 04:08:11 +00:00

1 2 3 4 5 ...

1824 Commits