Commit Graph

1833 Commits

Author SHA1 Message Date
Thomas Tauber-Marshall
87be63e321 IMPALA-6821: Push down limits into Kudu
This patch takes advantage of a recent change in Kudu (KUDU-16) that
exposes the ability to set limits on KuduScanners. Since each
KuduScanner corresponds to a scan token, and there will be multiple
scan tokens per query, this is just a performance optimization in
cases where the limit is smaller than the number of rows per token,
and Impala still needs to apply the limit on our side for cases where
the limit is greater than the number of rows per token.

Testing:
- Added e2e tests for various situations where limits are applied at
  a Kudu scan node.
- For the query 'select * from tpch_kudu.lineitem limit 1', a best
  case perf scenario for this change where the limit is highly
  effective, the time spent in the Kudu scan node was reduced from
  6.107ms to 3.498ms (avg over 3 runs).
- For the query 'select count(*) from (select * from
  tpch_kudu.lineitem limit 1000000) v', a worst case perf scenario for
  this change where the limit is ineffective, the time spent in the
  Kudu scan node was essentially unchanged, 32.815ms previously vs.
  29.532ms (avg over 3 runs).

Change-Id: Ibe35e70065d8706b575e24fe20902cd405b49941
Reviewed-on: http://gerrit.cloudera.org:8080/10119
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-27 21:55:11 +00:00
Zoltan Borok-Nagy
1e79f14798 IMPALA-6314: Add run time scalar subquery check for uncorrelated subqueries
If a scalar subquery is used with a binary predicate,
or, used in an arithmetic expression, it must return
only one row/column to be valid. If this cannot be
guaranteed at parse time through a single row aggregate
or limit clause, Impala fails the query like such.

E.g., currently the following query is not allowed:
SELECT bigint_col
FROM alltypesagg
WHERE id = (SELECT id FROM alltypesagg WHERE id = 1)

However, it would be allowed if the query contained
a LIMIT 1 clause, or instead of id it was max(id).

This commit makes the example valid by introducing a
runtime check to test if the subquery returns a single
row. If the subquery returns more than one row, it
aborts the query with an error.

I added a new node type, called CardinalityCheckNode. It
is created during planning on top of the subquery when
needed, then during execution it checks if its child
only returns a single row.

I extended the frontend tests and e2e tests as well.

Change-Id: I0f52b93a60eeacedd242a2f17fa6b99c4fc38e06
Reviewed-on: http://gerrit.cloudera.org:8080/9005
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-27 20:06:56 +00:00
Philip Zeyliger
2e6a63e31e IMPALA-6070: Further improvements to test-with-docker.
This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.

Bug fixes:

* Embarassingly, I was still skipping thrift-server-test in the backend
  tests. This was a mistake in handling feedback from my last review.

* I made the timeline a little bit taller to clip less.

Adding workloads:

* I added the RAT licensing check.

* I added exhaustive runs. This led me to model the suites a little
  bit more in Python, with a class representing a suite with a
  bunch of data about the suite. It's not perfect and still
  coupled with the entrypoint.sh shell script, but it feels
  workable. As part of adding exhaustive tests, I had
  to re-work the timeout handling, since now different
  suites meaningfully have different timeouts.

Speed ups:

* To speed up test runs, I added a mechanism to split py.test suites into
  multiple shards with a py.test argument. This involved a little bit of work in
  conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.

  Furthermore, I moved a bit more logic about managing the
  list of suites into Python.

* Doing the full build with "-notests" and only building
  the backend tests in the relevant target that needs them. This speeds
  up "docker commit" significantly by removing about 20GB from the
  container.  I had to indicates that expr-codegen-test depends on
  expr-codegen-test-ir, which was missing.

* I sped up copying the Kudu data: previously I did
  both a move and a copy; now I'm doing a move followed by a move. One
  of the moves is cross-filesystem so is slow, but this does half the
  amount of copying.

Memory usage:

* I tweaked the memlimit_gb settings to have a higher default. I've been
  fighting empirically to have the tests run well on c4.8xlarge and
  m4.10xlarge.

The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:

 * The non-first Impalad does very little JVM work, so having
   an Xmx keeps it small, even in the parallel tests.
 * Datanodes do work, but they essentially were never garbage
   collecting, because JVM defaults let them use up to 1/4th
   the machine memory. (I observed this based on RSS at the
   end of the run; nothing fancier.) Adding Xms/Xmx settings
   helped.
 * Similarly, I piped the settings through to HBase.

A few daemons still run without resource limitations, but they don't
seem to be a problem.

Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 20:47:29 +00:00
Zoltan Borok-Nagy
25422c74b2 IMPALA-6934: Wrong results with EXISTS subquery containing ORDER BY, LIMIT, and OFFSET
Queries may return wrong results if an EXISTS subquery has
an ORDER BY with a LIMIT and OFFSET clause. The EXISTS
subquery may incorrectly evaluate to TRUE even though it is
FALSE.

The bug was found during the code review of IMPALA-6314
(https://gerrit.cloudera.org/#/c/9005/). Turned out
QueryStmt.setLimit() wipes the offset. I modified it to
keep the offset expr.

Added tests to 'PlannerTest/subquery-rewrite.test' and
'QueryTest/subquery.test'

Change-Id: I9693623d3d0a8446913261252f8e4a07935645e0
Reviewed-on: http://gerrit.cloudera.org:8080/10218
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 20:12:38 +00:00
Tim Armstrong
d879fa9930 IMPALA-6905: support regexes with more verifiers
Support row_regex and other lines for the subset and superset verifiers,
which previously assumed that lines in the actual and expected had to
match exactly.

Use in test_stats_extrapolation to make the test more robust to
irrelevant changes in the explain plan.

Testing:
Manually modified a superset and a subset test to check that tests fail
as expected.

Change-Id: Ia7a28d421c8e7cd84b14d07fcb71b76449156409
Reviewed-on: http://gerrit.cloudera.org:8080/10155
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 00:56:36 +00:00
Joe McDonnell
da363a99a4 IMPALA-6899: Optimize the HDFS commands used in dataload
HDFS commandline calls can be expensive due to JVM
startup and other costs. Since most HDFS commandline
calls can take multiple paths, one way to reduce
execution time is to consolidate multiple HDFS
commands into a single HDFS call. Since HDFS put
commands will follow symbolic links and can copy
recursively, this can allow for further consolidation
by creating the full directory structure and
copying it in a single HDFS call.

This does several of these optimizations throughout
the dataload codepath. It saves a few seconds here
and there:
Loading Hive Builtins: 1:10 -> 0:30
Loading custom schemas: 0:35 -> 0:20
Loading Hive UDFs: 0:45 -> 0:25

Change-Id: I0934353329dc7312394fc4457ab8db2a272c6282
Reviewed-on: http://gerrit.cloudera.org:8080/10120
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-23 21:29:59 +00:00
Joe McDonnell
5bc5279b07 IMPALA-6898: Avoid duplicate Kudu load during full dataload
testdata/bin/create-load-data.sh does bin/load-data.py for
functional/exhaustive, tpch/core, and tpcds/core in a
first phase, then it loads functional and tpch for Kudu
in a second phase. For a full dataload, this second phase
is not necessary. functional/exhaustive and tpch/core
already include Kudu.

This avoids the second phase when doing a full dataload.
The second phase is still necessary when loading from
a snapshot, and this does not change that behavior.

This saves a couple minutes off of full dataload.

Change-Id: Ic023d230f99126ed37795106c38faae5f0cb608e
Reviewed-on: http://gerrit.cloudera.org:8080/10128
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-21 01:08:50 +00:00
Tianyi Wang
8e86678d65 IMPALA-5690: Part 1: Rename ostream operators for thrift types
Thrift 0.9.3 implements "ostream& operator<<(ostream&, T)" for thrift
data types while impala did the same to enums and special types
including TNetworkAddress and TUniqueId. To prepare for the upgrade of
thrift 0.9.3, this patch renames these impala defined functions. In the
absence of operator<<, assertion macros like DCHECK_EQ can no longer be
used on non-enum thrift defined types.

Change-Id: I9c303997411237e988ef960157f781776f6fcb60
Reviewed-on: http://gerrit.cloudera.org:8080/9168
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-20 10:28:12 +00:00
Thomas Tauber-Marshall
b68e06997c IMPALA-6880: disable flaky bloom filter test
This test is made flaky by IMPALA-6338. While that is being worked on,
temporarily disable this test.

Change-Id: I595645b0f2875614294adc7abb4572aec1be8ad5
Reviewed-on: http://gerrit.cloudera.org:8080/10122
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-19 23:43:19 +00:00
Tim Armstrong
3ebf30a2a4 IMPALA-6847: work around high memory estimates for AC
Adds MAX_MEM_ESTIMATE_FOR_ADMISSION query option, which takes
effect if and only if
* Memory-based admission control is enabled for the pool
* No mem_limit is set (i.e. best practices are not being followed)

In that case min(MAX_MEM_ESTIMATE_FOR_ADMISSION, mem_estimate)
is used for admission control instead of mem_estimate.

This provides a way to override the planner's estimate if
it happens to be incorrect and are preventing the query from
running. Setting MEM_LIMIT is usually a better alternative
but sometimes it is not feasible to set MEM_LIMIT for each
individual query.

Testing:
Added an admission control test to verify that query option allows
queries with high estimates to run.

Also tested manually on a minicluster started with:

  start-impala-cluster.py --impalad_args='-vmodule admission-controller=3 \
      -default_pool_mem_limit 12884901888'

Change-Id: Ia5fc32a507ad0f00f564dfe4f954a829ac55d14e
Reviewed-on: http://gerrit.cloudera.org:8080/10058
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-18 01:18:20 +00:00
Fredy Wijaya
51cf5b27fc IMPALA-6850: Print actual error message on Sentry error
The patch puts the output of Sentry to
$IMPALA_CLUSTER_LOGS_DIR/sentry/sentry.out to follow the
same convention as other service output logs.

Testing:
- Injected some failure in run-sentry-service.sh script to see if the
  error message was captured

Change-Id: I76627bb5b986a548ec6e4f12b555bd6fc8c4dab8
Reviewed-on: http://gerrit.cloudera.org:8080/10064
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-14 01:41:38 +00:00
Joe McDonnell
d481cd4842 IMPALA-6372: Go parallel for Hive dataload
This changes generate-schema-statements.py to produce
separate SQL files for different file formats for Hive.
This changes load-data.py to go parallel on these
separate Hive SQL files. For correctness, the text
version of all tables must be loaded before any
of the other file formats.

load-data.py runs DDLs to create the tables in Impala
and goes parallel. Currently, there are some minor
dependencies so that text tables must be created
prior to creating the other table formats. This
changes the definitions of some tables in
testdata/datasets/functional/functional_schema_template.sql
to remove these dependencies. Now, the DDLs for the
text tables can run in parallel to the other file formats.

To unify the parallelism for Impala and Hive, load-data.py
now uses a single fixed-size pool of processes to run all
SQL files rather than spawning a thread per SQL file.

This also modifies the locations that do invalidate to
use refresh where possible and eliminate global
invalidates.

For debuggability, different SQL executions output to
different log files rather than to standard out. If an
error occurs, this will point out the relevant log
file.

This saves about 10-15 minutes on dataload (including
for GVO).

Change-Id: I34b71e6df3c8f23a5a31451280e35f4dc015a2fd
Reviewed-on: http://gerrit.cloudera.org:8080/8894
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-14 00:16:26 +00:00
Tianyi Wang
9a751f00b8 IMPALA-6822: Add a query option to control shuffling by distinct exprs
IMPALA-4794 changed the distinct aggregation behavior to shuffling by
both grouping exprs and the distinct expr. It's slower in queries
where the NDVs of grouping exprs are high and data are uniformly
distributed among groups. This patch adds a query option controlling
this behavior, letting users switch to the old plan.

Change-Id: Icb4b4576fb29edd62cf4b4ba0719c0e0a2a5a8dc
Reviewed-on: http://gerrit.cloudera.org:8080/9949
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-12 22:01:35 +00:00
stiga-huang
818cd8fa27 IMPALA-5717: Support for reading ORC data files
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 05:13:02 +00:00
Zoltan Borok-Nagy
2ee914d5b3 IMPALA-5903: Inconsistent specification of result set and result set metadata
Before this commit it was quite random which DDL oprations
returned a result set and which didn't.

With this commit, every DDL operations return a summary of
its execution. They declare their result set schema in
Frontend.java, and provide the summary in CalatogOpExecutor.java.

Updated the tests according to the new behavior.

Change-Id: Ic542fb8e49e850052416ac663ee329ee3974e3b9
Reviewed-on: http://gerrit.cloudera.org:8080/9090
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 02:21:48 +00:00
Philip Zeyliger
6dc13d933b Remove Yarn from minicluster by default. (2nd try)
Remove Yarn from minicluster by default.

Turns out that we start Yarn as part of the minicluster, but we never use it.
(HiveServer2 is configured to run MR jobs "locally" in process.) Likely, this
Yarn integration is a vestige of Yarn/Llama integration.  We can save memory by
not starting it by default.

There are some less-common tooks like tests/comparison/cluster.py which use
Yarn (and Hadoop Streaming). In deference to those tools, I've left a mechanism
to start Yarn rather than excising it altogether. After running
buildall the regular way, add Yarn to the cluster by running:
  testdata/cluster/admin -y start_cluster

I tested by running core tests. I did not test the kerberized minicluster.

[Due to a git mishap, a version of this was previously checked in and reverted.]

Change-Id: I97053a44bbe32048e6c35cc28680d1c7696af13f
Reviewed-on: http://gerrit.cloudera.org:8080/9970
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-10 20:15:30 +00:00
Philip Zeyliger
01b6995abf Revert "Remove Yarn from minicluster by default."
This reverts commit c05df104570fa2cb7067599bbe3b87740ca9f09e.

Change-Id: I00151795581d22a9852cceaca1d21325d68dbe59
Reviewed-on: http://gerrit.cloudera.org:8080/9969
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Philip Zeyliger <philip@cloudera.com>
2018-04-10 16:21:09 +00:00
Philip Zeyliger
942781d80f Remove Yarn from minicluster by default.
Turns out that we start Yarn as part of the minicluster, but we never use it.
(HiveServer2 is configured to run MR jobs "locally" in process.) Likely, this
Yarn integration is a vestige of Yarn/Llama integration.  We can save memory by
not starting it by default.

There are some less-common tooks like tests/comparison/cluster.py which use
Yarn (and Hadoop Streaming). In deference to those tools, I've left a mechanism
to start Yarn rather than excising it altogether. After running
buildall the regular way, add Yarn to the cluster by running:
  testdata/cluster/admin -y start_cluster

I tested by running core tests. I did not test the kerberized minicluster.

Change-Id: I5504cc40b89e3c6d53fac0b7aa4b395fa63e8d79
2018-04-10 09:17:28 -07:00
Tim Armstrong
2995be8238 IMPALA-5607: part 1: breaking extract/date_part changes
This is the compatibility-breaking part of Jinchul Kim's change
to add additional units. To support nanoseconds we need to
widen the output type of these functions. We also change
the meaning of "milliseconds" to include the seconds component.

Cherry-picks: not for 2.x

Change-Id: I42d83712d9bb3a4900bec38a9c009dcf2a1fe019
Reviewed-on: http://gerrit.cloudera.org:8080/9957
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-10 04:00:37 +00:00
Thomas Tauber-Marshall
d437f956ca IMPALA-6338: Disable flaky bloom filter test
The underlying issue in IMPALA-6338 causes successful queries that are
cancelled internally due to all results having been returned to, in
rare cases, have info missing from the profile. This has caused flaky
tests but has low impact on users, and unfortunately with the current
query lifecycle logic in the coordinator, there is no simple solution.

There is ongoing work to improve query lifecycle logic in the
coordinator holistically, see IMPALA-5384. This work will eventually
address the underlying cause of IMPALA-6338. Until then, we disable
the tests that have been flaky.

Change-Id: Ie30b88fb8fb7780fc3a7153c05fdc3606145ce35
Reviewed-on: http://gerrit.cloudera.org:8080/9822
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-09 21:52:57 +00:00
Philip Zeyliger
2896b8d127 IMPALA-6070: Expose using Docker to run tests faster.
Allows running the tests that make up the "core" suite in about 2 hours.
By comparison, https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/buildTimeTrend
tends to run in about 3.5 hours.

This commit:
* Adds "echo" statements in a few places, to facilitate timing.
* Adds --skip-parallel/--skip-serial flags to run-tests.py,
  and exposes them in run-all-tests.sh.
* Marks TestRuntimeFilters as a serial test. This test runs
  queries that need > 1GB of memory, and, combined with
  other tests running in parallel, can kill the parallel test
  suite.
* Adds "test-with-docker.py", which runs a full build, data load,
  and executes tests inside of Docker containers, generating
  a timeline at the end. In short, one container is used
  to do the build and data load, and then this container is
  re-used to run various tests in parallel. All logs are
  left on the host system.

Besides the obvious win of getting test results more quickly, this
commit serves as an example of how to get various bits of Impala
development working inside of Docker containers. For example, Kudu
relies on atomic rename of directories, which isn't available in most
Docker filesystems, and entrypoint.sh works around it.

In addition, the timeline generated by the build suggests where further
optimizations can be made. Most obviously, dataload eats up a precious
~30-50 minutes, on a largely idle machine.

This work is significantly CPU and memory hungry. It was developed on a
32-core, 120GB RAM Google Compute Engine machine. I've worked out
parallelism configurations such that it runs nicely on 60GB of RAM
(c4.8xlarge) and over 100GB (eg., m4.10xlarge, which has 160GB). There is
some simple logic to guess at some knobs, and there are knobs.  By and
large, EC2 and GCE price machines linearly, so, if CPU usage can be kept
up, it's not wasteful to run on bigger machines.

Change-Id: I82052ef31979564968effef13a3c6af0d5c62767
Reviewed-on: http://gerrit.cloudera.org:8080/9085
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-06 06:40:07 +00:00
Philip Zeyliger
8e5f923158 Loosen hive-exec.jar glob pattern in copy-udfs-udas.sh.
This commit slightly loosens the coupling between IMPALA_HIVE_VERSION
and "hive.version" in the Maven sense.

Cherry-picks: not for 2.x

Change-Id: Ifbe6f5208b4ad0ffc9cbfe4e93d712ce698beb23
Reviewed-on: http://gerrit.cloudera.org:8080/9925
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-05 06:54:53 +00:00
Bikramjeet Vig
75e1bd1bcd IMPALA-6771: Fix in-predicate set up bug
Fixes a bug that introduced default initialized values in the set data
structure used to check for set membership that can cause wrong results.

Testing:
Added a test case that checks for the same.

Change-Id: I7e776dbcb7ee4a9b64e1295134a27d332f5415b6
Reviewed-on: http://gerrit.cloudera.org:8080/9891
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-04 21:51:29 +00:00
Fredy Wijaya
8173e9ab4d IMPALA-6571: NullPointerException in SHOW CREATE TABLE for HBase tables
This patch fixes the NullPointerException in SHOW CREATE TABLE for HBase
tables.

Testing:
- Moved the content of back hbase-show-create-table.test to
  show-create-table.test
- Ran show-create-table end-to-end tests

Change-Id: Ibe018313168fac5dcbd80be9a8f28b71a2c0389b
Reviewed-on: http://gerrit.cloudera.org:8080/9884
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-04 00:12:30 +00:00
Philip Zeyliger
6b18b00310 IMPALA-6776: Increase region move timeout.
Some builds are experiencing slow HBase region moves in the
test minicluster. Trying to increase the timeout from 10s to 60s.

Change-Id: Ic62719f1b1aad463bcdc18d0803e780ebb0f8b18
Reviewed-on: http://gerrit.cloudera.org:8080/9892
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-03 21:41:56 +00:00
Fredy Wijaya
08d386f0fc IMPALA-6724: Allow creating/dropping functions with the same name as built-ins
This patch removes restriction on creating a function with the same name
as the built-in function. The reason for lifting the restriction is to
avoid a name clash when introducing new built-in functions. The patch
also fixes some inconsistent behavior when creating or dropping a function
when the name specified is fully-qualified or not.

Refer to the below tables for more information.

Create function:
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| FQ Name | Built-in DB | Function Name           | Existing Behavior             | New Behavior                  |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| Yes     | Yes         | Same as built-in        | Same name exception           | Cannot modify system database |
| Yes     | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| Yes     | No          | Same as built-in        | Function created              | Function created              |
| Yes     | No          | Different than built-in | Function created              | Function created              |
| No      | Yes         | Same as built-in        | Same name exception           | Cannot modify system database |
| No      | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| No      | No          | Same as built-in        | Same name exception           | Function created              |
| No      | No          | Different than built-in | Function created              | Function created              |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+

Drop function:
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| FQ Name | Built-in DB | Function Name           | Existing Behavior             | New Behavior                  |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+
| Yes     | Yes         | Same as built-in        | Cannot modify system database | Cannot modify system database |
| Yes     | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| Yes     | No          | Same as built-in        | Function dropped              | Function dropped              |
| Yes     | No          | Different than built-in | Function dropped              | Function dropped              |
| No      | Yes         | Same as built-in        | Cannot modify system database | Cannot modify system database |
| No      | Yes         | Different than built-in | Cannot modify system database | Cannot modify system database |
| No      | No          | Same as built-in        | Cannot modify system database | Function dropped              |
| No      | No          | Different than built-in | Function dropped              | Function dropped              |
+---------+-------------+-------------------------+-------------------------------+-------------------------------+

Select function (no new behavior):
+---------+-------------+-------------------------+--------------------------------------------------------+
| FQ Name | Built-in DB | Function Name           | Behavior                                               |
+---------+-------------+-------------------------+--------------------------------------------------------+
| Yes     | Yes         | Same as built-in        | Function in the specified database (built-in) executed |
| Yes     | Yes         | Different than built-in | Unknown function exception                             |
| Yes     | No          | Same as built-in        | Function in the specified database executed            |
| Yes     | No          | Different than built-in | Function in the specified database executed            |
| No      | Yes         | Same as built-in        | Built-in function executed                             |
| No      | Yes         | Different than built-in | Unknown function exception                             |
| No      | No          | Same as built-in        | Built-in function executed                             |
| No      | No          | Different than built-in | Function in the current database executed              |
+---------+-------------+-------------------------+--------------------------------------------------------+

Testing:
- Ran front-end tests
- Added end-to-end DDL function tests

Cherry-picks: not for 2.x

Change-Id: Ic30df56ac276970116715c14454a5a2477b185fa
Reviewed-on: http://gerrit.cloudera.org:8080/9800
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 21:12:31 +00:00
Fredy Wijaya
c3ab27681f IMPALA-6739: Exception in ALTER TABLE SET statements
The patch fixes issues with executing ALTER TABLE SET statements when
there are no matching partitions.

The patch also removes incorrect precondition i.e.
(partitionSet == null || !partitionSet.isEmpty()) in ALTER TABLE SET
statements because a partitionSet can be null when PARTITION is not
specified in the ALTER TABLE SET statement and partitionSet can be
empty when there is no matching partition. For example:

Matching partitions (partitionSet != null && !partitionSet.isEmpty()):
> alter table functional.alltypesagg partition(year=2009, month=1)
  set fileformat parquet;

No matching partitions (partitionSet != null && partitionSet.isEmpty()):
> alter table functional.alltypesagg partition(year=2009, month=1)
  set fileformat parquet;

No partition specified (partitionSet == null):
> alter table functional.alltypesagg set fileformat parquet;

Testing:
- Added a new test
- Ran all front-end tests

Change-Id: I793e827d5cf5b7986bd150dd9706df58da3417f3
Reviewed-on: http://gerrit.cloudera.org:8080/9819
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 21:05:40 +00:00
Thomas Tauber-Marshall
832974383c IMPALA-6445: Test for kudu master address with whitespace
A concern was brought up that Impala might not handle kudu master
addresses containing whitespace correctly. Turns out that the Kudu
client takes care of stripping whitespace, so it works, but it would
be good to have a test to ensure it continues to work.

Change-Id: I1857b8dbcb5af66d69f7620368cd3b9b85ae7576
Reviewed-on: http://gerrit.cloudera.org:8080/9876
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-04-02 20:29:51 +00:00
Bikramjeet Vig
4a39e7c29f IMPALA-5980: Upgrade to LLVM 5.0.1
Highlighting a few changes in LLVM:
- Minor changes to some function signatures
- Minor changes to error handling
- Split Bitcode/ReaderWriter.h - https://reviews.llvm.org/D26502
- Introduced an optional new GVN optimization pass.

Needed to fix a bunch of new clang-tidy warnings.

Testing:
Ran core and ASAN tests successfully.

Performance:
Ran single node TPC-H and targeted perf with scale factor 60. Both
improved on average. Identified regression in
"primitive_filter_in_predicate" which will be addressed by IMPALA-6621.

+-------------------+-----------------------+---------+------------+------------+----------------+
| Workload          | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+-------------------+-----------------------+---------+------------+------------+----------------+
| TARGETED-PERF(60) | parquet / none / none | 22.29   | -0.12%     | 3.90       | +3.16%         |
| TPCH(60)          | parquet / none / none | 15.97   | -3.64%     | 10.14      | -4.92%         |
+-------------------+-----------------------+---------+------------+------------+----------------+

+-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| Workload          | Query                                                  | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Num Clients | Iters |
+-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+
| TARGETED-PERF(60) | PERF_LIMIT-Q1                                          | parquet / none / none | 0.01   | 0.00        | R +156.43% | * 25.80% * | * 17.14% *     | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_in_predicate                          | parquet / none / none | 3.39   | 1.92        | R +76.33%  |   3.23%    |   4.37%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_string_non_selective                  | parquet / none / none | 1.25   | 1.11        |   +12.46%  |   3.41%    |   5.36%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_decimal_selective                     | parquet / none / none | 1.40   | 1.25        |   +12.25%  |   3.57%    |   3.44%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_string_like                           | parquet / none / none | 16.87  | 15.65       |   +7.78%   |   5.05%    |   0.37%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_min_max_runtime_filter                       | parquet / none / none | 1.79   | 1.71        |   +4.77%   |   0.71%    |   1.73%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_broadcast_join_2                             | parquet / none / none | 0.60   | 0.58        |   +3.64%   |   3.19%    |   3.81%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_string_selective                      | parquet / none / none | 0.95   | 0.93        |   +2.91%   |   5.23%    |   5.85%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_broadcast_join_3                             | parquet / none / none | 4.33   | 4.21        |   +2.83%   |   5.46%    |   3.25%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_bigint_lowndv                        | parquet / none / none | 4.59   | 4.47        |   +2.82%   |   3.73%    |   1.14%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_3                          | parquet / none / none | 0.20   | 0.19        |   +2.65%   |   4.76%    |   2.24%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q1                                            | parquet / none / none | 2.49   | 2.43        |   +2.31%   |   1.06%    |   1.93%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q6                                            | parquet / none / none | 2.04   | 2.00        |   +2.09%   |   3.51%    |   2.80%        | 1           | 5     |
| TPCH(60)          | TPCH-Q3                                                | parquet / none / none | 12.37  | 12.17       |   +1.62%   |   0.80%    |   2.45%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q5                                         | parquet / none / none | 4.52   | 4.45        |   +1.54%   |   1.23%    |   1.08%        | 1           | 5     |
| TPCH(60)          | TPCH-Q6                                                | parquet / none / none | 2.95   | 2.91        |   +1.33%   |   1.92%    |   1.67%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q4                                         | parquet / none / none | 3.71   | 3.66        |   +1.26%   |   0.34%    |   0.53%        | 1           | 5     |
| TPCH(60)          | TPCH-Q1                                                | parquet / none / none | 18.69  | 18.47       |   +1.19%   |   0.75%    |   0.31%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q7                                         | parquet / none / none | 8.15   | 8.07        |   +0.99%   |   3.92%    |   1.58%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_decimal_highndv                      | parquet / none / none | 31.31  | 31.01       |   +0.97%   |   1.74%    |   1.14%        | 1           | 5     |
| TPCH(60)          | TPCH-Q5                                                | parquet / none / none | 7.59   | 7.53        |   +0.78%   |   0.38%    |   0.99%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q4                                            | parquet / none / none | 21.25  | 21.09       |   +0.76%   |   0.76%    |   0.75%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_4                          | parquet / none / none | 0.24   | 0.24        |   +0.75%   |   3.14%    |   4.76%        | 1           | 5     |
| TPCH(60)          | TPCH-Q19                                               | parquet / none / none | 7.88   | 7.82        |   +0.74%   |   2.39%    |   2.64%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_orderby_bigint                               | parquet / none / none | 5.10   | 5.07        |   +0.61%   |   0.74%    |   0.54%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q3                                         | parquet / none / none | 3.61   | 3.59        |   +0.60%   |   1.45%    |   0.90%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_orderby_all                                  | parquet / none / none | 27.63  | 27.48       |   +0.55%   |   0.85%    |   0.10%        | 1           | 5     |
| TPCH(60)          | TPCH-Q4                                                | parquet / none / none | 5.81   | 5.79        |   +0.45%   |   1.65%    |   2.16%        | 1           | 5     |
| TPCH(60)          | TPCH-Q13                                               | parquet / none / none | 23.49  | 23.43       |   +0.27%   |   0.83%    |   0.63%        | 1           | 5     |
| TPCH(60)          | TPCH-Q21                                               | parquet / none / none | 68.88  | 68.76       |   +0.18%   |   0.22%    |   0.19%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_decimal_lowndv.test                  | parquet / none / none | 4.38   | 4.37        |   +0.09%   |   2.45%    |   0.45%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_5                          | parquet / none / none | 10.40  | 10.40       |   +0.07%   |   0.77%    |   0.50%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_long_predicate                               | parquet / none / none | 222.37 | 222.23      |   +0.06%   |   0.25%    |   0.25%        | 1           | 5     |
| TPCH(60)          | TPCH-Q8                                                | parquet / none / none | 10.65  | 10.65       |   +0.03%   |   0.55%    |   1.40%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_shuffle_join_one_to_many_string_with_groupby | parquet / none / none | 261.84 | 261.87      |   -0.01%   |   0.91%    |   0.74%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q3                                            | parquet / none / none | 9.44   | 9.45        |   -0.02%   |   0.92%    |   1.33%        | 1           | 5     |
| TPCH(60)          | TPCH-Q16                                               | parquet / none / none | 5.21   | 5.21        |   -0.02%   |   1.46%    |   1.64%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_top-n_all                                    | parquet / none / none | 34.58  | 34.62       |   -0.11%   |   0.22%    |   0.19%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_topn_bigint                                  | parquet / none / none | 4.24   | 4.25        |   -0.13%   |   6.66%    |   2.03%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q2                                         | parquet / none / none | 3.23   | 3.24        |   -0.34%   |   2.03%    |   0.32%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_broadcast_join_1                             | parquet / none / none | 0.18   | 0.18        |   -0.40%   |   6.16%    |   2.45%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_exchange_broadcast                           | parquet / none / none | 46.27  | 46.51       |   -0.52%   |   7.83%    | * 15.60% *     | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_bigint_pk                            | parquet / none / none | 114.32 | 114.92      |   -0.52%   |   0.24%    |   0.61%        | 1           | 5     |
| TPCH(60)          | TPCH-Q22                                               | parquet / none / none | 6.66   | 6.70        |   -0.53%   |   1.39%    |   0.84%        | 1           | 5     |
| TPCH(60)          | TPCH-Q20                                               | parquet / none / none | 5.78   | 5.81        |   -0.62%   |   1.25%    |   0.67%        | 1           | 5     |
| TPCH(60)          | TPCH-Q2                                                | parquet / none / none | 2.53   | 2.55        |   -0.64%   |   3.86%    |   3.72%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q5                                            | parquet / none / none | 0.58   | 0.58        |   -0.75%   |   0.99%    |   6.89%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q7                                            | parquet / none / none | 2.05   | 2.07        |   -0.86%   |   2.16%    |   4.73%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_shuffle_join_union_all_with_groupby          | parquet / none / none | 54.86  | 55.34       |   -0.87%   |   0.25%    |   0.66%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_2                          | parquet / none / none | 7.52   | 7.59        |   -0.98%   |   1.53%    |   1.73%        | 1           | 5     |
| TPCH(60)          | TPCH-Q9                                                | parquet / none / none | 36.43  | 36.79       |   -1.00%   |   1.60%    |   7.39%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q1                                         | parquet / none / none | 2.79   | 2.82        |   -1.10%   |   1.15%    |   2.25%        | 1           | 5     |
| TPCH(60)          | TPCH-Q11                                               | parquet / none / none | 1.95   | 1.97        |   -1.18%   |   3.14%    |   2.24%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_AGG-Q2                                            | parquet / none / none | 10.98  | 11.11       |   -1.24%   |   0.77%    |   1.45%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_small_join_1                                 | parquet / none / none | 0.22   | 0.22        |   -1.34%   | * 13.03% * | * 12.31% *     | 1           | 5     |
| TPCH(60)          | TPCH-Q7                                                | parquet / none / none | 42.82  | 43.41       |   -1.37%   |   1.63%    |   1.51%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_empty_build_join_1                           | parquet / none / none | 3.30   | 3.35        |   -1.54%   |   2.15%    |   1.27%        | 1           | 5     |
| TARGETED-PERF(60) | PERF_STRING-Q6                                         | parquet / none / none | 10.34  | 10.54       |   -1.81%   |   0.24%    |   2.02%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_groupby_bigint_highndv                       | parquet / none / none | 32.80  | 33.46       |   -1.98%   |   1.29%    |   0.61%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_decimal_non_selective                 | parquet / none / none | 1.62   | 1.67        |   -3.01%   |   0.79%    |   1.65%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_conjunct_ordering_1                          | parquet / none / none | 0.13   | 0.14        |   -3.36%   |   8.66%    | * 12.66% *     | 1           | 5     |
| TARGETED-PERF(60) | primitive_exchange_shuffle                             | parquet / none / none | 84.92  | 87.96       |   -3.46%   |   1.46%    |   1.50%        | 1           | 5     |
| TPCH(60)          | TPCH-Q12                                               | parquet / none / none | 6.98   | 7.31        |   -4.57%   |   1.03%    |   7.13%        | 1           | 5     |
| TPCH(60)          | TPCH-Q18                                               | parquet / none / none | 47.54  | 50.39       |   -5.64%   |   5.70%    |   5.53%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_bigint_non_selective                  | parquet / none / none | 0.88   | 0.96        |   -7.81%   |   4.27%    |   5.97%        | 1           | 5     |
| TPCH(60)          | TPCH-Q15                                               | parquet / none / none | 8.14   | 9.15        |   -11.09%  |   0.63%    | * 10.44% *     | 1           | 5     |
| TPCH(60)          | TPCH-Q10                                               | parquet / none / none | 12.66  | 14.28       |   -11.34%  |   4.32%    |   1.14%        | 1           | 5     |
| TPCH(60)          | TPCH-Q17                                               | parquet / none / none | 10.31  | 12.59       |   -18.14%  |   0.65%    |   3.72%        | 1           | 5     |
| TARGETED-PERF(60) | primitive_filter_bigint_selective                      | parquet / none / none | 0.14   | 0.19        | I -27.60%  | * 32.55% * | * 39.78% *     | 1           | 5     |
| TPCH(60)          | TPCH-Q14                                               | parquet / none / none | 6.10   | 11.00       | I -44.55%  |   4.06%    |   3.84%        | 1           | 5     |
+-------------------+--------------------------------------------------------+-----------------------+--------+-------------+------------+------------+----------------+-------------+-------+

Change-Id: Ib0a15cb53feab89e7b35a56b67b3b30eb3e62c6b
Reviewed-on: http://gerrit.cloudera.org:8080/9584
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-28 04:25:27 +00:00
Taras Bobrovytsky
8fec1911e5 IMPALA-6230, IMPALA-6468: Fix the output type of round() and related fns
Before this patch, the output type of round() ceil() floor() trunc() was
not always the same as the input type. It was also inconsistent in
general. For example, round(double) returned an integer, but
round(double, int) returned a double.

After looking at other database systems, we decided that the guideline
should be that the output type should be the same as the input type. In
this patch, we change the behavior of the previously mentioned functions
so that if a double is given then a double is returned.

We also modify the rounding behavior to always round away from zero.
Before, we were rounding towards positive infinity in some cases.

Testinging:
- Updated tests
- Ran an exhaustive build which passed.

Cherry-picks: not for 2.x

Change-Id: I77541678012edab70b182378b11ca8753be53f97
Reviewed-on: http://gerrit.cloudera.org:8080/9346
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-24 04:43:01 +00:00
Vuk Ercegovac
2894884deb IMPALA-6670: refresh lib-cache entries from plan
When an impalad is in executor-only mode, it receives no
catalog updates. As a result, lib-cache entries are never
refreshed. A consequence is that udf queries can return
incorrect results or may not run due to resolution issues.
Both cases are caused by the executor using a stale copy
of the lib file. For incorrect results, an old version of
the method may be used. Resolution issues can come up if
a method is added to a lib file.

The solution in this change is to capture the coordinator's
view of the lib file's last modified time when planning.
This last modified time is then shipped with the plan to
executors. Executors must then use both the lib file path
and the last modified time as a key for the lib-cache.
If the coordinator's last modified time is more recent than
the executor's lib-cache entry, then the entry is refreshed.

Brief discussion of alternatives:

- lib-cache always checks last modified time
  + easy/local change to lib-cache
  - adds an fs lookup always. rejected for this reason

- keep the last modified time in the catalog
  - bound on staleness is too loose. consider the case where
    fn's f1, f2, f3 are created with last modified times of
    t1, t2, t3. treat the fn's last modified time as a low-watermark;
    if the cache entry has a more recent time, use it. Such a scheme
    would allow the version at t2 to persist. An old fn may keep the
    state from converging to the latest. This could end up with strange
    cases where different versions of the lib are used across executors
    for a single query.

    In contrast, the change in this path relies on the statestore to
    push versions forward at all coordinators, so will push all
    versions at all caches forward as well.

Testing:
- added an e2e custom cluster test

Change-Id: Icf740ea8c6a47e671427d30b4d139cb8507b7ff6
Reviewed-on: http://gerrit.cloudera.org:8080/9697
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-24 04:38:53 +00:00
Philip Zeyliger
783de170c9 IMPALA-4277: Support multiple versions of Hadoop ecosystem
Adds support for building against two sets of Hadoop ecosystem
components. The control variable is IMPALA_MINICLUSTER_PROFILE_OVERRIDE,
which can either be set to 2 (for Hadoop 2, Hive 1, and so on) or 3 (for
Hadoop 3, Hive 2, and so on).

We intend (in a trivial follow-on change soon) to make 3 the new default
and to explicitly deprecate 2, but this change only does not switch the
default yet. We support both to facilitate a smoother transition, but
support will be removed soon in the Impala 3.x line.

The switch is done at build time, following the pattern from IMPALA-5184
(build fe against both Hive 1 & 2 APIs). Switching back and forth
requires running 'cmake' again. Doing this at build-time avoids
complicating the Java code with classloader configuration.

There are relatively few incompatible APIs. This implementation
encapsulates that by extracting some Java code into
fe/src/compat-minicluminicluster-profile-{2,3}. (This follows the
pattern established by IMPALA-5184, but, to avoid a proliferation
of directories, I've moved the Hive files into the same tree.)
pattern from IMPALA-5184 (build fe against both Hive 1 & 2 APIs). I
consolidated the Hive changes into the same directory structure.

For Maven, I introduced Maven "profiles" to handle the two cases where
the dependencies (and exclusions) differ. These are driven by the
$IMPALA_MINICLUSTER_PROFILE environment variable.

For Sentry, exception class names changed. We work around this by adding
"isSentry...(Exception)" methods with two different implementations.
Sentry is also doing some odd shading, whereby some exceptions are
"sentry.org.apache.sentry..."; we handle both. Similarly, the mechanism
to create a SentryAuthProvider is slightly different. The easiest way to
see the differences is to run:

  diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/util/SentryUtil.java
  diff -u fe/src/compat-minicluster-profile-{2,3}/java/org/apache/impala/authorization/SentryAuthProvider.java

The Sentry work is based on a change by Zach Amsden.

In addition, we recently added an explicit "refresh" permission.  In
Sentry 2, this required creating an ImpalaPrivilegeModel to capture
that. It's a slight customization of Hive's equivalent class.

For Parquet, the difference is even more mechanical. The package names
gone from "parquet" to "org.apache.parquet". The affected code
was extracted into ParquetHelper, but only one copy exists. The second
copy is generated at build-time using sed.

In the rare cases where we need to behave differently at runtime,
MiniclusterProfile.MINICLUSTER_PROFILE is a class which encapsulates
what version we were built aginst. One of the cases is the results
expected by various frontend tests. I avoided the issue by translating
one error string into another, which handled the diversion in one place,
rather than complicating the several locations which look for "No
FileSystem for scheme..." errors.

The HBase APIs we use for splitting regions at test time changed.
This patch includes a re-write of that code for the new APIs. This
piece was contributed by Zach Amsden.

To work with newer versions of dependencies, I updated the version of
httpcomponents.core we use to 4.4.9.

We (Thomas Tauber-Marshall and I) uploaded new Hadoop/Hive/Sentry/HBase
binaries to s3://native-toolchain, and amended the shell scripts to
launch the right things. There are minor mechanical differences.  Some
of this was based on earlier work by Joe McDonnell and Zach Amsden.
Hive's logging is changed in Hive 2, necessitating creating a
log4j2.properties template and using it appropriately. Furthermore,
Hadoop3's new shell script re-writes do a certain amount of classpath
de-duplication, causing some issues with locating the relevant logging
configurations. Accomodations exist in the code to deal with that.

parquet-filtering.test was updated to turn off stats filtering. Older
Hive didn't write Parquet statistics, but newer Hive does. By turning
off stats filtering, we test what the test had intended to test.

For views-compatibility.test, it seems that Hive 2 has fixed certain
bugs that we were testing for in Hive. I've added a
HIVE=SUCCESS_PROFILE_3_ONLY mechanism to capture that.

For AuthorizationTest, different hive versions show slightly different
things for extended output.

To facilitate easier reviewing, the following files are 100% renames as identified by git; nothing
to see here.

 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetCatalogsReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetColumnsReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetFunctionsReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetInfoReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetSchemasReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/hive/service/rpc/thrift/TGetTablesReq.java (100%)
 rename fe/src/{compat-hive-1 => compat-minicluster-profile-2}/java/org/apache/impala/compat/MetastoreShim.java (100%)
 rename fe/src/{compat-hive-2 => compat-minicluster-profile-3}/java/org/apache/impala/compat/MetastoreShim.java (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-acls.xml.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/kms-site.xml.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/hadoop/conf/yarn-site.xml.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-common (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-master (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/init.d/kudu-tserver (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/master.conf.tmpl (100%)
 rename testdata/cluster/node_templates/{cdh5 => common}/etc/kudu/tserver.conf.tmpl (100%)

CreateTableLikeFileStmt had a chunk of code moved to ParquetHelper.java. This
was done manually, but without changing anything except what Java required in
terms of accessibility and boilerplate.

 rewrite fe/src/main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java (80%)
 copy fe/src/{main/java/org/apache/impala/analysis/CreateTableLikeFileStmt.java => compat-minicluster-profile-3/java/org/apache/impala/analysis/ParquetHelper.java} (77%)

Testing: Ran core & exhaustive tests with both profiles.
Cherry-picks: not for 2.x.

Change-Id: I7a2ab50331986c7394c2bbfd6c865232bca975f7
Reviewed-on: http://gerrit.cloudera.org:8080/9716
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-23 20:56:00 +00:00
Tim Armstrong
588e1d46e9 IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner
Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Reviewed-on: http://gerrit.cloudera.org:8080/9403
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-22 02:47:33 +00:00
Tianyi Wang
d03b66ca35 IMPALA-6394: Restart HDFS only when no replication progress is made
In wait-hdfs-replication, the frequent and eager restart might slow the
HDFS replication down. HDFS should be restarted only if no progress is
made in a certain amount of time, and we should wait longer before
failing the data loading.

Testing: It's tested with a fake HDFS fsck script.

Change-Id: Ib059480254643dc032731b4b3c55204a93b61e77
Reviewed-on: http://gerrit.cloudera.org:8080/9698
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-22 00:41:16 +00:00
Bikramjeet Vig
3d65f856f7 IMPALA-6621: Improve set lookup performance for in-predicate evaluation
Currently when using a SET_LOOKUP strategy for in-predicates in impala
we use a std:set object for checking membership. This patch takes a
hybrid approach based on benchmarking results and uses boost::flat_set
for int, big int, and float datatypes and boost::unordered_set for the
rest (tiny int, small int, double, string, timestamp, decimal).

The intent of this change is to fix a regression when upgrading the
toolchain to use LLVM 5.0.1 (IMPALA-5980).

Performance:
Ran a query for each data type with a large in predicate containing
500 elements on a single node with mt_dop set to 1.

+-----------+---------------+----------+---------------+----------+
| Data Type | Llvm 3 hybrid |  Llvm 3  | Llvm 5 hybrid |  Llvm 5  |
+-----------+---------------+----------+---------------+----------+
|           Table used: tpch100_parquet.lineitem                  |
+-----------+---------------+----------+--------------+-----------+
| big int   | 17s782ms      | 13s941ms | 13s201ms      | 25s604ms |
| string    | 40s750ms      | 64s      | 40s723ms      | 73s      |
| decimal   | 13s929ms      | 22s272ms | 13s710ms      | 34s338ms |
| int       | 19s368ms      | 11s308ms | 9s169ms       | 15s254ms |
+-----------+---------------+----------+--------------+-----------+
|           Table used: alltypes with 33638400 rows               |
+-----------+---------------+----------+--------------+-----------+
| double    | 5s726ms       | 5s894ms  | 5s595ms       | 6s592ms  |
| small int | 4s776ms       | 5s057ms  | 4s740ms       | 5s358ms  |
| float     | 7s223ms       | 6s397ms  | 6s287ms       | 6s926ms  |
+-----------+---------------+----------+---------------+----------+

Also added a targeted perf query that uses a large in-predicate
over a decimal column.

Testing:
- Ran expr-test and test_exprs successfully.

Change-Id: Ifd1627d779d10a16468cc3c2d0bc26a497e048df
Reviewed-on: http://gerrit.cloudera.org:8080/9570
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-21 00:40:10 +00:00
Philip Zeyliger
5c8da5d13a Consistently use Java 1.7 compiler.
We use Java 1.7 in fe/pom.xml, where most of our Java code is. For
consistency, this updates the rest of our Maven configurations to use
the same version of Java. A change I'm working with uses
try-with-resources in HBase splitting, which is how I ran into
this.

Testing: ran core tests

Change-Id: I6cecddf367f00185a14a8b08c03456e3b756bd70
Reviewed-on: http://gerrit.cloudera.org:8080/9600
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-17 04:08:53 +00:00
Tim Armstrong
e148c1a7c3 IMPALA-6589: remove invalid DCHECK in parquet reader
The DCHECK was only valid if the Parquet file metadata is internally
consistent, with the number of values reported by the metadata
matching the number of encoded levels.

The DCHECK was intended to directly detect misuse of the RleBatchDecoder
interface, which would lead to incorrect results. However, our other
test coverage for reading Parquet files is sufficient to test the
correctness of level decoding.

Testing:
Added a minimal corrupt test file that reproduces the issue.

Change-Id: Idd6e09f8c8cca8991be5b5b379f6420adaa97daa
Reviewed-on: http://gerrit.cloudera.org:8080/9556
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-17 02:52:19 +00:00
Fredy Wijaya
41a516f949 IMPALA-6655: Add owner information on database creation
Add owner information on database creation.

> create database foo;
> describe database extended foo;
+---------+----------+---------+
| name    | location | comment |
+---------+----------+---------+
| foo     |          |         |
| Owner:  |          |         |
|         | user1    | USER    |
+---------+----------+---------+

Testing:
- Ran end-to-end query and metadata tests

Change-Id: Id74ec9bd3cb7954999305e9cd9085cbf50921a78
Reviewed-on: http://gerrit.cloudera.org:8080/9637
Reviewed-by: Fredy Wijaya <fwijaya@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-16 19:28:37 +00:00
Alex Behm
42abe8139e IMPALA-5270: Pass resolved exprs into analytic SortInfo.
The bug was that the SortInfo of analytics was given
ordering exprs that were not fully resolved against their
input (e.g. inline views were not resolved).
As a result, the SortInfo logic did not materialize exprs
like rand() coming from inline views.

The fix is to pass fully resolved exprs to the analytic
SortInfo, and then the existing materialization logic
properly handles non-deterministic built-ins and UDFs.

The code around sort generation was rather convoluted
and difficult to understand. I overhauled SortInfo to
unify the different uses of it under a common codepath
After that cleanup, the fix for this issue was trivial.

Testing:
- Locally ran planner tests
- Locally ran analytic EE tests in test_queries.py
- Core/hdfs run passed

Change-Id: Id2b3f4e5e3f1fd441a63160db3c703c432fbb072
Reviewed-on: http://gerrit.cloudera.org:8080/9631
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-15 02:00:46 +00:00
Philip Zeyliger
45aee121eb Removing (broken) retries from split-hbase.sh.
The retries in split-hbase.sh don't work in the common case,
because $MINIKDC_PRINC_HIVE is not set in non-kerberized (common)
environments. The regular data load scripts (create-load-data.sh)
have code to manage that, but split-hbase.sh blindly forges ahead,
leading to errors like:

  /home/impdev/Impala/testdata/bin/split-hbase.sh: line 49: MINIKDC_PRINC_HIVE: unbound variable
  Error in /home/impdev/Impala/testdata/bin/create-load-data.sh at line 48: LOAD_DATA_ARGS=""

Since this hasn't been working, I opted to remove it entirely, as a failure on
the line where HBase splitting actually failed would be significantly more
useful than the error here. A search of mailing lists suggested that I was at
least the second person to have run into this. (In my case, I did break HBase
splitting, but it took me a second to identify the error, since the log was
spammed with unrelated information relating to the cluster restart.)

Testing: core tests.

Change-Id: I715891c9e744f21002330c3ae3ebc14095d94ffd
Reviewed-on: http://gerrit.cloudera.org:8080/9588
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-15 01:32:14 +00:00
Grant Henke
03794389fc IMPALA-6551: Change Kudu TPCDS and TPCH columns to DECIMAL
Before Kudu supported DECIMAL columns the TPCDS and TPCH
columns were djusted to use DOUBLE in place of DECIMAL. This
patch undoes that change now that Kudu supports DECIMAL.

Testing:
 - Updated concurrent_select.py
 - Updated test_tpch_queries.py
 - Excersized by the Kudu planner tests

Change-Id: I2f7e4464dc6705cadd610a82c459390a9c0dfe4f
Reviewed-on: http://gerrit.cloudera.org:8080/9484
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-14 21:38:06 +00:00
Vincent Tran
0d7787fe4d IMPALA-5315: Cast to timestamp fails for YYYY-M-D format
This change allows casting of a string in 'lazy' date/time
format to timestamp. The supported lazy date formats are:
  yyyy-[M]M-[d]d
  yyyy-[M]M-[d]d [H]H:[m]m:[s]s[.SSSSSSSSS]
  [H]H:[m]m:[s]s[.SSSSSSSSS]

We will incur a SCAN performance penalty (approximately 1/2
TotalReadThroughput) when the string is in one of these
lazy date/time format.

Testing:
Benchmarked the performance consequence by executing this SQL on
a private build over 3.8 billion rows:
select min(cast (time_string as timestamp)) from private.impala_5315

Added tests for valid and invalid date/time format strings
in expr-test.cc to be inline with existing tests for CAST() function.

Added end-to-end tests into exprs.test and
select-lazy-timestamp.test to exercise the new function within
the context of a query.

Added tests to exercise the leading and trailing white space trimming
behaviour in default and lazy date/time string format (IMPALA-6630).

Change-Id: Ib9a184a09d7e7783f04d47588537612c2ecec28f
Reviewed-on: http://gerrit.cloudera.org:8080/7009
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-13 22:10:18 +00:00
Grant Henke
6d8ce64020 IMPALA-6635: Add DECIMAL type to Kudu predicates
This patch enables pushing scan predicates on
DECIMAL columns down to Kudu.

Testing:
- Added Planner decimal predicate test to kudu.test
- Added Planner decimal in-list test to kudu-selectivity.test

Change-Id: I2569a9e1d58f1c58884d58633d46348364888ed7
Reviewed-on: http://gerrit.cloudera.org:8080/9578
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-13 20:31:43 +00:00
Philip Zeyliger
8e1bf0e99a IMPALA-6341, IMPALA-5917: Reduce mem-limit for start-impala-cluster.
We've observed empirically that giving Impala 80% of system memory
doesn't leave enough room for the minicluster and ASAN overhead, leading
to the OOM killer striking during test runs (sometimes). This commit
reduces the threshold to 70%.

This commit also reduces the memory usage of semi-joins-exhaustive.test
by roughly halving the number of records it deals with. This was
necessary for tests to pass on a machine with 32GB of RAM.

Testing: I've run the ASAN build (more) happily with this change.
I've run exhaustive tests on a 32GB machine.

Change-Id: Iabca7a95560bd27c2de2b0a147ee9a3c45199db7
Reviewed-on: http://gerrit.cloudera.org:8080/9395
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-13 00:39:38 +00:00
Tianyi Wang
c7a58b8a73 IMPALA-6394: Restart HDFS when blocks are under replicated
HDFS sometimes fails to fully replicate all the blocks in 30 seconds
and no progress is made. This patch tries to restart HDFS several times
before aborting the data loading.

Change-Id: Iefd4c2fc6c287f054e385de52bdc42b0bdbd7915
Reviewed-on: http://gerrit.cloudera.org:8080/9469
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-09 22:54:47 +00:00
Zoltan Borok-Nagy
5d044e0cb2 IMPALA-6542: Fix inconsistent write path of Parquet min/max statistics
Quick fix of Parquet write path until the Parquet community
agrees on the ordering of floating point numbers.

The behavior follows the way fmax()/fmin() works, ie. Impala
will only write NaN into the stats when all the values are NaNs.
This behavior is aligned with the quick fix of Parquet-CPP.

Added e2e tests as well.

Change-Id: I3957806948f7c661af4be5495f2ec92d1e9fc9d6
Reviewed-on: http://gerrit.cloudera.org:8080/9381
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-08 07:34:41 +00:00
Tim Armstrong
73e90d237e IMPALA-6592: add test for invalid parquet codecs
IMPALA-6592 revealed a gap in test coverage for files with
invalid/unsupported Parquet codecs. This adds a test that reproduces the
bug that was present in my IMPALA-4835 patch. master is unaffected by
this bug.

I also hid the conversion tables and made the conversion go through
functions that validate the enum values, to make it easier to track down
problems like this in the future.

Testing:
Ran exhaustive tests.

Change-Id: I1502ea7b7f39aa09f0ed2677e84219b37c64c416
Reviewed-on: http://gerrit.cloudera.org:8080/9500
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-08 04:48:36 +00:00
Tim Armstrong
7376ca29b4 IMPALA-6595: fix crash in NljBuilder::Close()
The bug is that the right child of a blocking join node could be closed
before the builder if an error was encountered when sending a batch to
the sink. This hits a DCHECK because Buffers owned by the sink may still
be accounted against the child node.

Testing:
Added the test that originally triggered the problem. It reproduced the
failure when based on the IMPALA-4835 patch, but I can't reproduce
the failure after rebase onto master.

Change-Id: Ie46b87a4889d7cee907124796c830db41125cf15
Reviewed-on: http://gerrit.cloudera.org:8080/9493
Tested-by: Impala Public Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-03-06 16:12:05 +00:00
Taras Bobrovytsky
b0027575cb IMPALA-6405: Error when string to decimal cast overflows
Before this patch, when there was an error when converting a string to
a decimal, a NULL was returned. In this patch, we change this behavior
so that an error is returned if decimal_v2 is enabled. We also add a
warning if there is an underflow.

The reasoning is that we want stricter behavior in decimal_v2.

Testing:
- Added some EE tests.
- Ran an exhaustive build, which passed.

Change-Id: Icffccac1c1c2361447ae4b0de9b6c2ec7de071db
Reviewed-on: http://gerrit.cloudera.org:8080/9339
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-06 03:29:47 +00:00
Tim Armstrong
161cbe30ff Revert IMPALA-4835 and dependent changes
Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit"

This reverts commit 25bcb258df.

Revert "IMPALA-6588: don't add empty list of ranges in text scan"

This reverts commit d57fbec6f6.

Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool"

This reverts commit 24b4ed0b29.

Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront"

This reverts commit 5699b59d0c.

Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation"

This reverts commit 65680dc421.

Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb
Reviewed-on: http://gerrit.cloudera.org:8080/9480
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-03 04:22:12 +00:00