Commit Graph

1363 Commits

Author SHA1 Message Date
Taras Bobrovytsky
f810458ca4 IMPALA-6231: Implement decimal_v2 fuzz test
Implement a test that generates random decimal numbers in the pytest
framework, performs a random mathemtaical operation in Impala and
verifies that the result is correct by doing the same operating using
the Python decimal module. We try to generate not only completely random
decimal numbers, but also numbers that have interesting properties, such
as the number being a power of two.

Change-Id: I4328125de5c583ec8ead1f78d9a08703b18b2d85
Reviewed-on: http://gerrit.cloudera.org:8080/8898
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-10 03:03:52 +00:00
aphadke
38461c524f IMPALA-5052: Read and write signed integer logical types in Parquet
This patch maps a signed integer logical type in parquet to a supported
Impala column type. This change introduces the following mapping -

  INT_8  -> TINYINT
  INT_16 -> SMALLINT
  INT_32 -> INT
  INT_64 -> BIGINT

Also, added a parquet file with the following schema for testing -

  schema {
    optional int32 id;
    optional int32 tinyint_col (INT_8);
    optional int32 smallint_col (INT_16);
    optional int32 int_col;
    optional int64 bigint_col;
  }

Change-Id: I47a8371858c9597c6a440808cf6f933532468927
Reviewed-on: http://gerrit.cloudera.org:8080/8548
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-09 04:55:59 +00:00
Bharath Vissapragada
6a87eb20a5 IMPALA-6348: Redact only sensitive fields in runtime profiles
Without this patch, redaction is applied to every field in the
runtime profile. This approach has an undesired side effect when
Kerberos auth + email redaction is in place.

Since the redaction applies to every field, even principals
(from Connected/Delegated User fields) are redacted, as the Kerberos
principal format generally pattern matches with an email redactor
template.

This is particularly problematic for monitoring tools that consume
runtime profiles and use these fields to group the queries by user.

This patch fixes the problem by redacting only the following sensitive
fields.

- Query Statement
- Error logs (since they can contain column references etc.)
- Query Status
- Query Plan

Other fields in the runtime profile are left unredacted.

Change-Id: Iae3b6726009bf458a7ec73131e5d659b12ab73cf
Reviewed-on: http://gerrit.cloudera.org:8080/8934
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-06 22:54:17 +00:00
Zoltan Borok-Nagy
ce65b43d47 IMPALA-2248: Make idle_session_timeout a query option
This commit makes idle_session_timeout a query option.

idle_session_timeout currently can be set as a command line
option, which will be the default timeout for sessions.
HS2 sessions can override it with a smaller value by setting
it in the configuration overlay of HS2 OpenSession().

However, we can't override idle_session_timeout for JDBC/ODBC
connections, because we cannot put this in the connection string.

This commit is a workaround for this problem, it allows JDBC/ODBC
connections to set the session timeout as a query option
with the SET statement.

After this commit, the session timeout can be overridden to
any value, i.e. the command line flag idle_session_timeout
doesn't limit this option anymore.

I created an automated test case in JdbcTest.java based on
test_hs2.py::test_concurrent_session_mixed_idle_timeout. I also
extended the test_session_expiration and test_set_and_unset
test suites.

Change-Id: I32e2775f80da387b0df4195fe2c5435b3f8e585e
Reviewed-on: http://gerrit.cloudera.org:8080/8490
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-06 01:47:47 +00:00
Tim Armstrong
d3ff67b8b3 IMPALA-6370: fix partitioned parquet tables with nested types
When materialising a nested collection, has_template_tuple() should use
the template tuple for the collection, not the top-level tuple.

Testing:
Added tests based on nested-types-basic.test that operate on a simple
partitioned table. The tests reliably crashed Impala before the fix.

Change-Id: Ic808b824ce3b31af0539036d8ca23d17b18deab4
Reviewed-on: http://gerrit.cloudera.org:8080/8947
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-05 20:44:21 +00:00
Gabor Kaszab
7810d1f9a2 IMPALA-6318: Adjustment for hanging query cancellation test
Apparently test_query_cancellation_during_fetch hangs occasionally
in Jenkins builds. The Impala debug page shows the query being
cancelled, however, on the host the ImpalaShell process related to
that query is still running.

Since I had no luck in reproducing the issue locally I only have a
theory what might be going on here: The query is cancelled
successfully on Impala backend and when the test tries to get the
stdout and stderr from the ImpalaShell it gets stuck. It might be
the case that ImpalaShell process fetching the query results holds
the stdout. According to the documentation of subprocess.communicate()
it may cause issues to fetch data when the data size is large or
unlimited, that we can consider to be the case here.
As a workaround there is a new optional parameter to
util.ImpalaShell to omit the stdout because this test wouldn't use
it anyway and we get rid of fetching the large result from
ImpalaShell.

Change-Id: I082c83b91b6d0c527de92c7992f0dc9d1b290433
Reviewed-on: http://gerrit.cloudera.org:8080/8852
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-03 20:32:24 +00:00
Taras Bobrovytsky
a16fe803ca IMPALA-5014: Part 1: Round when casting string to decimal
In this patch we implement rounding when casting string to decimal if
DECIMAL_V2 is enabled. The backend method that parses strings and
converts them to decimals is refactored to make it easier to understand.

Testing:
- Added some BE tests.

Change-Id: Icd8b92727fb384e6ff2d145e4aab7ae5d27db26d
Reviewed-on: http://gerrit.cloudera.org:8080/8774
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-22 11:39:08 +00:00
Zoram Thanga
b581a9d1ee IMPALA-6225: Part 2: Query profile date-time strings should have ns
precision.

This commit follows 16d8dd58.

This patch adds a test case that inspects the thrift profile of a
completed query, and verifies that the "Start Time" and
"End Time" of the query have nanosecond precision. We chose to
work with the thrift profile directly, rather than parse the debug
web page, as it is the thrift profile which is consumed by
management API clients of Impala.

Change-Id: Id3421a34cc029ebca551730084c7cbd402d5c109
Reviewed-on: http://gerrit.cloudera.org:8080/8784
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-21 04:26:33 +00:00
Alex Behm
1f7b3b00e9 IMPALA-5310: Part 3: Use SAMPLED_NDV() in COMPUTE STATS.
Modifies COMPUTE STATS TABLESAMPLE to use the new SAMPLED_NDV()
function.

Testing:
- modified/improved existing functional tests
- core/hdfs run passed

Change-Id: I6ec0831f77698695975e45ec0bc0364c765d819b
Reviewed-on: http://gerrit.cloudera.org:8080/8840
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-16 04:58:59 +00:00
Jinchul
bfbcd1fe86 IMPALA-4664: Unexpected string conversion in Shell
Impala shell can accidentally convert certain
literal strings to lowercase. Impala shell splits
each command into tokens and then converts the
first token to lowercase to figure out how it
should execute the command. The splitting is done
by spaces only. Thus, if the user types a TAB
after the SELECT, the first token after the split
becomes the SELECT plus whatever comes after it.

Testing:
TestImpalaShellInteractive.test_case_sensitive_command
TestImpalaShellInteractive.test_unexpected_conversion_for_literal_string_to_lowercase
TestImpalaShell.test_var_substitution

Change-Id: Ifdce9781d1d97596c188691b62a141b9bd137610
Reviewed-on: http://gerrit.cloudera.org:8080/8762
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-15 21:32:20 +00:00
stiga-huang
5c593be59c IMPALA-6301: Fix test failures when username or group name contains dots
Some tests use the local user's group name to construct SQLs, which may
lead to syntax errors when group name contains dots. We need to quote
the group names in SQL to avoid this error. Besides, a test in
test_admission_controller uses '\w+' to match the local user name. This
expression cannot match usernames with dots, which causes test failure
as well. Instead, we should use '\S+'.

Change-Id: Ib8ae15bb6a929dc48d3ad2176c8b3fafff87f32b
Reviewed-on: http://gerrit.cloudera.org:8080/8807
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-13 23:06:45 +00:00
Philip Zeyliger
2fcbf36c32 IMPALA-6270: remove redundant version properties
Removes properties that are already defined in the impala-parent pom.

I ran the tests.

Change-Id: I6812e11bb41716450ef29bb523773479e9f76eec
Reviewed-on: http://gerrit.cloudera.org:8080/8827
Reviewed-by: Zach Amsden <zamsden@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-13 22:48:10 +00:00
Jinchul
4feb4f3a54 IMPALA-5754: Improve randomness of rand()/random()
Currently implementation of rand/random built-in functions
use rand_r of C library. We recognized its randomness was poor.
pcg32 of third party library shows better randomness than rand_r.

Testing:
Revise unit test in expr-test
Add E2E test to random.test

Change-Id: Idafdd5fe7502ff242c76a91a815c565146108684
Reviewed-on: http://gerrit.cloudera.org:8080/8355
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-12-13 10:04:40 +00:00
Alex Behm
0936e32966 IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Reviewed-on: http://gerrit.cloudera.org:8080/8569
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-12 22:20:18 +00:00
Philip Zeyliger
d2fe9f437e IMPALA-6270: create Impala parent pom
This commit links together all the individual pom.xml files to have a
new "impala-parent" pom as the parent. This enables de-duplicating all
the repository configuration.

I ran the build to test this.

Change-Id: Id744e4357ee4d8e4be4e5490b2159bb76a2192f0
Reviewed-on: http://gerrit.cloudera.org:8080/8753
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-12 04:30:15 +00:00
Zach Amsden
245df3c69a IMPALA-6245: Tolerate column indenting from Hive
The fix for HIVE-3140 started indenting multi-line comments,
which breaks Impala testing when run against Hive 2.1.1.

To test this using the pure test runner proved difficult
since it would require extensive changes to support both
row_regexes (since the columns changed order) and subset
support (since the number of rows changed).

Instead, we manually verify the hints are present in the
output in the python test.

The fact that the hints have been reformatted leaves us
in an uncertain state as to whether they actually get applied,
so a new test case has been added to run EXPLAIN SELECT
on the view and verify the joins happen exactly as we
expect.

Testing: Ran the views-ddl test against Impala mini-cluster
setups using both Hive 2.1.1 and Hive 1.1.0

Change-Id: I49e53b1230520ca6e850af28078526e6627d69de
Reviewed-on: http://gerrit.cloudera.org:8080/8719
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-12 00:17:56 +00:00
Thomas Tauber-Marshall
b4cf5f2174 IMPALA-6298: Skip test_profile_fragment_instances on local filesystem
test_profile_fragment_instances was recently added to verify that the
final runtime profile for a query has the expected fragments and exec
nodes. The test fails on local filesystem builds, though, as it
assumes there will be 3 impalads and therefore 3 fragment instances,
but there is only 1 impalad on local filesystem builds.

The fix is to disable the test on local filesystem builds.

Change-Id: I2c98f160406081626f17709809b8efee9eae1450
Reviewed-on: http://gerrit.cloudera.org:8080/8809
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-11 21:45:43 +00:00
Thomas Tauber-Marshall
f3fa3e017f IMPALA-6081: Fix test_basic_filters runtime profile failure
test_basic_filters has been occasionally failing due to a line missing
from a runtime profile for a particular query.

The problem is that the query returns all of its results before all of
its fragment instances are finished executing (due to a limit). Then,
when one fragment instance reports its status, the coordinator returns
to it a 'cancelled' status, causing all remaining instances for that
backend to be cancelled.

Sometimes this cancellation happens quickly enough that the relevant
fragment instances have not yet sent a status report when they are
cancelled. They will still send a report in finalize, but as the
coordinator only updates its runtime profile for 'ok' status reports,
not 'cancelled', the final runtime profile doesn't end up with any
data for those fragment instances, which means the test does not find
the line in the runtime profile its checking for.

The fix is to have the coordinator update its runtime profile with
every status report it recieves, regardless of error status.

Testing:
- Ran existing runtime profile tests, which rely on profile output,
  in a loop.
- Manually tested some scenarios with failed queries and checked that
  the new profile output is reasonable.
- Added a new e2e test that runs the affected query and checks for the
  presence of info for all expected exec node in the profile. This
  repros the underlying issue consistently.

Change-Id: I4f581c7c8039f02a33712515c5bffab942309bba
Reviewed-on: http://gerrit.cloudera.org:8080/8754
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-07 21:07:02 +00:00
Michael Ho
ed72910e96 IMPALA-6262: Always initialize runtime profile for DataSink
This change moves the creation of the runtime profile from DataSink::Prepare()
to the ctor of DataSink derived classes. This makes sure that DataSink::Close()
and other functions can access the profile even if the DataSink fails to initialize.

Testing done: Added a test case which triggers failure in the initialization of output
expressions in a HdfsTableSink. Impalad crashed consistently without the fix.

Change-Id: I2a683000ef180027b929dbebe78bc2a530a4767e
Reviewed-on: http://gerrit.cloudera.org:8080/8770
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-07 09:47:09 +00:00
Zoltan Borok-Nagy
d52fa75cb9 IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
IMPALA-3798 disabled per-scan filtering for sequence-
based scanners due to a race between runtime filter
arrival and header splits processing.

This commit enables per-scan filtering again for the
sequence based files. In HdfsScanNode::ProcessSplit()
we check if the current range is the header of a
sequence file. If so, and the filters reject the file,
the whole file skipped.

If it is not a sequence header, but the filters reject
the partition, we call RangeComplete() on the current
scan range.

Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Reviewed-on: http://gerrit.cloudera.org:8080/8684
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-07 07:13:29 +00:00
Tianyi Wang
c505a8159b IMPALA-6210: Add query id to lineage graph logging
Some tools use lineage graph logging to collect query metrics. Currently
only query hash is present in this log. Adding query id into it makes
such accounting easier.

Testing: The equality of query id in the query profile and lineage log
is checked in test_lineage.py. A test for TUniqueIdUtil is added to the
FE tests.

Change-Id: I4adbd02df37a234dbb79f58b7c46ca11a914229f
Reviewed-on: http://gerrit.cloudera.org:8080/8589
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-06 00:52:19 +00:00
Gabor Kaszab
75121819be IMPALA-6265 Query cancellation test enhancements
In the query cancellation tests it is essential to wait until the
query gets to a desired state (waiting_to_finish, fetching) and then
cancel it. Apparently, ASAN query execution happens slower than on a
Release build. As a result a hard coded timeout threshold is not
sufficient to cover all the builds, or should be set to a wastingly
high value.
As a solution the query state is checked on the Impala debug page in
intervals until it reaches the desired state or the maximum retry
attempt value is reached.

Change-Id: Ie0bff485a872df7be8efd784314a6ca91aaadd11
Reviewed-on: http://gerrit.cloudera.org:8080/8713
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-05 21:40:11 +00:00
Vuk Ercegovac
633dbff71d IMPALA-1422: support a constant on LHS of IN predicates.
Currently, constant expressions for the LHS of the IN predicate
are not supported. This patch adds this support as a rewrite in
StmtRewriter (where subqueries are rewritten to joins). Since
there is a nested-loop variant of left semijoin, support for IN
is handled by not erring out. NOT IN is handled by a rewrite to
corresponding NOT EXISTS predicate. Support for NOT IN with a
correlated subquery is not included in this change.

Re-organized the frontend subquery analysis tests to expand coverage.

Testing:
- added frontend subquery analysis tests
- added e2e tests

Change-Id: I0d69889a3c72e90be9d4ccf47d2816819ae32acb
Reviewed-on: http://gerrit.cloudera.org:8080/8322
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-02 04:09:05 +00:00
Taras Bobrovytsky
575b5a20e6 IMPALA-5017: Error on decimal overflow
Before this patch, decimal operations would either silently overflow (in
the case of sum() and avg()), or produce a warning.

In this patch, the behaviour is changed so that an error is produced in
the case of overflow when DECIMAL_v2 is enabled. Decimal v1 behaviour is
unchanged.

We introduce overflow checks when computing sum() and avg(). This
results in a ~30% performance regression when we are in decimal v2 mode
compared to decimal v1.

Benchmarks:
  Query:
  select sum(dec_38_19) from decimal_tbl
    Decimal v1: 11.57s
    Decimal v2: 16.58s

  Query:
  select avg(dec_38_19) from decimal_tbl
    Decimal v1: 12.08s
    Decimal v2: 17.08s

The performance regression is not as bad if we are computing the sum or
average of decimal column with a lower precision:

  Query:
  select sum(dec_9_5) from decimal_tbl
    Decimal v1: 11.06s
    Decimal v2: 13.08s

  Query:
  select avg(dec_9_5) from decimal_tbl
    Decimal v1: 11.56s
    Decimal v2: 13.57s

Testing:
- Added several end to end tests.
- Updated Expr tests to check for error in case of overflow.

Change-Id: Id98a92c9a9469ec8cf14e518c741a2dab7053019
Reviewed-on: http://gerrit.cloudera.org:8080/8404
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-01 23:23:01 +00:00
Lars Volker
ea8d2ba7f6 IMPALA-6255: Add device names to DiskIoMgr thread names
This change adds device names to the DiskIoMgr thread names. It will
make them easier to identify during debugging.

Change-Id: I30faeda6db8846e4aad64ce29ca811366d84910b
Reviewed-on: http://gerrit.cloudera.org:8080/8669
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-01 05:51:59 +00:00
Alex Behm
b3d8a507cb IMPALA-5310: Add COMPUTE STATS TABLESAMPLE.
Adds the TABLESAMPLE clause for COMPUTE STATS.

Syntax:
COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]

Computes and replaces the table-level row count and total file size,
as well as all table-level column statistics. Existing partition-level
row counts are not modified.
The TABLESAMPLE clause can be used to limit the scanned data volume to
a desired percentage. When sampling, the unmodified results of the
COMPUTE STATS queries are sent to the CatalogServer. There, the stats
are extrapolated before storing them into the HMS so as not to confuse
other engines like Hive/SparkSQL which may rely on the shared HMS
fields being accurate.

Limitations
- Only works for HDFS tables
- TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS
- TABLESAMPLE requires --enable_stats_extrapolation=true

Changes to EXPLAIN
The stored statistics from the HMS are more clearly displayed under
a 'stored statistics' section. Example:

00:SCAN HDFS [functional.alltypes, RANDOM]
   partitions=24/24 files=24 size=478.45KB
   stored statistics:
     table: rows=7300 size=478.45KB
     partitions: 24/24 rows=7300
     columns: all

Testing:
- added new functional tests
- core/hdfs run passed

Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7
Reviewed-on: http://gerrit.cloudera.org:8080/8136
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-29 22:37:01 +00:00
Tim Armstrong
72ed4fc887 Update incubator-impala -> impala URLs
This fixes push_to_asf.py and various other scripts that had the Apache
repo location hard-coded. Also fixed the location of the github mirror
and mailing list archives.

Testing:
Ran push_to_asf.py to check I got the URL right. Checked a couple of the
github and mailing list URLs to make sure the new URL is valid.

Change-Id: Ie49221300340ef34bdd7c01670c35bdbbce3e84f
Reviewed-on: http://gerrit.cloudera.org:8080/8685
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-29 20:58:50 +00:00
Gabor Kaszab
6d9da17288 IMPALA-1144: Fix exception when cancelling query in Impala-shell with CTRL-C
Issue 1: When query is cancelled via CTRL-C while being executed in Impala-shell
then an exception is thrown from Impala backend saying 'Invalid query handle'.
This is because one ImpalaClient was making RPC's while another ImpalaClient
cancelled the query on the backend. As a result RPC handlers in ImpalaServer
try to access a ClientRequestState that had been cleared from the backend. The
issue is confidently reproducable both in wait_to_finish and in fetch states of
the query.

As a solution the query cancellation is indicated to ImpalaClient via a bool
flag. Once a cancellation originated exception reaches Impala shell this flag
is checked to decide whether to suppress the error or not.

Issue 2: Every time a query was cancelled a 'use db' command was issued
automatically. This happened to historical reasons but is not needed anymore
(see Jira for more details).

Change-Id: I6cefaf1dae78baae238289816a7cb9d210fb38e2
Reviewed-on: http://gerrit.cloudera.org:8080/8549
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-29 03:44:51 +00:00
Tim Armstrong
dc1282fbc9 IMPALA-6241: timeout in admission control test under ASAN
The fix for IMPALA-6241 is to increase the timeout for all slow builds.

While testing that fix, I discovered that the ASAN build detection logic
was failing silently, resulting in it assuming that it was testing a
DEBUG build. The error was:

  Unexpected DW_AT_name in first CU:
  /data/jenkins/workspace/verify-impala-toolchain-package-build/label/ec2-package-ubuntu-16-04/toolchain/source/llvm/llvm-3.9.1.src/projects/compiler-rt/lib/asan/asan_preinit.cc;
  choosing DEBUG

The fix for that issue is to remove the build type detection heuristic
and instead just write a file with the build type as part of the build process.

Testing:
Before this change I was able to reproduce locally every 5-10 test
iterations. After this change I haven't seen it reproduce.

Change-Id: Ia4ed949cac99b9925f72e19e4adaa2ead370b536
Reviewed-on: http://gerrit.cloudera.org:8080/8652
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-29 03:28:22 +00:00
Dimitris Tsirogiannis
a88c3b9c52 Revert "IMPALA-5538: Use explicit catalog versions for deleted objects"
This reverts commit dd340b8810.
This commit caused a number of issues tracked in IMPALA-6001. The
issues were due to the lack of atomicity between the catalog version
change and the addition to the delete log of a catalog object.

Conflicts:
	be/src/service/impala-server.cc

Change-Id: I3a2cddee5d565384e9de0e61b3b7d0d9075e0dce
Reviewed-on: http://gerrit.cloudera.org:8080/8667
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-29 02:19:50 +00:00
Thomas Tauber-Marshall
7dd28ff431 IMPALA-6201: Fix test_basic_filters on ASAN
TestRuntimeFilters.test_basic_filters is flaky on ASAN as sometimes
the runtime filters aren't recieved within the specified
RUNTIME_FILTER_WAIT_TIME_MS.

This patch increases the timeout for ASAN builds.

Change-Id: I8c20cbb75a9b6da73137f220657aa75dea9dfdce
Reviewed-on: http://gerrit.cloudera.org:8080/8646
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-28 03:01:39 +00:00
Gabor Kaszab
88cb68cfbe IMPALA-2181: Add query option levels for display
Four display levels are introduced for each query option: REGULAR, ADVANCED,
DEVELOPMENT and DEPRECATED. When the query options are displayed in Impala
shell using SET then only the REGULAR and ADVANCED options are shown. A new
command called SET ALL shows all the options grouped by their option levels.

When the query options are displayed through the SET SQL statement then the
result set would contain an extra column indicating the level of each option.
Similarly to Impala shell here the SET command only diplays the REGULAR and
ADVANCED options while SET ALL shows them all.

If the Impala shell connects to an Impala daemon that predates this change
then all the options would be displayed in the REGULAR group.

Change-Id: I75720d0d454527e1a0ed19bb43cf9e4f018ce1d1
Reviewed-on: http://gerrit.cloudera.org:8080/8447
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-28 00:31:15 +00:00
Thomas Tauber-Marshall
abd9b0e70a IMPALA-4591: Bound Kudu client error mem usage
Previously, Kudu client errors could grow in size unbounded,
potentially causing the process to be killed. This patch sets a
bound on the mem that can be used for these error messages, with
the size determined by the flag 'kudu_error_buffer_size'.

If the errors for a Kudu client exceed this size, the query will fail,
as some errors will be dropped and we won't be able to tell if all of
the errors can be safely ignored.

Testing:
- Added a custom cluster test that verifies that a query that exceeds
  the limit fails.

Change-Id: I186ddb3f3b5865e08f17dba57cf6640591d06b14
Reviewed-on: http://gerrit.cloudera.org:8080/8464
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-27 22:28:37 +00:00
Vuk Ercegovac
628f19ed0b IMPALA-6092: avoid drop/create function interactions in e2e tests
The e2e unit tests for udfs can interact via the backend
lib_cache, causing test flakes. IMPALA-6215 explains a
race between the lib_cache and UdfExecutor in the frontend
which is the likely the root cause.
Two e2e tests use the same jar (test_java_udfs and
test_udf_invalid_symbol), test_udf_invalid_symbol drops a
function from that jar, which causes the use of that jar to
fail in the test_java_udfs test. Since the state of lib_cache
is per process, its state causes these interactions across
unit tests.
This change avoids the interactions by using separate jars for
the separate tests.

Change-Id: Ica3538788b1d2ab5e361261e2ade62780b838e65
Reviewed-on: http://gerrit.cloudera.org:8080/8593
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-27 21:20:20 +00:00
Tim Armstrong
1a7b0d0bdc IMPALA-6227: deflake admission stress tests
The problem was that, during the initial admission decision phase, some
queries were initially queued then dequeued once memory came available.
All of the accounting in the test implicitly relies on queries not being
dequeued until queries are later explicitly ended, so if this happened,
the test broke in multiple subtle ways.

This happened because the query only scanned a small number of
rows, which could be all buffered on the receiver side of the
exchange even before the client fetched any rows from the coordinator.
This means that the reserved memory on some backends could increase
then decrease during the initial admission phase, resulting in a
query being queued then dequeued.

The fix is to increase the number of rows returned by the query so that
all fragments remain active during the initial admission phase.
This increased test execution time somewhat, so I also had to bump the
queue wait timeout for the admission stress tests (they assume that
queries don't time out in the queue).

Testing:
Ran the test under debug, release and ASAN builds, i.e.

  impala-py.test tests/custom_cluster/test_admission_controller.py \
    --workload_exploration_strategy="functional-query:exhaustive"

I looped the mem_limit test for a while to confirm it didn't reproduce
(it reproduced reliably every 2-3 iterations before this fix).

It still reproduces every 5-10 runs with exhaustive+release, so I
need to do further work to make it more robust.

Change-Id: Iafb3af0ce68f96e5d713dbb3b37dd0b50ea66bb4
Reviewed-on: http://gerrit.cloudera.org:8080/8631
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-23 07:48:18 +00:00
Vuk Ercegovac
21a96ed2e3 IMPALA-4985: use parquet stats of nested types for dynamic pruning
Currently, parquet row-groups can be pruned at run-time using
min/max stats when predicates (in, binary) are specified for
column scalar types. This patch extends pruning to nested types
for the same class of predicates. A nested value is an instance
of a nested type (struct, array, map). A nested value consists of
other nested and scalar values (as declared by its type).
Predicates that can be used for row-group pruning must be applied to
nested scalar values. In addition, the parent of the nested scalar
must also be required, that is, not empty. The latter requirement
is conservative: some filters that could be used for pruning are
not used for correctness reasons.

Testing:
- extended nested-types-parquet-stats e2e test cases.

Change-Id: I0c99e20cb080b504442cd5376ea3e046016158fe
Reviewed-on: http://gerrit.cloudera.org:8080/8480
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-22 22:00:16 +00:00
Tim Armstrong
7487c5de04 IMPALA-1575: part 2: yield admission control resources
This change releases admission control resources more eagerly,
once the query has finished actively executing. Some resources
(tracked and untracked) are still consumed by the client request
as long as it remains open, e.g. memory for control structures
and the result cache. However, these resources are relatively
small and should not block admission of new queries.

The same as in part 1, query execution is considered to be finished
under any of the following conditions:
1. The query encounters an error and fails
2. The query is cancelled due to the idle query timeout
3. The query reaches eos (or the DML completes)
4. The client cancels the query without closing the query

Admission control resources are released in two ways:
1. by calling AdmissionController::ReleaseQuery() on the coordinator
   promptly after query execution finishes, instead of waiting for
   UnregisterQuery(). This means that the query and its memory is
   no longer considered "admitted".
2. by changing the behaviour of MemTracker::GetPoolMemReserved() so
   that it is aware of when a query has finished executing and does not
   consider its entire memory limit to be "reserved".

The preconditions for releasing an admitted query are subtle because the
queries are being admitted to a distributed system, not just the
coordinator.  The comment for ReleaseAdmissionControlResources()
documents the preconditions and rationale. Note that the preconditions
are not weaker than the preconditions of calling UnregisterQuery()
before this patch.

Testing:
TestAdmissionController is extended to end queries in four ways:
cancellation by client, idle timeout, the last row being fetched,
and the client closing the query. The test uses a mix of all four.
After the query ends, all clients wait for the test to complete
before closing the query or closing the connection. This ensures
that the admission control decisions are based entirely on the
query end behavior. This test works for both query admission control
and mem_limit admission control and can detect both kinds of admission
control resources ("admitted" and "reserved") not being released
promptly.

I ran into a problem similar to IMPALA-3772 with the admission control
tests becoming flaky due to query timeouts on release builds, which I
solved in a similar way by increasing the frequency of statestore
updates.

This is based on an earlier patch by Joe McDonnell.

Change-Id: Ib1fae8dc1c4b0eca7bfa8fadae4a56ef2b37947a
Reviewed-on: http://gerrit.cloudera.org:8080/8581
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-20 04:34:47 +00:00
Lars Volker
10334260d8 IMPALA-6109: xfail TestHdfsUnknownErrors::test_hdfs_safe_mode_error_255
The test puts the HDFS name node into safe mode to trigger an "Unknown
Error 255" and verifies that the error details can be obtained correctly
via the libHDFS API. However, putting the name node into safe mode can
trip up HBase (HBASE-18738), which causes sporadic failures of our other
HBase tests. To prevent this, we xfail the test until the HBase issue
has been addressed (or we find a better way to trigger a 255 error).
IMPALA-6212 tracks re-enabling the test in the future.

Change-Id: I55979bed07147409949b798d4beb7a3b3b7ec5c3
Reviewed-on: http://gerrit.cloudera.org:8080/8590
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-18 03:22:27 +00:00
Thomas Tauber-Marshall
2510fe0aa0 IMPALA-4252: Min-max runtime filters for Kudu
This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.

In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.

Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.

Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.

For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.

Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.

Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
  lot of runtime filters are renumbered as we always generate min-max
  filters even if they don't end up getting assigned and they take up
  some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.

Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
  timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
  (full self join of lineitem). The difference in running time was
  negligible - 24.46s with filters on, 24.15s with filters off for
  a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
  join on lineitem with a join condition that never matches). The
  filters resulted in a significant speedup - 0.26s with filters on,
  1.46s with filters off for a ~5.6x speedup. This query is added to
  targeted-perf.

Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Reviewed-on: http://gerrit.cloudera.org:8080/7793
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-17 21:33:51 +00:00
Tim Armstrong
ae116b5bf7 IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding
Switch the decoders to using more batch-oriented interfaces. As an
intermediate step this doesn't make the interfaces of LevelDecoder
or DictDecoder batch-oriented, only the lower-level utility classes.

The next step would be to change those interfaces to be batch-oriented
and make according optimisations in parquet. This could deliver much
larger perf improvements than the current patch.

The high-level changes are.
* BitReader -> BatchedBitReader, which is built to unpack runs of 32
  bit-packed values efficiently.
* RleDecoder -> RleBatchDecoder, which exposes the repeated and literal
  runs to the caller and uses BatchedBitReader to unpack literal runs
  efficiently.
* Dict decoding uses RleBatchDecoder to decode repeated runs efficiently
  and uses the BitPacking utilities to unpack and encode in a single
  step.

Also removes an older benchmark that isn't too interesting (since
the batch-oriented approach to encoding and decoding is so much
faster than the value-by-value approach).

Testing:
* Ran core tests.
* Updated unit tests to exercise new code.
* Added test coverage for the deprecated bit-packed level encoding to
  that it still works (there was no coverage previously).

Perf:
Single-node benchmarks showed a few % performance gain. 16 node cluster
benchmarks only showed a gain for TPC-H nested.

Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6
Reviewed-on: http://gerrit.cloudera.org:8080/8267
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-16 21:23:09 +00:00
Philip Zeyliger
155bb77649 Remove unused/defunct Maven repositories.
Removes three Maven repositories. davidtrott and codehaus both don't
exist any more, so they're not doing anyone any good. (We had previously
cleaned up Codehaus in IMPALA-5224, but a reference was resurrected.)
The libphonenumber repo was simply misconfigured: the library exists in
Maven central in the "normal" place, and a subdirectory repo is
unnecessary.

To test this, I ran "buildall" after removing ~/.m2/ on my machine.

Change-Id: I79eb6c483561726c7cbaf86874001f1979128720
Reviewed-on: http://gerrit.cloudera.org:8080/8497
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2017-11-16 15:51:48 +00:00
Vuk Ercegovac
6769220e28 IMPALA-6198: marks a test as debug-only
The test_catalog_wait test uses flags that are only compiled
for debug binaries. This change marks the test as debug-only
so that it does not break release tests.

Change-Id: I92640b8192545cccea0411c04cc5fcf59fbefed0
Reviewed-on: http://gerrit.cloudera.org:8080/8573
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-16 12:04:17 +00:00
Tim Armstrong
9502e21d03 IMPALA-6188: make test_top_n_reclaim less flaky
Testing:
Previously I needed ~20 iterations to get the test to fail on my local
machine. After these changes I haven't been able to reproduce the
failure

Change-Id: I2bea7b0f770dec362a6df075da4e340402bd1d5d
Reviewed-on: http://gerrit.cloudera.org:8080/8562
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-11-16 04:57:39 +00:00
Zoltan Borok-Nagy
6539e89c81 IMPALA-2235: Fix current db when shell auto-reconnects
The ImpalaShell didn't issue the 'USE <current-db>' command after
reconnecting to the Impala daemon. Therefore the client session
used the default DB after reconnection, not the previously selected DB.

Setting the current DB is done by the _validate_database method.
Before this commit it appended the "use <db>" command to the
command queue of the Cmd class. But, at this point we might already
have commands in the command queue that will run before the
"use <db>" command. In case of reconnection, we want to invoke
the USE command right away.

Also, the command processed by the precmd() method can entirely skip
the command queue, therefore it is not enough to insert the USE
command to the front of the command queue. We need to issue the
USE command with the onecmd() method to execute it immediately.

I extended the _validate_database method with an "immediately" flag.
If this flag is true, _validate_database will use the onecmd() method.
Otherwise, it will append the USE command to the command queue to
maintain the previous behaviour.

I added a new automated test suite named test_shell_interactive_reconnect.py
to the "custom cluster" tests. It sets the default database, and after
reconnection it checks if the shell set it again automatically.

One test case checks if the shell set the DB after manually reconnecting
to the impala daemon by issuing the CONNECT command.
The other test case checks if the shell set the DB after automatic
reconnection due to cluster restart.

I needed to backup the impala shell history file because I didn't
want to pollute it by the test cases (just like the way it is done in
tests/shell/test_shell_interactive.py). I created utility functions for
this in tests/shell/util.py and now test_shell_interactive.py and
the newly created test suite are using these utility functions.

Change-Id: I40dfa00ba0314d356fe8617446f516505c925e5e
Reviewed-on: http://gerrit.cloudera.org:8080/8368
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-15 22:42:22 +00:00
Tim Wood
a8c123b55a IMPALA-6160: Allow multiple statements in a Query object.
Testing:
- Reproduced problem with bin/run-workload.py.
- Ran bin/run-workload.py --workloads=tpch,targeted-perf,tpcds
  --impalads=localhost:21000,localhost:21001,localhost:21002
  --results_json_file=$PWD/perf_results/IMPALA-6160.json
  --query_iterations=3 --table_formats=parquet/none --plan_first
  --query_names='.*' (Close to command line that single_node_perf_run.py
  builds.)
- Manually reviewed perf_results/IMPALA-6160.json to verify presence of
  plans and proper splitting of query batches.

Change-Id: Iac86af181b7c42655f21d2c1efd4652dd35d9297
Reviewed-on: http://gerrit.cloudera.org:8080/8513
Tested-by: Impala Public Jenkins
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
2017-11-15 19:38:30 +00:00
Vuk Ercegovac
6a2b7a64fb IMPALA-4704: Turns on client connections when local catalog initialized.
Currently, impalad starts beeswax and hs2 servers even if the
catalog has not yet been initialized. As a result, client
connections see an error message stating that the impalad
is not yet ready.

This patch changes the impalad startup sequence to wait
until the catalog is received before opening beeswax and hs2 ports
and starting their servers.

Testing:
- python e2e tests that start a cluster without a catalog
  and check that client connections are rejected as expected.

Change-Id: I52b881cba18a7e4533e21a78751c2e35c3d4c8a6
Reviewed-on: http://gerrit.cloudera.org:8080/8202
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-13 21:14:14 +00:00
Tianyi Wang
fdf94a4003 IMPALA-6164: Fix stale query profile in TestAlwaysFalseFilter
TestAlwaysFalseFilter gets the query profile without fetching all the
rows, resulting in a stale query profile and failing the test. With
this patch all the rows are fetched before getting the query profile.
This is enough to get the final profile because the query profile
finalization is performed in Coordinator::GetNext after we hit eos.
A bug in Base64Decode related to query profile decoding is also fixed.
Currently Base64Decode may produce incorrect output length if the output
parameter is not initialized with 0.

Testing: TestAlwaysFalseFilter is run and passes 1000 times. It doesn't
pass 1000 times consecutively without this patch.

Change-Id: I04bb76d20541fa035d88167b593d1b8bc3873e89
Reviewed-on: http://gerrit.cloudera.org:8080/8498
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-10 04:27:59 +00:00
Thomas Tauber-Marshall
3a1073c87c IMPALA-6173: Fix SHOW CREATE TABLE for unpartitioned Kudu tables
IMPALA-5546 added the ability to create unpartitioned Kudu tables, but
when SHOW CREATE TABLE is run on it still prints 'PARTITION BY' just
without a partition clause. This patch removes the 'PARTITION BY' from
the output.

Testing:
- Added test that runs SHOW CREATE on an unpartitioned Kudu table.

Change-Id: Icc327266cfb8b5c05efec97348528cea6904bb20
Reviewed-on: http://gerrit.cloudera.org:8080/8506
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-09 23:59:13 +00:00
Tim Armstrong
a772f84562 IMPALA-6171: Revert "IMPALA-1575: part 2: yield admission control resources"
This reverts commit fe90867d89.

Change-Id: I3eec4b5a6ff350933ffda0bb80949c5960ecdf25
Reviewed-on: http://gerrit.cloudera.org:8080/8499
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-08 22:03:59 +00:00
Tim Armstrong
bf9c2f521f IMPALA-6151: add query-level fragment/backend counters
This adds NumBackends, NumFragments and NumFragmentInstances
counters to the query profile to make it easier to manually or
programmatically analyse the query.

Also add a num-queries-registered metric to track the number
of queries that have been executed but are not yet unregistered.

Testing:
Ran "select count(*) from alltypessmall" and checked profile:

    Backend startup latencies: Count: 3, min / max: 1ms / 1ms, 25th %-ile: 1ms, 50th %-ile: 1ms, 75th %-ile: 1ms, 90th %-ile: 1ms, 95th %-ile: 1ms, 99.9th %-ile: 1ms
    Per Node Peak Memory Usage: tarmstrong-box:22000(1.10 MB) tarmstrong-box:22001(1.02 MB) tarmstrong-box:22002(1.02 MB)
     - FiltersReceived: 0 (0)
     - FinalizationTimer: 0.000ns
     - NumBackends: 3 (3)
     - NumFragmentInstances: 4 (4)
     - NumFragments: 2 (2)

Ran some query tests (both beeswax and HS2) and manually checked the
num-queries-registered metric on the /metrics page when the queries
were running and after they finished. Added the metric to
test_metrics_are_zero() to make sure that there are no accounting
errors.

Change-Id: I3df350414733e98d1ec28adc1c98f45bb0c4e3e9
Reviewed-on: http://gerrit.cloudera.org:8080/8461
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-07 21:44:34 +00:00