Commit Graph

66 Commits

Author SHA1 Message Date
Joe McDonnell
0c7c6a335e IMPALA-11977: Fix Python 3 broken imports and object model differences
Python 3 changed some object model methods:
 - __nonzero__ was removed in favor of __bool__
 - func_dict / func_name were removed in favor of __dict__ / __name__
 - The next() function was deprecated in favor of __next__
   (Code locations should use next(iter) rather than iter.next())
 - metaclasses are specified a different way
 - Locations that specify __eq__ should also specify __hash__

Python 3 also moved some packages around (urllib2, Queue, httplib,
etc), and this adapts the code to use the new locations (usually
handled on Python 2 via future). This also fixes the code to
avoid referencing exception variables outside the exception block
and variables outside of a comprehension. Several of these seem
like false positives, but it is better to avoid the warning.

This fixes these pylint warnings:
bad-python3-import
eq-without-hash
metaclass-assignment
next-method-called
nonzero-method
exception-escape
comprehension-escape

Testing:
 - Ran core tests
 - Ran release exhaustive tests

Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee
Reviewed-on: http://gerrit.cloudera.org:8080/19592
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
eb66d00f9f IMPALA-11974: Fix lazy list operators for Python 3 compatibility
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.

Python 2:
range(0,5) == [0,1,2,3,4]
True

Python 3:
range(0,5) == [0,1,2,3,4]
False

The fix is to wrap locations with list(). i.e.

Python 3:
list(range(0,5)) == [0,1,2,3,4]
True

Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).

Most of the changes were done via these futurize fixes:
 - libfuturize.fixes.fix_xrange_with_import
 - lib2to3.fixes.fix_map
 - lib2to3.fixes.fix_filter

This eliminates the pylint warnings:
 - xrange-builtin
 - range-builtin-not-iterating
 - map-builtin-not-iterating
 - zip-builtin-not-iterating
 - filter-builtin-not-iterating
 - reduce-builtin
 - deprecated-itertools-function

Testing:
 - Ran core job

Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Joe McDonnell
2b550634d2 IMPALA-11952 (part 2): Fix print function syntax
Python 3 now treats print as a function and requires
the parenthesis in invocation.

print "Hello World!"
is now:
print("Hello World!")

This fixes all locations to use the function
invocation. This is more complicated when the output
is being redirected to a file or when avoiding the
usual newline.

print >> sys.stderr , "Hello World!"
is now:
print("Hello World!", file=sys.stderr)

To support this properly and guarantee equivalent behavior
between python 2 and python 3, all files that use print
now add this import:
from __future__ import print_function

This also fixes random flake8 issues that intersect with
the changes.

Testing:
 - check-python-syntax.sh shows no errors related to print

Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351
Reviewed-on: http://gerrit.cloudera.org:8080/19552
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-02-28 17:11:50 +00:00
Steve Carlin
bb9fb663ce IMPALA-10778: Allow impala-shell to connect directly to HS2
Impala-shell already uses HS2 protocol to connect to Impalad.
This commit allows impala-shell to connect to any server (for
example, Hive) using the hs2 protocol. This will be done via
the "--strict_hs2_protocol" option.

When the "--strict_hs2_protocol" option is turned on, only features
supported by hs2 will work. For instance, "runtime-profile" is an
impalad specific feature and will be disabled.

The "--strict_hs2_protocol" will only work on servers that abide
by the strict definition of what is supported by HS2. So one will
be able to connect to Hive in this mode, but connections to Impala
will not work. Any feature supported by Hive (e.g. kerberos
authentication) should work as well.

Note: While authentication should work, the test framework is not
set up to create an HS2 server that does authentication at this point
so this feature should be used with caution.
Change-Id: I674a45640a4a7b3c9a577830dbc7b16a89865a9e
Reviewed-on: http://gerrit.cloudera.org:8080/17660
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-27 09:45:59 +00:00
Csaba Ringhofer
94f67a3432 IMPALA-7825: Upgrade Thrift version to 0.11.0
Before this patch Impala mainly used Thrift 0.9.3, but it was
possible to compile Impala shell with Thrift 0.11.0, so the 0.11.0
Thrift lib was already included in the toolchain.

Most of the changes are related to replacing boost:: with std::
shared_ptr-s in cpp code (this is a continuation of patch by Sahil).

The Thrift upgrade also needs an Impyla release with Thrift 0.11.0, as
Impala's test framework relies on Impyla. A thrift_sasl release is also
needed, because it currently pins Thrift version to 0.9.3 for Python 2.

The current patch uses alpha releases from Impyla and thrift_sasl that
use thrift 0.11.0.

Notable side effects:
- old logic to compile thrift for impala-shell with 0.11.0 was removed
- impala_shell's utf8 handling had to be updated as the new 0.11.0
  compilation happens with no_utf8strings. This also made things a
  bit faster, e.g the following is ~0.22s instead of ~0.25
  shell/impala_shell.py \
    -B -q "select * from functional_parquet.alltypes;" > /dev/null
- THRIFT-3921 changed the stream operators to print an enum's name
  instead of its number, leading to slightly different messages
  in some cases.
- "templates" was added to the thift generator's parameters to avoid
  a compilation issue (related to IMPALA-10600). I didn't notice any
  change in compilation time. This option generated .tcc files with
  templetized readers/writers for Thrift types. Currently we don't
  use these, but they could potentially speed up (de)serialization.

Testing:
- ran Impyla's test suite with Python 2 and 3
- ran core tests

Change-Id: Idd13f177b4f7acc07872ea6399035aa180ef6ab6
Reviewed-on: http://gerrit.cloudera.org:8080/17170
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-27 13:36:54 +00:00
Joe McDonnell
2357958e73 IMPALA-10304: Fix log level and format for pytests
Recent testing showed that the pytests are not
respecting the log level and format set in
conftest.py's configure_logging(). It is using
the default log level of WARNING and the
default formatter.

The issue is that logging.basicConfig() is only
effective the first time it is called. The code
in lib/python/impala_py_lib/helpers.py does a
call to logging.basicConfig() at the global
level, and conftest.py imports that file. This
renders the call in configure_logging()
ineffective.

To avoid this type of confusion, logging.basicConfig()
should only be called from the main() functions for
libraries. This removes the call in lib/python/impala_py_lib
(as it is not needed for a library without a main function).
It also fixes up various other locations to move the
logging.basicConfig() call to the main() function.

Testing:
 - Ran the end to end tests and custom cluster tests
 - Confirmed the logging format
 - Added an assert in configure_logging() to test that
   the INFO log level is applied to the root logger.

Change-Id: I5d91b7f910b3606c50bcba4579179a0bc8c20588
Reviewed-on: http://gerrit.cloudera.org:8080/16679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-10-30 15:32:21 +00:00
Joe McDonnell
bfdc5bf6af IMPALA-9702: Cleanup unique_database directories
If there are external tables in a database, drop database cascade
won't remove the external table locations. If those locations
are inside the database, then the database directory does not
get removed. Some tests that use unique_database fail when
running for the second time (or with a data snapshot) due to the
preexisting files.

This adds code to remove the database directory for
unique_database. It also adds some debugging statements that
list the files at the beginning of bin/run-all-tests.sh and
again at the end.

Testing:
 - Ran a core job and verified that the unique database
   directories are being removed
 - Ran TestMixedPartitions::test_incompatible_avro_partition_in_non_avro_table()
   multiple times and it passes when it previously failed.

Change-Id: I0530c028e5e7c241dfc054f04c78e2a045c2d035
Reviewed-on: http://gerrit.cloudera.org:8080/16015
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-06-02 14:01:23 +00:00
David Knupp
bc9d7e063d IMPALA-3343, IMPALA-9489: Make impala-shell compatible with python 3.
This is the main patch for making the the impala-shell cross-compatible with
python 2 and python 3. The goal is wind up with a version of the shell that will
pass python e2e tests irrepsective of the version of python used to launch the
shell, under the assumption that the test framework itself will continue to run
with python 2.7.x for the time being.

Notable changes for reviewers to consider:

- With regard to validating the patch, my assumption is that simply passing
  the existing set of e2e shell tests is sufficient to confirm that the shell
  is functioning properly. No new tests were added.

- A new pytest command line option was added in conftest.py to enable a user
  to specify a path to an alternate impala-shell executable to test. It's
  possible to use this to point to an instance of the impala-shell that was
  installed as a standalone python package in a separate virtualenv.

  Example usage:
  USE_THRIFT11_GEN_PY=true impala-py.test --shell_executable=/<path to virtualenv>/bin/impala-shell -sv shell/test_shell_commandline.py

  The target virtualenv may be based on either python3 or python2. However,
  this has no effect on the version of python used to run the test framework,
  which remains tied to python 2.7.x for the foreseeable future.

- The $IMPALA_HOME/bin/impala-shell.sh now sets up the impala-shell python
  environment independenty from bin/set-pythonpath.sh. The default version
  of thrift is thrift-0.11.0 (See IMPALA-9489).

- The wording of the header changed a bit to include the python version
  used to run the shell.

    Starting Impala Shell with no authentication using Python 3.7.5
    Opened TCP connection to localhost:21000
    ...

    OR

    Starting Impala Shell with LDAP-based authentication using Python 2.7.12
    Opened TCP connection to localhost:21000
    ...

- By far, the biggest hassle has been juggling str versus unicode versus
  bytes data types. Python 2.x was fairly loose and inconsistent in
  how it dealt with strings. As a quick demo of what I mean:

  Python 2.7.12 (default, Nov 12 2018, 14:36:49)
  [GCC 5.4.0 20160609] on linux2
  Type "help", "copyright", "credits" or "license" for more information.
  >>> d = 'like a duck'
  >>> d == str(d) == bytes(d) == unicode(d) == d.encode('utf-8') == d.decode('utf-8')
  True

  ...and yet there are weird unexpected gotchas.

  >>> d.decode('utf-8') == d.encode('utf-8')
  True
  >>> d.encode('utf-8') == bytearray(d, 'utf-8')
  True
  >>> d.decode('utf-8') == bytearray(d, 'utf-8')   # fails the eq property?
  False

  As a result, this was inconsistency was reflected in the way we handled
  strings in the impala-shell code, but things still just worked.

  In python3, there's a much clearer distinction between strings and bytes, and
  as such, much tighter type consistency is expected by standard libs like
  subprocess, re, sqlparse, prettytable, etc., which are used throughout the
  shell. Even simple calls that worked in python 2.x:

  >>> import re
  >>> re.findall('foo', b'foobar')
  ['foo']

  ...can throw exceptions in python 3.x:

  >>> import re
  >>> re.findall('foo', b'foobar')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/data0/systest/venvs/py3/lib/python3.7/re.py", line 223, in findall
      return _compile(pattern, flags).findall(string)
  TypeError: cannot use a string pattern on a bytes-like object

  Exceptions like this resulted in a many, if not most shell tests failing
  under python 3.

  What ultimately seemed like a better approach was to try to weed out as many
  existing spurious str.encode() and str.decode() calls as I could, and try to
  implement what is has colloquially been called a "unicode sandwich" -- namely,
  "bytes on the outside, unicode on the inside, encode/decode at the edges."

  The primary spot in the shell where we call decode() now is when sanitising
  input...

  args = self.sanitise_input(args.decode('utf-8'))

  ...and also whenever a library like re required it. Similarly, str.encode()
  is primarily used where a library like readline or csv requires is.

- PYTHONIOENCODING needs to be set to utf-8 to override the default setting for
  python 2. Without this, piping or redirecting stdout results in unicode errors.

- from __future__ import unicode_literals was added throughout

Testing:

  To test the changes, I ran the e2e shell tests the way we always do (against
  the normal build tarball), and then I set up a python 3 virtual env with the
  shell installed as a package, and manually ran the tests against that.

  No effort has been made at this point to come up with a way to integrate
  testing of the shell in a python3 environment into our automated test
  processes.

Change-Id: Idb004d352fe230a890a6b6356496ba76c2fab615
Reviewed-on: http://gerrit.cloudera.org:8080/15524
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-18 05:13:50 +00:00
Laszlo Gaal
c97191b6a5 IMPALA-9626: Use Python from the toolchain for Impala
Historically Impala used the Python2 version that was available on
the hosting platform, as long as that version was at least v2.6.
This caused constant headache as all Python syntax had to be kept
compatible with Python 2.6 (for Centos 6). It also caused a recent problem
on Centos 8: here the system Python version was compiled with the
system's GCC version (v8.3), which was much more recent than the Impala
standard compiler version (GCC 4.9.2). When the Impala virtualenv was
built, the system Python version supplied C compiler switches for models
containing native code that were unknown for the Impala version of GCC,
thus breaking virtualenv installation.

This patch changes the Impala virtualenv to always use the Python2
version from the toolchain, which is built with the toolchain compiler.

This ensures that
- Impala always has a known Python 2.7 version for all its scripts,
- virtualenv modules based on native code will always be installable, as
  the Python environment and the modules are built with the same compiler
  version.

Additional changes:
- Add an auto-use fixture to conftest.py to check that the tests are
  being run with Python 2.7.x
- Make bootstrap_toolchain.py independent from the Impala virtualenv:
  remove the dependency on the "sh" library

Tests:
- Passed core-mode tests on CentOS 7.4
- Passed core-mode tests in Docker-based mode for centos:7
  and ubuntu:16.04

Most content in this patch was developed but not published earlier
by Tim Armstrong.

Change-Id: Ic7b40cef89cfb3b467b61b2d54a94e708642882b
Reviewed-on: http://gerrit.cloudera.org:8080/15624
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-16 01:08:00 +00:00
David Knupp
f9cf70d035 IMPALA-9129: Add a test fixture that cleans up intentional core dumps
Some negative tests produce core dumps intentionally. We should have a
way of removing these as part of test cleanup.

For custom cluster tests, it's likely the cores may actually be generated
during the base class setup phase, which means it's too early for the
test fixture to really be useful. Such was the case with the test case
TestAuthorizationProvider::test_invalid_provider_flag. In this instance,
we had to add the same steps directly to the tests.

Testing done:
For test_invalid_provider_flag, I made sure I had pre-existing core files
in the IMPALA_HOME directory, then ran the test to confirm new cores were
removed.

-- 2019-11-06 19:53:27,303 INFO  MainThread: Removing core.impalad.61852 created by test_invalid_provider_flag
-- 2019-11-06 19:53:27,375 INFO  MainThread: Removing core.impalad.61856 created by test_invalid_provider_flag
-- 2019-11-06 19:53:27,450 INFO  MainThread: Removing core.impalad.61849 created by test_invalid_provider_flag

...and then made sure the pre-existing cores were still present.

Change-Id: I778f27e820a6983894c1294d35627ddb04f5a51a
Reviewed-on: http://gerrit.cloudera.org:8080/14640
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-08 02:22:59 +00:00
Bharath Vissapragada
72c9370856 IMPALA-8717: impala-shell support for HS2 HTTP endpoint
Adds impala-shell support to connect to HiveServer2 HTTP endpoint.
Relies on toolchain change at https://gerrit.cloudera.org/#/c/13725/.

Use --protocol='hs2-http' to enable this behavior.

Example usages:
---------------
impala-shell --protocol='hs2-http'  (No auth)
impala-shell --protocol='hs2-http' --ldap -u..... (PLAIN auth)
impala-shell --protocol-'hs2-http' --ssl --ca_cert... (TLS)
impala-shell --protocol='hs2-http' --ldap --ssl --ca_cert... (LDAP +
TLS)

Limitations:
-----------
- Does not support Kerberos (-k) due to lack ot SPNEGO support.

Testing:
--------
- Parameterized existing shell tests to support this combination.
- Added shell test coverage for LDAP auth.

Change-Id: I8323950857dfe1c1dfd5377fde79f87bc2ce9534
Reviewed-on: http://gerrit.cloudera.org:8080/13746
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
2019-07-29 05:43:48 +00:00
Tim Armstrong
9ecbe7d3dc IMPALA-8553,IMPALA-8552: fix checks for remote cluster
Apparently IMPALA_REMOTE_URL is not generally used for remote cluster
tests: only --testing_remote_cluster is reliably set. Fix the
is_remote_cluster() implementation to take into account
REMOTE_DATA_LOAD and --testing_remote_cluster in addition to
IMPALA_REMOTE_URL. Consistently use is_remote_cluster() in
other tests instead of checking the pytest flag directly.

There were a few lifecycle headaches with how
ImpalaTestClusterProperties is used:
* common.environ is imported from conftest, which means that
  the top-level code in the file runs *before* pytest
  command-line arguments have been registered and parsed.
* ImpalaTestClusterProperties is used by various code,
  like build_flavor_timeout(), which runs before pytest
  command-line arguments have been parsed.
* ImpalaTestClusterProperties is called from non-pytest
  scripts like start-impala-cluster.py, so the command-line
  arguments are not available.

I dealt with the above challenges by making a few changes
to do the detection later:
* Lazily initializing a singleton ImpalaTestClusterProperties.
  This was not strictly necessary but makes the whole problem
  less sensitive to import order and module dependencies.
* Adding cluster_properties fixture to make ImpalaTestClusterProperties
  available in tests without additional boilerplate.
* Removing the caching of the local/remote build calculation.
  ImpalaTestClusterProperties is instantiated outside of python
  tests, but is_remote_cluster() is only called from python tests,
  so if we check flags in is_remote_cluster() we'll get the
  right results reliably.

As a workaround to unblock remote tests, also assume catalog_v1 if
accessing the web UI fails.

Testing:
Ran core tests against a regular minicluster.

Ran tests against a remote cluster

Change-Id: Ifa6b2a1391f53121d3d7c00c5cf0a57590899ce4
Reviewed-on: http://gerrit.cloudera.org:8080/13386
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-20 20:27:31 +00:00
Todd Lipcon
800f635855 IMPALA-8667. Remove --pull_incremental_stats flag
This flag was added as a "chicken bit" -- so we could disable the new
feature if we had some problems with it. It's been out in the wild for a
number of months and we haven't seen any such problems, so at this point
let's stop maintaining the old code path.

Change-Id: I8878fcd8a2462963c7db3183a003bb9816dda8f9
Reviewed-on: http://gerrit.cloudera.org:8080/13671
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-19 01:07:00 +00:00
Tim Armstrong
2ca7f8e7c0 IMPALA-7995: part 1: fixes for e2e dockerised impala tests
This fixes all core e2e tests running on my local dockerised
minicluster build. I do not yet have a CI job or script running
but I wanted to get feedback on these changes sooner. The second
part of the change will include the CI script and any follow-on
fixes required for the exhaustive tests.

The following fixes were required:
* Detect docker_network from TEST_START_CLUSTER_ARGS
* get_webserver_port() does not depend on the caller passing in
  the default webserver port. It failed previously because it
  relied on start-impala-cluster.py setting -webserver_port
  for *all* processes.
* Add SkipIf markers for tests that don't make sense or are
  non-trivial to fix for containerised Impala.
* Support loading Impala-lzo plugin from host for tests that depend on
  it.
* Fix some tests that had 'localhost' hardcoded - instead it should
  be $INTERNAL_LISTEN_HOST, which defaults to localhost.
* Fix bug with sorting impala daemons by backend port, which is
  the same for all dockerised impalads.

Testing:
I ran tests locally as follows after having set up a docker network and
starting other services:

  ./buildall.sh -noclean -notests -ninja
  ninja -j $IMPALA_BUILD_THREADS docker_images
  export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster"
  export FE_TEST=false
  export BE_TEST=false
  export JDBC_TEST=false
  export CLUSTER_TEST=false
  ./bin/run-all-tests.sh

Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755
Reviewed-on: http://gerrit.cloudera.org:8080/12639
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-13 02:42:32 +00:00
Thomas Tauber-Marshall
a4ad8f35f7 IMPALA-8390: clean up test vectors in test_cancellation.py
Due to changes to TestCancellation made in IMPALA-7205 that were not
reflected in TestCancellationSerial and TestCancellationFullSort,
test_cancel_insert has not been running at all and test_cancel_sort
has been running with unintended parameters.

This patch re-enables test_cancel_insert, while including a number of
constraints on its parameters to keep test execution time reasonable.
It also fixes an incorrect constraint on test_cancel_sort.

The patch also makes some related improvements:
- Removes an xfail on test_cancel_insert related to a bug that is
  fixed now.
- When ImpalaTestVector.get_value() is called with a value name that
  does not actually exist in the vector, the result is a StopIteration
  exception. Due to python's questionable habit of using exceptions
  for flow control, StopIteration is frequently treated not as an
  error but as the normal end of iteration, which can result in
  unexpected behavior, eg. when pytest_generate_tests raises a
  StopIteration pytest just silently ignores it and drops the test
  case. This patch modifies get_value() to instead raise a ValueError
  in this situation.
- When a test has no vectors generated for it, the name of the test is
  now included in the logged warning.

Testing:
- Ran full core and exhaustive runs and verified that the expected
  test cases are run for test_cancellation.py now

Change-Id: I9673fe82bda5314aff6a51d1961759ff286fbf6f
Reviewed-on: http://gerrit.cloudera.org:8080/12960
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-13 00:27:19 +00:00
Austin Nobis
515ded0035 IMPALA-7918: Remove support for authorization policy file
This patch removes support for the authorization_policy_file. When the
flag is passed, the backend will issue a warning message that the flag
is being ignored.

Tests relying on the authorization_policy_file flag have been updated to
rely on sentry server instead.

Testing:
- Ran all FE tests
- Ran all E2E tests

Change-Id: Ic2a52c2d5d35f58fbff8c088fb0bf30169625ebd
Reviewed-on: http://gerrit.cloudera.org:8080/12637
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-03-25 20:23:33 +00:00
Philip Zeyliger
6d5ca479f6 IMPALA-7666: Propagate name of test into CLIENT_IDENTIFIER.
To facilitate correlating test failures (where we sometimes know things
like fragment id) with the tests that generated those queries, we can
stuff the test name into CLIENT_IDENTIFIER.

The mechanics here are to create a global, tests.common.current_node to
store the current test, create a plugin in conftest to set this when
entering a test, and then configuring connections as they're created
deep in test code.

Change-Id: I2f685fd16982d73ad3fc0f4a7578c5ad83b9a84c
Reviewed-on: http://gerrit.cloudera.org:8080/12177
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-11 05:34:07 +00:00
Sahil Takiar
691f9d9ff9 IMPALA-6249: Expose several build flags via web UI
Exposes a list of build flags via the impalad web UI. The build flags
can be viewed on the root page under the "Version" section. They can
be accessed via other tests through the debug version of the root page
(e.g. adding &json to the URL). The build flags are listed in a JSON
array so that they can be parsed easily. This should help run Impala
tests against a remote Impala cluster.

The build flags are read in CMakeLists.txt and then stored in
preprocessor variables.

Three build flags are exposed as part of this commit:
- Is_NDEBUG = [true, false]
    - Whether NDEBUG was true or false at compile time
- CMake_Build_Type = [DEBUG, RELEASE, ADDRESS_SANITIZER, TIDY, UBSAN,
  UBSAN_FULL, TSAN, CODE_COVERAGE_RELEASE, CODE_COVERAGE_DEBUG]
    - The value of CMAKE_BUILD_TYPE at compile time
- Library_Link_Type = [DYNAMIC, STATIC]
    - Derived from the compile time value of BUILD_SHARED_LIBS

There are a few other minor changes that are apart of this commit:

* The patch modifies environ.py so that it supports fetching build metadata
for both local and remote clusters.

* The tests under the tests/webserver directory were not being run because
'webserver' was not whitelisted in tests/run-tests.py. This patch fixes
that and addresses several test failures in run-tests.py.

* It reverts part of IMPALA-6947 so that their is no dependency from
start-impala-cluster.py to environ.py. The timeout discussed IMPALA-6947
is now set at compile time.

Testing:

Added new tests to webserver/test_web_pages.py to ensure that the build
flags are being set. Some tests are only run when run against a local
cluster because we have no way of getting the build info from a remote
cluster, whereas local clusters contain a .cmake_build_type file.

Change-Id: I47e3ad4cbf844909bdaf22a6f9d7bd915dce3f19
Reviewed-on: http://gerrit.cloudera.org:8080/11410
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-05 22:47:31 +00:00
Fredy Wijaya
0cd9151801 IMPALA-7713: Add test coverage for catalogd restart when authorization is enabled
This patch adds a test coverage for catalogd restart when authorization
is enabled to ensure all privileges in the impalad's catalogs get reset
after the catalogd restart to avoid stale privileges in the impalad's
catalogs, which can pose a security issue.

Testing:
- Ran all E2E authorization tests
- Added a new test

Change-Id: Ib9a168697401cf0b83c7a193fa477888b48cb369
Reviewed-on: http://gerrit.cloudera.org:8080/11696
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-10-17 05:33:05 +00:00
Tim Armstrong
d05f73f415 IMPALA-7647: Add HS2/Impyla dimension to TestQueries
I used some ideas from Alex Leblang's abandoned patch:
https://gerrit.cloudera.org/#/c/137/ in order to run .test files through
HS2. The advantage of using Impyla is that much of the code will be
reusable for any Python client implementing the standard Python dbapi
and does not require us implementing yet another thrift client.

This gives us better coverage of non-trivial result sets from HS2,
including handling of NULLs, error logs and more interesting result
sets than the basic HS2 tests.

I added HS2 coverage to TestQueries, which has a reasonable variety of
queries and covers the data types in alltypes. I also added
TestDecimalQueries, TestStringQuery and TestCharFormats to get coverage
of DECIMAL, CHAR and VARCHAR that aren't in alltypes. Coverage of
results sets with NULLs was limited so I added a couple of queries.

Places where results differ from Beeswax:
* Impyla is a Python dbapi client so must convert timestamps into python datetime
  objects, which only have microsecond precision. Therefore result
  timestamps within nanosecond precision are truncated.
* The HS2 interface reports the NULL type as BOOLEAN as a workaround for
  IMPALA-914.
* The Beeswax interface reported VARCHAR as STRING, but HS2 reports
  VARCHAR.

I dealt with different results by adding additional result sections so
that the expected differences between the clients/protocols were
explicit.

Limitations:
* Not all of the same methods are implemented as for beeswax, so some
  tests that have more complicated interactions with the client will not
  work with HS2 yet.
* We don't have a way to get the affected row count for inserts.

I also simplified the ImpalaConnection API by removing some unnecessary
methods and moved some generic methods to the base class.

Testing:
* Confirmed that it detected IMPALA-7588 by re-applying the buggy patch.
* Ran exhaustive and CentOS6 tests.

Change-Id: I9908ccc4d3df50365be8043b883cacafca52661e
Reviewed-on: http://gerrit.cloudera.org:8080/11546
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-10-09 00:45:10 +00:00
Thomas Tauber-Marshall
7eb64a9be7 IMPALA-7576: Add a timeout for all E2E tests
We've been seeing a lot of hangs in tests lately. This can waste
test resources by keeping machines busy, cause loss of coverage when
subsequent tests don't run, and can be difficult to diagnose if its
not clear which test hung.

This patch introduces a timeout of 2 hours for normal builds and 4
hours for slow builds for all tests run under pytest by using the
pytest-timeout plugin.

The timeouts were chosen to be generous to avoid false positives.
In recent runs I examined, the longest running test is
test_decimal_fuzz, which took 63 minutes in a DEBUG build and
162 minutes in an ASAN build.

Testing:
- Ran locally with a reduced timeout and confirmed the test is
  timed out when expected.

Change-Id: I301dd27a9767bfaef2756282014ef457a31956bd
Reviewed-on: http://gerrit.cloudera.org:8080/11447
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-17 21:49:01 +00:00
Vuk Ercegovac
72ee4a4275 IMPALA-7425: Change incremental stats to pull from catalogd.
Currently, incremental stats can consume a substantial
amount of metadata memory (per table, partition, column).
This metadata is transmitted from catalogd to all coordinators.
As a result, memory is used for all loaded tables that use
incremental stats all the time at all coordinators. A consequence
is that coordinators and catalogd die from OOM more often
when incremental stats are used and more network bandwidth is used.

This change removes incremental stats from impalads. These stats
are only needed when computing incremental statistics and merging
new results with the existing results. They are not used by queries.
As a result, the change requires that coordinators fetch
incremental stats directly from catalogd when computing incremental stats.
In addition, catalogd no longer sends incremental stats to coordinators
via the statestore.

The option is enabled by setting a new flag, --pull_incremental_statistics,
on the catalogd and all impalad coordinators.

Testing:
  - manual testing
  - added end-to-end tests with --pull_incremental_statistics enabled
    for the compute-stats-incremental.test
  - added fe CatalogTest for new catalogd service method
  - passes exhaustive tests when --pull_incremental_statistics is enabled
    and disabled

Change-Id: I9d564808ca5157afe4e091909ca6cdac76e60d6e
Reviewed-on: http://gerrit.cloudera.org:8080/11193
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-05 20:49:54 +00:00
Todd Lipcon
30bb0b3d89 tests: ensure consistent logging format across tests
Many of the test modules included calls to 'logging.basicConfig' at
global scope in their implementation. This meant that by just importing
one of these files, other tests would inherit their logging format. This
is typically a bad idea in Python -- modules should not have side
effects like this on import.

The format was additionally inconsistent. In some cases we had a "--"
prepended to the format, and in others we didn't. The "--" is very
useful since it lets developers copy-paste query-test output back into
the shell to reproduce an issue.

This patch fixes the above by centralizing the logging configuration in
a pytest hook that runs prior to all pytests. A few other non-pytest
related tools now configure logging in their "main" code which is only
triggered when the module is executed directly.

I tested that, with this change, logs still show up properly in the .xml
output files from 'run-tests.py' as well as when running tests manually
from impala-py.test

Change-Id: I55ef0214b43f87da2d71804913ba4caa964f789f
Reviewed-on: http://gerrit.cloudera.org:8080/11225
Reviewed-by: Philip Zeyliger <philip@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-08-18 04:21:00 +00:00
Todd Lipcon
4aec50484a IMPALA-7308. Support Avro tables in LocalCatalog
This adds support for loading Avro-formatted tables in LocalCatalog. In
the case that the table properties indicate a table is Avro-formatted,
the semantics are identical to the existing catalog implementation:

- if an explicit avro schema is specified, it overrides the schema
  provided by the HMS
- if no explicit avro schema is specified, one is inferred, and then the
  inferred schema takes the place of the one provided by the HMS (thus
  promoting columns like TINYINT to INT)
- on COMPUTE STATS, if any discrepancy is discovered between the HMS
  schema and the inferred schema, an error is emitted.

The semantics for LocalCatalog are slightly different in the case of
tables which have not been configured as Avro format on the table level:

The existing implementation has the behavior that, when a table is
loaded, all partitions are inspected, and, if any partition is
discovered with Avro format, the above rules are applied. This has some
very unexpected results, described in an earlier email to
dev@impala.apache.org [1]. To summarize that email thread, the existing
behavior was decided to be unintuitive and inconsistent with Hive.
Additionally, this behavior requires loading all partitions up-front,
which gets in the goal of lazy/granular metadata loading in
LocalCatalog.

Thus, the LocalCatalog implementation differs as follows:

- the "schema override" behavior ONLY occurs if the Avro file format has
  been selected at a table level.

- if an Avro partition is added to a non-Avro table, and that partition
  has a schema that isn't compatible with the table's schema, an error
  will occur on read.

The thread additionally discusses adding an error message on "alter" to
prevent users from adding an Avro partition to a table with an
incompatible schema. To keep the scope of this patch minimal, that is
not yet implemented here. I filed IMPALA-7309 to change the behavior of
the existing catalog implementation to match.

A new test verifies the behavior, set to 'xfail' when running on the
existing catalog implementation.

[1] https://lists.apache.org/thread.html/fb68c54bd66a40982ee17f9f16f87a4112220a5df035a311bda310f1@%3Cdev.impala.apache.org%3E

Change-Id: Ie4b86c8203271b773a711ed77558ec3e3070cb69
Reviewed-on: http://gerrit.cloudera.org:8080/10970
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>
2018-08-07 17:38:04 +00:00
Michael Ho
8d7f638654 IMPALA-7212: Removes --use_krpc flag and remove old DataStream services
This change removes the flag --use_krpc which allows users
to fall back to using Thrift based implementation of DataStream
services. This flag was originally added during development of
IMPALA-2567. It has served its purpose.

As we port more ImpalaInternalServices to use KRPC, it's becoming
increasingly burdensome to maintain parallel implementation of the
RPC handlers. Therefore, going forward, KRPC is always enabled.
This change removes the Thrift based implemenation of DataStreamServices
and also simplifies some of the tests which were skipped when KRPC
is disabled.

Testing done: core debug build.

Change-Id: Icfed200751508478a3d728a917448f2dabfc67c3
Reviewed-on: http://gerrit.cloudera.org:8080/10835
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-24 02:36:50 +00:00
Joe McDonnell
6887fc2190 IMPALA-7238: Use custom timeout for create unique database
test_kudu.TestCreateExternalTables() saw a timeout when
creating the unique database for its tests.

__unique_conn() opens a connection, creates a unique database,
then returns another connection in that database. It takes
a custom timeout argument, but the timeout is only for the
returned connection. The first connection to create the
unique database uses the default timeout of 45 seconds.

This patch changes the first connection to use the custom
timeout. For Kudu tests, this is 5 minutes rather than 45
seconds.

Change-Id: I4f2beb5bc027a4bb44e854bf1dd8919807a92ea0
Reviewed-on: http://gerrit.cloudera.org:8080/10862
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-07-17 16:45:34 +00:00
Vuk Ercegovac
4653637b9e IMPALA-6933: Avoids db name collisions for Kudu tests
Kudu tests generate temporary db names in a way so that its
unlikely, yet possible to collide. A recent test failure
indicates such a collision came up. The fix changes the
way that the name is generated so that it includes the
classes name for which the db name is generated. This db name
will make it easier to see which test created it and the name
will not collide with other names generated by other tests.

Testing:
- ran the updated test locally

Change-Id: I7c2f8a35fec90ae0dabe80237d83954668b47f6e
Reviewed-on: http://gerrit.cloudera.org:8080/10513
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-05-29 20:03:54 +00:00
Philip Zeyliger
2e6a63e31e IMPALA-6070: Further improvements to test-with-docker.
This commit tackles a few additions and improvements to
test-with-docker. In general, I'm adding workloads (e.g., exhaustive,
rat-check), tuning memory setting and parallelism, and trying to speed
things up.

Bug fixes:

* Embarassingly, I was still skipping thrift-server-test in the backend
  tests. This was a mistake in handling feedback from my last review.

* I made the timeline a little bit taller to clip less.

Adding workloads:

* I added the RAT licensing check.

* I added exhaustive runs. This led me to model the suites a little
  bit more in Python, with a class representing a suite with a
  bunch of data about the suite. It's not perfect and still
  coupled with the entrypoint.sh shell script, but it feels
  workable. As part of adding exhaustive tests, I had
  to re-work the timeout handling, since now different
  suites meaningfully have different timeouts.

Speed ups:

* To speed up test runs, I added a mechanism to split py.test suites into
  multiple shards with a py.test argument. This involved a little bit of work in
  conftest.py, and exposing $RUN_CUSTOM_CLUSTER_TESTS_ARGS in run-all-tests.sh.

  Furthermore, I moved a bit more logic about managing the
  list of suites into Python.

* Doing the full build with "-notests" and only building
  the backend tests in the relevant target that needs them. This speeds
  up "docker commit" significantly by removing about 20GB from the
  container.  I had to indicates that expr-codegen-test depends on
  expr-codegen-test-ir, which was missing.

* I sped up copying the Kudu data: previously I did
  both a move and a copy; now I'm doing a move followed by a move. One
  of the moves is cross-filesystem so is slow, but this does half the
  amount of copying.

Memory usage:

* I tweaked the memlimit_gb settings to have a higher default. I've been
  fighting empirically to have the tests run well on c4.8xlarge and
  m4.10xlarge.

The more memory a minicluster and test suite run uses, the fewer parallel
suites we can run. By observing the peak processes at the tail of a run (with a
new "memory_usage" function that uses a ps/sort/awk trick) and by observing
peak container total_rss, I found that we had several JVMs that
didn't have Xmx settings set. I added Xms/Xmx settings in a few
places:

 * The non-first Impalad does very little JVM work, so having
   an Xmx keeps it small, even in the parallel tests.
 * Datanodes do work, but they essentially were never garbage
   collecting, because JVM defaults let them use up to 1/4th
   the machine memory. (I observed this based on RSS at the
   end of the run; nothing fancier.) Adding Xms/Xmx settings
   helped.
 * Similarly, I piped the settings through to HBase.

A few daemons still run without resource limitations, but they don't
seem to be a problem.

Change-Id: I43fe124f00340afa21ad1eeb6432d6d50151ca7c
Reviewed-on: http://gerrit.cloudera.org:8080/10123
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-26 20:47:29 +00:00
Michael Ho
3b72a6c0da IMPALA-2567: Enable KRPC by default
This change enables the switch to use KRPC by default.
This change also fixes a bug in KrpcDataStreamMgr to
check if maintenance thread was started before calling
Join() on it. This shows up in BE tests as the maintenance
thread isn't started in them.

Testing done: exhaustive build.

Change-Id: Iae736c1c1351758969b4d84e34fc5b2d048660a0
Reviewed-on: http://gerrit.cloudera.org:8080/9461
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-05 08:57:40 +00:00
Lars Volker
a8fc9f0fc7 IMPALA-6508: add KRPC test flag
This change adds a flag "--use_krpc" to start-impala-cluster.py. The
flag is currently passed as an argument to the impalad daemon. In the
future it will also enable KRPC for the catalogd and statestored
daemons.

This change also adds a flag "--test_krpc" to pytest. When running tests
using "impala-py.test --test_krpc", the test cluster will be started
by passing "--use_krpc" to start-impala-cluster.py (see above).

This change also adds a SkipIf to skip tests based on whether the
cluster was started with KRPC support or not.

- SkipIf.not_krpc can be used to mark a test that depends on KRPC.
- SkipIf.not_thrift can be used to mark a test that depends on Thrift
  RPC.

This change adds a meta test to make sure that the new SkipIf decorators
work correctly. The test should be removed as soon as real tests have
been added with the new decorators.

Change-Id: Ie01a5de2afac4a0f43d5fceff283f6108ad6a3ab
Reviewed-on: http://gerrit.cloudera.org:8080/9291
Reviewed-by: David Knupp <dknupp@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-16 09:26:01 +00:00
David Knupp
894bb77855 IMPALA-4839: Remove implicit 'localhost' for KUDU_MASTER_HOSTS
The Kudu query tests were failing on a remote cluster because the Kudu
master was always set to '127.0.0.1', with no way to override it.

This patch corrects the issue with a number of changes:

- Add a pytest command line option to specify an arbitrary Kudu master

- Consolidate the place where the default Kudu master is derived. It
  had been stored both in the env and in tests/common/__init__.py,
  with different files looking to different places. For now, just look
  to the env, and remove the value from __init__.py.

- The kudu_client test fixture in conftest.py was using the connect()
  method from impala.dbapi (part of the Impyla library), without
  specifying the host param. In the absence of that, the default value
  is 'localhost', so add the host param to the connect() call.

- Define the various defaults for pytest config as constants at the top
  of conftest.py.

Change-Id: I9df71480a165f4ce21ae3edab6ce7227fbf76f77
Reviewed-on: http://gerrit.cloudera.org:8080/5877
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-14 21:51:39 +00:00
David Knupp
f590bc0da6 IMPALA-4750: Rename test infra classes so they don't mimic test classes.
This patch addresses warning messages from pytest re: the imported
TestMatrix, TestVector, and TestDimension classes, which were being
collected as potential test classes. The fix was to simply prepend
the class names with Impala-

git grep -l 'TestDimension' | xargs \
    sed -i 's/TestDimension/ImpalaTestDimension/g'

git grep -l 'TestMatrix' | xargs \
    sed -i 's/TestMatrix/ImpalaTestMatrix/g'

git grep -l 'TestVector' | xargs \
    sed -i 's/TestVector/ImpalaTestVector/g'

The tests all passed in an exhaustive run on the upstream jenkins
server:

http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/

Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1
Reviewed-on: http://gerrit.cloudera.org:8080/5794
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-26 23:40:22 +00:00
David Knupp
6c5f8e3f5e IMPALA-4639: Add pytest option and xfail markers for tests that only run locally.
As we're beginning to run Impala end-to-end tests on remote clusters, we're
finding some tests that do not pass for infrastructure-related reasons (as
opposed to product issues.) It would be useful to be able to xfail any tests
that we know to be problematic within a given module, yet still run the
others. This way, we can get passing test runs as we're ironing out those
infrastructure issues.

Change-Id: Id4d6e46dc1e64ad20c727ccb19af7a9f3daf917f
Reviewed-on: http://gerrit.cloudera.org:8080/5446
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-12-15 02:45:50 +00:00
Thomas Tauber-Marshall
d15f86cb6f IMPALA-4454: test_kudu.TestShowCreateTable flaky
The cause of the flakiness is Kudu CREATE TABLE operations
that are sometimes taking a long time, leading to timeouts
in the hiveserver2 connection. This patch adds the ability
for tests using the 'conn' pytest fixture to specify a
timeout to connect(), and sets a timeout of 5 minutes for
this test.

Change-Id: I2727c27ff66140ac4043bcad332cd4e1d72b255f
Reviewed-on: http://gerrit.cloudera.org:8080/5040
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-11 20:04:01 +00:00
Michael Brown
ac516670b6 IMPALA-4352: test infra: store Impala/Kudu primary keys in object model
Test infrastructure, including the random query generator and the data
migrator, needs to know the primary keys of Impala/Kudu tables. This
test infrastructure keeps Python object models of the tables and
columns. This patch adds the ability to read from source Impala/Kudu
tables via SHOW CREATE TABLE and store primary keys as proper
attributes. The patch also adds tests that ensure the test
infrastructure is always able to read and store the primary keys. This
helps find breakages sooner rather than later. For example, if a
regression to "SHOW CREATE TABLE" or the test infrastructure makes us no
longer able to parse primary keys, GVO or other CI will find the
breakage faster than running the query generator.

I also fixed some flake8 issues in files I touched. There were several
files that had a lot of white space warnings, and I wanted to keep the
patch from getting too large.

Change-Id: Ib654b6cd0e8c2a172ffb7330497be4d4a751e6e5
Reviewed-on: http://gerrit.cloudera.org:8080/4873
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2016-11-05 19:27:17 +00:00
Dimitris Tsirogiannis
041fa6d946 IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.

Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU

Changes:
1) Remove the requirement to specify table properties such as key
   columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
   schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
   time of creation in Impala.
4) Disallow table properties that could conflict with an existing
   table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
   addresses. The flag is used as the default value for the table
   property kudu_master_addresses but it can still be overriden
   using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
   wasn't implemented for Kudu tables and silently ignored. The Kudu
   tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
   Kudu) the existence of the other delegate and the use of delegates in
   general has led to confusion. The Kudu delegate only exists to provide
   functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
   standard. When used at the column level, only one column can be
   marked as a key. When used at the table level, multiple columns can
   be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
   The old "kudu.key_columns" table property is no longer accepted
   though it is still used internally. "PRIMARY" is now a keyword.
   The ident style declaration is used for "KEY" because it is also used
   for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
   The table property "kudu.table_name" is optional for managed tables
   and is required for external tables. If for a managed table a Kudu
   table name is not provided, a table name will be generated based
   on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
   of HMS when a table is loaded or refreshed. Table/column metadata
   are cached in the catalog and are stored in HMS in order to be
   able to use table and column statistics.

Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 10:52:25 +00:00
Alex Behm
ab9e54bc42 IMPALA-3491: Use unique database fixture in test_ddl.py.
Adds new parametrization to the unique database fixture:
- num_dbs: allows creating multiple unique databases at once;
  the 2nd, 3rd, etc. datbase name is generated by appending
  "2", "3", etc., to the first database name
- sync_ddl: allows creating the dabatases(s) with sync_ddl
  which is needed by most tests in test_ddl.py

Testing: I ran debug/core and debug/exhaustive on HDFS and
core/debug on S3. Also ran the test locally in a loop on
exhaustive.

Change-Id: Idf667dd5e960768879c019e2037cf48ad4e4241b
Reviewed-on: http://gerrit.cloudera.org:8080/4155
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 02:47:02 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Michael Brown
5112e65be2 Revert "Revert "Add Kudu test helpers""
This reverts commit f8dd5413b65d30646c3745dfc738ed812d50a51f and
effectively re-adds commit 9248dcb70478b8f93f022893776a0960f45fdc28. The
difference between this patch and its original is that I fixed the
changes introduced in infra/python/bootstrap_virtualenv.py to be
python2.4-compatible:

- removed the use of str.format(), preferring a str.join() pattern
- removed the call of the exit() builtin to prefer sys.exit()

The only testing I did for this patch was to ensure
CDH Impala-packaging-on-demand works.

Change-Id: I02ed97473868eacf45b25abe89b41e6fa2fce325
Reviewed-on: http://gerrit.cloudera.org:8080/3160
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
2016-05-24 16:40:59 -07:00
Shiraz Ali
08eff2bc09 Revert "Add Kudu test helpers"
This reverts commit 9248dcb70478b8f93f022893776a0960f45fdc28.
2016-05-20 08:46:00 -07:00
casey
36b524f68c Add Kudu test helpers
Changes:

1) Add the python Kudu module to the virtualenv. Building the virtualenv
is much slower now because Cython and numpy are required. To help with
the rebuild time --no-cache was removed. That option was added to help
when using the dev version of impyla, the version number would be the
same but the module contents were different and the cache used the old
module contents.

2) Add some py.test fixtures to help create Kudu and Impala connections.

Change-Id: I8e5e22b38d5bd09a36238e66a69aa42d1a941de7
Reviewed-on: http://gerrit.cloudera.org:8080/2855
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2016-05-19 19:45:48 -07:00
Skye Wanderman-Milne
9b51b2b6e6 IMPALA-2835: introduce PARQUET_FALLBACK_SCHEMA_RESOLUTION query option
This patch introduces a new query option,
PARQUET_FALLBACK_SCHEMA_RESOLUTION which allows Parquet files' schemas
to be resolved by either name or position.  It's "fallback" because
eventually field IDs will be the primary schema resolution scheme, and
we don't want to create an option that we will have to change the name
of later. The default is still by position. I chose to do a query
option because it will make testing easier and also be easier to
diagnose resolution problems quickly in the field. If users want to
switch the default behavior to be by name (like Hive), they can use
the --default_query_options flag.

This patch also introduces a new test section, SHELL, which can be
used to execute shell commands in a .test file. This is useful for
copying files into test tables.

Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6
Reviewed-on: http://gerrit.cloudera.org:8080/2384
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Internal Jenkins
2016-04-02 04:04:25 +00:00
Michael Brown
58219eac2c IMPALA-2537: EE tests: create and use unique database fixture
To speed up tests and reduce flakiness, introduce a pytest fixture
whereby a test maintainer may request a database unique to his test.
Such databases are suitable for tests that need to create tables within
Python test code. Because the database name is unique to the test, the
test can create any tables within that database it wants without fear
that the same tables will be picked up by another test. Unique databases
effectively guarantee a unique namespace for tables.

To generate the database name, we use the CRC32 checksum of the test's
so-called pytest test ID. This ID is a long string containing the test's
module path, class (if applicable), function name, and parameter set
(e.g., vector). We then concatenate the CRC32 checksum with the test
function name, so that it's easier to identify the test to which the
database belongs. The test author may also override the prefix by
parametrizing the fixture.

We then use a pytest fixture to create the database, hand the name to
the test using the fixture, and clean up the database automatically
after the test completes.

The command `impala-py.test --fixtures` executed from the tests/
directory explains the full usage.

Finally, we modify a few tests to show how test maintainers can use this
fixture.

Not supported here are databases used by .test files, creation of hive
databases, databases with special CREATE parameters such as LOCATION and
COMMENT, or asking the fixture to create multiple databases. Also not
supported would be attempted parallel runs of the same test with the
same test parameters.

Testing:

1. Manual testing of the fixture usage, both in vanilla and
   parametrized context.

2. Manual runs of the tests modified.

3. An exhaustive exploration strategy test run.

Change-Id: I74d200da8a59379388e1edfbb849828f92a1b3b7
Reviewed-on: http://gerrit.cloudera.org:8080/1821
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
2016-03-16 18:29:57 +00:00
Casey Ching
074e5b4349 Remove hashbang from non-script python files
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.

Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-04 05:26:07 +00:00
ishaan
dbc78aaa2c Enable isilon end to end tests for Impala.
This patch introduces changes to run tests against Isilon, combined with minor cleanup of
the test and client code.
For Isilon, it:
  - Populates the SkipIfIsilon class with appropriate pytest markers.
  - Introduces a new default for the hdfs client in order to connect to Isilon.
  - Cleans up a few test files take the underlying filesystem into account.
  - Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl

On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically:
  - delete_file_dir does not throw an error if the file does not exist.
  - get_file_dir_status automatically strips the leading '/'

Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f
Reviewed-on: http://gerrit.cloudera.org:8080/370
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Internal Jenkins
2015-05-27 22:25:12 +00:00
Matthew Jacobs
0224eb980c Add flag to skip HBase pytests
Adds the pytest flag --skip_hbase to skip HBase tests.

Change-Id: I57088640df22ce24d2b8f697d9dbe36ebf57f8f4
Reviewed-on: http://gerrit.cloudera.org:8080/374
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2015-05-13 00:59:27 +00:00
Mike Yoder
75a97d3d7e [CDH5] Kerberize mini-cluster and Impala daemons
This is the first iteration of a kerberized development environment.
All the daemons start and use kerberos, with the sole exception of the
hive metastore.  This is sufficient to test impala authentication.

When buildall.sh is run using '-kerberize', it will stop before
loading data or attempting to run tests.

Loading data into the cluster is known to not work at this time, the
root causes being that Beeline -> HiveServer2 -> MapReduce throws
errors, and Beeline -> HiveServer2 -> HBase has problems.  These are
left for later work.

However, the impala daemons will happily authenticate using kerberos
both from clients (like the impala shell) and amongst each other.
This means that if you can get data into the mini-cluster, you could
query it.

Usage:
* Supply a '-kerberize' option to buildall.sh, or
* Supply a '-kerberize' option to create-test-configuration.sh, then
  'run-all.sh -format', re-source impala-config.sh, and then start
  impala daemons as usual.  You must reformat the cluster because
  kerberizing it will change all the ownership of all files in HDFS.

Notable changes:
* Added clean start/stop script for the llama-minikdc
* Creation of Kerberized HDFS - namenode and datanodes
* Kerberized HBase (and Zookeeper)
* Kerberized Hive (minus the MetaStore)
* Kerberized Impala
* Loading of data very nearly working

Still to go:
* Kerberize the MetaStore
* Get data loading working
* Run all tests
* The unknown unknowns
* Extensive testing

Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019
Reviewed-by: Michael Yoder <myoder@cloudera.com>
Tested-by: jenkins
2014-09-05 12:36:21 -07:00
Taras Bobrovytsky
568e851774 Added option to specify the scale factor for pytest
This allows execution of tests on a cluster with multiple scale factors.

For example:
py.test <test file> --impalad <cluster ip>:21000 --scale_factor 300gb

Change-Id: I5230a6ef354def44b984eab2ac8a01989b9a471c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3051
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3215
2014-07-15 14:44:37 -07:00
casey
2351266d0e Replace single process mini-dfs with multiple processes
This should allow individual service components, such as a single nodemanager,
to be shutdown for failure testing. The mini-cluster bundled with hadoop is a
single process that does not expose the ability to control individual roles.
Now each role can be controlled and configured independently of the others.

Change-Id: Ic1d42e024226c6867e79916464d184fce886d783
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1432
Tested-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2297
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-04-23 18:24:05 -07:00