impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
Riza Suminto	28cff4022d	IMPALA-14333: Run impala-py.test using Python3 Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true reveals some tests that require adjustment. This patch made such adjustment, which mostly revolves around encoding differences and string vs bytes type in Python3. This patch also switch the default to run pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The following are the details: Change hash() function in conftest.py to crc32() to produce deterministic hash. Hash randomization is enabled by default since Python 3.3 (see https://docs.python.org/3/reference/datamodel.html#object.__hash__). This cause test sharding (like --shard_tests=1/2) produce inconsistent set of tests per shard. Always restart minicluster during custom cluster tests if --shard_tests argument is set, because test order may change and affect test correctness, depending on whether running on fresh minicluster or not. Moved one test case from delimited-latin-text.test to test_delimited_text.py for easier binary comparison. Add bytes_to_str() as a utility function to decode bytes in Python3. This is often needed when inspecting the return value of subprocess.check_output() as a string. Implement DataTypeMetaclass.__lt__ to substitute DataTypeMetaclass.__cmp__ that is ignored in Python3 (see https://peps.python.org/pep-0207/). Fix WEB_CERT_ERR difference in test_ipv6.py. Fix trivial integer parsing in test_restart_services.py. Fix various encoding issues in test_saml2_sso.py, test_shell_commandline.py, and test_shell_interactive.py. Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1. Switch to binary comparison in test_iceberg.py where needed. Specify text mode when calling tempfile.NamedTemporaryFile(). Simplify create_impala_shell_executable_dimension to skip testing dev and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason is that several UTF-8 related tests in test_shell_commandline.py break in Python3 pytest + Python2 impala-shell combo. This skipping already happen automatically in build OS without system Python2 available like RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty). Removed unused vector argument and fixed some trivial flake8 issues. Several test logic require modification due to intermittent issue in Python3 pytest. These include: Add _run_query_with_client() in test_ranger.py to allow reusing a single Impala client for running several queries. Ensure clients are closed when the test is done. Mark several tests in test_ranger.py with SkipIfFS.hive because they run queries through beeline + HiveServer2, but Ozone and S3 build environment does not start HiveServer2 by default. Increase the sleep period from 0.1 to 0.5 seconds per iteration in test_statestore.py and mark TestStatestore to execute serially. This is because TServer appears to shut down more slowly when run concurrently with other tests. Handle the deprecation of Thread.setDaemon() as well. Always force_restart=True each test method in TestLoggingCore, TestShellInteractiveReconnect, and TestQueryRetries to prevent them from reusing minicluster from previous test method. Some of these tests destruct minicluster (kill impalad) and will produce minidump if metrics verifier for next tests fail to detect healthy minicluster state. Testing: Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true. Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4 Reviewed-on: http://gerrit.cloudera.org:8080/23319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-03 10:01:29 +00:00
Riza Suminto	9c87cf41bf	IMPALA-13396: Unify tmp dir management in CustomClusterTestSuite There are many custom cluster tests that require creating temporary directory. The temporary directory typically live within a scope of test method and cleaned afterwards. However, some test do create temporary directory directly and forgot to clean them afterwards, leaving junk dirs under /tmp/ or $LOG_DIR. This patch unify the temporary directory management inside CustomClusterTestSuite. It introduce new 'tmp_dir_placeholders' arg in CustomClusterTestSuite.with_args() that list tmp dirs to create. 'impalad_args', 'catalogd_args', and 'impala_log_dir' now accept formatting pattern that is replaceable by a temporary dir path, defined through 'tmp_dir_placeholders'. There are few occurrences where mkdtemp is called and not replaceable by this work, such as tests/comparison/cluster.py. In that case, this patch change them to supply prefix arg so that developer knows that it comes from Impala test script. This patch also addressed several flake8 errors in modified files. Testing: - Pass custom cluster tests in exhaustive mode. - Manually run few modified tests and observe that the temporary dirs are created and removed under logs/custom_cluster_tests/ as the tests go. Change-Id: I8dd665e8028b3f03e5e33d572c5e188f85c3bdf5 Reviewed-on: http://gerrit.cloudera.org:8080/21836 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-10-02 01:25:39 +00:00
Joe McDonnell	0c7c6a335e	IMPALA-11977: Fix Python 3 broken imports and object model differences Python 3 changed some object model methods: - __nonzero__ was removed in favor of __bool__ - func_dict / func_name were removed in favor of __dict__ / __name__ - The next() function was deprecated in favor of __next__ (Code locations should use next(iter) rather than iter.next()) - metaclasses are specified a different way - Locations that specify __eq__ should also specify __hash__ Python 3 also moved some packages around (urllib2, Queue, httplib, etc), and this adapts the code to use the new locations (usually handled on Python 2 via future). This also fixes the code to avoid referencing exception variables outside the exception block and variables outside of a comprehension. Several of these seem like false positives, but it is better to avoid the warning. This fixes these pylint warnings: bad-python3-import eq-without-hash metaclass-assignment next-method-called nonzero-method exception-escape comprehension-escape Testing: - Ran core tests - Ran release exhaustive tests Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee Reviewed-on: http://gerrit.cloudera.org:8080/19592 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	aa4050b4d9	IMPALA-11976: Fix use of deprecated functions/fields removed in Python 3 Python 3 moved several things around or removed deprecated functions / fields: - sys.maxint was removed, but sys.maxsize provides similar functionality - long was removed, but int provides the same range - file() was removed, but open() already provided the same functionality - Exception.message was removed, but str(exception) is equivalent - Some encodings (like hex) were moved to codecs.encode() - string.letters -> string.ascii_letters - string.lowercase -> string.ascii_lowercase - string.strip was removed This fixes all of those locations. Python 3 also has slightly different rounding behavior from round(), so this changes round() to use future's builtins.round() to get the Python 3 behavior. This fixes the following pylint warnings: - file-builtin - long-builtin - invalid-str-codec - round-builtin - deprecated-string-function - sys-max-int - exception-message-attribute Testing: - Ran cores tests Change-Id: I094cd7fd06b0d417fc875add401d18c90d7a792f Reviewed-on: http://gerrit.cloudera.org:8080/19591 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	2b550634d2	IMPALA-11952 (part 2): Fix print function syntax Python 3 now treats print as a function and requires the parenthesis in invocation. print "Hello World!" is now: print("Hello World!") This fixes all locations to use the function invocation. This is more complicated when the output is being redirected to a file or when avoiding the usual newline. print >> sys.stderr , "Hello World!" is now: print("Hello World!", file=sys.stderr) To support this properly and guarantee equivalent behavior between python 2 and python 3, all files that use print now add this import: from __future__ import print_function This also fixes random flake8 issues that intersect with the changes. Testing: - check-python-syntax.sh shows no errors related to print Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351 Reviewed-on: http://gerrit.cloudera.org:8080/19552 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Joe McDonnell	e7fc18c4ea	IMPALA-10608: Update kudu-python version and remove some unused packages This updates kudu-python to version 1.14.0 (from 1.2.0). As part of this, it disables ccache for bootstrap_virtualenv.py. ccache wasn't working anyway, because pip install uses random temporary directories. It also needs to copy a few files to the build directory for the Kudu install. The advantage to upgrading is that the new version no longer has a numpy dependency. Additionally, this modifies a few minor packages: - virtualenv moves to the latest version prior to the rewrite that accompanied version 20 (i.e. 16.10.7). - setuptools moves to the last version that supports python 2.7 (44.1.1) - remove botos3, ipython, and ordereddict These changes speed up installing the virtualenv Before: real 3m11.956s user 2m49.620s sys 0m14.266s After: real 1m38.798s user 1m33.591s sys 0m8.112s Testing: - Hand tests, GVO run Change-Id: Ib47770df9e46de448fe2bffef7abe2c3aa942fb9 Reviewed-on: http://gerrit.cloudera.org:8080/17231 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-31 03:17:24 +00:00
Tim Armstrong	5989900ae8	IMPALA-9618: fix some usability issues with dev env Automatically assume IMPALA_HOME is the source directory in a couple of places. Delete the cache_tables.py script and MINI_DFS_BASE_DATA_DIR config var which had both bit-rotted and were unused. Allow setting IMPALA_CLUSTER_NODES_DIR to put the minicluster nodes, most important the data, in a different location, e.g. on a different filesystem. Testing: I set up a dev environment using this code and was able to load data and run some tests. Change-Id: Ibd8b42a6d045d73e3ea29015aa6ccbbde278eec7 Reviewed-on: http://gerrit.cloudera.org:8080/15687 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-09 08:01:24 +00:00
Joe McDonnell	0163a10332	IMPALA-9068: Use different directories for external vs managed warehouse Hive 3 changed the typical storage model for tables to split them between two directories: - hive.metastore.warehouse.dir stores managed tables (which is now defined to be only transactional tables) - hive.metastore.warehouse.external.dir stores external tables (everything that is not a transactional table) In more recent commits of Hive, there is now validation that the external tables cannot be stored in the managed directory. In order to adopt these newer versions of Hive, we need to use separate directories for external vs managed warehouses. Most of our test tables are not transactional, so they would reside in the external directory. To keep the test changes small, this uses /test-warehouse for the external directory and /test-warehouse/managed for the managed directory. Having the managed directory be a subdirectory of /test-warehouse means that the data snapshot code should not need to change. The Hive 2 configuration doesn't change as it does not have this concept. Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate location for data as compared to USE_CDP_HIVE=false. That should reduce conflicts between the two configurations. Testing: - Ran exhaustive tests with USE_CDP_HIVE=false - Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version) - Verified that dataload succeeds and tests are able to run with a newer Hive version. Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec Reviewed-on: http://gerrit.cloudera.org:8080/15026 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-24 17:29:15 +00:00
Tim Armstrong	0cc9371f0f	[stress] pull out query and runtime info loading This is an incremental step to reduce the size of concurrent_select.py. It is now "only" 2019 lines long. Testing: Ran various command-line invocations locally, including random and DML queries. Running a cluster stress test with the modifications. Change-Id: If65069cd1678cdf71091bd2601bc1fc1d745cec5 Reviewed-on: http://gerrit.cloudera.org:8080/12576 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-27 00:08:44 +00:00
Michael Brown	df83d562e2	IMPALA-8175: improve tests_minicluster_obj Adjust minicluster impalad pgrep detection usage to be compatible with CentOS 6 pgrep. Skip the test if not in a minicluster, because the test will fail. Don't run the test in exhaustive: it's most important this test run pre-merge, which is core. I was able to use sh -c "pgrep ..." and impala-py.test to test this locally on CentOS 6 and Ubuntu 16. Change-Id: I558b3157bb16ef3d169c0d3e795e03700a17ffe4 Reviewed-on: http://gerrit.cloudera.org:8080/12412 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-09 01:56:39 +00:00
Michael Brown	971cf179f6	IMPALA-7460 part 1: require user to install Paramiko and Fabric - Remove Fabric and Paramiko as requirements. They aren't needed by anything in buildall.sh. - Add a means to install into the impala-python virtual environment by hand. impala-pip is fine for this. - Add another requirements file for extended testing. The dependency situation is messy and untangling that out of impala-python and into lib/python should be out of the scope of IMPALA-7460. - Update core tests, which cover real regressions that have happened in the past, to run against locations that don't require a Paramiko import. This moves some logic out of concurrent_select.py into a thinner module. - Insulate ssh_util from globally-scoped import so that it only imports when needed. Testing: - This works in my development environment. - This works in my downstream stress and query gen environments. - This works when doing a full data load. - Impala still builds on a variety of OSs. Todo: - A subsequent review will update the versions. Change-Id: Ibf9010a0387b52c95b7bda5d1d4606eba1008b65 Reviewed-on: http://gerrit.cloudera.org:8080/11264 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-23 00:20:15 +00:00
Michael Brown	c5a9b43db4	IMPALA-3496: stress test: print version info Print the version info of each impalad that's used in a stress test run, sorted by host name. Testing done: $ tests/stress/concurrent_select.py [redacted cluster options] --tpcds-db null --max-queries 0 Cluster Impalad Version Info: host2.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385281c74baa6f1a4d10d44f3411a4303abc) Built on Tue Jul 25 07:06:27 PDT 2017 host3.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385281c74baa6f1a4d10d44f3411a4303abc) Built on Tue Jul 25 07:06:27 PDT 2017 host4.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385281c74baa6f1a4d10d44f3411a4303abc) Built on Tue Jul 25 07:06:27 PDT 2017 host5.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385281c74baa6f1a4d10d44f3411a4303abc) Built on Tue Jul 25 07:06:27 PDT 2017 host6.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385281c74baa6f1a4d10d44f3411a4303abc) Built on Tue Jul 25 07:06:27 PDT 2017 2017-07-25 12:38:52,732 12793 Thread-1 INFO:cluster[691]:Finding impalad binary location ... Change-Id: Ie4b40783ddae6b1bfb2bb4e28c0e3bf97ab944c5 Reviewed-on: http://gerrit.cloudera.org:8080/7501 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Michael Brown <mikeb@cloudera.com>	2017-07-26 13:16:08 +00:00
Michael Brown	428b5a1bfe	IMPALA-5263: test infra: support CA bundles with secure clusters This patch adds the command line option --ca_cert to the common test infra CLI options for use alongside --use-ssl. This is useful when testing against a secured Impala cluster in which the SSL certs are self-signed. This will allow the SSL request to be validated. Using this option will also suppress noisy console warnings like: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html We also go further in this patch and use the warnings module to print these SSL-related warnings once and only once, instead of all over the place. In the case of the stress test, this greatly reduces the noise in the console log. Testing: - quick concurrent_select.py calls with and without --ca_cert to observe that connections still get made and the test runs smoothly. Some of this testing occurred without warning suppression, so that I could be sure the InsecureRequestWarnings were not occurring when using --ca_cert anymore. - ensured warnings are printed once, not multiple times Change-Id: Ifb9e466e4b7cde704cdc4cf98159c068c0a400a9 Reviewed-on: http://gerrit.cloudera.org:8080/7152 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-13 19:25:57 +00:00
Michael Brown	4882910226	IMPALA-5455: test infra: default --cm-port based on --use-tls This patch sets the default --cm-port (for the CM ApiResource initialization) based on a new flag, --use-tls, which enables test infra to talk to CM clusters with TLS enabled. It is still possible to set a port override, but in general it will not be needed. Reference: https://cloudera.github.io/cm_api/epydoc/5.4.0/cm_api.api_client.ApiResource-class.html#__init__ Testing: Connected both to TLS-disabled and TLS-enabled CM instances. Before this patch, we would fail hard when trying to talk to the TLS-enabled CM instance. Change-Id: Ie7dfa6c400687f3c5ccaf578fd4fb17dedd6eded Reviewed-on: http://gerrit.cloudera.org:8080/7107 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-08 05:49:37 +00:00
Michael Brown	8b459dffec	IMPALA-5162,IMPALA-5163: stress test support on secure clusters This patch adds support for running the stress test (concurrent_select.py) and loading nested data (load_nested.py) into a Kerberized, SSL-enabled Impala cluster. It assumes the calling user already has a valid Kerberos ticket. One way to do that is: 1. Get access to a keytab and krb5.config 2. Set KRB5_CONFIG and KRB5CCNAME appropriately 3. Run kinit(1) 4. Run load_nested.py and/or concurrent_select.py within this environment. Because our Python clients already support Kerberos and SSL, we simply need to make sure to use the correct options when calling the entry points and initializing the clients: Impala: Impyla Hive: Impyla HDFS: hdfs.ext.kerberos.KerberosClient With this patch, I was able to manually do a short concurrent_select.py run against a secure cluster without connection or auth errors, and I was able to do the same with load_nested.py for a cluster that already had TPC-H loaded. Follow-ons for future cleanup work: IMPALA-5263: support CA bundles when running stress test against SSL'd Impala IMPALA-5264: fix InsecurePlatformWarning under stress test with SSL Change-Id: I0daad57bb8ceeb5071b75125f11c1997ed7e0179 Reviewed-on: http://gerrit.cloudera.org:8080/6763 Reviewed-by: Matthew Mulder <mmulder@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 04:56:01 +00:00
Lars Volker	ef4c9958d0	IMPALA-4047: Remove occurrences of 'CDH'/'cdh' from repo This change removes some of the occurrences of the strings 'CDH'/'cdh' from the Impala repository. References to Cloudera-internal Jiras have been replaced with upstream Jira issues on issues.cloudera.org. For several categories of occurrences (e.g. pom.xml files, DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to remove the occurrences left after this change. Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8 Reviewed-on: http://gerrit.cloudera.org:8080/4187 Reviewed-by: Jim Apple <jbapple@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-13 00:40:41 +00:00
Michael Brown	a35e438096	IMPALA-4207: test infra: move Hive options from connection to cluster options Various test tools and frameworks, including the stress test, random query generator, and nested types loader, share common modules. This change IMPALA-3980: qgen: re-enable Hive as a target database made changes to tests.comparison.cli_options, the shared command line option module, and to tests.comparison.cluster, the shared module for modeling various Impala clusters. Those changes were for the random query generator, but didn't take into account the other shared entry points. It was possible to call some of those entry points in such a way as to produce an exception, because the Hive-related options are now required for miniclusters, but the Hive-related options weren't always being initialized in those entry points. The simple fix is to say that, because Hive settings are now needed to create Minicluster objects, make the Hive options initialized with cluster options, not connection options. While I was making these changes, I fixed all flake8 problems in this file. Testing: - qgen/minicluster unit tests (regression test) - full private data load job, including load_nested.py (bug verification) - data_generator.py run (regression test), long enough to verify connection to the minicluster, using both Hive and Impala - discrepancy_searcher.py run (regression test), long enough verify connection to the minicluster, using both Hive and Impala - concurrent_select.py (in typical mode using a CM host, this is a regression check; from the command line against the minicluster, this is a bug verification) Change-Id: I2a2915e6db85ddb3d8e1bce8035eccd0c9324b4b Reviewed-on: http://gerrit.cloudera.org:8080/4555 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 02:10:17 +00:00
Sahil Takiar	0780d2c8af	IMPALA-3980: qgen: re-enable Hive as a target database Changes: * Added hive cli options back in (removed in commit "Stress test: Various changes") * Modifications so that if --use-hive is specified, a Hive connection is actually created * A few minor bug fixes so that the RQG can be run locally * Modified MiniCluster to use HADOOP_CONF_DIR and HIVE_CONF_DIR rather than a hard-coded file under IMPALA_HOME * Fixed fe/src/test/resources/hive-default.xml so that it is a valid XML file, it was missing a few element terminators that cause an exception in the cluster.py file Testing: * Hive integration tested locally by invoking the data generator via the command: ./data-generator.py \ --db-name=functional \ --use-hive \ --min-row-count=50 \ --max-row-count=100 \ --storage-file-formats textfile \ --use-postgresql \ --postgresql-user stakiar and the discrepancy checker via the command: ./discrepancy-checker.py \ --db-name=functional \ --use-hive \ --use-postgresql \ --postgresql-user stakiar \ --test-db-type HIVE \ --timeout 300 \ --query-count 50 \ --profile hive * The output of the above two commands is essentially the same as the Impala output, however, about 20% of the queries will fail when the discrepancy checker is run * Regression testing done by running Leopard in a local VM running Ubuntu 14.04, and by running the discrepancy checker against Impala while inside an Impala Docker container Change-Id: Ifb1199b50a5b65c21de7876fb70cc03bda1a9b46 Reviewed-on: http://gerrit.cloudera.org:8080/4011 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>	2016-09-27 22:24:59 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Casey Ching	31ccd85605	Stress test: Fix stack trace collection A lot of stuff got messed up during the switch to the cluster model... Changes: 1) find_crashed_impalads() returned a list but the caller expected a dict. 2) for_each_impalad() ignored the parameter 'impalads' and instead used all impalads in the cluster. 3) find_last_backtrace() returned the oldest core dump instead of the newest. 4) num_successive_errors_needed_to_abort was effectively hard-coded to 2. I'm not sure how that happened. 5) Catch EOFError when getting a query from the work queue. This happens when the work queue is shutdown but there are workers waiting for an item. 6) Ignore connection errors due to an unresponsive impalad. When the load on an impalad get very high it randomly stops responding to client requests. Reducing the load seems to help. 7) Added various log messages. Change-Id: Icb823dc47a51874b0f8a0b20f966a556752f7796 Reviewed-on: http://gerrit.cloudera.org:8080/2176 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com>	2016-03-10 21:49:53 +00:00
Tim Armstrong	0589b86481	Changes to allow running stress test against MiniCluster Miscellaneous fixes to allow running the binary mem_limit search against a local mini cluster of varying size. Change-Id: Ic87f8e6eeae97791c9e3d69355aac45d366a1882 Reviewed-on: http://gerrit.cloudera.org:8080/2209 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 01:30:11 +00:00
Casey Ching	f8363a4dc1	Fix nested TPC-H data loading on python 2.6 Element.iter() is only available in 2.7. Both 2.6 and 2.7 have Element.getiterator(). This change was tested on both python versions. The full load_nested.py script worked in both cases. Change-Id: I0475c202e96828e9783207e72955a5e9ba7a01d0 Reviewed-on: http://gerrit.cloudera.org:8080/1860 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-01-22 00:17:10 +00:00
Casey Ching	72d1889c08	IMPALA-2873: Fix nested TPC-H data loading In commit 960808 I forgot to update the data-loading script for the conversion of a shell script to a python script. It turns out there were a couple of other little problems too. I checked manually that the data was loaded after these changes. Change-Id: Id81fc423348515ab446835868025cb839c77f52c Reviewed-on: http://gerrit.cloudera.org:8080/1851 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-01-21 05:42:17 +00:00
Casey Ching	f288867833	Stress test: Various changes The major changes are: 1) Collect backtrace and fatal log on crash. 2) Poll memory usage. The data is only displayed at this time. 3) Support kerberos. 4) Add random queries. 5) Generate random and TPC-H nested data on a remote cluster. The random data generator was converted to use MR for scaling. 6) Add a cluster abstraction to run data loading for #5 on a remote or local cluster. This also moves and consolidates some Cloudera Manager utilities that were in the stress test. 7) Cleanup the wrappers around impyla. That stuff was getting messy. Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7 Reviewed-on: http://gerrit.cloudera.org:8080/1298 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 23:00:25 +00:00

27 Commits