impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Michael Brown	428b5a1bfe	IMPALA-5263: test infra: support CA bundles with secure clusters This patch adds the command line option --ca_cert to the common test infra CLI options for use alongside --use-ssl. This is useful when testing against a secured Impala cluster in which the SSL certs are self-signed. This will allow the SSL request to be validated. Using this option will also suppress noisy console warnings like: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html We also go further in this patch and use the warnings module to print these SSL-related warnings once and only once, instead of all over the place. In the case of the stress test, this greatly reduces the noise in the console log. Testing: - quick concurrent_select.py calls with and without --ca_cert to observe that connections still get made and the test runs smoothly. Some of this testing occurred without warning suppression, so that I could be sure the InsecureRequestWarnings were not occurring when using --ca_cert anymore. - ensured warnings are printed once, not multiple times Change-Id: Ifb9e466e4b7cde704cdc4cf98159c068c0a400a9 Reviewed-on: http://gerrit.cloudera.org:8080/7152 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-13 19:25:57 +00:00
Michael Brown	4882910226	IMPALA-5455: test infra: default --cm-port based on --use-tls This patch sets the default --cm-port (for the CM ApiResource initialization) based on a new flag, --use-tls, which enables test infra to talk to CM clusters with TLS enabled. It is still possible to set a port override, but in general it will not be needed. Reference: https://cloudera.github.io/cm_api/epydoc/5.4.0/cm_api.api_client.ApiResource-class.html#__init__ Testing: Connected both to TLS-disabled and TLS-enabled CM instances. Before this patch, we would fail hard when trying to talk to the TLS-enabled CM instance. Change-Id: Ie7dfa6c400687f3c5ccaf578fd4fb17dedd6eded Reviewed-on: http://gerrit.cloudera.org:8080/7107 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-08 05:49:37 +00:00
Michael Brown	8b459dffec	IMPALA-5162,IMPALA-5163: stress test support on secure clusters This patch adds support for running the stress test (concurrent_select.py) and loading nested data (load_nested.py) into a Kerberized, SSL-enabled Impala cluster. It assumes the calling user already has a valid Kerberos ticket. One way to do that is: 1. Get access to a keytab and krb5.config 2. Set KRB5_CONFIG and KRB5CCNAME appropriately 3. Run kinit(1) 4. Run load_nested.py and/or concurrent_select.py within this environment. Because our Python clients already support Kerberos and SSL, we simply need to make sure to use the correct options when calling the entry points and initializing the clients: Impala: Impyla Hive: Impyla HDFS: hdfs.ext.kerberos.KerberosClient With this patch, I was able to manually do a short concurrent_select.py run against a secure cluster without connection or auth errors, and I was able to do the same with load_nested.py for a cluster that already had TPC-H loaded. Follow-ons for future cleanup work: IMPALA-5263: support CA bundles when running stress test against SSL'd Impala IMPALA-5264: fix InsecurePlatformWarning under stress test with SSL Change-Id: I0daad57bb8ceeb5071b75125f11c1997ed7e0179 Reviewed-on: http://gerrit.cloudera.org:8080/6763 Reviewed-by: Matthew Mulder <mmulder@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 04:56:01 +00:00
Michael Brown	fe2be25245	IMPALA-4775: minor adjustments to python test infra logging - Set up log handler to append, not truncate. This was the cause of IMPALA-4775. Other improvements: - Log a thread name, not thread ID. Thread names are more useful. - Use ISO 8601-like timestamps I tested that running disrepancy_searcher.py doesn't overwrite its logs anymore. One such command that could reproduce it is: tests/comparison/discrepancy_searcher.py \ --use-postgresql \ --query-count 1 \ --db-name tpch_kudu I also ensured the stress test (concurrent_select.py) still logged to its file. Change-Id: I2b7af5b2be20f3c6f38d25612f6888433c62d693 Reviewed-on: http://gerrit.cloudera.org:8080/5746 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-01-20 02:24:47 +00:00
Michael Brown	54665120cb	IMPALA-4355: random query generator: modify statement execution flow to support DML - Rework the discrepancy searcher to run DML statements. We do this by using the query profile to choose a table, copy that table, and generate a statement that will INSERT into that copy. We chose a slow copy over other methods because INSERTing into a copy is a more reliable test that prevents table sizes from getting out of hand or time-consuming replay to reproduce a particular statement. - Introduce a statement generator stub. The real generator work is tracked in IMPALA-4351 and IMPALA-4353. Here we simply generate a basic INSERT INTO ... VALUES statement to make sure our general query execution flow is working. - Add query profile stub for DML statements (INSERT-only at this time). Since we'll want INSERT INTO ... SELECT very soon, this inherits from DefaultProfile. Also add building blocks for choosing random statements in the DefaultProfile. - Improve the concept of an "execution mode" and add new modes. Before, we had "RAW", "CREATE_TABLE_AS", and "CREATE_VIEW_AS". The idea here is that some random SELECT queries could be generated as "CREATE TABLE\|VIEW AS" at execution time, based on weights in the query profile. First, we remove the use of raw string literals for this, since raw string literals can be error-prone, and introduce a StatementExecutionMode class to contain a namespace for the enumerated statement execution modes. Second, we introduce a couple new execution modes. The first is DML_SETUP: this is a DML statement that needs to be run in both the test and reference databases concurrently. For our purposes, it's the INSERT ... SELECT that copies data from the chosen random table into the table copy. The second is DML_TEST: this is a randomly-generated DML statement. - Switch to using absolute imports in many places. There was a mix of absolute and relative imports happening here, and they were causing problems, especially when comparing data types. In Python, <class 'db_types.Int'> != <class 'tests.comparison.db_types.Int'>. Using from __future__ import absolute_import didn't seem to catch the relative import usage anyway, so I haven't employed that. - Rename some, but not nearly all, names from "query" to "statement". Doing this is a rather large undertaking leading to much larger diffs and testing (IMPALA-4602). - Fix a handful of flake8 warnings. There are a bunch that went unfixed for over- and under-indentation. - Testing o ./discrepancy_searcher.py runs with and without --explain-only, and with --profile default and --profile dmlonly. For tpch_kudu data, it seems sufficient to use a --timeout of about 300. o Leopard run to make sure standard SELECT-only generation still works o Generated random stress queries locally o Generated random data locally Change-Id: Ia4c63a2223185d0e056cc5713796772e5d1b8414 Reviewed-on: http://gerrit.cloudera.org:8080/5387 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-12 21:40:39 +00:00
Michael Brown	a35e438096	IMPALA-4207: test infra: move Hive options from connection to cluster options Various test tools and frameworks, including the stress test, random query generator, and nested types loader, share common modules. This change IMPALA-3980: qgen: re-enable Hive as a target database made changes to tests.comparison.cli_options, the shared command line option module, and to tests.comparison.cluster, the shared module for modeling various Impala clusters. Those changes were for the random query generator, but didn't take into account the other shared entry points. It was possible to call some of those entry points in such a way as to produce an exception, because the Hive-related options are now required for miniclusters, but the Hive-related options weren't always being initialized in those entry points. The simple fix is to say that, because Hive settings are now needed to create Minicluster objects, make the Hive options initialized with cluster options, not connection options. While I was making these changes, I fixed all flake8 problems in this file. Testing: - qgen/minicluster unit tests (regression test) - full private data load job, including load_nested.py (bug verification) - data_generator.py run (regression test), long enough to verify connection to the minicluster, using both Hive and Impala - discrepancy_searcher.py run (regression test), long enough verify connection to the minicluster, using both Hive and Impala - concurrent_select.py (in typical mode using a CM host, this is a regression check; from the command line against the minicluster, this is a bug verification) Change-Id: I2a2915e6db85ddb3d8e1bce8035eccd0c9324b4b Reviewed-on: http://gerrit.cloudera.org:8080/4555 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 02:10:17 +00:00
Sahil Takiar	0780d2c8af	IMPALA-3980: qgen: re-enable Hive as a target database Changes: * Added hive cli options back in (removed in commit "Stress test: Various changes") * Modifications so that if --use-hive is specified, a Hive connection is actually created * A few minor bug fixes so that the RQG can be run locally * Modified MiniCluster to use HADOOP_CONF_DIR and HIVE_CONF_DIR rather than a hard-coded file under IMPALA_HOME * Fixed fe/src/test/resources/hive-default.xml so that it is a valid XML file, it was missing a few element terminators that cause an exception in the cluster.py file Testing: * Hive integration tested locally by invoking the data generator via the command: ./data-generator.py \ --db-name=functional \ --use-hive \ --min-row-count=50 \ --max-row-count=100 \ --storage-file-formats textfile \ --use-postgresql \ --postgresql-user stakiar and the discrepancy checker via the command: ./discrepancy-checker.py \ --db-name=functional \ --use-hive \ --use-postgresql \ --postgresql-user stakiar \ --test-db-type HIVE \ --timeout 300 \ --query-count 50 \ --profile hive * The output of the above two commands is essentially the same as the Impala output, however, about 20% of the queries will fail when the discrepancy checker is run * Regression testing done by running Leopard in a local VM running Ubuntu 14.04, and by running the discrepancy checker against Impala while inside an Impala Docker container Change-Id: Ifb1199b50a5b65c21de7876fb70cc03bda1a9b46 Reviewed-on: http://gerrit.cloudera.org:8080/4011 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>	2016-09-27 22:24:59 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Tim Armstrong	0589b86481	Changes to allow running stress test against MiniCluster Miscellaneous fixes to allow running the binary mem_limit search against a local mini cluster of varying size. Change-Id: Ic87f8e6eeae97791c9e3d69355aac45d366a1882 Reviewed-on: http://gerrit.cloudera.org:8080/2209 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 01:30:11 +00:00
Casey Ching	f288867833	Stress test: Various changes The major changes are: 1) Collect backtrace and fatal log on crash. 2) Poll memory usage. The data is only displayed at this time. 3) Support kerberos. 4) Add random queries. 5) Generate random and TPC-H nested data on a remote cluster. The random data generator was converted to use MR for scaling. 6) Add a cluster abstraction to run data loading for #5 on a remote or local cluster. This also moves and consolidates some Cloudera Manager utilities that were in the stress test. 7) Cleanup the wrappers around impyla. That stuff was getting messy. Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7 Reviewed-on: http://gerrit.cloudera.org:8080/1298 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 23:00:25 +00:00
Szehon Ho	5787dc2cf3	First commit to run the random query generator on Hive. With this change, random query generator can run continuously on Hive and approximately half of its generated queries are able to run. 1. Connect timeout from Impyla to HS2 was too small, increasing it to match Impala's. 2. Query timeout to wait for Hive queries was too short, making it configurable so we can play with different values. 3. Hive does not support 'with' clause in subquery, but interestingly supports it at the top-level. Added a profile flag "use_nested_with" to disable nested with's. 4. Hive does not support 'having' without 'group by'. Added a profile flag "use_having_without_groupby" to always generate a group by with having. 5. Hive does not support "interval" keyword for timestamp. Added a profile 'restrict' list to restrict certain functions, and added 'dateAdd' to this list for Hive. 6. Hive 'greatest' and 'least' UDF's do not do implicit type casting like other databases. Modified the query-generator to only choose args of the same type for these, and for HiveSqlWriter to add a cast as there were still some lingering issues like udf's on int returning bigint. 7. Hive always orders the Nulls first in ORDER BY ASC, opposite to other databases, and does not have any 'NULLS FIRST' or 'NULLS LAST' option. Thus the only workaround is to add a "nulls_order_asc" flag to the profile, and pass it in to the ref database's SqlWriter to generate the 'NULLS FIRST' or 'NULLS LAST' statement on that end. 8. Hive strangely does not support multiple sort keys in a window without frame specification. The workaround is for HiveSqlWriter to add 'rows unbounded preceding' to specify the default frame if there are no existing frames. Change-Id: I2a5b07e37378f695de1b50af49845283468b4f0f Reviewed-on: http://gerrit.cloudera.org:8080/619 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-08-21 08:19:04 +00:00
Casey Ching	a4fe24c1b2	Python: Add more logging and CM options to common CLI parser Example output of --help: Options: --debug-log-file=DEBUG_LOG_FILE Path to debug log file. [default: /tmp/concurrent_select.py.log] --cm-host=host name The host name of the CM server. --cm-port=port number The port of the CM server. [default: 7180] --cm-user=user name The name of the CM user. [default: admin] --cm-password=password The password for the CM user. [default: admin] --cm-cluster-name=name If CM manages multiple clusters, use this to specify which cluster to use. Change-Id: I614383f4a65e700348572204e3d8fd5670f5bcf7 Reviewed-on: http://gerrit.cloudera.org:8080/472 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com>	2015-08-15 23:10:10 +00:00
Szehon Ho	44151730db	This adds the following flags to data_generator to populate data into Hive. ('--use-hive', action='store_true', default=False, help='Use Hive') ('--hive-host', default='localhost', help="The name of the host running the HS2") ('--hive-port', default=10000, type=int, help="The hs2 port of the host") ('--hive-user', default='hive', help="The user name to use when connecting to HiveServer2") ('--hive-password', default='hive', help="The password to use when connecting to HiveServer2") ('--hdfs-host', help='The host of HDFS backing Hive tables, necessary for external HiveServer2') ('--hdfs-port', help='The port of HDFS backing Hive tables, necessary for external HiveServer2') These configurations allow it to talk to an external HiveServer2, so that it can be used as a standalone tool running against a Hive cluster in Hive automated The Hive connection is backed by Impyla. Impyala has been fixed to work with Hive on the latest patch: `a1053ce73e` Change-Id: I29b5c8937babf711f8c93ceb3c91fb75cd91d8eb Reviewed-on: http://gerrit.cloudera.org:8080/553 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-07-24 23:43:30 +00:00
casey	b013495e1d	Misc updates to the query generator (part 1 of 2) Summary of changes: 1) Simplified type system. The old system was overly complicated for the task of query generation. The modeling of types used to mirror the types used in Impala. For simplicity, new system only uses a subset of types, Boolean, Char, Decimal, Float, Int, and Timestamp. 2) Functions now have fully typed signatures. Previously you had to know which functions accepted which inputs, now arbitrary permutations of functions can be generated. The chance of being able to add a new function without needing to change the query generation logic is much higher now. 3) Query generation profiles. The randomness of the previous version was hardcoded in various places in throughout the query generator. Now there is a profile to determine which SQL features should be used. There is still a lot of room for improvement in terms of intuitiveness and documentation for configuring the profiles. 4) Greater diversity of queries. Besides the function permutations, various restrictions to simplify query generation have been removed. Also constants are used in queries. 5) Eliminate spinning and infinite loops. Also the old version would sometimes "hope" that a generated SQL element would be compatible with the context and if not, it would try again which would lead to noticeable spinning and/or infinite loops. 6) Catchup with Impala 2.0 features: subqueries, analytics, and Char/VarChar. Change-Id: Ia25f4e85d6a06f7958a906aa42d9f90d63675bc0 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5640 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2014-12-19 03:30:44 -08:00

17 Commits