impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	0c7c6a335e	IMPALA-11977: Fix Python 3 broken imports and object model differences Python 3 changed some object model methods: - __nonzero__ was removed in favor of __bool__ - func_dict / func_name were removed in favor of __dict__ / __name__ - The next() function was deprecated in favor of __next__ (Code locations should use next(iter) rather than iter.next()) - metaclasses are specified a different way - Locations that specify __eq__ should also specify __hash__ Python 3 also moved some packages around (urllib2, Queue, httplib, etc), and this adapts the code to use the new locations (usually handled on Python 2 via future). This also fixes the code to avoid referencing exception variables outside the exception block and variables outside of a comprehension. Several of these seem like false positives, but it is better to avoid the warning. This fixes these pylint warnings: bad-python3-import eq-without-hash metaclass-assignment next-method-called nonzero-method exception-escape comprehension-escape Testing: - Ran core tests - Ran release exhaustive tests Change-Id: I988ae6c139142678b0d40f1f4170b892eabf25ee Reviewed-on: http://gerrit.cloudera.org:8080/19592 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	c233634d74	IMPALA-11975: Fix Dictionary methods to work with Python 3 Python 3 made the main dictionary methods lazy (items(), keys(), values()). This means that code that uses those methods may need to wrap the call in list() to get a list immediately. Python 3 also removed the old iter* lazy variants. This changes all locations to use Python 3 dictionary methods and wraps calls with list() appropriately. This also changes all itemitems(), itervalues(), iterkeys() locations to items(), values(), keys(), etc. Python 2 will not use the lazy implementation of these, so there is a theoretical performance impact. Our python code is mostly for tests and the performance impact is minimal. Python 2 will be deprecated when Python 3 is functional. This addresses these pylint warnings: dict-iter-method dict-keys-not-iterating dict-values-not-iterating Testing: - Ran core tests Change-Id: Ie873ece54a633a8a95ed4600b1df4be7542348da Reviewed-on: http://gerrit.cloudera.org:8080/19590 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Michael Brown	db7facdee0	IMPALA-4351,IMPALA-4353: [qgen] randomly generate INSERT statements - Generate INSERT statements that are either INSERT ... VALUES or INSERT ... SELECT - On both types of INSERTs, we either insert into all columns, or into some column list. If the column list exists, all primary keys will be present, and 0 or more additional columns will also be in the list. The ordering of the column list is random. - For INSERT ... SELECT, occasionally generate a WITH clause - For INSERT ... VALUES, generate non-null constants for the primary keys, but for the non-primary keys, randomly generate a value expression. The type system in the random statement/query generator isn't sophisticated enough to the implicit type of a SELECT item or a value expression. It knows it will be some INT-based type, but not if it's going to be a SMALLINT or a BIGINT. To get around this, the easiest thing seems to be to explicitly cast the SELECT items or value expressions to the columns' so-called exact_type attribute. Much of the testing here involved running discrepancy_searcher.py --explain-only on both tpch_kudu and a random HDFS table, using both the default profile and DML-only profile. This was done to quickly find bugs in the statement generation, as they tend to bubble up as analysis errors. I expect to make other changes as follow on patches and more random statements find small test issues. For actual use against Kudu data, you need to migrate data from Kudu into PostgreSQL 5 (instructions tests/comparison/POSTGRES.txt) and run something like: tests/comparison/discrepancy_searcher.py \ --use-postgresql \ --postgresql-port 5433 \ --profile dmlonly \ --timeout 300 \ --db-name tpch_kudu \ --query-count 10 Change-Id: I842b41f0eed07ab30ec76d8fc3cdd5affb525af6 Reviewed-on: http://gerrit.cloudera.org:8080/5486 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-13 01:31:47 +00:00
Michael Brown	54665120cb	IMPALA-4355: random query generator: modify statement execution flow to support DML - Rework the discrepancy searcher to run DML statements. We do this by using the query profile to choose a table, copy that table, and generate a statement that will INSERT into that copy. We chose a slow copy over other methods because INSERTing into a copy is a more reliable test that prevents table sizes from getting out of hand or time-consuming replay to reproduce a particular statement. - Introduce a statement generator stub. The real generator work is tracked in IMPALA-4351 and IMPALA-4353. Here we simply generate a basic INSERT INTO ... VALUES statement to make sure our general query execution flow is working. - Add query profile stub for DML statements (INSERT-only at this time). Since we'll want INSERT INTO ... SELECT very soon, this inherits from DefaultProfile. Also add building blocks for choosing random statements in the DefaultProfile. - Improve the concept of an "execution mode" and add new modes. Before, we had "RAW", "CREATE_TABLE_AS", and "CREATE_VIEW_AS". The idea here is that some random SELECT queries could be generated as "CREATE TABLE\|VIEW AS" at execution time, based on weights in the query profile. First, we remove the use of raw string literals for this, since raw string literals can be error-prone, and introduce a StatementExecutionMode class to contain a namespace for the enumerated statement execution modes. Second, we introduce a couple new execution modes. The first is DML_SETUP: this is a DML statement that needs to be run in both the test and reference databases concurrently. For our purposes, it's the INSERT ... SELECT that copies data from the chosen random table into the table copy. The second is DML_TEST: this is a randomly-generated DML statement. - Switch to using absolute imports in many places. There was a mix of absolute and relative imports happening here, and they were causing problems, especially when comparing data types. In Python, <class 'db_types.Int'> != <class 'tests.comparison.db_types.Int'>. Using from __future__ import absolute_import didn't seem to catch the relative import usage anyway, so I haven't employed that. - Rename some, but not nearly all, names from "query" to "statement". Doing this is a rather large undertaking leading to much larger diffs and testing (IMPALA-4602). - Fix a handful of flake8 warnings. There are a bunch that went unfixed for over- and under-indentation. - Testing o ./discrepancy_searcher.py runs with and without --explain-only, and with --profile default and --profile dmlonly. For tpch_kudu data, it seems sufficient to use a --timeout of about 300. o Leopard run to make sure standard SELECT-only generation still works o Generated random stress queries locally o Generated random data locally Change-Id: Ia4c63a2223185d0e056cc5713796772e5d1b8414 Reviewed-on: http://gerrit.cloudera.org:8080/5387 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-12 21:40:39 +00:00
Michael Brown	ac516670b6	IMPALA-4352: test infra: store Impala/Kudu primary keys in object model Test infrastructure, including the random query generator and the data migrator, needs to know the primary keys of Impala/Kudu tables. This test infrastructure keeps Python object models of the tables and columns. This patch adds the ability to read from source Impala/Kudu tables via SHOW CREATE TABLE and store primary keys as proper attributes. The patch also adds tests that ensure the test infrastructure is always able to read and store the primary keys. This helps find breakages sooner rather than later. For example, if a regression to "SHOW CREATE TABLE" or the test infrastructure makes us no longer able to parse primary keys, GVO or other CI will find the breakage faster than running the query generator. I also fixed some flake8 issues in files I touched. There were several files that had a lot of white space warnings, and I wanted to keep the patch from getting too large. Change-Id: Ib654b6cd0e8c2a172ffb7330497be4d4a751e6e5 Reviewed-on: http://gerrit.cloudera.org:8080/4873 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-11-05 19:27:17 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Casey Ching	f288867833	Stress test: Various changes The major changes are: 1) Collect backtrace and fatal log on crash. 2) Poll memory usage. The data is only displayed at this time. 3) Support kerberos. 4) Add random queries. 5) Generate random and TPC-H nested data on a remote cluster. The random data generator was converted to use MR for scaling. 6) Add a cluster abstraction to run data loading for #5 on a remote or local cluster. This also moves and consolidates some Cloudera Manager utilities that were in the stress test. 7) Cleanup the wrappers around impyla. That stuff was getting messy. Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7 Reviewed-on: http://gerrit.cloudera.org:8080/1298 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 23:00:25 +00:00
Taras Bobrovytsky	f15e4f9033	Add nested types to Query Generator - Parsing of describe statements with nested types - Random query generation that involves nested types - Query flattening (converts a query for a dataset with nested types to an equivalent query for a flattened dataset) Change-Id: If013d104fb90864dcf0934ef92157b95e917e7e8 Reviewed-on: http://gerrit.cloudera.org:8080/1375 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-11-26 00:40:29 +00:00
Taras Bobrovytsky	7431ece2a0	Avoid grouping by Floats and other Query Generator fixes - Fixed bug in contains_agg where we didn't check if self.is_func before checking args - Fixed bug in get_sql_for_data_type where we used VarChar.MAX instead of String.MAX for String data type - Improved naming of query execution threads - Renamed create_aggregate to require_aggregate because aggregate may not be created when it's set to false Change-Id: I2d7601983113e1452f9da54eac62e9c4dd6b7a7f	2015-04-01 13:06:52 -07:00
casey	b013495e1d	Misc updates to the query generator (part 1 of 2) Summary of changes: 1) Simplified type system. The old system was overly complicated for the task of query generation. The modeling of types used to mirror the types used in Impala. For simplicity, new system only uses a subset of types, Boolean, Char, Decimal, Float, Int, and Timestamp. 2) Functions now have fully typed signatures. Previously you had to know which functions accepted which inputs, now arbitrary permutations of functions can be generated. The chance of being able to add a new function without needing to change the query generation logic is much higher now. 3) Query generation profiles. The randomness of the previous version was hardcoded in various places in throughout the query generator. Now there is a profile to determine which SQL features should be used. There is still a lot of room for improvement in terms of intuitiveness and documentation for configuring the profiles. 4) Greater diversity of queries. Besides the function permutations, various restrictions to simplify query generation have been removed. Also constants are used in queries. 5) Eliminate spinning and infinite loops. Also the old version would sometimes "hope" that a generated SQL element would be compatible with the context and if not, it would try again which would lead to noticeable spinning and/or infinite loops. 6) Catchup with Impala 2.0 features: subqueries, analytics, and Char/VarChar. Change-Id: Ia25f4e85d6a06f7958a906aa42d9f90d63675bc0 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5640 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: jenkins	2014-12-19 03:30:44 -08:00

11 Commits