impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Joe McDonnell	c5a0ec8bdf	IMPALA-11980 (part 1): Put all thrift-generated python code into the impala_thrift_gen package This puts all of the thrift-generated python code into the impala_thrift_gen package. This is similar to what Impyla does for its thrift-generated python code, except that it uses the impala_thrift_gen package rather than impala._thrift_gen. This is a preparatory patch for fixing the absolute import issues. This patches all of the thrift files to add the python namespace. This has code to apply the patching to the thirdparty thrift files (hive_metastore.thrift, fb303.thrift) to do the same. Putting all the generated python into a package makes it easier to understand where the imports are getting code. When the subsequent change rearranges the shell code, the thrift generated code can stay in a separate directory. This uses isort to sort the imports for the affected Python files with the provided .isort.cfg file. This also adds an impala-isort shell script to make it easy to run. Testing: - Ran a core job Change-Id: Ie2927f22c7257aa38a78084efe5bd76d566493c0 Reviewed-on: http://gerrit.cloudera.org:8080/20169 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2025-04-15 17:03:02 +00:00
Joe McDonnell	8d5adfd0ba	IMPALA-13123: Add option to run tests with Python 3 This introduces the IMPALA_USE_PYTHON3_TESTS environment variable to select whether to run tests using the toolchain Python 3. This is an experimental option, so it defaults to false, continuing to run tests with Python 2. This fixes a first batch of Python 2 vs 3 issues: - Deciding whether to open a file in bytes mode or text mode - Adapting to APIs that operate on bytes in Python 3 (e.g. codecs) - Eliminating 'basestring' and 'unicode' locations in tests/ by using the recommendations from future ( https://python-future.org/compatible_idioms.html#basestring and https://python-future.org/compatible_idioms.html#unicode ) - Uses impala-python3 for bin/start-impala-cluster.py All fixes leave the Python 2 path working normally. Testing: - Ran an exhaustive run with Python 2 to verify nothing broke - Verified that the new environment variable works and that it uses Python 3 from the toolchain when specified Change-Id: I177d9b8eae9b99ba536ca5c598b07208c3887f8c Reviewed-on: http://gerrit.cloudera.org:8080/21474 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2024-12-17 07:28:51 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	ba3518366a	IMPALA-11952 (part 4): Fix odds and ends: Octals, long, lambda, etc. There are a variety of small python 3 syntax differences: - Octal constants need to start with 0o rather than just 0 - Long constants are not supported (i.e. numbers ending with L) - Lambda syntax is slightly different - The 'ur' string mode is no longer supported Testing: - check-python-syntax.sh now passes Change-Id: Ie027a50ddf6a2a0db4b34ec9b49484ce86947f20 Reviewed-on: http://gerrit.cloudera.org:8080/19554 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Csaba Ringhofer	39413a1811	IMPALA-5051: Add INT64 timestamp write support in Parquet Add query option "parquet_timestamp_type" that chooses the Parquet type used when writing TIMESTAMP columns. This is an experimental feature at the moment, because these types are not widely adopted in other Hadoop components yet. For this reason the query option is added as "development" level, and the default behavior is not changed. The following options can be used: INT96_NANOS (default): This is the same as the old behavior, can represent any timestamp that can be handled by Impala. INT64_MILLIS, INT64_MICROS: Can encode the whole [1400..10000) range handled by Impala at the cost of reduced precision. Values are rounded towards minus infinity during writing. INT64_NANOS: Can encode a reduced range without losing nanosecond precision: [1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807] Values outside this range are converted to NULLs without warning. The change was done completely in the backend and all TIMESTAMP columns are written using the type set in the query option. An alternative design would have been to implement some parts in the fronted by adding TIMESTAMP->BIGINT conversion functions to the query plan, which would make it easier to add the possibility of per-column setting in the future. I choose the current design because it seemed much simpler and there are no clear plans for the per-column setting. Most of the code will be still useful if we decide to go the other way in the future. All types are written without conversion to UTC (the way Impala always wrote timestamps), and this information is expressed in the new Parquet logical types by setting isAdjustedToUTC to false. The old logical type (converted_type) is not set, because old readers do not read isAdjustedToUTC, and assume that TIMESTAMP_MILLIS and TIMESTAMP_MICROS are written in UTC. These readers can still read int64 timestamp columns as INT_64. Testing: - added unit tests for new TimestampValue->int64 functions - add EE tests for checking values / min-max stats / metadata written for int64 Parquet timestamps - ran core tests Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Reviewed-on: http://gerrit.cloudera.org:8080/12247 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2019-04-09 08:54:23 +00:00
Csaba Ringhofer	0906e0817c	IMPALA-7889: Write new logical types in Parquet Fill the LogicalType field in Parquet schemas for columns that have an associated logical type. ConvertedType still has to be filled to remain compatible with older readers. Testing: - added new tests to check both logical and converted types to test_insert_parquet.py Change-Id: I6f377950845683ab9c6dea79f4c54db0359d0b91 Reviewed-on: http://gerrit.cloudera.org:8080/12004 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-17 02:48:39 +00:00
Zoltan Borok-Nagy	ccf19f9f8f	IMPALA-5842: Write page index in Parquet files This commit builds on the previous work of Pooja Nilangekar: https://gerrit.cloudera.org/#/c/7464/ The commit implements the write path of PARQUET-922: "Add column indexes to parquet.thrift". As specified in the parquet-format, Impala writes the page indexes just before the footer. This allows much more efficient page filtering than using the same information from the 'statistics' field of DataPageHeader. I updated Pooja's python tests as well. Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9 Reviewed-on: http://gerrit.cloudera.org:8080/9693 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-17 20:22:02 +00:00
Lars Volker	6251d8b4dd	IMPALA-3909: Populate min/max statistics in Parquet writer Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Reviewed-on: http://gerrit.cloudera.org:8080/5611 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2017-02-02 06:44:48 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Skye Wanderman-Milne	a78f3a8ca5	IMPALA-2069: add USE_UTF8_PARQUET_STRINGS query option This option toggles whether the parquet writer will use the UTF8 annotation for string columns. This patch includes a test that writes a table with or without this option, then verifies that the annotation is or isn't present using a new get_parquet_metadata Python utility. Change-Id: I030c9f5c6272e09c1ce133f66234e3cfb26b68d4 Reviewed-on: http://gerrit.cloudera.org:8080/2531 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-17 05:58:39 +00:00

11 Commits