impala

mirror of https://github.com/apache/impala.git synced 2025-12-22 11:28:09 -05:00

Author	SHA1	Message	Date
Csaba Ringhofer	5dccf8024b	IMPALA-10171: Create query options for local time related flags convert_legacy_hive_parquet_utc_timestamps and use_local_tz_for_unix_timestamp_conversions were controllable only by flags until now. After this change the old flags are only used on the Coordinator to set the defaults for the query options. If default_query_options also sets these query options, it will take precedence over the old flags. Possible follow up work: - the old flags could be deprecated, as default_query_options can be used to set this on a server level - testing these functionalities no longer needs custom cluster tests - rewriting the existing tests could speed up test execution Testing: - expr-test was rewritten to use the query option instead of the flag - extended the custom cluster tests for the old flags to also use the query options - ran related tests Change-Id: I5c4252d1c8f8e224c1d0e8234f09374bcc0c6f68 Reviewed-on: http://gerrit.cloudera.org:8080/16469 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-18 22:12:47 +00:00
Csaba Ringhofer	2c54dbe225	IMPALA-9385: Unix time conversion cleanup + ORC fix ORC scanner uses TimestampValue::FromUnixTimeNanos() to convert sec + nano representation to Impala's TimestampValue (day + nano). FromUnixTimeNanos was affected by flag use_local_tz_for_unix_timestamp_conversions, while that global option should not affect ORC. By default there was no conversion, but if the flag is 1, then timestamps were interpreted as UTC and converted to local time. This could be solved by creating a UTC version of FromUnixTimeNanos, but I decided to change the interface in the hope of making To/From timestamp functions less confusing. Changes: - Fixed the bug by passing UTC as timezone in the ORC scanner. - Changed the interface of these TimestampValue functions to expect a timezone pointer, interpret null as UTC and skip conversion. It would be also possible to pass the actual UTC timezone and check for this in the functions, but I guess it is easier to optimize the inlined functions this way. - Moved the checking of use_local_tz_for_unix_timestamp_conversions to RuntimeState and added property time_zone_for_unix_time_conversions() to return the timezone to use in Unix time conversions. This made TimestampValue's interface clearer and makes it easy to replace the flag with a query option if we want to. - Changed RuntimeState and the Parquet scanner to skip timezone conversion if convert_legacy_hive_parquet_utc_timestamps=1 but the timezone is UTC. This allows users to avoid the performance penalty of this flag by setting query option timezone to UTC in their session (IMPALA-7557). CCTZ is not good at this, actually conversions are slower with fixed offset timezones (including UTC) than with timezones that have DST/historical rule changes. Postponed changes: - Didn't remove the UTC versions of the functions yet, as that would require changing (and possibly rethinking) several BE tests and benchmarks (IMPALA-9409). Tests: - Added regression test for Orc and other file formats to check that they are not affected by this flag. - Extended test_hive_parquet_timestamp_conversion.py to cover the case when convert_legacy_hive_parquet_utc_timestamps=1 and timezone=UTC. Also did some cleanup there to use query option timezone instead of env var TZ. Change-Id: I14e2a7e512ccd013d5d9fe480a5467ed4c46b76e Reviewed-on: http://gerrit.cloudera.org:8080/15222 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-22 02:02:56 +00:00
Sahil Takiar	ac87278b16	IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp Add the -d option and -f option to the following commands: `hdfs dfs -copyFromLocal <localsrc> URI` `hdfs dfs -put [ - \| <localsrc1> .. ]. <dst>` `hdfs dfs -cp URI [URI ...] <dest>` The -d option "Skip[s] creation of temporary file with the suffix ._COPYING_." which improves performance of these commands on S3 since S3 does not support metadata only renames. The -f option "Overwrites the destination if it already exists" combined with HADOOP-13884 this improves issues seen with S3 consistency issues by avoiding a HEAD request to check if the destination file exists or not. Added the method 'copy_from_local' to the BaseFilesystem class. Re-factored most usages of the aforementioned HDFS commands to use the filesystem_client. Some usages were not appropriate / worth refactoring, so occasionally this patch just adds the '-d' and '-f' options explicitly. All calls to '-put' were replaced with 'copyFromLocal' because they both copy files from the local fs to a HDFS compatible target fs. Since WebHDFS does not have good support for copying files, this patch removes the copy functionality from the PyWebHdfsClientWithChmod. Re-factored the hdfs_client so that it uses a DelegatingHdfsClient that delegates to either the HadoopFsCommandLineClient or PyWebHdfsClientWithChmod. Testing: * Ran core tests on HDFS and S3 Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144 Reviewed-on: http://gerrit.cloudera.org:8080/14311 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-05 00:04:08 +00:00
Tim Armstrong	153663c22f	IMPALA-4123: Columnar decoding in Parquet The idea is to optimise the common case where there are long runs of NULL or non-NULL values (i.e. the def level is repeated). We can detect this cheaply by keying the decoding loop in the column reader off the state of the def level RLE decoder - if there's a long run of repeated levels, we can skip checking the def level for every value. We still fall back to decoding, caching and reading value-by-value a batch of def levels whenever the next def level is not in a repeated run. We still use the old approach for decoding rep levels. There might be some benefit to using the same approach for rep levels if repeated def and rep level runs line up. These changes should unlock further optimizations because more time is spent in simple kernel functions, e.g. UnpackAndDecode32Values() for dictionary decompression, which is very optimisable using SIMD etc. Snappy decompression now seems to be the main CPU bottleneck for decoding snappy-compressed Parquet. Perf: Running TPC-H scale factor 60 on uncompressed and snappy parquet both showed a ~4% speedup overall. Microbenchmarks on uncompressed parquet show scans only doing dictionary decoding on uncompressed Parquet is ~75% faster: set mt_dop=1; select min(l_returnflag) from lineitem; Testing: We have alltypes agg with a mix of null and non-null. Many tables have long runs of non-null values. Added new test data and coverage: * a test table manynulls with long runs of null values. * a large CHAR test table * missing coverage for materialising pos slot in flattened nested types scan. * Extended dict test to test longer runs. * A larger version of complextypestbl with interesting collection shapes - NULL collections, empty collections, etc, particularly runs of collections with the same shape. * Test interaction of timestamp validation with conversion * Ran code coverage build to confirm all code paths are tested * ASAN and exhaustive runs. Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef Reviewed-on: http://gerrit.cloudera.org:8080/8319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-17 01:48:05 +00:00
Csaba Ringhofer	eb3108c461	IMPALA-7559: Disable stat filtering for UTC-normalized timestamp columns If convert_legacy_hive_parquet_utc_timestamps=true and the Parquet file is by parquet-mr (also used by Hive), then timestamps are converted from UTC to local time during scanning. Stat filtering did not handle this case correctly and compared UTC min/max values from stats with local min/max values from predicates. This could lead to skipping row groups incorrectly. Note that parquet-mr only writes stats if min and max are equal, because it cannot order timestamps correctly, so the only case affected here is when every value is the same in the column chunk. It would be possible to implement stat filtering correctly, but this is non-trivial because of DST and historical timezone rule changes. Testing: - added a Hive generated parquet file + custom cluster test that could reproduce this issue Change-Id: Id4c02230993f2390c03d513f08bae2e9d3d538fa Reviewed-on: http://gerrit.cloudera.org:8080/11431 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-13 21:18:56 +00:00
Attila Jeges	17749dbcfc	IMPALA-3307: Add support for IANA time-zone db Impala currently uses two different libraries for timestamp manipulations: boost and glibc. Issues with boost: - Time-zone database is currently hard coded in timezone_db.cc. Impala admins cannot update it without upgrading Impala. - Time-zone database is flat, therefore can’t track year-to-year changes. - Time-zone database is not updated on a regular basis. Issues with glibc: - Uses /usr/share/zoneinfo/ database which could be out of sync on some of the nodes in the Impala cluster. - Uses the host system’s local time-zone. Different nodes in the Impala cluster might use a different local time-zone. - Conversion functions take a global lock, which causes severe performance degradation. In addition to the issues above, the fact that /usr/share/zoneinfo/ and the hard-coded boost time-zone database are both in use is a source of inconsistency in itself. This patch makes the following changes: - Instead of boost and glibc, impalad uses Google's CCTZ to implement time-zone conversions. - Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to specify an HDFS/S3/ADLS path to a zip archive that contains the shared compiled IANA time-zone database. If the startup flag is set, impalad will use the specified time-zone database. Otherwise, impalad will use the default /usr/share/zoneinfo time-zone database. - Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to specify an HDFS/S3/ADLS path to a shared config file that contains definitions for non-standard time-zone aliases. - impalad reads the entire time-zone database into an in-memory map on startup for fast lookups. - The name of the coordinator node’s local time-zone is saved to the query context when preparing query execution. This time-zone is used whenever the current time-zone is referred afterwards in an execution node. - Adds a new ZipUtil class to extract files from a zip archive. The implementation is not vulnerable to Zip Slip. Cherry-picks: not for 2.x. Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77 Reviewed-on: http://gerrit.cloudera.org:8080/9986 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-22 13:18:58 +00:00
Attila Jeges	21f9063304	Revert "IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet" Reverting IMPALA-2716 as SparkSQL does not agree with the approach taken. More details can be found at: https://issues.apache.org/jira/browse/SPARK-12297 Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd Reviewed-on: http://gerrit.cloudera.org:8080/6896 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-23 01:46:22 +00:00
Attila Jeges	5803a0b074	IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet Before this change: Hive adjusts timestamps by subtracting the local time zone's offset from all values when writing data to Parquet files. Hive is internally inconsistent because it behaves differently for other file formats. As a result of this adjustment, Impala may read "incorrect" timestamp values from Parquet files written by Hive. After this change: Impala reads Parquet MR timestamp data and adjusts values using a time zone from a table property (parquet.mr.int96.write.zone), if set, and will not adjust it if the property is absent. No adjustment will be applied to data written by Impala. New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE LIKE <file> will set the table property to UTC if the global flag --set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true. HDFS tables created by Impala using CREATE TABLE LIKE <other table> will copy the property of the table that is copied. This change also affects the way Impala deals with --convert_legacy_hive_parquet_utc_timestamps global flag (introduced in IMPALA-1658). The flag will be taken into account only if parquet.mr.int96.write.zone table property is not set and ignored otherwise. Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6 Reviewed-on: http://gerrit.cloudera.org:8080/5939 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-02 20:24:08 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Jim Apple	b3cbc960a7	IMPALA-4434: In Python, ''.split('\n') is [''], which has length 1 This test simply may have never been run in GMT or UTC - it appears to have an easy-to-make off-by-one error. Change-Id: Iac4943085b0693deb380499cd0e141eb672bead8 Reviewed-on: http://gerrit.cloudera.org:8080/5061 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-11-15 15:29:26 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
casey	87b9fac2ad	IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps No changes to writing were made. No changes to reading Impala written files were made. Hive writes TIMESTAMP values to parquet files differently than Impala does. Hive converts the value from local time to UTC before writing; Impala does not. This change adds a startup flag that will convert UTC to local when reading files written by Hive. The Hive-file detection actually checks for "parquet-mr" (which is the library Hive uses) in the file metadata. A slight possibility exists that TIMESTAMP values written by something other than Hive but also using parquet-mr may become incorrect. The possibility should be very small because TIMESTAMP values are stored and encoded in a non-standard way other applications are unlikely to be aware of. Flags from be/src/exec/hdfs-parquet-scanner.cc: -convert_legacy_hive_parquet_utc_timestamps (When true, TIMESTAMPs read from files written by Parquet-MR (used by Hive) will be converted from UTC to local time. Writes are unaffected.) type: bool default: false Change-Id: I79a499fe24049b7025ee2dd76c9c3e07010d346a Reviewed-on: http://gerrit.cloudera.org:8080/35 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-02-11 13:28:17 +00:00

12 Commits