convert_legacy_hive_parquet_utc_timestamps and
use_local_tz_for_unix_timestamp_conversions were controllable only by
flags until now. After this change the old flags are only used on the
Coordinator to set the defaults for the query options.
If default_query_options also sets these query options, it will
take precedence over the old flags.
Possible follow up work:
- the old flags could be deprecated, as default_query_options
can be used to set this on a server level
- testing these functionalities no longer needs custom cluster
tests - rewriting the existing tests could speed up test
execution
Testing:
- expr-test was rewritten to use the query option instead of the flag
- extended the custom cluster tests for the old flags to also use the
query options
- ran related tests
Change-Id: I5c4252d1c8f8e224c1d0e8234f09374bcc0c6f68
Reviewed-on: http://gerrit.cloudera.org:8080/16469
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ORC scanner uses TimestampValue::FromUnixTimeNanos() to convert
sec + nano representation to Impala's TimestampValue (day + nano).
FromUnixTimeNanos was affected by flag
use_local_tz_for_unix_timestamp_conversions, while that global option
should not affect ORC. By default there was no conversion, but if the
flag is 1, then timestamps were interpreted as UTC and converted to
local time.
This could be solved by creating a UTC version of FromUnixTimeNanos,
but I decided to change the interface in the hope of making To/From
timestamp functions less confusing.
Changes:
- Fixed the bug by passing UTC as timezone in the ORC scanner.
- Changed the interface of these TimestampValue functions to expect
a timezone pointer, interpret null as UTC and skip conversion. It
would be also possible to pass the actual UTC timezone and check
for this in the functions, but I guess it is easier to optimize
the inlined functions this way.
- Moved the checking of use_local_tz_for_unix_timestamp_conversions to
RuntimeState and added property time_zone_for_unix_time_conversions()
to return the timezone to use in Unix time conversions. This made
TimestampValue's interface clearer and makes it easy to replace the
flag with a query option if we want to.
- Changed RuntimeState and the Parquet scanner to skip timezone
conversion if convert_legacy_hive_parquet_utc_timestamps=1 but the
timezone is UTC. This allows users to avoid the performance penalty
of this flag by setting query option timezone to UTC in their
session (IMPALA-7557). CCTZ is not good at this, actually
conversions are slower with fixed offset timezones (including UTC)
than with timezones that have DST/historical rule changes.
Postponed changes:
- Didn't remove the UTC versions of the functions yet, as that would
require changing (and possibly rethinking) several BE tests and
benchmarks (IMPALA-9409).
Tests:
- Added regression test for Orc and other file formats to
check that they are not affected by this flag.
- Extended test_hive_parquet_timestamp_conversion.py to cover the case
when convert_legacy_hive_parquet_utc_timestamps=1 and timezone=UTC.
Also did some cleanup there to use query option timezone instead of
env var TZ.
Change-Id: I14e2a7e512ccd013d5d9fe480a5467ed4c46b76e
Reviewed-on: http://gerrit.cloudera.org:8080/15222
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Add the -d option and -f option to the following commands:
`hdfs dfs -copyFromLocal <localsrc> URI`
`hdfs dfs -put [ - | <localsrc1> .. ]. <dst>`
`hdfs dfs -cp URI [URI ...] <dest>`
The -d option "Skip[s] creation of temporary file with the suffix
._COPYING_." which improves performance of these commands on S3 since S3
does not support metadata only renames.
The -f option "Overwrites the destination if it already exists" combined
with HADOOP-13884 this improves issues seen with S3 consistency issues by
avoiding a HEAD request to check if the destination file exists or not.
Added the method 'copy_from_local' to the BaseFilesystem class.
Re-factored most usages of the aforementioned HDFS commands to use
the filesystem_client. Some usages were not appropriate / worth
refactoring, so occasionally this patch just adds the '-d' and '-f'
options explicitly. All calls to '-put' were replaced with
'copyFromLocal' because they both copy files from the local fs to a HDFS
compatible target fs.
Since WebHDFS does not have good support for copying files, this patch
removes the copy functionality from the PyWebHdfsClientWithChmod.
Re-factored the hdfs_client so that it uses a DelegatingHdfsClient
that delegates to either the HadoopFsCommandLineClient or
PyWebHdfsClientWithChmod.
Testing:
* Ran core tests on HDFS and S3
Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144
Reviewed-on: http://gerrit.cloudera.org:8080/14311
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.
These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.
Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.
Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.
Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:
set mt_dop=1;
select min(l_returnflag) from lineitem;
Testing:
We have alltypes agg with a mix of null and non-null.
Many tables have long runs of non-null values.
Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
shapes - NULL collections, empty collections, etc, particularly runs
of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.
Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Reviewed-on: http://gerrit.cloudera.org:8080/8319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If convert_legacy_hive_parquet_utc_timestamps=true and the Parquet
file is by parquet-mr (also used by Hive), then timestamps are
converted from UTC to local time during scanning. Stat filtering
did not handle this case correctly and compared UTC min/max values
from stats with local min/max values from predicates. This could
lead to skipping row groups incorrectly.
Note that parquet-mr only writes stats if min and max are equal,
because it cannot order timestamps correctly, so the only case
affected here is when every value is the same in the column chunk.
It would be possible to implement stat filtering correctly, but
this is non-trivial because of DST and historical timezone rule
changes.
Testing:
- added a Hive generated parquet file + custom cluster test
that could reproduce this issue
Change-Id: Id4c02230993f2390c03d513f08bae2e9d3d538fa
Reviewed-on: http://gerrit.cloudera.org:8080/11431
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.
Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
changes.
- Time-zone database is not updated on a regular basis.
Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
performance degradation.
In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.
This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
specify an HDFS/S3/ADLS path to a zip archive that contains the
shared compiled IANA time-zone database. If the startup flag is set,
impalad will use the specified time-zone database. Otherwise,
impalad will use the default /usr/share/zoneinfo time-zone database.
- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
specify an HDFS/S3/ADLS path to a shared config file that contains
definitions for non-standard time-zone aliases.
- impalad reads the entire time-zone database into an in-memory
map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
query context when preparing query execution. This time-zone is used
whenever the current time-zone is referred afterwards in an
execution node.
- Adds a new ZipUtil class to extract files from a zip archive. The
implementation is not vulnerable to Zip Slip.
Cherry-picks: not for 2.x.
Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this change:
Hive adjusts timestamps by subtracting the local time zone's offset
from all values when writing data to Parquet files. Hive is internally
inconsistent because it behaves differently for other file formats. As
a result of this adjustment, Impala may read "incorrect" timestamp
values from Parquet files written by Hive.
After this change:
Impala reads Parquet MR timestamp data and adjusts values using a time
zone from a table property (parquet.mr.int96.write.zone), if set, and
will not adjust it if the property is absent. No adjustment will be
applied to data written by Impala.
New HDFS tables created by Impala using CREATE TABLE and CREATE TABLE
LIKE <file> will set the table property to UTC if the global flag
--set_parquet_mr_int96_write_zone_to_utc_on_new_tables is set to true.
HDFS tables created by Impala using CREATE TABLE LIKE <other table>
will copy the property of the table that is copied.
This change also affects the way Impala deals with
--convert_legacy_hive_parquet_utc_timestamps global flag (introduced
in IMPALA-1658). The flag will be taken into account only if
parquet.mr.int96.write.zone table property is not set and ignored
otherwise.
Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Reviewed-on: http://gerrit.cloudera.org:8080/5939
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This patch addresses warning messages from pytest re: the imported
TestMatrix, TestVector, and TestDimension classes, which were being
collected as potential test classes. The fix was to simply prepend
the class names with Impala-
git grep -l 'TestDimension' | xargs \
sed -i 's/TestDimension/ImpalaTestDimension/g'
git grep -l 'TestMatrix' | xargs \
sed -i 's/TestMatrix/ImpalaTestMatrix/g'
git grep -l 'TestVector' | xargs \
sed -i 's/TestVector/ImpalaTestVector/g'
The tests all passed in an exhaustive run on the upstream jenkins
server:
http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/
Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1
Reviewed-on: http://gerrit.cloudera.org:8080/5794
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
This test simply may have never been run in GMT or UTC - it appears to
have an easy-to-make off-by-one error.
Change-Id: Iac4943085b0693deb380499cd0e141eb672bead8
Reviewed-on: http://gerrit.cloudera.org:8080/5061
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
No changes to writing were made. No changes to reading Impala written
files were made.
Hive writes TIMESTAMP values to parquet files differently than Impala
does. Hive converts the value from local time to UTC before writing;
Impala does not. This change adds a startup flag that will convert UTC
to local when reading files written by Hive.
The Hive-file detection actually checks for "parquet-mr" (which is the
library Hive uses) in the file metadata. A slight possibility exists
that TIMESTAMP values written by something other than Hive but also
using parquet-mr may become incorrect. The possibility should be very
small because TIMESTAMP values are stored and encoded in a non-standard
way other applications are unlikely to be aware of.
Flags from be/src/exec/hdfs-parquet-scanner.cc:
-convert_legacy_hive_parquet_utc_timestamps (When true, TIMESTAMPs
read from files written by Parquet-MR (used by Hive) will be
converted from UTC to local time. Writes are unaffected.) type: bool
default: false
Change-Id: I79a499fe24049b7025ee2dd76c9c3e07010d346a
Reviewed-on: http://gerrit.cloudera.org:8080/35
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins