In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.
Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.
Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Reviewed-on: http://gerrit.cloudera.org:8080/15023
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hudi Read Optimized Table contains multiple versions of parquet files,
in order to load the table correctly, Impala needs to recognize Hudi Read
Optimized Table as a HdfsTable and load the latest version of the file
using HoodieROTablePathFilter.
Tests
- Unit test for Hudi in FileMetadataLoader
- Create table tests in functional_schema_template.sql
- Query tests in hudi-parquet.test
Change-Id: I65e146b347714df32fe968409ef2dde1f6a25cdf
Reviewed-on: http://gerrit.cloudera.org:8080/14711
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Implements read path for the date type in ORC scanner. The internal
representation of a date is an int32 meaning the number of days since
Unix epoch using proleptic Gregorian calendar.
Similarly to the Parquet implementation (IMPALA-7370) this
representation introduces an interoperability issue between Impala
and older versions of Hive (before 3.1). For more details see the
commit message of the mentioned Parquet implementation.
Change-Id: I672a2cdd2452a46b676e0e36942fd310f55c4956
Reviewed-on: http://gerrit.cloudera.org:8080/14982
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The root type should be struct as far as I know, and this was
checked with a DCHECK, leading to crashes in fuzz tests. This
change replaced the DCHECK with returning an error message.
Testing:
- added corrupt ORC file and e2e test
Change-Id: I7fba8cffbcdf8f647e27e2d5ee9e6716a4492b9b
Reviewed-on: http://gerrit.cloudera.org:8080/15021
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
orc::ColumnSelector::updateSelectedByTypeId can throw an exception on
malformed ORC files. The exception wasn't caught by Impala therefore it
caused program termination.
The fix is to simply catch the exception and return with a parse error
instead.
Testing:
* added corrupt ORC file and e2e test
Change-Id: I2f706bc832298cb5089e539b7a818cb86d02199f
Reviewed-on: http://gerrit.cloudera.org:8080/14994
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive can write timestamps that are outside Impala's valid
range (Impala: 1400-9999 Hive: 0001-9999). This change adds
validation logic to ORC reading that replaces out-of-range
timestamps with NULLs and adds a warning to the query.
The logic is very similar to the existing validation in
Parquet. Some differences:
- "time of day" is not checked separately as it doesn't make
sense with ORC's encoding
- instead of column name only column id is added to the warning
Testing:
- added a simple EE test that scans an existing ORC file
Change-Id: I8ee2ba83a54f93d37e8832e064f2c8418b503490
Reviewed-on: http://gerrit.cloudera.org:8080/14832
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The goal is to let JDBC clients get constraint information
from Impala tables. We implement two new metadata operations in
impala-hs2-server, GetPrimaryKeys and GetCrossReference, which are
already implemented in Hive's HS2. The thrift
definitions are copied from Hive's TCLIService.thrift. In FE, these
two operations are implemented to get the information from tables
in the catalog.
Much like GetColumns(), tables need to be loaded in order to be able to get
PK/FK information. We wait for the PK table/FK table to load.
In the implementation, PK/FK information is returned
ONLY if the user has access to ALL the columns involved in the PK/FK
relationship.
Testing:
- Added three test tables to our test datasets since most of our FE tests
relied on dummy tables or testdata. It was difficult to test PK/FK with
these methods. Also, we can build on this testdata in future when we make
optimizer improvements.
- Added unit tests in AuthorizationTest and JDBCtest.
- Added e2e test in test_hs2.py
- This patch modifies AnalyzeDDLTests and ToSqlTests to rely on the newly
added dataset instead of dummy tables for pk/fk tests.
Caveats:
- Ranger needs OWNER user information for authorization. Since this is HMS
metadata that we do not aggresively load, this information is not available
for IncompleteTables. Some foreign key tables (fact tables for example)
might have FK/PK relationships with several PK tables some of which might
not be loaded in catalog. Currently we have no way to check column
previleges without owner user information tables. We do not return keys
involving such columns. Therefore, when Ranger is used, there maybe missing
PK/FK relationships for parent tables that are not loaded. This can be
tracked in IMPALA-9172.
- Retrieval of constraints is not yet supported in LocalCatalog mode. See
IMPALA-9158.
Change-Id: I8942dfbbd4a3be244eed1c61ac2ce17069960477
Reviewed-on: http://gerrit.cloudera.org:8080/14720
Reviewed-by: Vihang Karajgaonkar <vihang@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch the supported year range for DATE type started with
year 0. This contradicts the ANSI SQL standard that defines the valid
DATE value range to be 0001-01-01 to 9999-12-31.
Change-Id: Iefdf1c036834763f52d44d0c39a25a1f04e41e07
Reviewed-on: http://gerrit.cloudera.org:8080/14349
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the avro scanner.
Similarly to parquet, avro uses DATE logical type for dates. DATE
logical type annotates an INT32 that stores the number of days since
the unix epoch, 1 January 1970.
This representation introduces an avro interoperability issue between
Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
up to 1582-10-05 and Gregorian calendar for dates starting with
1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
calendar backward to dates preceding its official introduction in
1582-10-15.
This means that pre-1582-10-15 dates written to an avro table by Hive
will be read back incorrectly by Impala.
Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.
Dependency changes:
- BE uses avro 1.7.4-p5 from native-toolchain.
Change-Id: I7a9d5b93a22cf3a00244037e187f8c145cacc959
Reviewed-on: http://gerrit.cloudera.org:8080/13944
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit implements page filtering based on the Parquet page index.
The read and evaluation of the page index is done by the
HdfsParquetScanner. At first, we determine the row ranges we are
interested in, and based on the row ranges we determine the candidate
pages for each column that we are reading.
We still issue one ScanRange per column chunk, but we specify
sub-ranges that store the candidate pages, i.e. we don't read
the whole column chunk, but only fractions of it.
Pages are not aligned across column chunks, i.e. page #2 of column A
might store completely different rows than page #2 of column B.
It means we need to implement some kind of row-skipping logic
when we read the data pages. This logic is implemented in
BaseScalarColumnReader and ScalarColumnReader. Collection column
readers know nothing about page filtering.
Page filtering can be turned off by setting the query option
'read_parquet_page_index' to false.
Testing:
* added some unit tests for the row range and
page selection logic
* generated various Parquet files with Parquet-MR
* enabled Page index writing and wrote selective queries against
tables written by Impala. Current tests are likely to use page
filtering transparently.
Performance:
* Measured locally, observed 3x to 20x speedup for selective queries.
The speedup was proportional to the IO operations need to be done.
* The TPCH benchmark didn't show a significant performance change. It
is not a suprise since the data is not being sorted in any useful
way. So the main goal was to not introduce perf regression.
TODO:
* measure performance for remote reads
Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Reviewed-on: http://gerrit.cloudera.org:8080/12065
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change is a follow-up to IMPALA-7368 and adds support for DATE
type to the parquet scanner/writer. CREATE TABLE LIKE PARQUET
statements associated with data files that contain dates are also
supported.
Parquet uses DATE logical type for dates. DATE logical type annotates
an INT32 that stores the number of days from the Unix epoch, 1 January
1970.
This representation introduces a parquet interoperability issue
between Impala and older versions of Hive:
- Before version 3.1, Hive used Julian calendar to represent dates
up to 1582-10-05 and Gregorian calendar for dates starting with
1582-10-15. Dates between 1582-10-05 and 1582-10-15 were lost.
- Impala uses proleptic Gregorian calendar, extending the Gregorian
calendar backward to dates preceding its official introduction in
1582-10-15.
This means that pre-1582-10-15 dates written to a parquet table by
Hive will be read back incorrectly by Impala and vice versa.
Note that Hive 3.1 switched to proleptic Gregorian calendar too, so
for Hive 3.1+ this is no longer an issue.
Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Reviewed-on: http://gerrit.cloudera.org:8080/13189
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.
This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.
The changes are as follows:
- Support for DATE literal syntax.
- Explicit casting between DATE and other types (note that invalid
casts will fail with an error just like invalid DECIMAL_V2 casts,
while failed casts to other types do no lead to warning or error):
- from STRING to DATE. The string value must be formatted as
yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
the time component is optional. If the time component is
present, it will be truncated silently.
- from DATE to STRING. The resulting string value is formatted as
yyyy-MM-dd.
- from TIMESTAMP to DATE. The source timestamp's time of day
component is ignored.
- from DATE to TIMESTAMP. The target timestamp's time of day
component is set to 00:00:00.
- Implicit casting between DATE and other types:
- from STRING to DATE if the source string value is used in a
context where a DATE value is expected.
- from DATE to TIMESTAMP if the source date value is used in a
context where a TIMESTAMP value is expected.
- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
implicit conversions are now all possible, the existing function
overload resolution logic is not adequate anymore.
For example, it resolves the
if(false, '2011-01-01', DATE '1499-02-02') function call to the
if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
function, instead of the if(BOOLEAN, DATE, DATE) version.
This is clearly wrong, so the function overload resolution logic had
to be changed to resolve function calls to the best-fit overloaded
function definition if there are multiple applicable candidates.
An overloaded function definition is an applicable candidate for a
function call if each actual parameter in the function call either
matches the corresponding formal parameter's type (without casting)
or is implicitly castable to that type.
When looking for the best-fit applicable candidate, a parameter
match score (i.e. the number of actual parameters in the function
call that match their corresponding formal parameter's type without
casting) is calculated and the applicable candidate with the highest
parameter match score is chosen.
There's one more issue that the new resolution logic has to address:
if two applicable candidates have the same parameter match score and
the only difference between the two is that the first one requires a
STRING -> TIMESTAMP implicit cast for some of its parameters while
the second one requires a STRING -> DATE implicit cast for the same
parameters then the first candidate has to be chosen not to break
backward compatibility.
E.g: year('2019-02-15') function call must resolve to
year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
implemented yet, so this is not an issue at the moment but it will
be in the future.
When the resolution algorithm considers overloaded function
definitions, first it orders them lexicographically by the types in
their parameter lists. To ensure the backward compatible behavior
Primitivetype.DATE enum value has to come after
PrimitiveType.TIMESTAMP.
- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.
These items are tightly coupled and it makes sense to implement them
in one change-set.
Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
corresponding HBASE table 'functional_hbase.date_tbl') was
introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
- since DATE type is supported for TEXT and HBASE fileformats
only, most DATE tests were implemented separately in
tests/query_test/test_date_queries.py.
Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.
Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is a fix for the following issue:
1. Some BE tests (e.g. ExprTest.TimestampFunctions) use the system's
local timezone but run against a test timezone db (instead of the
system's timezone db).
2. On some Linux installations /usr/share/zoneinfo contains symlinks
to files in the /usr/share/zoneifo/SystemV directory
(e.g /usr/share/zoneinfo/America/Los_Angeles is a symlink to
../SystemV/PST8PDT).
3. The 'SystemV' directory is not part of the test timezone db, since
it is obsolete and excluded by default.
Consequently, if the system's local timezone is set to
America/Los_Angeles, BE tests won't find the corresponding timezone
file in the test timezone db. BE tests will default to UTC, which will
break some of them.
This change sets local timezone explicitly for failing BE tests, so
they don't depend on the system's local timezone.
It also adds 'SystemV' directory to the test timezone db to avoid
similar issues in the future.
Change-Id: I9288cd24c8af0c059e55d47c86bd92eaf0075681
Reviewed-on: http://gerrit.cloudera.org:8080/12199
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The code mimics the code written for other min-max filters. Decimal data
can be stored using 4 bytes, 8 bytes and 16 bytes. The code respectively
handles these 3 storage configurations. The column definition states the
precision and the precision determines the storage size.
The minimum and maximum values are stored in a union. The precision from
the column will come in as an input. Based on the precision the size will be
found, and depending on the size appropriate variable will be used.
The code in min-max-filter* follows the general convention of the file, hence
uses macros.
The test includes 24 decimal columns (as listed below) with the following joins:
1. Inner Join with broadcast (2 tables)
1a. 1 predicate
1b. 4 predicates - all results in decimal min-max filter
1c. 4 predicates - 3 results in decimal min=max filter; 1 doesn't
2. Inner Join with Shuffle (3 tables)
3. Right outer join (2 tables)
4. Left Semi join (2 tables)
5. Right Semi join (2 tables)
Decimal Columns:
4bytes:
(5,0), (5,1), (5,3), (5,5)
(9,0), (9,1), (9,5), (9,9)
8 bytes:
(14,0), (14,1), (14,7), (14,14)
(18,0), (18,1), (18,9), (18,18)
16 bytes:
(28,0), (28,1), (28,14), (28,28)
(38,0), (38,1), (38,19), (38,38)
The test aggregates the count of probe rows. This shows that the min-max filter
is exercised, because the number of probe rows is less than the total number
of rows in the probe side table. The count of probe rows is considered to be
deterministic. But, it will be beneficial to look out for changes in Kudu that can
change the way data is partitioned. Such a change could change the probe row count
and in that case, the test will have to be updated.
impala_test_suite.py and test_result_verifier.py are enhanced to support saving
of aggregation using update_results.
Change-Id: Ib7e7278e902160d7060f8097290bc172d9031f94
Reviewed-on: http://gerrit.cloudera.org:8080/12113
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
PARQUET-1387 added int64 timestamps with nanosecond precision that
stores timestamps as nanoseconds since the Unix epoch.
As 64 bits are not enough to represent the whole 1400..9999 range
of Impala timestamps, this new type works with a limited range:
1677-09-21 00:12:43.145224192 .. 2262-04-11 23:47:16.854775807 UTC
The benefit of the reduced range is that no validation is necessary
during scanning, as every possible 64 bit value represents a valid
timestamp in Impala. This may mean that this has the potential be
the fastest way to store timestamps in Impala + Parquet.
Another way NANO differs from MICRO and MILLI is that NANO can
be only described with new logical types in Parquet, it has no
converted type equivalent. This made implementing CREATE TABLE
LIKE PARQUET less trivial than it was for MICRO/MILLI: the type
conversion logic in ParquetHelper.java had to be rewritten to
use LogicalTypeAnnotation instead of ConvertedType.
The changes on Java side also made bumping CDH_BUILD_NUMBER
necessary.
Testing:
- added a new testfile with int64 nano timestamps
- ran core tests
Change-Id: I932396d8646f43c0b9ca4a6359f164c4d8349d8f
Reviewed-on: http://gerrit.cloudera.org:8080/11984
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The idea is to optimise the common case where there are long runs of
NULL or non-NULL values (i.e. the def level is repeated). We can
detect this cheaply by keying the decoding loop in the column reader
off the state of the def level RLE decoder - if there's a long run
of repeated levels, we can skip checking the def level for every
value. We still fall back to decoding, caching and reading
value-by-value a batch of def levels whenever the next def level is not
in a repeated run. We still use the old approach for decoding rep
levels. There might be some benefit to using the same approach for rep
levels *if* repeated def and rep level runs line up.
These changes should unlock further optimizations because more time is
spent in simple kernel functions, e.g. UnpackAndDecode32Values() for
dictionary decompression, which is very optimisable using SIMD etc.
Snappy decompression now seems to be the main CPU bottleneck for
decoding snappy-compressed Parquet.
Perf:
Running TPC-H scale factor 60 on uncompressed and snappy parquet
both showed a ~4% speedup overall.
Microbenchmarks on uncompressed parquet show scans only doing
dictionary decoding on uncompressed Parquet is ~75% faster:
set mt_dop=1;
select min(l_returnflag) from lineitem;
Testing:
We have alltypes agg with a mix of null and non-null.
Many tables have long runs of non-null values.
Added new test data and coverage:
* a test table manynulls with long runs of null values.
* a large CHAR test table
* missing coverage for materialising pos slot in flattened nested types
scan.
* Extended dict test to test longer runs.
* A larger version of complextypestbl with interesting collection
shapes - NULL collections, empty collections, etc, particularly runs
of collections with the same shape.
* Test interaction of timestamp validation with conversion
* Ran code coverage build to confirm all code paths are tested
* ASAN and exhaustive runs.
Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Reviewed-on: http://gerrit.cloudera.org:8080/8319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Changes:
- parquet.thrift is updated to a newer version which contains the
timestamp logical type.
- INT64 columns with converted types TIMESTAMP_MILLIS and
TIMESTAMP_MICROS can be read as TIMESTAMP.
- If the logical type is timestamp, then the type will contain the
information whether the UTC->local conversion is necessary. This
feature is only supported for the new timestamp types, so INT96
timestamps must still use flag
convert_legacy_hive_parquet_utc_timestamps.
- Min/max stat filtering is enabled again for columns that need
UTC->local conversion. This was disabled in IMPALA-7559 because
it could incorrectly drop column chunks.
- CREATE TABLE LIKE PARQUET converts these columns to
TIMESTAMP - before the change, an error was returned instead.
- Bulk of the Parquet column stat logic was moved to a new class
called "ColumnStatsReader".
Testing:
- Added unit tests for timezone conversion (this needed a new public
function in timezone_db.h and adding CET to tzdb_tiny).
- Added parquet files (created with parquet-mr) with int64 timestamp
columns.
Change-Id: I4c7c01fffa31b3d2ca3480adf6ff851137dadac3
Reviewed-on: http://gerrit.cloudera.org:8080/11057
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This fixes a class of bugs where the planner incorrectly uses the raw
string from the parser instead of the unescaped string. This occurs in
several places that push predicates down to the storage layer:
* Kudu scans
* HBase scans
* Data source scans
There are some more complex issues with escapes and the LIKE predicate
that are tracked separately by IMPALA-2422.
This also uncovered a different issue with RCFiles that is tracked by
IMPALA-7778 and is worked around by the tests added.
In order to make bugs like this more obvious in future, I renamed
getValue() to getValueWithOriginalEscapes().
Testing:
Added regression test that tests handling of backslash escapes on all
file formats. I did not add a regression test for the data source bug
since it seems to require some major modification of the data source
test infrastructure.
Change-Id: I53d6e20dd48ab6837ddd325db8a9d49ee04fed28
Reviewed-on: http://gerrit.cloudera.org:8080/11814
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00
Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
converted back to timestamp
- timezone conversions handle the overflowing time part and give
a valid timestamp result (at least since CCTZ, I did not check
older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
resulting Parquet file will also contain corrupt timestamps
The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. A new error code is added for this case. If both the date
and the time part is corrupt, then error about corrupt time is
returned.
Testing:
- added a new scanner test that reads a corrupted Parquet file
with edge values
Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Reviewed-on: http://gerrit.cloudera.org:8080/11521
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If convert_legacy_hive_parquet_utc_timestamps=true and the Parquet
file is by parquet-mr (also used by Hive), then timestamps are
converted from UTC to local time during scanning. Stat filtering
did not handle this case correctly and compared UTC min/max values
from stats with local min/max values from predicates. This could
lead to skipping row groups incorrectly.
Note that parquet-mr only writes stats if min and max are equal,
because it cannot order timestamps correctly, so the only case
affected here is when every value is the same in the column chunk.
It would be possible to implement stat filtering correctly, but
this is non-trivial because of DST and historical timezone rule
changes.
Testing:
- added a Hive generated parquet file + custom cluster test
that could reproduce this issue
Change-Id: Id4c02230993f2390c03d513f08bae2e9d3d538fa
Reviewed-on: http://gerrit.cloudera.org:8080/11431
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The error message described in IMPALA-6442 incorrectly reported the file offset where the
Parquet footer starts, as if the offset is counted from the file end instead of from the
file beginning. The fix changed the reported file offset to be counted from the beginning
of the Parquet file.
Testing:
Create a small table that contains one row of data with a single column that's bigint and
store it as Parquet. Manually changed the footer size field to be
1) smaller than the original footer size by 1, to trigger the error message fixed by
this jira to be printed, to verify that the fix functions correctly;
2) bigger than the file size, thus to trigger another related error message to be
printed.
Change-Id: I35235e99ea9ceb0d31961dd3b8069f7194f5a2de
Reviewed-on: http://gerrit.cloudera.org:8080/11379
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch implements support for primitive type widening on parquet
tables. It only supports conversion to those types without any loss of
precision.
- tinyint (INT32) -> smallint (INT32), int (INT32), bigint (INT64),
double (DOUBLE)
- smallint (INT32) -> int (INT32), bigint (INT64), double (DOUBLE)
- int (INT32) -> bigint (INT64), double (DOUBLE)
- float (FLOAT) -> double (DOUBLE)
Testing:
- Added BE test
- Added E2E test
- Ran core tests
Change-Id: If93394b035c64cf6fc5f37b54d29c034cc1f86e4
Reviewed-on: http://gerrit.cloudera.org:8080/11268
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The Decimal type in Parquet is a logical type. That means
the Parquet file stores some physical/primitive type that
is annotated by the DECIMAL tag to make it behave like
decimals.
The allowed physical types for decimals are INT32, INT64,
FIXED, and BINARY. Before this commit Impala could only
read decimals stored as FIXED or BINARY.
Spark decided to write decimals as INT32 or INT64 when
their precision allows it:
(1 <= precision <= 9) ==> INT32
(10 <= precision <= 18) ==> INT64
I updated our column readers to accept INT32 and INT64
as valid physical types for decimals.
Testing:
* extended parquet-plain-test.cc
* added Parquet files generated by Spark 2.3.1
and updated test_scanners.py
Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794
Reviewed-on: http://gerrit.cloudera.org:8080/11000
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala currently uses two different libraries for timestamp
manipulations: boost and glibc.
Issues with boost:
- Time-zone database is currently hard coded in timezone_db.cc.
Impala admins cannot update it without upgrading Impala.
- Time-zone database is flat, therefore can’t track year-to-year
changes.
- Time-zone database is not updated on a regular basis.
Issues with glibc:
- Uses /usr/share/zoneinfo/ database which could be out of sync on
some of the nodes in the Impala cluster.
- Uses the host system’s local time-zone. Different nodes in the
Impala cluster might use a different local time-zone.
- Conversion functions take a global lock, which causes severe
performance degradation.
In addition to the issues above, the fact that /usr/share/zoneinfo/
and the hard-coded boost time-zone database are both in use is a
source of inconsistency in itself.
This patch makes the following changes:
- Instead of boost and glibc, impalad uses Google's CCTZ to implement
time-zone conversions.
- Introduces a new startup flag (--hdfs_zone_info_zip) to impalad to
specify an HDFS/S3/ADLS path to a zip archive that contains the
shared compiled IANA time-zone database. If the startup flag is set,
impalad will use the specified time-zone database. Otherwise,
impalad will use the default /usr/share/zoneinfo time-zone database.
- Introduces a new startup flag (--hdfs_zone_alias_conf) to impalad to
specify an HDFS/S3/ADLS path to a shared config file that contains
definitions for non-standard time-zone aliases.
- impalad reads the entire time-zone database into an in-memory
map on startup for fast lookups.
- The name of the coordinator node’s local time-zone is saved to the
query context when preparing query execution. This time-zone is used
whenever the current time-zone is referred afterwards in an
execution node.
- Adds a new ZipUtil class to extract files from a zip archive. The
implementation is not vulnerable to Zip Slip.
Cherry-picks: not for 2.x.
Change-Id: I93c1fbffe81f067919706e30db0a34d0e58e7e77
Reviewed-on: http://gerrit.cloudera.org:8080/9986
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reading dictionary encoded Parquet data pages where the bit width is
larger than the encoded type's size (e.g. coding 8 bit TINYINT with
16 bit dictionary indices) led to DCHECK error in debug builds.
Impala does not create such parquet files (an N bit type can have
maximum 2^N distinct values, so N bit dictionary indices are enough
for a dictionary that contains every possible value), but the Parquet
standard does not forbid to do so.
These DCHECKs were probably introduced by a copy paste error (similar
checks exist in the non-dictionary encoded bit reader functions,
where they are valid).
Testing:
- a new test is added to check that these data pages can be decoded
correctly
Change-Id: I9ff3b00cbcab09dec11b3607d7d9a9c2c0025e1a
Reviewed-on: http://gerrit.cloudera.org:8080/10683
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.
A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.
Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.
Tests
- Most of the end-to-end tests can run on ORC format.
- Add tpcds, tpch tests for ORC.
- Add some ORC specific tests.
- Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
is not robust for corrupt files (ORC-315).
Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.
A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.
There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".
Testing:
As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet
tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.
Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Reviewed-on: http://gerrit.cloudera.org:8080/9403
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The DCHECK was only valid if the Parquet file metadata is internally
consistent, with the number of values reported by the metadata
matching the number of encoded levels.
The DCHECK was intended to directly detect misuse of the RleBatchDecoder
interface, which would lead to incorrect results. However, our other
test coverage for reading Parquet files is sufficient to test the
correctness of level decoding.
Testing:
Added a minimal corrupt test file that reproduces the issue.
Change-Id: Idd6e09f8c8cca8991be5b5b379f6420adaa97daa
Reviewed-on: http://gerrit.cloudera.org:8080/9556
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This change allows casting of a string in 'lazy' date/time
format to timestamp. The supported lazy date formats are:
yyyy-[M]M-[d]d
yyyy-[M]M-[d]d [H]H:[m]m:[s]s[.SSSSSSSSS]
[H]H:[m]m:[s]s[.SSSSSSSSS]
We will incur a SCAN performance penalty (approximately 1/2
TotalReadThroughput) when the string is in one of these
lazy date/time format.
Testing:
Benchmarked the performance consequence by executing this SQL on
a private build over 3.8 billion rows:
select min(cast (time_string as timestamp)) from private.impala_5315
Added tests for valid and invalid date/time format strings
in expr-test.cc to be inline with existing tests for CAST() function.
Added end-to-end tests into exprs.test and
select-lazy-timestamp.test to exercise the new function within
the context of a query.
Added tests to exercise the leading and trailing white space trimming
behaviour in default and lazy date/time string format (IMPALA-6630).
Change-Id: Ib9a184a09d7e7783f04d47588537612c2ecec28f
Reviewed-on: http://gerrit.cloudera.org:8080/7009
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
IMPALA-6592 revealed a gap in test coverage for files with
invalid/unsupported Parquet codecs. This adds a test that reproduces the
bug that was present in my IMPALA-4835 patch. master is unaffected by
this bug.
I also hid the conversion tables and made the conversion go through
functions that validate the enum values, to make it easier to track down
problems like this in the future.
Testing:
Ran exhaustive tests.
Change-Id: I1502ea7b7f39aa09f0ed2677e84219b37c64c416
Reviewed-on: http://gerrit.cloudera.org:8080/9500
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
If the first number in a row group written by Impala is NaN,
then Impala writes incorrect statistics in the metadata.
This will result in incorrect results when filtering the
data.
This commit fixes the read path when encountering NaNs in
Parquet min/max statistics. If min and max are both NaN, we
can't use the statistics at all. If only one of them is NaN,
the other still can be used.
I added some tests to QueryTest/parqet-stats.test
Change-Id: If3897fc1426541239223670812f59e2bed32f455
Reviewed-on: http://gerrit.cloudera.org:8080/9358
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
SnappyDecompressor::MaxOutputLen assumes the input pointer to be
non-null. It's not true when the parquet file is corrupted and the
compressed_page_size field in a page header is 0. This patch handles
this error instead of failing a DCHECK.
Testing: A bad parquet file with 0 compressed_page_size is added. It
crashes impala without this patch.
Change-Id: I0d42937aab92a74f8e104d2f7fcd64dc24f6a500
Reviewed-on: http://gerrit.cloudera.org:8080/8977
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This patch maps a signed integer logical type in parquet to a supported
Impala column type. This change introduces the following mapping -
INT_8 -> TINYINT
INT_16 -> SMALLINT
INT_32 -> INT
INT_64 -> BIGINT
Also, added a parquet file with the following schema for testing -
schema {
optional int32 id;
optional int32 tinyint_col (INT_8);
optional int32 smallint_col (INT_16);
optional int32 int_col;
optional int64 bigint_col;
}
Change-Id: I47a8371858c9597c6a440808cf6f933532468927
Reviewed-on: http://gerrit.cloudera.org:8080/8548
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Tianyi Wang <twang@cloudera.com>
Tested-by: Impala Public Jenkins
Switch the decoders to using more batch-oriented interfaces. As an
intermediate step this doesn't make the interfaces of LevelDecoder
or DictDecoder batch-oriented, only the lower-level utility classes.
The next step would be to change those interfaces to be batch-oriented
and make according optimisations in parquet. This could deliver much
larger perf improvements than the current patch.
The high-level changes are.
* BitReader -> BatchedBitReader, which is built to unpack runs of 32
bit-packed values efficiently.
* RleDecoder -> RleBatchDecoder, which exposes the repeated and literal
runs to the caller and uses BatchedBitReader to unpack literal runs
efficiently.
* Dict decoding uses RleBatchDecoder to decode repeated runs efficiently
and uses the BitPacking utilities to unpack and encode in a single
step.
Also removes an older benchmark that isn't too interesting (since
the batch-oriented approach to encoding and decoding is so much
faster than the value-by-value approach).
Testing:
* Ran core tests.
* Updated unit tests to exercise new code.
* Added test coverage for the deprecated bit-packed level encoding to
that it still works (there was no coverage previously).
Perf:
Single-node benchmarks showed a few % performance gain. 16 node cluster
benchmarks only showed a gain for TPC-H nested.
Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6
Reviewed-on: http://gerrit.cloudera.org:8080/8267
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Extendes parquet column reader and associated classes to allow for more
than one possible physical type for a given logical type. This patch
only adds support for variable sized byte array encoded decimals and
more will be added in upcoming commits.
Also, column level metadata verification which was currently being
done per row group will now only be done once per column per file.
Testing:
Added backend test for verifying newly added decimal types are decoded
correctly.
Added Query test that decodes both plain and dictionary-encoded
decimals using binary encoding.
Performance:
Initial perf testing using tpcds_1000 shows no regression.
Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Reviewed-on: http://gerrit.cloudera.org:8080/7822
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
Tested-by: Impala Public Jenkins
Having the repetition level set to REPEATED on the root schema
resulted a scan to fail with error when Impala tried to parse that
table.
As a solution, the 'REPEATED' repetition level is ignored when the
root schema is processed. The reasoning behind is that the Parquet
format description says that the repetition level of the root schema
should not be set to REPEATED anyway, so it's safe to ignore it in
case it is set to this value for some reason.
Change-Id: I7ea84589e1d122ad9d43adde46893ec0ecc5f9c4
Reviewed-on: http://gerrit.cloudera.org:8080/7870
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This change adds functionality to write and read parquet::Statistics for
Decimal, String, and Timestamp values. As an exception, we don't read
statistics for CHAR columns, since CHAR support is broken in Impala
(IMPALA-1652).
This change also switches from using the deprecated fields 'min' and
'max' to populate the new fields 'min_value' and 'max_value' in
parquet::Statistics, that were added in parquet-format pull request #46.
The HdfsParquetScanner will preferably read the new fields if they are
populated and if the column order 'TypeDefinedOrder' has been used to
compute the statistics. For columns without a column order set or with
only the deprecated fields populated, the scanner will read them only if
they are of simple numeric type, i.e. boolean, integer, or floating
point.
This change removes the validation of the Parquet Statistics we write to
Hive from the tests, since Hive does not write the new fields. Instead
it adds a parquet file written by Hive that uses the deprecated fields
for its statistics. It uses that file to exercise the fallback logic for
supported types in a test.
This change also cleans up the interface of ParquetPlainEncoder in
parquet-common.h.
Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Reviewed-on: http://gerrit.cloudera.org:8080/6563
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Lars Volker <lv@cloudera.com>
This change fixed IMPALA-4873 by adding the capability to supply a dict
'test_file_vars' to run_test_case(). Keys in this dict will be replaced
with their values inside test queries before they are executed.
Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70
Reviewed-on: http://gerrit.cloudera.org:8080/6817
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
Zero-slot scans of Parquet files that have num_rows > MAX_INT32
in the footer metadata used to run forever due to an overflow when
calculating the remaining number of rows to process.
Testing:
- Added a regression test using a file with num_rows = 2*MAX_INT32.
- Locally ran test_scanners.py which succeeded.
- Private core/hdfs run succeeded
Change-Id: Ib9f8a6b83f8f621451d5977423ef81a6e4b124bd
Reviewed-on: http://gerrit.cloudera.org:8080/6286
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The string parsing code already errors if the decimal column either
overflows or underflows (i.e. loses scale). Let's just add a test
case.
Change-Id: Idd66c0fb5a4d201919d39f73dea08b87339d6469
Reviewed-on: http://gerrit.cloudera.org:8080/6150
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Before this patch, we would simply read the INT96 Parquet timestamp
representation and assume that it's valid. However, not all bit
permutations represent a valid timestamp. One of the boost functions
raised an exception (that we didn't catch) when passed an invalid
boost date object, which resulted in a crash. This patch fixes
problem by validating that the date falls into 1400..9999 year
range as we are scanning Parquet.
Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac
Reviewed-on: http://gerrit.cloudera.org:8080/5343
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
For Parquet files with no row groups but with num_rows=0 in the
file footer the Parquet scanner returns an error indicating
that the file is invalid. This behavior is a regression from
previous Impala versions which used to accept such files.
This patch restores the previous behavior and adds tests.
Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Reviewed-on: http://gerrit.cloudera.org:8080/4693
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*
A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:
find . | grep "\.java\|\.py\|\.h\|\.cc" | \
xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g
along with some manual fixes.
After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/
Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
Adds handling and testing for a specific Parquet data corruption
scenario with plain dictionary encoded values.
The problematic scenario is when the repeat or literal count of
the RLE-encoded dictionary indexes is decoded as 0 - an invalid value.
There are several other cases of data corruption that are not yet
handled gracefully. This patch only handles one specific case.
Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Reviewed-on: http://gerrit.cloudera.org:8080/3299
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
HIVE-5795 introduced a parameter skip.header.line.count to skip header
lines from input files. This change introduces the capability to skip
an arbitrary number of header lines from csv input files on hdfs. The
size of the total file header must be smaller than
max_scan_range_length, otherwise an error will be reported. This is
necessary because scan ranges are not read in disk order, so there is
no way of identifying header lines except by counting from the start
of the first scan range.
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='1');
Query: alter table t1 set tblproperties('skip.header.line.count'='1')
[localhost:21000] > select * from t1;
Query: select * from t1
+----+----+
| c1 | c2 |
+----+----+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+----+----+
Fetched 3 row(s) in 0.32s
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='0');
Query: alter table t1 set tblproperties('skip.header.line.count'='0')
[localhost:21000] > select * from t1;
Query: select * from t1
+------+------+
| c1 | c2 |
+------+------+
| NULL | NULL |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
+------+------+
WARNINGS: Error converting column: 0 TO INT (Data is: num1)
Error converting column: 1 TO DOUBLE (Data is: num2)
file: hdfs://localhost:20500/test-warehouse/t1/test.txt
record: num1,num2
Fetched 4 row(s) in 0.41s
Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543
Reviewed-on: http://gerrit.cloudera.org:8080/2110
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
Fix a bug in which Impala only reads the first stream
of a multi-stream bz2/gzip file.
Changes the bz2 decoder to read the file in a streaming
fashion rather than reading the entire file into memory
before it can be decompressed.
Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8
(cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e)
Reviewed-on: http://gerrit.cloudera.org:8080/2219
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
There was an incorrect DCHECK in the parquet scanner. If abort_on_error
is false, the intended behaviour is to skip to the next row group, but
the DCHECK assumed that execution should have aborted if a parse error
was encountered.
This also:
- Fixes a DCHECK after an empty row group. InitColumns() would try to
create empty scan ranges for the column readers.
- Uses metadata_range_->file() instead of stream_->filename() in the
scanner. InitColumns() was using stream_->filename() in error
messages, which used to work but now stream_ is set to NULL before
calling InitColumns().
Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701
Reviewed-on: http://gerrit.cloudera.org:8080/1257
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.
Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins