Commit Graph

1630 Commits

Author SHA1 Message Date
Daniel Becker
87e0077255 IMPALA-11734: TestIcebergTable.test_compute_stats fails in RELEASE builds
If the Impala version is set to a release build as described in point 8
in the "How to Release" document
(https://cwiki.apache.org/confluence/display/IMPALA/How+to+Release#HowtoRelease-HowtoVoteonaReleaseCandidate),
TestIcebergTable.test_compute_stats fails:

Stacktrace
query_test/test_iceberg.py:852: in test_compute_stats
self.run_test_case('QueryTest/iceberg-compute-stats', vector,
unique_database) common/impala_test_suite.py:742: in run_test_case
self.__verify_results_and_errors(vector, test_section, result, use_db)
common/impala_test_suite.py:578: in __verify_results_and_errors
replace_filenames_with_placeholder) common/test_result_verifier.py:469:
in verify_raw_results VERIFIER_MAP[verifier](expected, actual)
common/test_result_verifier.py:278: in verify_query_result_is_equal
assert expected_results == actual_results E assert Comparing
QueryTestResults (expected vs actual): E 2,1,'2.33KB','NOT CACHED','NOT
CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/test_compute_stats_74dbc105.db/ice_alltypes'
!= 2,1,'2.32KB','NOT CACHED','NOT
CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/test_compute_stats_74dbc105.db/ice_alltypes'

The problem is the file size which is 2.32KB instead of 2.33KB. This is
because the version is written into the file, and "x.y.z-RELEASE" is one
byte shorter than "x.y.z-SNAPSHOT". The size of the file in this test is
on the boundary between 2.32KB and 2.33KB, so this one byte can change
the value.

This change fixes the problem by using a regex to accept both values so
it works for both snapshot and release versions.

Change-Id: Ia1fa12eebf936ec2f4cc1d5f68ece2c96d1256fb
Reviewed-on: http://gerrit.cloudera.org:8080/19260
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-21 16:15:00 +00:00
Daniel Becker
03c465dac1 IMPALA-11719: Inconsistency in printing NULL values
NULL values are printed as "NULL" if they are top level or in
collections, but as "null" in structs. We should print collections and
structs in JSON form, so it should be "null" in collections, too. Hive
also follows the latter (correct) approach.

This commit changes the printing of NULL values to "null" in
collections.

Testing:
 - Modified the tests to expect "null" instead of "NULL" in collections.

Change-Id: Ie5e7f98df4014ea417ddf73ac0fb8ec01ef655ba
Reviewed-on: http://gerrit.cloudera.org:8080/19236
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
2022-11-15 09:24:03 +00:00
Peter Rozsa
97a506c656 IMPALA-11537: Query option validation for numeric types
This change adds a more generic approach to validate numeric
query options and report parse and validation errors.
Supported types: integers, floats, memory specifications.
Range and bound validator helper functions are added to make
validation unified on call sites.

Testing:
 - Error messages got more generic, therefore the existing tests
   around query options are aligned to match them

Change-Id: Ia7757b52393c094d2c661918d73cbfad7214f855
Reviewed-on: http://gerrit.cloudera.org:8080/19096
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-11 22:02:16 +00:00
Michael Smith
15b07ff1fb IMPALA-11704: (Addendum) fix crash on open for HDFS cache
When trying to read from HDFS cache, ReadFromCache calls
FileReader::Open(false) to force the file to open. The prior commit for
IMPALA-11704 didn't allow for that case when using a data cache, as the
data cache check would always happen. This resulted in a crash calling
CachedFile as exclusive_hdfs_fh_ was nullptr. Tests only catch this when
reading from HDFS cache with data cache enabled.

Replaces explicit arguments to override FileReader behavior with a flag
to communicate whether FileReader supports delayed open. Then the caller
can choose whether to call Open before read. Also simplifies calls to
ReadFromPos as it already has a pointer to ScanRange and can check
whether file handle caching is enabled directly. The Open call in
DoInternalRead uses a slightly wider net by only checking UseDataCache.
If the data cache is unavailable or a miss the file will then be opened.

Adds a select from tpch.nation to the query for test_data_cache.py as
something that triggers checking the HDFS cache.

Change-Id: I741488d6195e586917de220a39090895886a2dc5
Reviewed-on: http://gerrit.cloudera.org:8080/19228
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-11 17:53:58 +00:00
LPL
f617e36487 IMPALA-11711: Virtual columns should be skipped in 'FileMetadataUtils::AddIcebergColumns'
In the 'FileMetadataUtils::AddIcebergColumns' method, when the slot is
a virtual column, it should be skipped directly. That may affect that
when we query the Iceberg v2 table (the first column is a partition
column of bool type), wrong position-delete result is given.

Testing:
 - Add e2e tests
 - Locally tested the result of The Position-based Iceberg tables

Change-Id: I58faf3df6ae8a5bcabb1d2ac9f11a6fbcd74bc24
Reviewed-on: http://gerrit.cloudera.org:8080/19223
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-10 00:51:50 +00:00
Fang-Yu Rao
4e6692b024 IMPALA-11686: Fix test_corrupt_stat after IMPALA-11666
IMPALA-11666 revised the message in the query plans when there are
potentially corrupt statistics, which broke test_corrupt_stat, an E2E
test only run in the exhaustive tests. This patch fixes the test file
accordingly.

Testing:
 - Verified locally that the patch passes test_corrupt_stat.

Change-Id: I817c7807a07bb89b93d795bce958b9872eff2eef
Reviewed-on: http://gerrit.cloudera.org:8080/19224
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-09 05:15:14 +00:00
Daniel Becker
77dc20264c IMPALA-11687: Select * with EXPAND_COMPLEX_TYPES=1 and explicit complex types fails
If EXPAND_COMPLEX_TYPES is set to true, some queries that combine star
expressions and explicitly given complex columns fail:

select outer_struct, * from
functional_orc_def.complextypes_nested_structs;
ERROR: IllegalStateException: Illegal reference to non-materialized
slot: tid=1 sid=1

select *, outer_struct.str from
functional_orc_def.complextypes_nested_structs;
ERROR: IllegalStateException: null

Having two stars in a table with complex columns also fails.

select *, * from functional_orc_def.complextypes_nested_structs;
ERROR: IllegalStateException: Illegal reference to non-materialized
slot: tid=6 sid=13

The error is because of this line in 'SelectStmt.addStarResultExpr()':
8e350d0a8a/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java (L811)
What we want to do is create 'SlotRef's for the struct children
(recursively) but 'reExpandStruct()' also creates new 'SlotDescriptor's
for the children. The new 'SlotDescriptor's are redundant and are not
inserted into the tree which can leave them unmaterialised or without a
correct memory layout.

The solution is to only create the 'SlotRef's for the struct children
without creating new 'SlotDescriptor's. This leads us to another
problem:
 - for structs, it is 'SlotRef.analyzeImpl()' that creates the child
   'SlotRef's
 - the constructor 'SlotRef(SlotDescriptor desc)' sets 'isAnalyzed_' to
   true.
Before structs were allowed, this was correct but now struct-typed
'SlotRef's created with the above constructor are counted as analysed
but lack child expressions, which would have been added if 'analyze()'
had been called on them. This essentially violates the contract of this
constructor.

This commit modifies 'SlotRef(SlotDescriptor desc)' so that child
expressions are generated for structs, restoring the correct semantics
of this constructor. After this, it is no longer necessary to call
'reExpandStruct()' in 'SelectStmt.addStarResultExpr()'.

Testing:
 - Added the failing test cases and a few variations of them to
   nested-types-star-expansion.test

Change-Id: Ia8cf53b0a7409faca668713228bfef275f3833f9
Reviewed-on: http://gerrit.cloudera.org:8080/19171
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-04 04:33:02 +00:00
Fang-Yu Rao
20c0de1017 IMPALA-10436: Support storage handler privileges for external Kudu table creation
This patch lowers the privilege requirement for external Kudu table
creation. Before this patch, a user was required to have the ALL
privilege on SERVER if the user wanted to create an external Kudu table.
In this patch we introduce a new type of resources called storage
handler URI and a new access type called RWSTORAGE that will be
supported by Apache Ranger once RANGER-3281 is resolved, which in turn
depends on the release of Apache Hive 4.0 that consists of HIVE-24705.

Specifically, after this patch, a user will be allowed to create an
external Kudu table as long as the user is granted the RWSTORAGE
privilege on the resource specified by a storage handler URI that points
to an existing Kudu table.

For instance, in order for a user 'non_owner' to create an external Kudu
table based on an existing Kudu table 'impala::tpch_kudu.nation', it
suffices to execute the following command as an administrator to grant
the necessary privilege to the requesting user, where "localhost" is the
default address of Kudu master host assuming there is only one single
master host in this example.

GRANT RWSTORAGE ON
STORAGEHANDLER_URI 'kudu://localhost/impala::tpch_kudu.nation'
TO USER non_owner

One may be wondering why we do not simply cancel the privilege check
that required the ALL privilege on SERVER for external Kudu table
creation. One scenario in which such relaxation is not secure is when
the owner or the creator of the existing Kudu table is different from
the requesting user who wants to create an external Kudu table in
Impala. Not requiring any additional privilege check would allow a user
without any privilege to retrieve the contents of the existing Kudu
table.

On the other hand, after this patch we still require a user to have the
ALL privilege on SERVER when the table property of
'kudu.master_addresses' is specified in a query that tries to create a
Kudu table whether or not the table is external. To be more specific,
the user 'non_owner' would be able to create an external Kudu table
using the following statement once being granted the RWSTORAGE privilege
on the specified storage handler URI above.

CREATE EXTERNAL TABLE default.kudu_tbl STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='impala::tpch_kudu.nation')

However, the following query submitted by the same user would be
rejected due to the user 'non_owner' not being granted the ALL privilege
on SERVER.

CREATE EXTERNAL TABLE default.kudu_tbl STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='impala::tpch_kudu.nation',
'kudu.master_addresses'='localhost')

We do not relax such a requirement in that specifying the addresses of
Kudu master hosts to connect should still be considered as an
administrative operation.

Testing:
 - Added various FE and E2E tests to verify Impala's behavior after this
   patch with respect to external Kudu table creation.
 - Verified that this patch passes the core tests in the DEBUG build.

Change-Id: I7936e1d8c48696169f7ad7ad92abe44a26eea3c4
Reviewed-on: http://gerrit.cloudera.org:8080/17640
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-03 05:48:17 +00:00
Csaba Ringhofer
a983a347a7 IMPALA-11682: Add tests for minor compacted insert only ACID tables
Only test changes. Minor compacted delta dirs are supported in
Impala since IMPALA-9512, but at that time Hive supported minor
compaction only on full ACID tables. Since that time Hive added
support for minor compacting insert only/MM tables (HIVE-22610).

Change-Id: I7159283f3658f2119d38bd3393729535edd0a76f
Reviewed-on: http://gerrit.cloudera.org:8080/19164
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-03 00:52:08 +00:00
Zoltan Borok-Nagy
301c3cebad IMPALA-11591: Avoid calling planFiles() on Iceberg tables
Iceberg's planFiles() API is very expensive as it needs to read all
the relevant manifest files. It's especially expensive on object
stores like S3.

When there are no predicates on the table and we are not doing
time travel it's possible to avoid calling planFiles() and do the
scan planning from cached metadata. When none of the predicates are
on partition columns there's little benefit of pushing down predicates
to Iceberg. So with this patch we only push down predicates (and
hence invoke planFiles()) when at least one of the predicates are
on partition columns.

This patch introduces a new class to store content files:
IcebergContentFileStore. It separates data, delete, and "old" content
files. "Old" content files are the ones that are not part of the current
snapshot. We add such data files during time travel. Storing "old"
content files in a separate concurrent hash map also fixes a concurrency
bug in the current code.

Testing:
 * executed current e2e tests
 * updated predicate push down tests

Change-Id: Iadb883a28602bb68cf4f61e57cdd691605045ac5
Reviewed-on: http://gerrit.cloudera.org:8080/19043
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-02 22:14:47 +00:00
LPL
3973fc6d09 IMPALA-11608: Fix SHOW TABLE STATS iceberg_tbl shows wrong number of files
Impala SHOW TABLE stats outputs wrong value for number of files for
Iceberg tables. It should only calculate the number of data files and
delete files, but it calculates all files under the table directory,
including metadata files, orphaned files, and old data files not
belonging to the current snapshot.

Testing:
 - add e2e tests

Change-Id: I110e5e13cec3aa898f115e1ed795ce98e68ef06c
Reviewed-on: http://gerrit.cloudera.org:8080/19150
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-21 22:06:07 +00:00
TheOmid
e327a28757 IMPALA-6684: Fix untracked memory in KRPC
During serialization of a row batch header, a tuple_data_ is created
which will hold the compressed tuple data for an outbound row batch.
We would like this tuple data to be trackable as it is responsible for
a significant portion of untrackable memory from the krpc data stream
sender. By using MemTrackerAllocator, we can allocate tuple data and
compression scratch and account for it in the memory tracker of the
KrpcDataStreamSender. This solution replaces the type for tuple data
and compression scratch from std::string to TrackedString, an
std:basic_string with MemTrackerAllocator as the custom allocator.

This patch adds memory estimation in DataStreamSink.java to account
for OutboundRowBatch memory allocation. This patch also removes the
thrift-based serialization because the thrift RPC has been removed
in the prior commit.

Testing:
 - Passed core tests.
 - Ran a single node benchmark which shows no regression.
 - Updated row-batch-serialize-test and row-batch-serialize-benchmark
   to test the row-batch serialization used by KRPC.
 - Manually collected query-profile, heap growth, and memory usage log
   showing untracked memory decreased by 1/2.
 - Add test_datastream_sender.py to verify the peak memory of EXCHANGE
   SENDER node.
 - Raise mem_limit in two of test_spilling_large_rows test case.
 - Print test line number in PlannerTestBase.java

New row-batch serialization benchmark:

Machine Info: Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
serialize:            10%   50%   90%     10%     50%     90%
                                        (rel)   (rel)   (rel)
-------------------------------------------------------------
   ser_no_dups_base  18.6  18.8  18.9      1X      1X      1X
        ser_no_dups  18.5  18.5  18.8  0.998X  0.988X  0.991X
   ser_no_dups_full  14.7  14.8  14.8  0.793X   0.79X  0.783X

  ser_adj_dups_base  28.2  28.4  28.8      1X      1X      1X
       ser_adj_dups  68.9  69.1  69.8   2.44X   2.43X   2.43X
  ser_adj_dups_full  56.2  56.7  57.1   1.99X      2X   1.99X

      ser_dups_base  20.7  20.9  20.9      1X      1X      1X
           ser_dups  20.6  20.8  20.9  0.994X  0.995X      1X
      ser_dups_full  39.8    40  40.5   1.93X   1.92X   1.94X

deserialize:          10%   50%   90%     10%     50%     90%
                                        (rel)   (rel)   (rel)
-------------------------------------------------------------
 deser_no_dups_base  75.9  76.6    77      1X      1X      1X
      deser_no_dups  74.9  75.6    76  0.987X  0.987X  0.987X

deser_adj_dups_base   127   128   129      1X      1X      1X
     deser_adj_dups   179   193   195   1.41X   1.51X   1.51X

    deser_dups_base   128   128   129      1X      1X      1X
         deser_dups   165   190   193   1.29X   1.48X   1.49X

Change-Id: I2ba2b907ce4f275a7a1fb8cf75453c7003eb4b82
Reviewed-on: http://gerrit.cloudera.org:8080/18798
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-17 21:59:57 +00:00
Zoltan Borok-Nagy
dc912f016e IMPALA-11655: Impala should set write mode "merge-on-read" by default
Similarly to HIVE-26596 Impala should set merge-on-read write mode
for V2 tables, unless otherwise specified:

* during table creation with 'format-version'='2'
* during alter table set tblproperties 'format-version'='2'

We do so because in the foreseeable future Impala will only support
merge-on-read (on the write-side, on the read side copy-on-write is
also supported). Also, currently Hive only supports merge-on-read.

Testing:
 * e2e tests added

Change-Id: Icaa32472cde98e21fb23f5461175db1bf401db3d
Reviewed-on: http://gerrit.cloudera.org:8080/19138
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-10-17 19:37:15 +00:00
Joe McDonnell
11e66523d6 IMPALA-11526: Install en_US.UTF-8 locale into docker images
In IMPALA-11492, ExprTest.Utf8MaskTest was failing on some
configurations because the en_US.UTF-8 was missing. Since the
Docker images don't contain en_US.UTF-8, they are subject
to the same bug. This was confirmed by adding tests cases
to the test_utf8_strings.py end-to-end test and running it
in the dockerized tests.

This add the appropriate language pack to the list of packages
installed for the Docker build.

Testing:
 - This adds end-to-end tests to test_utf8_strings.py covering the
   same cases that were failing in ExprTest.Utf8MaskTest. They
   failed without the added languages packs, and now succeed.

Change-Id: I353f257b3cb6d45f7d0a28f7d5319fdb457e6e3d
Reviewed-on: http://gerrit.cloudera.org:8080/19080
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
2022-10-11 20:30:50 +00:00
Michael Smith
3577030df6 IMPALA-11562: Revert support for o3fs as default filesystem
Reverts support for o3fs as a default filesystem added in IMPALA-9442.
Updates test setup to use ofs instead.

Munges absolute paths in Iceberg metadata to match the new location
required for ofs. Ozone has strict requirements on volume and bucket
names, so all tables must be created within a bucket (e.g. inside
/impala/test-warehouse/).

Change-Id: I45e90d30b2e68876dec0db3c43ac15ee510b17bd
Reviewed-on: http://gerrit.cloudera.org:8080/19001
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-28 22:35:48 +00:00
xqhe
d47d305bf4 IMPALA-11418: A statement that returns at most one row need not to spool results
A query that returns at most one row can run more efficiently without
result spooling. If result spooling is enabled, it will set the
minimum memory reservation in PlanRootSink, e.g. 'select 1' minimum
memory reservation is 4MB.

This optimization can reduce the statement's resource reservation and
prevent the exception 'Failed to get minimum memory reservation' when
the host memory limit not available.

Testing:
- Add tests in result-spooling.test

Change-Id: Icd4d73c21106048df68a270cf03d4abd56bd3aac
Reviewed-on: http://gerrit.cloudera.org:8080/18711
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-28 05:46:03 +00:00
xiabaike
a377662e94 IMPALA-11420: Support CREATE/ALTER VIEW SET/UNSET TBLPROPERTIES
Add TBLPROPERTIES support to the view, here are some examples:
  CREATE VIEW [IF NOT EXISTS] [database_name.]view_name
    [(column_name [COMMENT 'column_comment'][, ...])]
    [COMMENT 'view_comment']
    [TBLPROPERTIES (property_name = property_value, ...)]
    AS select_statement;

  ALTER VIEW [database_name.]view_name SET TBLPROPERTIES
    (property_name = property_value, ...);

  ALTER VIEW [database_name.]view_name UNSET TBLPROPERTIES
    (property_name, ...);

Change-Id: I8d05bb4ec1f70f5387bb21fbe23f62c05941af18
Reviewed-on: http://gerrit.cloudera.org:8080/18940
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-27 13:27:41 +00:00
Zoltan Borok-Nagy
b91aa06537 IMPALA-11582: Implement table sampling for Iceberg tables
This patch adds table sampling functionalities for Iceberg tables.
From now it's possible to execute SELECT and COMPUTE STATS statements
with table sampling.

Predicates in the WHERE clause affect the results of table sampling
similarly to how legacy tables work (sampling is applied after static
partition and file pruning).

Sampling is repeatable via the REPEATABLE clause.

Testing
 * planner tests
 * e2e tests for V1 and V2 tables

Change-Id: I5de151747c0e9d9379a4051252175fccf42efd7d
Reviewed-on: http://gerrit.cloudera.org:8080/18989
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-26 15:49:22 +00:00
Zoltan Borok-Nagy
3f382b7ebb IMPALA-11583: Use Iceberg API to update stats
Before this patch we used HMS API alter_table() to update an Iceberg
table's statistics. 'alter_table()' API calls are unsafe for Iceberg
tables as they overwrite the whole HMS table, including the table
property 'metadata_location' which must always point to the latest
snapshot. Hence concurrent modification to the same table could be
reverted by COMPUTE STATS.

In this patch we are using Iceberg API to update Iceberg tables.
Also, table-level stats (e.g. numRows, totalSize, totalFiles) are not
set as Iceberg keeps them up-to-date.

COMPUTE INCREMENTAL STATS without partition clause is the same as
plain COMPUTE STATS for Iceberg tables. This behavior is aligned
with current behavior on non-partitioned tables:
https://impala.apache.org/docs/build/html/topics/impala_compute_stats.html

COMPUTE INCREMENTAL STATS .. PARTITION raises an error.

DROP STATS has been also modified to not drop table-level stats for
HMS-integrated Iceberg tables.

Testing:
 * added e2e tests for COMPUTE STATS
 * added e2e tests for DROP STATS
 * manually tested concurrent Hive INSERT and Impala COMPUTE STATS
   using latest Hive
 * opened IMPALA-11590 to add automated interop tests with Hive

Change-Id: I46b6e0a5a65e18e5aaf2a007ec0242b28e0fed92
Reviewed-on: http://gerrit.cloudera.org:8080/18995
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-22 18:33:25 +00:00
Peter Rozsa
04b5319e6e IMPALA-9499: Display support for all complex types in a SELECT * query
This change adds EXPAND_COMPLEX_TYPES query option to support the
display of complex types in SELECT statements where star (*) expression
is in the select list. By default, the query option is disabled. When
it's enabled, it changes the behaviour of star expansion to list all
top-level column fields including ones with complex types, instead of
listing the scalar column fields only. Nested complex type expansion is
also supported, eg.: struct.* will enumerate the members of the struct.
Array, map and struct types are supported.

Testing:
 - Analyzer tests check select statements when the query option is
   enabled or disabled.
 - EE tests check the proper complex type deserialization when the query
   option is enabled, and the original behaviour when the option is
   disabled.

Change-Id: I84b5e5703f9e0ce0f4f8bff83941677dd7489974
Reviewed-on: http://gerrit.cloudera.org:8080/18863
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-09 22:43:33 +00:00
Csaba Ringhofer
2e1ce445b2 IMPALA-11567: Fix left outer join if the right side is subquery with complex type
Non-matching rows from the left side will null out all slots from the
right side in left outer joins. If the right side is a subquery, it
is possible that some returned expressions will be non-NULL even if all
slots are NULL (e.g. constants) - these expressions are wrapped as
IF(TupleIsNull(tids), NULL, expr) to null them in the non-matching
case.

The logic above used to hit a precondition for complex types. We can
safely ignore complex types for now, as currently the only possible
expression that returns a complex type is SlotRef, which doesn't
need to be wrapped. We will have to revisit this once functions are
added that return complex types.

Testing:
- added a regression  test and ran it

Change-Id: Iaa8991cd4448d5c7ef7f44f73ee07e2a2b6f37ce
Reviewed-on: http://gerrit.cloudera.org:8080/18954
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-09 14:42:41 +00:00
Gergely Fürnstáhl
f598b2ad68 IMPALA-10610: Support multiple file formats in a single Iceberg Table
Added support for multiple file formats. Previously Impala created a
Scanner class based on the partitions file format, now in case of an
Iceberg table it will read out the file format from the file level
metadata instead.

IcebergScanNode will aggregate file formats as well instead of relying
on partitions, so it can be used for plannig.

Testing:

Created a mixed file format table with hive and added a test for it.

Change-Id: Ifc816595724e8fd2c885c6664f790af61ddf5c07
Reviewed-on: http://gerrit.cloudera.org:8080/18935
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-08 18:03:37 +00:00
Daniel Becker
37f44a58f3 IMPALA-10918: Allow map type in SELECT list
Adding support for MAP types in the select list.
An example of how maps are printed:
{"k1":2,"k2":null}

Nested collection types (maps and arrays) are supported in any
combination. However, structs in collections and collections in structs
are not supported.

Limitations (other than map support) as described in the commit for
IMPALA-9498 still apply, the following are to be implemented later:
- Unify HS2 / Beeswax logic with the way STRUCTs are handled.
  This could be done in a "final" logic that can handle
  STRUCTS/ARRAYS nested to each other
- Implement "deep copy" and "deep serialize" for collections in BE.
  This would enable all operators, e.g. ORDER BY and UNION.

Testing:
 - modified the FE tests that checked that maps were not allowed in the
   select list - now the test expect maps are allowed there
 - added FE and EE tests involving maps based on the array tests

Change-Id: I921c647f1779add36e7f5df4ce6ca237dcfaf001
Reviewed-on: http://gerrit.cloudera.org:8080/18736
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-07 19:55:43 +00:00
LPL
cc26f345a4 IMPALA-11507: Use absolute_path when Iceberg data files are outside of the table location
For Iceberg tables, when one of the following properties is used, it is
considered that the table is possible to have data outside the table
location directory:
- 'write.object-storage.enabled' is true
- 'write.data.path' is not empty
- 'write.location-provider.impl' is configured
- 'write.object-storage.path'(Deprecated) is not empty
- 'write.folder-storage.path'(Deprecated) is not empty

We should tolerate the situation that relative path of the data files
cannot be obtained by the table location path, and we could use the
absolute path in that case. E.g. the ETL program will write the table
that the metadata of the Iceberg tables is placed in
'hdfs://nameservice_meta/warehouse/hadoop_catalog/ice_tbl/metadata',
the recent data files in
'hdfs://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', and the
data files half a year ago in
's3a://nameservice_data/warehouse/hadoop_catalog/ice_tbl/data', it
should still be queried normally by Impala.

Testing:
 - added e2e tests

Change-Id: I666bed21d20d5895f4332e92eb30a94fa24250be
Reviewed-on: http://gerrit.cloudera.org:8080/18894
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-06 18:35:30 +00:00
Zoltan Borok-Nagy
73da4d7ddf IMPALA-11484: Create SCAN plan for Iceberg V2 position delete tables
This patch adds support for reading Iceberg V2 tables use position
deletes. Equality deletes are still not supported. Position delete
files store the file path and file position of the deleted rows.

When an Iceberg table has position delete files we need to do an
ANTI JOIN between data files and delete files. From the data files
we need to query the virtual columns INPUT__FILE__NAME and
FILE__POSITION, while from the delete files we need the data columns
'file_path' and 'pos'. The latter data columns are not part of the
table schema, so we create a virtual table instance of
'IcebergPositionDeleteTable' that has a table schema corresponding
to the delete files ('file_path', 'pos').

This patch introduces a new class 'IcebergScanPlanner' which has
the responsibility of doing a plan for Iceberg table scans. It creates
the aforementioned ANTI JOIN. Also, if there are data files without
corresponding delete files, we can have a separate SCAN node and its
results would be UNIONed to the rows coming from the ANTI JOIN:

              UNION
             /     \
    SCAN data       ANTI JOIN
                     /      \
              SCAN data    SCAN deletes

Some refactorings in the context of this CR:
Predicate pushdown and time travel logic is transferred from
IcebergScanNode to IcebergScanPlanner. Iceberg snapshot summary
retrieval is moved from FeFsTable to FeIcebergTable.

Testing:
 * added planner test
 * added e2e tests

TODO in follow-up Jiras:
 * better cardinality estimates (IMPALA-11516)
 * support unrelative collection columns (select item from t.int_array)
   (IMPALA-11517)
   Currently such queries return error during analysis

Change-Id: I672cfee18d8e131772d90378d5b12ad4d0f7dd48
Reviewed-on: http://gerrit.cloudera.org:8080/18847
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-01 16:51:17 +00:00
Gabor Kaszab
fec7a79c50 IMPALA-11529: FILE__POSITION virtual column for ORC tables
IMPALA-11350 implemented the FILE__POSITION virtual column for Parquet
files. This ticket does the same but for ORC files. Note, that for full
ACID ORC tables there have already been an implementation of row__id
that could simply be re-used for this ticket.

Testing:
 - TestScannersVirtualColumns.test_virtual_column_file_position_generic
   is changed to run now on ORC as well. I don't think further testing
   is required as this functionality has already been there for row__id
   we just re-used it for FILE__POSITION.

Change-Id: Ie8e951f73ceb910d64cd149192853a4a2131f79b
Reviewed-on: http://gerrit.cloudera.org:8080/18909
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-30 16:45:24 +00:00
Tamas Mate
423b087762 IMPALA-11520: Remove functional.unsupported_types misc test
IMPALA-9482 added support to the remaining Hive types and removed the
functional.unsupported_types table. There was a reference remaining in a
misc test. test_misc is not marked as exhaustive but it only runs in
exhaustive builds.

Change-Id: I65b6ea5ac742fbcc427ad41741d347558cb7d110
Reviewed-on: http://gerrit.cloudera.org:8080/18896
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-25 16:24:41 +00:00
Tamas Mate
40b9b9cd75 IMPALA-11521: Fix test_binary_type
Fix test_binary_type typo, it should not reference to an HBase table.

Change-Id: Id41049094b632af6326f6ee9f3886577d1fc5ee6
Reviewed-on: http://gerrit.cloudera.org:8080/18895
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-25 15:12:38 +00:00
LPL
d7ecc11149 IMPALA-11500: Fix Impalad crashed in ParquetBoolDecoder::SkipValues when num_values is 0
Fix Impalad crashed in the method ParquetBoolDecoder::SkipValues when
the parameter 'num_values' is 0. The function should tolerate that the
'num_values' is 0 values.

Testing:
 - Add e2e tests

Change-Id: I8c4c5a4dff9e9e75913c7b524b4ae70967febb37
Reviewed-on: http://gerrit.cloudera.org:8080/18854
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-19 19:11:58 +00:00
Csaba Ringhofer
7ca11dfc7f IMPALA-9482: Support for BINARY columns
This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-19 13:55:42 +00:00
Zoltan Borok-Nagy
522ee1fcc0 IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables
Virtual column FILE__POSITION returns the ordinal position of the row
in the data file. It will be useful to add support for Iceberg's
position-based delete files

This patch only adds FILE__POSITION to Parquet tables. It works
similarly to the handling of collection position slots. I.e. we
add the responsibility of dealing with the file position slot to
an existing column reader. Because of page-filtering and late
materialization we already tracked the file position in member
'current_row_' during scanning.

Querying the FILE__POSITION in other file formats raises an error.

Testing:
 * added e2e tests

Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7
Reviewed-on: http://gerrit.cloudera.org:8080/18704
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-12 19:21:55 +00:00
LPL
0e3e4b57a1 IMPALA-11408: Fill missing partition columns when INSERT INTO iceberg_tbl (col_list)
In the case of INSERT INTO iceberg_tbl (col_a, col_b, ...), if the
partition columns of Iceberg table are not in the columns permutation,
in order for data to be written to the default
partition '__HIVE_DEFAULT_PARTITION__' we will fill the missing
partition columns with NullLiteral.

Testing:
 - add e2e tests

Change-Id: I40c733755d65e5c81a12ffe09b6d16ed5d115368
Reviewed-on: http://gerrit.cloudera.org:8080/18790
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-10 14:34:04 +00:00
Riza Suminto
05c3a8e09c IMPALA-11465, IMPALA-11466: Bump CDP_BUILD_NUMBER to 30010248
This patch bump up CDP_BUILD_NUMBER to pick Hive version
3.1.3000.7.2.16.0-127 that contains:
- thrift-0.16.0 upgrade from HIVE-25635.
- Backport of ORC-517.

This patch also contains fix for IMPALA-11466 by adding jetty-server as
an allowed dependency.

Testing:
- Build locally and confirm that the cdp components is downloaded.

Change-Id: Iff5297a48865fb2444e8ef7b9881536dc1bbf63c
Reviewed-on: http://gerrit.cloudera.org:8080/18803
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2022-08-09 06:45:52 +00:00
Tamas Mate
bb5ec85134 IMPALA-10453: Support file pruning via runtime filters on Iceberg
Iceberg tables store partition information in manifest files and not in
the file path. This metadata has already been pushed down to the
scanners and this commit uses this metadata to evaluate runtime filters
on Iceberg files.

Pefromance measurement:
Used TPC-DS Q10 [1] with scale of 10 to measure the query performance.
Min/Max filters have been disabled and increased the wait time for
runtime filters to 5 seconds. After pre-warming the Catalog I executed
Q10 5 times on my local machine. The fastest execution times were:
Baseline Parquet tables: 1.08s
Baseline Iceberg tables without this patch: 1.43s
Iceberg tables with this patch: 1.09s

Testing:
  * Added e2e tests.
  * Initial perofrmance test with TPC-DS Q10.

Ref:
[1] TPC-DS Q10:
select cd_gender, cd_marital_status, cd_education_status, count(*) cnt1,
  cd_purchase_estimate, count(*) cnt2, cd_credit_rating, count(*) cnt3,
  cd_dep_count, count(*) cnt4, cd_dep_employed_count, count(*) cnt5,
  cd_dep_college_count, count(*) cnt6
 from customer c, customer_address ca, customer_demographics
 where c.c_current_addr_sk = ca.ca_address_sk and
  ca_county in ('Walker County','Richland County','Gaines County',
  'Douglas County','Dona Ana County') and
  cd_demo_sk = c.c_current_cdemo_sk and
  exists (select *
          from store_sales, date_dim
          where c.c_customer_sk = ss_customer_sk and
                ss_sold_date_sk = d_date_sk and
                d_year = 2002 and
                d_moy between 4 and 4+3) and
   exists (select *
          from (select ws_bill_customer_sk as customer_sk, d_year,d_moy
             from web_sales, date_dim where ws_sold_date_sk = d_date_sk
              and d_year = 2002 and
                  d_moy between 4 and 4+3
             union all
             select cs_ship_customer_sk as customer_sk, d_year, d_moy
             from catalog_sales, date_dim
             where cs_sold_date_sk = d_date_sk and d_year = 2002 and
                  d_moy between 4 and 4+3
          ) x
            where c.c_customer_sk = customer_sk)
 group by cd_gender, cd_marital_status, cd_education_status,
  cd_purchase_estimate, cd_credit_rating, cd_dep_count,
  cd_dep_employed_count, cd_dep_college_count
 order by cd_gender, cd_marital_status, cd_education_status,
  cd_purchase_estimate, cd_credit_rating, cd_dep_count,
  cd_dep_employed_count, cd_dep_college_count
limit 100;

Change-Id: I7762e1238bdf236b85d2728881a402a2bb41f36a
Reviewed-on: http://gerrit.cloudera.org:8080/18531
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-04 20:18:04 +00:00
Tamas Mate
c0b0875bda IMPALA-11378: Allow INSERT OVERWRITE for bucket tranforms in some cases
This change has been considered only for Iceberg tables mainly for table
maintenance reasons. Iceberg table writes create new snapshots and these
can accumulate over time. This commit allows a simple form of compaction
of these snapshots.

INSERT OVERWRITES have been blocked in case partition evolution is in
place, because it would be possible to overwrite a data file with a
newer schema that has less columns. This could cause unexpected data
loss.

For bucketed tables, the following syntax is allowed to be executed:
  INSERT OVERWRITE ice_tbl SELECT * FROM ice_tbl;
The source and target table has to be the same and specified, only
SELECT '*' queries are allowed. These requirements are also in place to
avoid unexpected data loss.
 - Values are not allowed, because inserting a single record could
   overwrite a whole file in a bucket.
 - Only source table is allowed, because at the time of the insert it
   is unknown which files will be modified, similar to values.

Testing:
 - Added e2e tests.

Change-Id: Ibd1bc19d839297246eadeb754cdeeec1e306098a
Reviewed-on: http://gerrit.cloudera.org:8080/18649
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-01 13:36:51 +00:00
Gergely Fürnstáhl
cbd3fab8c4 IMPALA-11268: Allow STORED BY and STORED AS as well
Extended the parser to support 'stored by' keyword for storage
engines, namely Kudu and Iceberg at the moment, to match hive's
syntax. Furthermore, this lays the ground work for seperating
storage engines from file formats to be able specify both with
"STORED BY ... STORED AS ...".

Testing:

Added front-end Parser and Analyzer tests and query tests for table
creation.

Change-Id: Ib677bea8e582bbc01c5fb8c81df57eb60b0ed961
Reviewed-on: http://gerrit.cloudera.org:8080/18743
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-26 18:21:48 +00:00
LPL
614dc54dcc IMPALA-11446: Push-down NOT_IN predicate to iceberg
Because the column value bounds of the Iceberg meta are not necessarily
a min or max value, NOT_IN cannot be answered using them.
NOT_IN(col, {X, ...}) with (X, Y) doesn't guarantee that X is a value
in col. But it works when the push-down column is the partition column,
it's still very helpful.

Testing:
  - add e2e tests

Change-Id: Ib8bdaf6f31a4438e11c4eb27485bb413fe6df9a3
Reviewed-on: http://gerrit.cloudera.org:8080/18760
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-26 18:11:25 +00:00
Csaba Ringhofer
efc303b71a IMPALA-11434: Fix analysis of multiple more than 1d arrays in select list
More than 1d arrays in select list tried to register a
CollectionTableRef with name "item" for the inner arrays,
leading to name collision if there was more than one such array.

The logic is changed to always use the full path as implicit alias
in CollectionTableRefs backing arrays in select list.

As a side effect this leads to using the fully qualified names
in expressions in the explain plans of queries that use arrays
from views. This is not an intended change, but I don't consider
it to be critical. Created IMPALA-11452 to deal with more
sophisticated alias handling in collections.

Testing:
- added a new table to testdata and a regression test

Change-Id: I6f2b6cad51fa25a6f6932420eccf1b0a964d5e4e
Reviewed-on: http://gerrit.cloudera.org:8080/18734
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-22 22:59:19 +00:00
wzhou-code
02043744be IMPALA-11445: Fix bug in firing insert event of partitions located in different FS
When adding a partition with location in a file system which is
different from the file system of the table location, Impala accept
it. But when insert values to the table, catalogd throw exception.

This patch fix the issue by using the right FileSystem object.

Testing:
 - Added new test case with partitions on different file systems.
   Ran the test on S3.
 - Did manual tests in cluster with partitions on HDFS and Ozone.
 - Passed core test.

Change-Id: I0491ee1bf40c3d5240f9124cef3f3169c44a8267
Reviewed-on: http://gerrit.cloudera.org:8080/18759
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-22 04:23:30 +00:00
Csaba Ringhofer
06e8e7bba7 IMPALA-886: Support displaying HBase cols in the order from HMS
Before this patch catalogd always ordered HBase columns
lexicographically by family/qualifier. This is incompatible with other
table formats and the way Hive handles HBase tables, where the order
comes from HMS as defined during CREATE TABLE.

I don't know of any valid reason behind this old behavior, it probably
just made the implementation a bit easier by doing the ordering in FE
instead of BE - the BE actually needs this ordering during scanning
as the HBase API returns results in this order, but this should have
no effect on other parts of Impala.

Added flag use_hms_column_order_for_hbase_tables (used by catalogd)
to decide whether to do this reordering:
- true: keep HMS order
- false: reorder by family/qualifier [default]

The old way is kept as default to avoid breaking existing workloads,
but it would make sense to change it in the next major release.

Note that a query option would be more convenient to use, but it
would be much harder to implement it as the order is decided during
loading in catalogd.

Testing:
- added custom cluster test for
  use_hms_column_order_for_hbase_tables = true

Change-Id: Ibc5df8b803f2ae3b93951765326cdaea706e3563
Reviewed-on: http://gerrit.cloudera.org:8080/18635
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-18 19:02:34 +00:00
LPL
e1fea843ea IMPALA-11433: Remove misleading bucketing info from DESCRIBE FORMATTED output for Iceberg tables
The DESCRIBE FORMATTED output show this even for bucketed Iceberg
tables:

| Num Buckets:	    | 0	    | NULL |
| Bucket Columns:   | []    | NULL |

We should remove them, and the user should rely on the information in
the '# Partition Transform Information' block instead.

Testing:
 - add e2e tests
 - tested in a real cluster

Change-Id: Idc156c932780f0f12c935a1a60ff6606d59bb1da
Reviewed-on: http://gerrit.cloudera.org:8080/18735
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-16 17:11:24 +00:00
Kurt Deschler
7764830216 IMPALA-11430: Support custom hash schema for Kudu range tables
KUDU-2671 added support for custom hash partition specification at the
range partition level. This patch adds CREATE TABLE and ALTER TABLE
syntax to allow Kudu custom hash schema to be specified through Impala.
In addition, a new SHOW HASH SCHEMA statement has been added to allow
display of the hash schema information for each partition.

HASH syntax within a partition is similar to the table-level syntax
except that HASH clauses must follow the PARTITION clause and commas are
not allowed within a partition. These differences were required to keep
the grammar unambiguous and due to limitations of the Java Cup Parser.
To make the grammar more consistent, commas in the table-level partion
spec and between PARTITION clauses are now optional but allowed for
backward compatibility.

Example:

CREATE TABLE t1 (id int, c2 int, PRIMARY KEY(id, c2))
PARTITION BY HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4
RANGE (c2)
(
  PARTITION 0 <= VALUES < 10
  PARTITION 10 <= VALUES < 20
      HASH(id) PARTITIONS 2 HASH(c2) PARTITIONS 3
  PARTITION 20 <= VALUES < 30
)
STORED AS KUDU;
ALTER TABLE t1 ADD RANGE PARTITION 30 <= VALUES < 40
      HASH(id) PARTITIONS 3 HASH(c2) PARTITIONS 4;

This bumps the toolchain to Kudu githash 43ee785b2d to get
the needed Kudu-side changes.

Testing:
Tests added to kudu_partition_ddl.test

Change-Id: I981056e0827f4957580706d6e73742e4e6743c1c
Reviewed-on: http://gerrit.cloudera.org:8080/18676
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-07-16 06:06:46 +00:00
Gergely Fürnstáhl
8d034a2f7c IMPALA-11034: Resolve schema of old data files in migrated Iceberg tables
When external tables are converted to Iceberg, the data files remain
intact, thus missing field IDs. Previously, Impala used name based
column resolution in this case.

Added a feature to traverse through the data files before column
resolution and assign field IDs the same way as iceberg would, to be
able to use field ID based column resolutions.

Testing:

Default resolution method was changed to field id for migrated tables,
existing tests use that from now.

Added new tests to cover edge cases with complex types and schema
evolution.

Change-Id: I77570bbfc2fcc60c2756812d7210110e8cc11ccc
Reviewed-on: http://gerrit.cloudera.org:8080/18639
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-14 13:06:04 +00:00
Zoltan Borok-Nagy
26438d8e3e IMPALA-11414: Off-by-one error in Parquet late materialization
With PARQUET_LATE_MATERIALIZATION we can set the number of minimum
consecutive rows that if filtered out, we avoid materialization of rows
in other columns in parquet.

E.g. if PARQUET_LATE_MATERIALIZATION is 10, and in a filtered column we
find at least 10 consecutive rows that don't pass the predicates we
avoid materializing the corresponding rows in the other columns.

But due to an off-by-one error we actually only needed
(PARQUET_LATE_MATERIALIZATION - 1) consecutive elements. This means if
we set PARQUET_LATE_MATERIALIZATION to one, then we need zero
consecutive filtered out elements which leads to a crash/DCHECK. The bug
is in the GetMicroBatches() algorithm when we produce the micro batches
based on the selected rows.

Setting PARQUET_LATE_MATERIALIZATION to 0 doesn't make sense so it
shouldn't be allowed.

Testing
 * e2e test with PARQUET_LATE_MATERIALIZATION=1
 * e2e test for checking SET PARQUET_LATE_MATERIALIZATION=N

Change-Id: I38f95ad48c4ac8c1e06651565ab5c496283b29fa
Reviewed-on: http://gerrit.cloudera.org:8080/18700
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-11 19:25:02 +00:00
LPL
78e45a7cea IMPALA-11287 (part 2): Implement CREATE TABLE LIKE for Iceberg tables
This commit implements cloning between Iceberg tables. Cloning Iceberg
tables from other Types of tables is not implemented, because the Data
Types of Iceberg and Impala do not correspond one by one.

Testing:
 - e2e tests

Change-Id: I1284b926f51158e221277b18b2e73707e29f86ac
Reviewed-on: http://gerrit.cloudera.org:8080/18658
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2022-07-08 13:52:26 +00:00
gaoxq
4174ff5ea6 IMPALA-11320: SHOW PARTITIONS on Iceberg table doesn't list the partitions
Currently, SHOW PARTITIONS on Iceberg tables only outputs the partition
spec which is not too useful.

Instead it should output the concrete partitions, number of files, number
of rows in each partitions. E.g.:

SHOW PARTITIONS ice_ctas_hadoop_tables_part;

'{"d_month":"613"}',4,2
'{"d_month":"614"}',3,1
'{"d_month":"615"}',2,1

Testing:
 - Added end-to-end test

Change-Id: I3b4399ae924dadb89875735b12a2f92453b6754c
Reviewed-on: http://gerrit.cloudera.org:8080/18641
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-05 20:32:17 +00:00
LPL
9c96855146 IMPALA-11368: Iceberg time-travel error message should show timestamp in local timezone
In the FOR SYSTEM_TIME AS OF clause we expect timestamps in the local
timezone, while the error message shows the timestamp in UTC timezone.
The error message should show timestamp in local timezone.

Testing:
 - Add e2e test

Change-Id: Iba5d5eb65133f11cc4eb2fc15a19f7b25c14cc46
Reviewed-on: http://gerrit.cloudera.org:8080/18675
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-30 18:31:18 +00:00
LPL
f38c53235f IMPALA-11279: Optimize plain count(*) queries for Iceberg tables
This commit optimizes the plain count(*) queries for the Iceberg tables.
When the `org.apache.iceberg.SnapshotSummary#TOTAL_RECORDS_PROP` can be
retrieved from the current `org.apache.iceberg.BaseSnapshot#summary` of
the Iceberg table, this kind of query can be very fast. If this property
is not retrieved, the query will aggregate the `num_rows` of parquet
`file_metadata_` as usual.

Queries that can be optimized need to meet the following requirements:
 - SelectStmt does not have WHERE clause
 - SelectStmt does not have GROUP BY clause
 - SelectStmt does not have HAVING clause
 - The TableRefs of FROM clause contains only one BaseTableRef
 - Only for the Iceberg table
 - SelectList must contain 'count(*)' or 'count(constant)'
 - SelectList can contain other agg functions, e.g. min, sum, etc
 - SelectList can contain constant

Testing:
 - Added end-to-end test
 - Existing tests
 - Test it in a real cluster

Change-Id: I8e9c48bbba7ab2320fa80915e7001ce54f1ef6d9
Reviewed-on: http://gerrit.cloudera.org:8080/18574
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-27 15:29:11 +00:00
Gabor Kaszab
5d021ce5a7 IMPALA-9496: Allow struct type in the select list for Parquet tables
This patch is to extend the support of Struct columns in the select
list to Parquet files as well.

There are some limitation with this patch:
  - Dictionary filtering could work when we have conjuncts on a member
    of a struct, however, if this struct is given in the select list
    then the dictionary filtering is disabled. The reason is that in
    this case there would be a mismatch between the slot/tuple IDs in
    the conjunct between the ones in the select list due to expr
    substitution logic when a struct is in the select list. Solving
    this puzzle would be a nice future performance enhancement. See
    IMPALA-11361.
  - When structs are read in a batched manner it delegates the actual
    reading of the data to the column readers of its children, however,
    would use the simple ReadValue() on these readers instead of the
    batched version. The reason is that calling the batched reader in
    the member column readers would in fact read in batches, but it
    won't handle the case when the parent struct is NULL and would set
    only itself to NULL but not the parent struct. This might also be a
    future performance enhancement. See IMPALA-11363.
  - If there is a struct in the select list then late materialization
    is turned off. The reason is that LM expects the column readers to
    be used through the batched reading interface, however, as said in
    the above bulletpoint currently struct column readers use the
    non-batched reading interface of its children. As a result after
    reading the column readers are not in a state as SkipRows() of LM
    expects and then results in a query failure because it's not able
    to skip the rows for non-filter readers.
    Once IMPALA-11363 is implemented and the struct will also use the
    ReadValueBatch() interface of its children then late
    materialization could be turned on even if structs are in the
    select list. See IMPALA-11364.

Testing:
  - There were a lot of tests already to exercise this functionality
    but they were only run on ORC table. I changed these to cover
    Parquet tables too.

Change-Id: I3e8b4cbc2c4d1dd5fbefb7c87dad8d4e6ac2f452
Reviewed-on: http://gerrit.cloudera.org:8080/18596
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-22 17:55:07 +00:00
LPL
75ccdc6aec IMPALA-11367: Fix some formatting bugs when DESCRIBE FORMATTED for Iceberg tables
Fix some formatting bugs when DESCRIBE FORMATTED for Iceberg tables:
 - 'LINE_DELIM' is missing on '# Partition Transform Information'
 - The partition transform columns header should be
 'col_name,transform_type,NULL'

Testing:
 - Existing tests

Change-Id: I991644cefb34decc843a5542b47eaec11d7b6e42
Reviewed-on: http://gerrit.cloudera.org:8080/18634
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-20 14:06:56 +00:00