Commit Graph

1688 Commits

Author SHA1 Message Date
Zoltan Borok-Nagy
ca5de24d6a IMPALA-12153: Parquet STRUCT reader should fill position slots
Before this patch the Parquet STRUCT reader didn't fill the
position slots: collection position, file position. When users
queried these virtual columns Impala was crashed or returned
incorrect results.

The ORC scanner already worked correctly, but there was no tests
written for it.

Test:
 * e2e tests for both ORC / Parquet

Change-Id: I32a808a11f4543cd404ed9f3958e9b4e971ca1f4
Reviewed-on: http://gerrit.cloudera.org:8080/19911
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-24 01:18:56 +00:00
Daniel Becker
8785270451 IMPALA-12147: Allow collections of fixed length types as non-passthrough children of unions
IMPALA-12019 implemented support for collections of fixed length types
in the sorting tuple. This was made possible by implementing the
materialisation of these collections.

Building on this, this change allows such collections as non-passthrough
children of UNION ALL operations. Note that plain UNIONs are not
supported for any collections for other reasons and this patch does not
affect them or any other set operation.

Testing:
Tests in nested-array-in-select-list.test and
nested-map-in-select-list.test check that
 - the newly allowed cases work correctly and
 - the correct error message is given for collections of variable length
   types.

Change-Id: I14c13323d587e5eb8a2617ecaab831c059a0fae3
Reviewed-on: http://gerrit.cloudera.org:8080/19903
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-19 22:11:03 +00:00
Daniel Becker
ff3d0c7984 IMPALA-12019: Support ORDER BY for arrays of fixed length types in select list
As a first stage of IMPALA-10939, this change implements support for
including in the sorting tuple top-level collections that only contain
fixed length types (including fixed length structs). For these types the
implementation is almost the same as the existing handling of strings.

Another limitation is that structs that contain any type of collection
are not yet allowed in the sorting tuple.

Also refactored the RawValue::Write*() functions to have a clearer
interface.

Testing:
 - Added a new test table that contains many rows with arrays. This is
   queried in a new test added in test_sort.py, to ensure that we handle
   spilling correctly.
 - Added tests that have arrays and/or maps in the sorting tuple in
   test_queries.py::TestQueries::{test_sort,
       test_top_n,test_partitioned_top_n}.

Change-Id: Ic7974ef392c1412e8c60231e3420367bd189677a
Reviewed-on: http://gerrit.cloudera.org:8080/19660
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-18 09:56:55 +00:00
Csaba Ringhofer
4261225f65 IMPALA-6433: Add read support for PageHeaderV2
Parquet v2 means several changes in Parquet files compared to v1:

1. file version = 2 instead of 1

c185faf0c4/src/main/thrift/parquet.thrift (L1016)
Before this patch Impala rejected Parquet files with version!=1.

2. possible use of DataPageHeaderV2 instead DataPageHeader

c185faf0c4/src/main/thrift/parquet.thrift (L561)

The main differences compared to V1 DataPageHeader:
a. rep/def levels are not compressed, so the compressed part contains
   only the actual encoded values
b. rep/def levels must be RLE encoded (Impala only supports RLE encoded
   levels even for V1 pages)
c. compression can be turned on/off per page (member is_compressed)
d. number of nulls (member num_nulls) is required - in v1 it was
   included in statistics which is optional
e. number of rows is required (member num_rows) which can help with
   matching collection items with the top level collection

The patch adds support for understanding v2 data pages but does not
implement some potential optimizations:

a. would allow an optimization for queries that need only the nullness
of a column but not the actual value: as the values are not needed the
decompression of the page data can be skipped. This optimization is not
implemented - currently Impala materializes both the null bit and the
value for all columns regardless of whether the value is actually
needed.

d. could be also used for optimizations / additional validity checks
but it is not used currently

e. could make skipping rows easier but is not used, as the existing
scanner has to be able to skip rows efficiently also in v1 files so
it can't rely on num_rows

3. possible use of new encodings (e.g. DELTA_BINARY_PACKED)

No new encoding is added - when an unsupported encoding is encountered
Impala returns an error.

parquet-mr uses new encodings (DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY)
for most types if the file version is 2, so with this patch Impala is
not yet able to read all v2 Parquet tables written by Hive.

4. Encoding PLAIN_DICTIONARY is deprecated and RLE_DICTIONARY is used
instead. The semantics of the two encodings are exactly the same.

Additional changes:
Some responsibilites are moved from ParquetColumnReader to
ParquetColumnChunkReader:
- ParquetColumnChunkReader decodes rep/def level sizes to hide v1/v2
  differences (see 2.a.)
- ParquetColumnChunkReader skips empty data pages in
  ReadNextDataPageHeader
- the state machine of ParquetColumnChunkReader is simplified by
  separating data page header reading / reading rest of the page

Testing:
- added 4 v2 Parquet test tables (written by Hive) to cover
  compressed / uncompressed and scalar/complex cases
- added EE and fuzz tests for the test tables above
- manual tested v2 Parquet files written by pyarrow
- ran core tests

Note that no test is added where some pages are compressed while
some are not. It would be tricky to create such files with existing
writers. The code should handle this case and it is very unlikely that
files like this will be encountered.

Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca
Reviewed-on: http://gerrit.cloudera.org:8080/19793
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-12 18:31:03 +00:00
Riza Suminto
bd4a817893 IMPALA-11123: Restore NumFileMetadataRead counter
NumFileMetadataRead counter was lost with the revert of commit
f932d78ad0. This patch restore
NumFileMetadataRead counter and also assertions in impacted iceberg test
files. Other impacted test files will be gradually restored with
reimplementation of optimized count star for ORC.

Testing:
- Pass core tests.

Change-Id: Ib14576245d978a127f688e265cab2f4ff519600c
Reviewed-on: http://gerrit.cloudera.org:8080/19854
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-09 00:05:36 +00:00
Riza Suminto
7ca20b3c94 Revert "IMPALA-11123: Optimize count(star) for ORC scans"
This reverts commit f932d78ad0.

The commit is reverted because it cause significant regression for
non-optimized counts star query in parquet format.

There are several conflicts that need to be resolved manually:
- Removed assertion against 'NumFileMetadataRead' counter that is lost
  with the revert.
- Adjust the assertion in test_plain_count_star_optimization,
  test_in_predicate_push_down, and test_partitioned_insert of
  test_iceberg.py due to missing improvement in parquet optimized count
  star code path.
- Keep the "override" specifier in hdfs-parquet-scanner.h to pass
  clang-tidy
- Keep python3 style of RuntimeError instantiation in
  test_file_parser.py to pass check-python-syntax.sh

Change-Id: Iefd8fd0838638f9db146f7b706e541fe2aaf01c1
Reviewed-on: http://gerrit.cloudera.org:8080/19843
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-05-06 22:55:05 +00:00
Riza Suminto
69da2ff86e IMPALA-12106: Fix overparallelization of Union fragment by 1
IMPALA-10973 has a bug where a union fragment without a scan node can be
over-parallelized by the backend scheduler by 1. It is reproducible by
running TPC-DS Q11 with MT_DOP=1. This patch additionally checks that
such a fragment does not have an input fragment before randomizing the
host assignment.

Testing:
Add TPC-DS Q11 to test_mt_dop.py::TestMtDopScheduling::test_scheduling
and verify the number of fragment instances scheduled in the
ExecSummary.

Change-Id: Ic69e7c8c0cadb4b07ee398aff362fbc6513eb08d
Reviewed-on: http://gerrit.cloudera.org:8080/19816
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-03 03:16:29 +00:00
wzhou-code
296224a6fb IMPALA-12110: Create Kudu table in CTAS without specifying primary key
IMPALA-11809 adds support non unique primary key for Kudu table.
It allows to create Kudu table without specifying primary key since
partition columns could be promoted as non unique primary key. But
when creating Kudu table in CTAS without specifying primary key,
Impala returns parsing error.

This patch fixed the parsing issue for creating Kudu table in CTAS
without specifying primary key.

Testing:
 - Added new test cases in parsing unit-test and end-to-end unit-test.
 - Passed core tests.

Change-Id: Ia7bb0cf1954e0a4c3d864a800e929a88de272dd5
Reviewed-on: http://gerrit.cloudera.org:8080/19825
Reviewed-by: Abhishek Chennaka <achennaka@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-02 02:47:46 +00:00
LPL
8b6f9273ce IMPALA-12097: WITH CLAUSE should be skipped when optimizing COUNT(*) query on Iceberg table
When optimizing the simple count star query for the Iceberg table, the
WITH CLAUSE should be skipped, but that doesn't mean the SQL can't be
optimized, because when the WITH CLAUSE is inlined, the final statement
is optimized by the CountStarToConstRule.

Testing:
 * Add e2e tests

Change-Id: I7b21cbea79be77f2ea8490bd7f7b2f62063eb0e4
Reviewed-on: http://gerrit.cloudera.org:8080/19811
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-04-29 02:35:45 +00:00
hexianqing
4f1d8d4d39 IMPALA-11536: fix invalid predicates propagate for outer join simplification
When set ENABLE_OUTER_JOIN_TO_INNER_TRANSFORMATION = true, the planner
will simplify outer joins if the WHERE clause contains at least one
null rejecting condition and then remove the outer-joined tuple id
from the map of GlobalState#outerJoinedTupleIds.
However, there may be false removals for right join simplification or
full join simplification. This may lead to incorrect results since it
is incorrect to propagate a non null-rejecting predicate into a plan
subtree that is on the nullable side of an outer join.
GlobalState#outerJoinedTupleIds indicates whether a table is on the
nullable side of an outer join.

E.g.
SELECT COUNT(*)
FROM functional.nullrows t1
  FULL JOIN functional.nullrows t2 ON t1.id = t2.id
  FULL JOIN functional.nullrows t3 ON coalesce(t1.id, t2.id) = t3.id
WHERE t1.group_str = 'a'
  AND coalesce(t2.group_str, 'f') = 'f'
The predicate coalesce(t2.group_str, 'f') = 'f' will propagate into t2
if we remove t2 from GlobalState#outerJoinedTupleIds.

Testing:
- Add new plan tests in outer-to-inner-joins.test
- Add new query tests to verify the correctness on transformation

Change-Id: I6565c5bff0d2f24f30118ba47a2583383e83fff7
Reviewed-on: http://gerrit.cloudera.org:8080/19116
Reviewed-by: Qifan Chen <qfchen@hotmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-04-22 23:24:10 +00:00
Gabor Kaszab
7e0feb4a8e IMPALA-11701 Part1: Don't push down predicates to scanner if already applied by Iceberg
We push down predicates to Iceberg that uses them to filter out files
when getting the results of planFiles(). Using the
FileScanTask.residual() function we can find out if we have to use
the predicates to further filter the rows of the given files or if
Iceberg has already performed all the filtering.
Basically if we only filter on IDENTITY-partition columns then Iceberg
can filter the files and using these filters in Impala wouldn't filter
any more rows from the output (assuming that no partition evolution was
performed on the table).

An additional benefit of not pushing down no-op predicates to the
scanner is that we can potentially materialize less slots.
For example:

SELECT count(1) from iceberg_tbl where part_col = 10;

Another additional benefit comes with count(*) queries. If all the
predicates are skipped from being pushed to Impala's scanner for a
count(*) query then the Parquet scanner can go to an optimized path
where it uses stats instead of reading actual data to answer the query.

In the above query Iceberg filters the files using the predicate on
a partition column and then there won't be any need to materialize
'part_col' in Impala, nor to push down the 'part_col = 10' predicate.

Note, this is an all or nothing approach, meaning that assuming N
number of predicates we either push down all predicates to the scanner
or none of them. There is a room for improvement to identify a subset
of the predicates that we still have to push down to the scanner.
However, for this we'd need a mapping between Impala predicates and the
predicates returned by Iceberg's FileScanTask.residual() function that
would significantly increase the complexity of the relevant code.

Testing:
  - Some existing tests needed some extra care as they were checking
    for predicates being pushed down to the scanner, but with this
    patch not all of them are pushed down. For these tests I added some
    extra predicates to achieve that all of the predicates are pushed
    down to the scanner.
  - Added a new planner test suite for checking how predicate push down
    works with Iceberg tables.

Change-Id: Icfa80ce469cecfcfbcd0dcb595a6b04b7027285b
Reviewed-on: http://gerrit.cloudera.org:8080/19534
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-04-21 15:22:17 +00:00
Daniel Becker
b73847f178 IMPALA-10851: Codegen for structs
IMPALA-9495 added support for struct types in SELECT lists but only with
codegen turned off. This commit implements codegen for struct types.

To facilitate this, code generation for reading and writing 'AnyVal's
has been refactored. A new class, 'CodegenAnyValReadWriteInfo' is
introduced. This class is an interface between sources and destinations,
one of which is an 'AnyVal' object: sources generate an instance of this
class and destinations take that instance and use it to write the value.

The other side can for example be tuples from which we read (in the case
of 'SlotRef') or tuples we write into (in case of materialisation, see
Tuple::CodegenMaterializeExprs()). The main advantage is that sources do
not have to know how to write their destinations, only how to read the
values (and vice versa).

Before this change, many tests that involve structs ran only with
codegen turned off. Now that codegen is supported in these cases, these
tests are also run with codegen on.

Testing:
  - enabed tests for structs in the select list with codegen on in
    tests/query_test/test_nested_types.py
  - enabled codegen in other tests where it used to be disabled because
    it was not supported.

Change-Id: I5272c3f095fd9f07877104ee03c8e43d0c4ec0b6
Reviewed-on: http://gerrit.cloudera.org:8080/18526
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-04-14 13:46:59 +00:00
Gabor Kaszab
826b113fd7 IMPALA-11954: Fix for URL encoded partition columns for Iceberg tables
There is a bug when an Iceberg table has a string partition column and
Impala insert special chars into this column that need to be URL
encoded. In this case the partition name is URL encoded not to confuse
the file paths for that partition. E.g. 'b=1/2' value is converted to
'b=1%2F2'.
This if fine for path creation, however, for Iceberg tables
the same URL encoded partition name is saved into catalog as the
partition name also used for Iceberg column stats. This brings to
incorrect results when querying the table as the URL encoded values
are returned in a SELECT * query instead of what the user inserted.
Additionally, when adding a filter to the query, Iceberg will filter
out all the rows because it compares the non-encoded values to the URL
encoded values.

Testing:
  - Added new tests to iceberg-partitioned-insert.test to cover this
    scenario.
  - Re-run the existing test suite.

Change-Id: I67edc3d04738306fed0d4ebc5312f3d8d4f14254
Reviewed-on: http://gerrit.cloudera.org:8080/19654
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-04-03 12:05:57 +00:00
Tamas Mate
3b3acd5c08 IMPALA-11908: Parser change for Iceberg metadata querying
This change extends parsing table references with Iceberg metadata
tables. The TableName class has been extended with an extra vTbl field
which is filled when a virtual table reference is suspected. This
additional field helps to keep the real table in the statement table
cache next to the virtual table, which should be loaded so Iceberg
metadata tables can be created.

Iceberg provides a rich API to query metadata, these Iceberg API tables
are accessible through the MetadataTableUtils class. Using these table
schemas it is possible to create an Impala table that can be queried
later on.

Querying a metadata table at this point is expected to throw a
NotImplementedException.

Testing:
 - Added E2E test to test it for some tables.

Change-Id: I0b5db884b5f3fecbd132fcb2c2cbd6c622ff965b
Reviewed-on: http://gerrit.cloudera.org:8080/19483
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-29 20:53:18 +00:00
Andrew Sherman
2c779939dc IMPALA-11509: Prevent queries hanging when Iceberg metadata is missing.
Traditionally table metadata is loaded by the catalog and sent as thrift
to the Impala daemons. With Iceberg tables, some metadata, for example
the org.apache.iceberg.Table, is loaded in the Coordinator at the same
time as the thrift description is being deserialized. If the loading of
the org.apache.iceberg.Table fails, perhaps because of missing Iceberg
metadata, then the loading of the table fails. This can cause an
infinite loop as StmtMetadataLoader.loadTables() waits hopefully for
the catalog to send a new version of the table.

Change some Iceberg table loading methods to throw
IcebergTableLoadingException when a failure occurs. Prevent the hang by
substituting in an IncompleteTable if an IcebergTableLoadingException
occurs.

The test test_drop_incomplete_table had previously been disabled because
of IMPALA-11509. To re-enable this required a second change. The way
that DROP TABLE is executed on an iceberg table depends on which
Iceberg catalog is being used. If this Iceberg catalog is not a Hive
catalog then the execution happens in two parts, first the Iceberg
table is dropped, then the table is dropped in HMS. If this case, if
the drop fails in Iceberg, we should still continue on to perform the
drop in HMS.

TESTING

- Add a new test, originally developed for IMPALA-11330, which tests
  failures after deleting Iceberg metadata.
- Re-enable test_drop_incomplete_table().

Change-Id: I695559e21c510615918a51a4b5057bc616ee5421
Reviewed-on: http://gerrit.cloudera.org:8080/19509
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-27 21:42:49 +00:00
Zoltan Borok-Nagy
f2cb2c9ceb IMPALA-11964: Make sure Impala returns error for Iceberg tables with equality deletes
Impala only supports position deletes currently. It should raise an
error when equality deletes are encountered.

We already had a check for this when the query was planned by Iceberg.
But when we were using cached metadata the check was missing. This means
that Impala could return bogus results in the presence of equality
delete files. This patch adds check for the latter case as well.

Tables with equality delete files are still loadable by Impala, and
users can still query snapshots of it if they don't have equality
deletes.

Testing:
 * added e2e tests

Change-Id: I14d7116692c0e47d0799be650dc323811e2ee0fb
Reviewed-on: http://gerrit.cloudera.org:8080/19601
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-22 16:44:05 +00:00
Peter Rozsa
04f7e8c0f9 IMPALA-8054: Fix BetweenToCompound rewrite rule binary predicate creation
This patch fixes the BetweenToCompound rewrite rule's binary predicate
creation. When the BETWEEN expression gets separated, the first
operand's reference is assigned to both upper and lower binary
predicates, but in the case of different typed second and third
operands, the first operand must be cloned to make type casting unique
for both binary predicates.

Testing:
  - test cases added to exprs.test

Change-Id: Iaff4199f6d0875c38fa7e91033385c9290c57bf5
Reviewed-on: http://gerrit.cloudera.org:8080/19618
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-21 20:01:14 +00:00
Andrew Sherman
29586d6631 IMPALA-11482: Alter Table Execute Rollback for Iceberg tables.
Iceberg table modifications cause new table snapshots to be created;
these snapshots represent an earlier version of the table. The Iceberg
API provides a way to rollback the table to a previous snapshot.

This change adds the ability to execute a rollback on Iceberg tables
using the following statements:

- ALTER TABLE <tbl> EXECUTE ROLLBACK(<snapshot id>)
- ALTER TABLE <tbl> EXECUTE ROLLBACK('<timestamp>')

The latter form of the command rolls back to the most recent snapshot
that has a creation timestamp that is older than the specified
timestamp.

Note that when a table is rolled back to a snapshot, a new snapshot is
created with the same snapshot id, but with a new creation timestamp.

Testing:
 - Added analysis unit tests.
 - Added e2e tests.
 - Converted test_time_travel to use get_snapshots() from iceberg_util.
 - Add a utility class to allow pytests to create tables with various
   iceberg catalogs.

Change-Id: Ic74913d3b81103949ffb5eef7cc936303494f8b9
Reviewed-on: http://gerrit.cloudera.org:8080/19002
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-09 22:00:08 +00:00
Peter Rozsa
afe59f7f0d IMPALA-11854: ImpalaStringWritable's underlying array can't be changed in UDFs
This change fixes the behavior of BytesWritable and TextWritable's
getBytes() method. Now the returned byte array could be handled as
the underlying buffer as it gets loaded before the UDF's evaluation,
and tracks the changes as a regular Java byte array; the resizing
operation still resets the reference. The operations that wrote back
to the native heap were also removed as these operations are now
handled in the byte array. ImpalaStringWritable class is also removed,
writables that used it before now store the data directly.

Tests:
 - Test UDFs added as BufferAlteringUdf and GenericBufferAlteringUdf
 - E2E test ran for UDFs

Change-Id: Ifb28bd0dce7b0482c7abe1f61f245691fcbfe212
Reviewed-on: http://gerrit.cloudera.org:8080/19507
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-08 19:54:38 +00:00
Daniel Becker
a28da34a21 IMPALA-9551: (Addendum) disable sorting if select list contains struct containing collection
Sorting is not supported if the select list contains collection columns
(see IMPALA-10939). IMPALA-9551 added support for mixed complex types
(collections in structs and structs in collections). However, the case
of having structs containing collections in the select list when sorting
was not handled explicitly. The query

  select id, struct_contains_arr from collection_struct_mix order by id;

resulted in
  ERROR: IllegalStateException: null

After this change, a meaningful error message is given (the same as in
the case of pure collection columns):

  ERROR: IllegalStateException: Sorting is not supported if the select
  list contains collection columns.

The check for collections in the sorting tuple was moved to an earlier
stage of analysis from SingleNodePlanner to QueryStmt, as otherwise we
would hit another precondition check first in the case of structs
containing collections.

Testing:
 - Added tests in mixed-collections-and-structs.test that test sorting
   when a struct in the select list contains an array and a map
   respectively.

Change-Id: I09ac27cba34ee7c6325a7e7895f3a3c9e1a088e5
Reviewed-on: http://gerrit.cloudera.org:8080/19597
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-07 19:36:58 +00:00
Daniel Becker
2d47306987 IMPALA-9551: Allow mixed complex types in select list
Currently collections and structs are supported in the select list, also
when they are nested (structs in structs and collections in
collections), but mixing different kinds of complex types, i.e. having
structs in collections or vice versa, is not supported.

This patch adds support for mixed complex types in the select list.

Limitation: zipping unnest is not supported for mixed complex types, for
example the following query:

  use functional_parquet;
  select unnest(struct_contains_nested_arr.arr) from
  collection_struct_mix;

Testing:
 - Created a new test table, 'collection_struct_mix', that contains
   mixed complex types.
 - Added tests in mixed-collections-and-structs.test that test having
   mixed complex types in the select list. These tests are called from
   test_nested_types.py::TestMixedCollectionsAndStructsInSelectList.
 - Ran existing tests that test collections and structs in the select
   list; test queries that expected a failure in case of mixed complex
   types have been moved to mixed-collections-and-structs.test and now
   expect success.

Change-Id: I476d98884b5fd192dfcd4feeec7947526aebe993
Reviewed-on: http://gerrit.cloudera.org:8080/19322
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-07 13:22:33 +00:00
Csaba Ringhofer
67bb870aa3 IMPALA-11911: Fix NULL argument handling in Hive GenericUDFs
Before this patch if an argument of a GenericUDF was NULL, then Impala
passed it as null instead of a DeferredObject. This was incorrect, as
a DeferredObject is expected with a get() function that returns null.
See the Jira for more details and GenericUDF examples in Hive.

TestGenericUdf's NULL handling was further broken in IMPALA-11549,
leading to throwing null pointer exceptions when the UDF's result is
NULL. This test bug was not detected, because Hive udf tests were
running with default abort_java_udf_on_exception=false, which means
that exceptions from Hive UDFs only led to warnings and returning NULL,
which was the expected result in all affected test queries.

This patch fixes the behavior in HiveUdfExecutorGeneric and improves
FE/EE tests to catch null handling related issues. Most Hive UDF tests
are run with abort_java_udf_on_exception=true after this patch to treat
exceptions in UDFs as errors. The ones where the test checks that NULL
is returned if an exception is thrown while abort_java_udf_on_exception
is false are moved to new .test files.
TestGenericUdf is also fixed (and simplified) to handle NULL return
values correctly.

Change-Id: I53238612f4037572abb6d2cc913dd74ee830a9c9
Reviewed-on: http://gerrit.cloudera.org:8080/19499
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-06 13:45:56 +00:00
Csaba Ringhofer
3573db68c8 IMPALA-11960: Fix constant propagation from TIMESTAMP to DATE
The constant propagation introduced in IMPALA-10064 handled conversion
of < and > predicates from timestamps to dates incorrectly.

Example:
select * from functional.alltypes_date_partition
  where date_col = cast(timestamp_col as date)
    and timestamp_col > '2009-01-01 01:00:00'
    and timestamp_col < '2009-02-01 01:00:00';

Before this change query rewrites added the following predicates:
date_col > DATE '2009-01-01' AND date_col < DATE '2009-02-01'
This incorrectly rejected all timestamps on the days of the
lower / upper bounds.

The fix is to rewrite < and > to <= and >= in the date predicates.

< could be kept if the upper bound is a constant with no time-of-day
part, e.g. timestamp_col < "2009-01-01" could be rewritten to
date_col < "2009-01-01", but this optimization is not added in this
patch to make it simpler.

Testing:
- added planner + EE regression tests

Change-Id: I1938bf5e91057b220daf8a1892940f674aac3d68
Reviewed-on: http://gerrit.cloudera.org:8080/19572
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-03-05 22:41:11 +00:00
xiabaike
8b375a66a2 IMPALA-11565: Support IF NOT EXISTS in alter table add columns for kudu/iceberg table
Impala already supports IF NOT EXISTS in alter table add columns for
general hive table in IMPALA-7832, but not for kudu/iceberg table.
This patch try to add such semantics for kudu/iceberg table.

Testing:
- Updated E2E DDL tests
- Added fe tests

Change-Id: I82590e5372e881f2e81d4ed3dd0d32a2d3ddb517
Reviewed-on: http://gerrit.cloudera.org:8080/18953
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-03-02 16:51:12 +00:00
LPL
3153490545 IMPALA-11802: Optimize count(*) queries for Iceberg V2 position delete tables
The SCAN plan of count star query for Iceberg V2 position delete tables
as follows:

    AGGREGATE
    COUNT(*)
        |
    UNION ALL
   /         \
  /           \
 /             \
SCAN all    ANTI JOIN
datafiles  /         \
without   /           \
deletes  SCAN         SCAN
         datafiles    deletes

Since Iceberg provides the number of records in a file(record_count), we
can use this to optimize a simple count star query for Iceberg V2
position delete tables. Firstly, the number of records of all DataFiles
without corresponding DeleteFiles can be calculated by Iceberg meta
files. And then rewrite the query as follows:

      ArithmeticExpr(ADD)
      /             \
     /               \
    /                 \
record_count       AGGREGATE
of all             COUNT(*)
datafiles              |
without            ANTI JOIN
deletes           /         \
                 /           \
                SCAN        SCAN
                datafiles   deletes

Testing:
 * Existing tests
 * Added e2e tests

Change-Id: I8172c805121bf91d23fe063f806493afe2f03d41
Reviewed-on: http://gerrit.cloudera.org:8080/19494
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2023-02-21 14:51:18 +00:00
gaoxq
89cc20717e IMPALA-4052: CREATE TABLE LIKE for Kudu tables
This commit implements cloning between Kudu tables, including clone the
schema and hash partitions. But there is one limitation, cloning of
Kudu tables with range paritions is not supported. For cloning range
partitions, it's tracked by IMPALA-11912.

Cloning Kudu tables from other types of tables is not implemented,
because the table creation statements are different.

Testing:
 - e2e tests
 - AnalyzeDDLTest tests

Change-Id: Ia3d276a6465301dbcfed17bb713aca06367d9a42
Reviewed-on: http://gerrit.cloudera.org:8080/18729
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-20 16:38:16 +00:00
Peter Rozsa
1d05381b7b IMPALA-11745: Add Hive's ESRI geospatial functions as builtins
This change adds geospatial functions from Hive's ESRI library
as builtin UDFs. Plain Hive UDFs are imported without changes,
but the generic and varargs functions are handled differently;
generic functions are added with all of the combinations of
their parameters (cartesian product of the parameters), and
varargs functions are unfolded as an nth parameter simple
function. The varargs function wrappers are generated at build
time and they can be configured in
gen_geospatial_udf_wrappers.py. These additional steps are
required because of the limitations in Impala's UDF Executor
(lack of varargs support and only partial generics support)
which could be further improved; in this case, the additional
wrapping/mapping steps could be removed.

Changes regarding function handling/creating are sourced from
https://gerrit.cloudera.org/c/19177

A new backend flag was added to turn this feature on/off
as "geospatial_library". The default value is "NONE" which
means no geospatial function gets registered
as builtin, "HIVE_ESRI" value enables this implementation.

The ESRI geospatial implementation for Hive currently only
available in Hive 4, but CDP Hive backported it to Hive 3,
therefore for Apache Hive this feature is disabled
regardless of the "geospatial_library" flag.

Known limitations:
 - ST_MultiLineString, ST_MultiPolygon only works
   with the WKT overload
 - ST_Polygon supports a maximum of 6 pairs of coordinates
 - ST_MultiPoint, ST_LineString supports a maximum of 7
   pairs of coordinates
 - ST_ConvexHull, ST_Union supports a maximum of 6 geoms

These limits can be increased in gen_geospatial_udf_wrappers.py

Tests:
 - test_geospatial_udfs.py added based on
   https://github.com/Esri/spatial-framework-for-hadoop

Co-Authored-by: Csaba Ringhofer <csringhofer@cloudera.com>

Change-Id: If0ca02a70b4ba244778c9db6d14df4423072b225
Reviewed-on: http://gerrit.cloudera.org:8080/19425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-07 20:18:47 +00:00
stiga-huang
32536ba258 IMPALA-11845: (Addendum) Don't specify db name in the new struct tests
Some new tests are added for STAR expansion on struct types when the
table is masked by Ranger masking policies. They are tested on both
Parquet and ORC tables. However, some tests explicitly use
'functional_parquet' as the db name, which lose the coverage on ORC
tables. This patch removes the explicit db names.

Change-Id: I8efea5cc2e10d8ae50ee6c1201e325932cb27fbf
Reviewed-on: http://gerrit.cloudera.org:8080/19470
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-06 14:47:04 +00:00
wzhou-code
40da36414f IMPALA-11809: Support non unique primary key for Kudu
Kudu engine recently enables the auto-incrementing column feature
(KUDU-1945). The feature works by appending a system generated
auto-incrementing column to the primary key columns to guarantee the
uniqueness on primary key when the primary key columns can be non
unique. The non unique primary key columns and the auto-incrementing
column form the effective unique composite primary key.

This auto-incrementing column is named as 'auto_incrementing_id' with
big int type. The assignment to it during insertion is automatic so
insertion statements should not specify values for auto-incrementing
column. In current Kudu implementation, there is no central key provider
for auto-incrementing columns. It uses a per tablet-server global
counter to assign values for auto-incrementing columns. So the values
of auto-incrementing columns are not unique in a Kudu table, but unique
within a continuous region of the table served by a tablet-server.

This patch also upgraded Kudu version to 345fd44ca3 to pick up Kudu
changes needed for supporting non-unique primary key. It added
syntactic support for creating Kudu table with non unique primary key.
When creating a Kudu table, specifying PRIMARY KEY is optional.
If there is no primary key attribute specified, the partition key
columns will be promoted as non unique primary key if those columns
are the beginning columns of the table.
New column "key_unique" is added to the output of 'describe' table
command for Kudu table.

Examples of CREATE TABLE statement with non unique primary key:
  CREATE TABLE tbl (i INT NON UNIQUE PRIMARY KEY, s STRING)
  PARTITION BY HASH (i) PARTITIONS 3
  STORED as KUDU;

  CREATE TABLE tbl (i INT, s STRING, NON UNIQUE PRIMARY KEY(i))
  PARTITION BY HASH (i) PARTITIONS 3
  STORED as KUDU;

  CREATE TABLE tbl NON UNIQUE PRIMARY KEY(id)
  PARTITION BY HASH (id) PARTITIONS 3
  STORED as KUDU
  AS SELECT id, string_col FROM functional.alltypes WHERE id = 10;

  CREATE TABLE tbl NON UNIQUE PRIMARY KEY(id)
  PARTITION BY RANGE (id)
  (PARTITION VALUES <= 1000,
   PARTITION 1000 < VALUES <= 2000,
   PARTITION 2000 < VALUES <= 3000,
   PARTITION 3000 < VALUES)
  STORED as KUDU
  AS SELECT id, int_col FROM functional.alltypestiny ORDER BY id ASC
   LIMIT 4000;

  CREATE TABLE tbl (id INT, name STRING, NON UNIQUE PRIMARY KEY(id))
  STORED as KUDU;

  CREATE TABLE tbl (a INT, b STRING, c FLOAT)
  PARTITION BY HASH (a, b) PARTITIONS 3
  STORED as KUDU;

SELECT statement does not show the system generated auto-incrementing
column unless the column is explicitly specified in the select list.
Auto-incrementing column cannot be added, removed or renamed with
ALTER TABLE statements.
UPSERT operation is not supported now for Kudu tables with auto
incrementing column due to limitation in Kudu engine.

Testing:
 - Ran manual test in impala-shell with queries to create Kudu tables
   with non unique primary key, and tested insert/update/delete
   operations for these tables with non unique primary key.
 - Added front end tests, and end to end unit tests for Kudu tables
   with non unique primary key.
 - Passed exhaustive test.

Change-Id: I4d7882bf3d01a3492cc9827c072d1f3200d9eebd
Reviewed-on: http://gerrit.cloudera.org:8080/19383
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-04 07:34:56 +00:00
LPL
47c71bbb32 IMPALA-11798: Property 'external.table.purge' should not be ignored when CREATE Iceberg tables
Table property 'external.table.purge' should not be ignored when
creating Iceberg tables, except that when 'iceberg.catalog' is not the
Hive Catalog for managed tables, because we need to call
'org.apache.hadoop.hive.metastore.IMetaStoreClient#createTable' and HMS
will override 'external.table.purge' to 'TRUE'.

Testing:
 * existing tests
 * add e2e tests

Change-Id: I2649dd38fbe050044817d6c425ef447245aa2829
Reviewed-on: http://gerrit.cloudera.org:8080/19416
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-03 17:00:44 +00:00
stiga-huang
0c1bd9eff3 IMPALA-11845: Fix incorrect check of struct STAR path in resolvePathWithMasking
resolvePathWithMasking() is a wrapper on resolvePath() to further
resolve nested columns inside the table masking view. When it was
added, complex types in the select list hadn't been supported yet. So
the table masking view can't expose complex type columns directly in the
select list. Any paths in nested types will be further resolved inside
the table masking view in resolvePathWithMasking().

Take the following query as an example:
  select id, nested_struct.* from complextypestbl;
If Ranger column-masking/row-filter policies applied on the table, the
query is rewritten as
  select id, nested_struct.* from (
    select mask(id) from complextypestbl
    where row-filtering-condition
  ) t;
Table masking view "t" can't expose the nested column "nested_struct".
So we further resolve "nested_struct" inside the inlineView to use the
masked table "complextypestbl". The underlying TableRef is expected to
be a BaseTableRef.

Paths that don't reference nested columns should be resolved and
returned directly (just like the original resolvePath() does). E.g.
  select v.* from masked_view v
is rewritten to
  select v.* from (
    select mask(c1), mask(c2), ..., mask(cn)
    from masked_view
    where row-filtering-condition
  ) v;

The STAR path "v.*" should be resolved directly. However, it's treated
as a nested column unexpectedly. The code then tries to resolve it
inside the table "masked_view" and found "masked_view" is not a table so
throws the IllegalStateException.

These are the current conditions for identifying nested STAR paths:
 - The destType is STRUCT
 - And the resolved path is rooted at a valid tuple descriptor

They don't really recognize the nested struct columns because STAR paths
on table/view also match these conditions. When the STAR path is an
expansion on a catalog table/view, the root tuple descriptor is
exactly the output tuple of the table/view. The destType is the type of
the tuple descriptor which is always a StructType.

Note that STAR paths on other nested types, i.e. array/map, are invalid.
So the first condition matches for all valid cases. The second condition
also matches all valid cases since both the table/view and struct STAR
expansion have the path rooted at a valid tuple descriptor.

This patch fixes the check for nested struct STAR path by checking
the matched types instead. Note that if "v.*" is a table/view expansion,
the matched type list is empty. If "v.*" is a struct column expansion,
the matched type list contains the STRUCT column type.

Tests:
 - Add missing coverage on STAR paths (v.*) on masked views.

Change-Id: I8f1e78e325baafbe23101909d47e82bf140a2d77
Reviewed-on: http://gerrit.cloudera.org:8080/19429
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-01 04:24:00 +00:00
Tamas Mate
8292e4afdd IMPALA-11864: Iceberg LOAD DATA should not load S3 hidden files
Loading data from S3 did not skip hidden files because the
FileSystemUtil.listFiles() call was returning a RemoteIterator, which
compared to RecursingIterator does not filter the hidden files. This
would make a load fail because the hidden files likely have invalid
magic string.

This commit adds an extra condition to skip hidden files when creating
the CREATE subquery.

Testing:
 - Added E2E test
 - Ran E2E test on S3 build

Change-Id: Iffd179383c2bb2529f6f9b5f8bf5cba5f3553652
Reviewed-on: http://gerrit.cloudera.org:8080/19441
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2023-01-27 12:35:52 +00:00
Daniel Becker
ed59690b44 IMPALA-11840: Error with joining unnest with views
Queries fail in the following situation involving collections and views:
 1. A view returns an array
 2. A second view unnests the array returned from the first view
 3. The unnested view is queried in an outer query

For example:
  use functional_parquet;
  with sub as (
    select id, arr1.item unnested_arr
    from complextypes_arrays_only_view,
    complextypes_arrays_only_view.int_array arr1)
  select id, unnested_arr from sub;
  ERROR: IllegalStateException: null

The problem is that in CollectionTableRef.analyze(), if
 - there is a source view and
 - the collection ref is within a WITH clause and
 - it is not in the select list
then 'desc_' is not set, but it has to be set in order for
TableRef.analyzeJoin() to succeed.

This commit solves the problem by assigning a value to 'desc_' also in
the above case.

Testing:
 - Added regression tests in nested-types-runtime.test.

Change-Id: Ic52655631944913553a7e7d9e9169b93da46dde3
Reviewed-on: http://gerrit.cloudera.org:8080/19426
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-18 17:46:54 +00:00
stiga-huang
b0009db40b IMPALA-11843: Fix IndexOutOfBoundsException in analytic limit pushdown
When finding analytic conjuncts for analytic limit pushdown, the
following conditions are checked:
 - The conjunct should be a binary predicate
 - Left hand side is a SlotRef referencing the analytic expression, e.g.
   "rn" of "row_number() as rn"
 - The underlying analytic function is rank(), dense_rank() or row_number()
 - The window frame is UNBOUNDED PRECEDING to CURRENT ROW
 - Right hand side is a valid numeric limit
 - The op is =, <, or <=
See more details in AnalyticPlanner.inferPartitionLimits().

While checking the 2nd and 3rd condition, we get the source exprs of the
SlotRef. The source exprs could be empty if the SlotRef is actually
referencing a column of the table, i.e. a column materialized by the
scan node. Currently, we check the first source expr directly regardless
whether the list is empty, which causes the IndexOutOfBoundsException.

This patch fixes it by augmenting the check to consider an empty list.
Also fixes a similar code in AnalyticEvalNode.

Tests:
 - Add FE and e2e regression tests

Change-Id: I26d6bd58be58d09a29b8b81972e76665f41cf103
Reviewed-on: http://gerrit.cloudera.org:8080/19422
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-17 02:38:27 +00:00
Daniel Becker
53ce60457b IMPALA-11778: Printing maps may produce invalid json
Impala allows non-string types, for example numbers, to be keys in maps.
We print maps as json objects, but json objects only allow string keys.
If the Impala map has for example an INT key, the printed json is
invalid.

For example, in Impala the following two maps are not the same:
{1: "a", 2: "b"}
{"1": "a", "2": "b"}

The first map has INT keys, the second has STRING keys. Only the second
one is valid json.

Hive has the same behaviour as Impala, i.e. it produces invalid json if
the map keys have a non-string type.

This change introduces the STRINGIFY_MAP_KEYS query option that, when
set to true, converts non-string keys to strings. The default value of
the new query option is false because
  - conversion to string causes loss of information and
  - setting it to true would be a breaking change.

Testing:
  - Added tests in nested-map-in-select-list.test and map_null_keys.test
    that check the behaviour when STRINGIFY_MAP_KEYS is set to true.

Change-Id: I1820036a1c614c34ae5d70ac4fe79a992c9bce3a
Reviewed-on: http://gerrit.cloudera.org:8080/19364
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-10 20:50:50 +00:00
noemi
0549d9562b IMPALA-11620: Enable setting 'write.format.default'
Enable setting 'write.format.default' to a different file format
than what the table already contains.

Before IMPALA-10610 Iceberg tables with mixed-format data files were not
supported.
We used 'write.format.default' to determine the file format of the
table, which was only a temporary workaround. Because of this we did not
allow changing this table property if the table already contained
different table formats. E.g. we did not allow modifying
'write.format.default' to PARQUET if the table already contained ORC
files, because it would have made the table unreadable for Impala.
Since IMPALA-10610 'write.format.default' is not used to determine the
Iceberg table's format anymore, so we can allow changing it.

This table property change is not synchronized between HMS and Iceberg
metadata files in case of true external Hive Catalog tables.
See IMPALA-11710.

Testing:
- E2E test in iceberg-alter.test

Change-Id: I22d0a8a18fce99015fcfe1fd15cb4a4d4c2deaec
Reviewed-on: http://gerrit.cloudera.org:8080/19221
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-09 19:03:51 +00:00
Andrew Sherman
11068d9aeb IMPALA-11811: Avoid storing unregistered predicate objects in a Map
Within the extractIcebergConjuncts() method we are tracking conjuncts
which are identity conjuncts by storing them in a temporary Map. The
conjuncts are Expr objects which have a hashCode() method based on
their id_ field, which is only present when they are registered. If the
id_ field is null, then the hashCode() will throw, and hence
unregistered predicates cannot be stored in a Map. Some predicates
produced by getBoundPredicates() are explicitly not registered.

Change extractIcebergConjuncts() to track the identity conjuncts using
a boolean array, which tracks the index of the identity conjuncts in
conjuncts_ List.

Print the name of the Class in the Expr.hashCode() error to aid future
debugging.

TESTING:

Add a query which causes an unregistered predicate Expr to be seen
during Iceberg scan planning.

Change-Id: I103e3b8b06b5a1d12214241fd5907e5192d682ce
Reviewed-on: http://gerrit.cloudera.org:8080/19390
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-03 19:01:34 +00:00
pengdou1990
c38afdf50f IMPALA-11390: Describe formatted statement on materialized view should show the view definition
DESCRIBE FORMATTED/EXTENDED statement on materialized view show the
view definition like hive.

Change-Id: Ie62be1ca6d74370bad4b32bb046df93027e6f651
Reviewed-on: http://gerrit.cloudera.org:8080/19109
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2023-01-02 01:12:53 +00:00
Tamas Mate
6ff99431a6 IMPALA-11806: Fix TestIcebergTable.test_load E2E test
The test had a flaky part, it was referring to a directory which was
random generated. Removed the reference to this directory.

The test was failing with filesystems other than HDFS due to the
hdfs_client dependency, replaced the hdfs_client calls to
filesystem_client instead.

Testing:
 - Executed the test locally (HDFS/Minicluster)
 - Triggered an Ozone build to verify it with different FS

Change-Id: Id95523949aab7dc2417a3d06cf780d3de2e44ee3
Reviewed-on: http://gerrit.cloudera.org:8080/19385
Reviewed-by: Tamas Mate <tmater@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-21 23:08:52 +00:00
noemi
390a932064 IMPALA-11708: Add support for mixed Iceberg tables with AVRO file format
This patch extends the support of Iceberg tables containing multiple
file formats. Now AVRO data files can also be read in a mixed table
besides Parquet and ORC.

Impala uses its avro scanner to read AVRO files, therefore all the
avro related limitations apply here as well: writes/metadata
changes are not supported.

testing:
- E2E testing: extending 'iceberg-mixed-file-format.test' to include
  AVRO files as well, in order to test reading all three currently
  supported file formats: avro+orc+parquet

Change-Id: I941adfb659218283eb5fec1b394bb3003f8072a6
Reviewed-on: http://gerrit.cloudera.org:8080/19353
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-16 17:37:35 +00:00
Daniel Becker
25b5058ef5 IMPALA-11717: Use rapidjson for printing collections
We have been using rapidjson to print structs but didn't use it to print
collections (arrays and maps).

This change introduces the usage of rapidjson to print collections for
both the HS2 and the Beeswax protocol.

The old code handling the printing of collections in raw-value.{h,cc} is
removed.

Testing:
 - Ran existing EE tests
 - Added EE tests with non-string and NULL map keys in
   nested-map-in-select-list.test and map_null_keys.test.

Change-Id: I08a2d596a498fbbaf1419b18284846b992f49165
Reviewed-on: http://gerrit.cloudera.org:8080/19309
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
2022-12-15 15:04:07 +00:00
Tamas Mate
05a4b778d3 IMPALA-11339: Add Iceberg LOAD DATA INPATH statement
Extend LOAD DATA INPATH statement to support Iceberg tables. Native
parquet tables need Iceberg field ids, therefore to add files this
change uses child queries to load and rewrite the data. The child
queries create > insert > drop the temporary table over the specified
directory.

The create part depends on LIKE PARQUET/ORC clauses to infer the file
format. This requires identifying a file in the directory and using that
to create the temporary table.

The target file or directory is moved to a staging directory before
ingestion similar to native file formats. In case of a query failure the
files are moved back to the original location. Child query executor will
return the error message of the failing query and the child query
profiles will be available through the WebUI.

At this point the PARTITION clause it not supported because it would
require analysis of the PartitionSpec (IMPALA-11750).

Testing:
 - Added e2e tests
 - Added fe unit tests

Change-Id: I8499945fa57ea0499f65b455976141dcd6d789eb
Reviewed-on: http://gerrit.cloudera.org:8080/19145
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-15 13:34:51 +00:00
Michael Smith
8b83726b8e IMPALA-9487: Add erasure coding policy to SHOW, DESCRIBE
Adds erasure coding policy to introspection commands SHOW FILES, SHOW
PARTITIONS, SHOW TABLE STATS, and DESCRIBE EXTENDED.

Remove `throws IOException` for methods that don't throw. Removes null
check for getSd because getStorageDescriptorInfo - which is called right
after getTableMetaDataInformation - uses it without checking for null.

Adds '$ERASURECODE_POLICY' for runtime test substitution. The test suite
replaces this with the current erasure code policy - from
HDFS_ERASURECODE_POLICY - or NONE to match expected output.

Testing:
- ran backend, end-to-end, and custom cluster tests with erasure coding
- ran backend, end-to-end, and custom cluster tests with exhaustive
  strategy

Change-Id: Idd95f2d18b3980581788c92993b6d2f53504b5e0
Reviewed-on: http://gerrit.cloudera.org:8080/19268
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2022-12-14 22:37:14 +00:00
stiga-huang
64efb7695c IMPALA-11779: Fix crash in TopNNode due to slots in null type
BE can't codegen or evaluate exprs in NULL type. So when FE transfers
exprs to BE (via thrift), it will convert exprs in NULL type into
NullLiteral with Boolean type, e.g. see code in Expr#treeToThrift().
The type doesn't matter since ScalarExprEvaluator::GetValue() in BE
returns nullptr for null values of all types, and nullptr is treated as
null value.

Most of the exprs in BE are generated from thrift TExprs transferred
from FE, which guarantees they are not NULL type exprs. However, in
TopNPlanNode::Init(), we create SlotRefs directly based on the sort
tuple descriptor. If there are NULL type slots in the tuple descriptor,
we get SlotRefs in NULL type, which will crash codegen or evaluation (if
codegen is disabled) on them.

This patch adds a type-safe create method for SlotRef which uses
TYPE_BOOLEAN for TYPE_NULL. BE codes that create SlotRef directly from
SlotDescriptors are replaced by calling this create method, which
guarantees no TYPE_NULL exprs are used in the corresponding evaluators.

Tests:
 - Added new tests in partitioned-top-n.test
 - Ran exhaustive tests

Change-Id: I6aaf80c5129eaf788c70c8f041021eaf73087f94
Reviewed-on: http://gerrit.cloudera.org:8080/19336
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Quanlong Huang <huangquanlong@gmail.com>
2022-12-13 04:12:19 +00:00
Zoltan Borok-Nagy
c56cd7b214 IMPALA-11780: Wrong FILE__POSITION values for multi row group Parquet files when page filtering is used
Impala generated wrong values for the FILE__POSITION column when the
Parquet file contained multiple row groups and page filtering was
used as well.

We are using the value of 'current_row_' in the Parquet column readers
to populate the file position slot. The problem is that 'current_row_'
denotes the index of the row within the row group and not within the
file. We cannot change 'current_row_' as page filtering depends on its
value, as the page index also uses the row group-based indexes of the
rows, not the file indexes.

In the meantime it turned out FILE__POSITION was also not set correctly
in the Parquet late materialization code, as
BaseScalarColumnReader::SkipRowsInternal() didn't update 'current_row_'
in some code paths.

The value of FILE__POSITION is critical for Iceberg V2 tables as
position delete files store file positions of the deleted rows.

Testing:
 * added e2e tests
 * the tests are now running w/o PARQUET_READ_STATISTICS to exercise
   more code paths

Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605
Reviewed-on: http://gerrit.cloudera.org:8080/19328
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-08 23:07:08 +00:00
Michael Smith
f8819ac7cc IMPALA-11751: (Addendum) fix test for Ozone
Use FILESYSTEM_PREFIX to make test relocatable, as required by the Ozone
test environment.

Testing: ran test with Ozone.

Change-Id: Ic16322d90bd4039ec5ce2a54be79c748ee822978
Reviewed-on: http://gerrit.cloudera.org:8080/19330
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-08 23:06:46 +00:00
Csaba Ringhofer
86740a7d35 IMPALA-11549: Support Hive GenericUdfs that return primitive java types
Before this patch only the Writable* types were accepted in GenericUdfs
as return types, while some GenericUdfs in the wild return primitive java
types (e.g. Integer instead of IntWritable). For legacy Hive UDFs these
return types were already handled, so the only change needed was to
map the ObjectInspector subclasses (e.g. JavaIntObjectInspector) to the
correct JavaUdfDataType in Impala.

Testing:
- Added a subclass for TestGenericUdf (TestGenericUdfWithJavaReturnTypes)
  that returns primitive java types (probably inheriting in the opposite
  direction would be more logical, but the diff is smaller this way).
- Changed EE tests to also use TestGenericUdfWithJavaReturnTypes.
- Changed FE tests (UdfExecutorTest) to check both
  TestGenericUdfWithJavaReturnTypes and TestGenericUdf.
- Also added a test with BINARY type to UdfExecutorTest as this was
  forgotten during the original BINARY patch.

Change-Id: I30679045d6693ebd35718b6f1a22aaa4963c1e63
Reviewed-on: http://gerrit.cloudera.org:8080/19304
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-08 17:51:00 +00:00
noemi
80fc49abe6 IMPALA-11158: Add support for Iceberg tables with AVRO data files
Iceberg tables containing only AVRO files or no AVRO files at all
can now be read by Impala. Mixed file format tables with AVRO are
currently unsupported.
Impala uses its avro scanner to read AVRO files, therefore all the
avro related limitations apply here as well: writes/metadata
changes are not supported.

testing:
- created test tables: 'iceberg_avro_only' contains only AVRO files;
  'iceberg_avro_mixed' contains all file formats: avro+orc+parquet
- added E2E test that reads Avro-only table
- added test case to iceberg-negative.test that tries to read
  mixed file format table

Change-Id: I827e5707e54bebabc614e127daa48255f86f4c4f
Reviewed-on: http://gerrit.cloudera.org:8080/19084
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-08 03:03:13 +00:00
Yida Wu
4bdb99938a IMPALA-11470: Add Cache For Codegen Functions
The patch adds supports of the cache for CodeGen functions
to improve the performance of sub-second queries.

The main idea is to store the codegen functions to a cache,
and reuse them when it is appropriate to avoid repeated llvm
optimization time which could take over hundreds of milliseconds.

In this patch, we implement a cache to store codegen functions.
The cache is a singleton instance for each daemon, and contains
multiple cache entries. Each cache entry is at the fragment
level, that is storing all the codegen functions of a fragment
in a cache entry, if one exactly same fragment comes again, it
should be able to find all the codegen functions it needs
from the specific cache entry, therefore saving the time.

The module bitcode is used as the key to the cache, which will
be generated before the module optimization and final
compilation. If codegen_cache_mode is NORMAL, which is by default,
we will store the full bitcode string as the key. Otherwise, if
codegen_cache_mode is set to OPTIMAL, we will store a key only
containing the hash code and the total length of a full key to
reduce memory consumption.

Also, KrpcDataStreamSenderConfig::CodegenHashRow() is changed to
pass the hash seed as an argument because it can't hit the cache
for the fragment if using a dynamic hash seed within the codegen
function.

Codegen cache is disabled automatically for a fragment using a
native udf, because it can lead to a crash in this case. The reason
for that is the udf is loaded to the llvm execution engine global
mapping instead of the llvm module, however, the current key to the
cache entry uses the llvm module bitcode which can't reflect the
change of the udf address if the udf is reloaded during runtime,
for example database recreation, then it could lead to a crash due
to using an old udf address from the cache. Disable it until there
is a better solution, filed IMPALA-11771 to follow.

The patch also introduces following new flags for start and query
options for feature configuration and operation purpose.
Start option for configuration:
  - codegen_cache_capacity: The capacity of the cache, if set to 0,
    codegen cache is disabled.

Query option for operations:
  - disable_codegen_cache: Codegen cache will be disabled when it
    is set to true.

  - codegen_cache_mode: It is defined by a new enum type
    TCodeGenCacheMode. There are four types, NORMAL and OPTIMAL,
    and two other types, NORMAL_DEBUG and OPTIMAL_DEBUG, which are
    the debug mode of the first two types.
    If using NORMAL, a full key will be stored to the cache, it will
    cost more memory for each entry because the key is the bitcode
    of the llvm module, it can be large.
    If using OPTIMAL, the cache will only store the hash code and
    length of the key, it reduces the memory consumption largely,
    however, could be possible to have collision issues.
    If using debug modes, the behavior would be the same as the
    non-debug modes, but more logs or statistics will be allowed,
    that means could be slower.
    Only valid when disable_codegen_cache is set to false.

New impalad metrics:
  - impala.codegen-cache.misses
  - impala.codegen-cache.entries-in-use
  - impala.codegen-cache.entries-in-use-bytes
  - impala.codegen-cache.entries-evicted
  - impala.codegen-cache.hits
  - impala.codegen-cache.entry-sizes

New profile Metrics:
  - CodegenCacheLookupTime
  - CodegenCacheSaveTime
  - ModuleBitcodeGenTime
  - NumCachedFunctions

TPCH-1 performance evaluation (8 iteration) on AWS m5a.4xlarge,
the result removes the first iteration to show the benefit of the
cache:
Query     Cached(s) NoCache(s) Delta(Avg) NoCodegen(s)  Delta(Avg)
TPCH-Q1    0.39      1.02       -61.76%     5.59         -93.02%
TPCH-Q2    0.56      1.21       -53.72%     0.47         19.15%
TPCH-Q3    0.37      0.77       -51.95%     0.43         -13.95%
TPCH-Q4    0.36      0.51       -29.41%     0.33         9.09%
TPCH-Q5    0.39      1.1        -64.55%     0.39         0%
TPCH-Q6    0.24      0.27       -11.11%     0.77         -68.83%
TPCH-Q7    0.39      1.2        -67.5%      0.39         0%
TPCH-Q8    0.58      1.46       -60.27%     0.45         28.89%
TPCH-Q9    0.8       1.38       -42.03%     1            -20%
TPCH-Q10   0.6       1.03       -41.75%     0.85         -29.41%
TPCH-Q11   0.3       0.93       -67.74%     0.2          50%
TPCH-Q12   0.28      0.48       -41.67%     0.38         -26.32%
TPCH-Q13   1.11      1.22       -9.02%      1.16         -4.31%
TPCH-Q14   0.55      0.78       -29.49%     0.45         22.22%
TPCH-Q15   0.33      0.73       -54.79%     0.44         -25%
TPCH-Q16   0.32      0.78       -58.97%     0.41         -21.95%
TPCH-Q17   0.56      0.84       -33.33%     0.89         -37.08%
TPCH-Q18   0.54      0.92       -41.3%      0.89         -39.33%
TPCH-Q19   0.35      2.34       -85.04%     0.35         0%
TPCH-Q20   0.34      0.98       -65.31%     0.31         9.68%
TPCH-Q21   0.83      1.14       -27.19%     0.86         -3.49%
TPCH-Q22   0.26      0.52       -50%        0.25         4%

From the result, it shows a pretty good performance compared to
codegen without cache (default setting). However, compared
to codegen disabled, as expected, for short queries, codegen
cache is not always faster, probably because for the codegen
cache, it still needs some time to prepare the codegen functions
and generate an appropriate module bitcode to be the key, if
the time of the preparation is larger than the benefit from
the codegen functions, especially for the extremely short queries,
the result can be slower than not using the codegen. There could
be room to improve in future.

We also test the total cache entry size for tpch queries. The data
below shows the total codegen cache used by each tpch query. We
can see the optimal mode is very helpful to reduce the size of
the cache, and the reason is the much smaller key in optimal mode
we mentioned before because the only difference between two modes
is the key.

Query     Normal(KB)  Optimal(KB)
TPCH-Q1     604.1       50.9
TPCH-Q2     973.4       135.5
TPCH-Q3     561.1       36.5
TPCH-Q4     423.3       41.1
TPCH-Q5     866.9       93.3
TPCH-Q6     295.9       4.9
TPCH-Q7     1105.4      124.5
TPCH-Q8     1382.6      211
TPCH-Q9     1041.4      119.5
TPCH-Q10    738.4       65.4
TPCH-Q11    1201.6      136.3
TPCH-Q12    452.8       46.7
TPCH-Q13    541.3       48.1
TPCH-Q14    696.8       102.8
TPCH-Q15    1148.1      95.2
TPCH-Q16    740.6       77.4
TPCH-Q17    990.1       133.4
TPCH-Q18    376         70.8
TPCH-Q19    1280.1      179.5
TPCH-Q20    1260.9      180.7
TPCH-Q21    722.5       66.8
TPCH-Q22    713.1       49.8

Tests:
Ran exhaustive tests.
Added E2e testcase TestCodegenCache.
Added unit testcase LlvmCodeGenCacheTest.

Change-Id: If42c78a7f51fd582e5fe331fead494dadf544eb1
Reviewed-on: http://gerrit.cloudera.org:8080/19181
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-07 21:57:46 +00:00
Michael Smith
8cd4a1e4e5 IMPALA-11584: Enable minicluster tests for Ozone
Enables tests guarded by SkipIfNotHdfsMinicluster to run on Ozone as
well as HDFS. Plans are still skipped for Ozone because there's
Ozone-specific text in the plan output.

Updates explain output to allow for Ozone, which has a block size of
256MB instead of 128MB. One of the partitions read in test_explain is
~180MB, straddling the difference between Ozone and HDFS.

Testing: ran affected tests with Ozone.

Change-Id: I6b06ceacf951dbc966aa409cf24a310c9676fe7f
Reviewed-on: http://gerrit.cloudera.org:8080/19250
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-12-06 21:18:33 +00:00