Iceberg table modifications cause new table snapshots to be created;
these snapshots represent an earlier version of the table. The Iceberg
API provides a way to rollback the table to a previous snapshot.
This change adds the ability to execute a rollback on Iceberg tables
using the following statements:
- ALTER TABLE <tbl> EXECUTE ROLLBACK(<snapshot id>)
- ALTER TABLE <tbl> EXECUTE ROLLBACK('<timestamp>')
The latter form of the command rolls back to the most recent snapshot
that has a creation timestamp that is older than the specified
timestamp.
Note that when a table is rolled back to a snapshot, a new snapshot is
created with the same snapshot id, but with a new creation timestamp.
Testing:
- Added analysis unit tests.
- Added e2e tests.
- Converted test_time_travel to use get_snapshots() from iceberg_util.
- Add a utility class to allow pytests to create tables with various
iceberg catalogs.
Change-Id: Ic74913d3b81103949ffb5eef7cc936303494f8b9
Reviewed-on: http://gerrit.cloudera.org:8080/19002
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change fixes the behavior of BytesWritable and TextWritable's
getBytes() method. Now the returned byte array could be handled as
the underlying buffer as it gets loaded before the UDF's evaluation,
and tracks the changes as a regular Java byte array; the resizing
operation still resets the reference. The operations that wrote back
to the native heap were also removed as these operations are now
handled in the byte array. ImpalaStringWritable class is also removed,
writables that used it before now store the data directly.
Tests:
- Test UDFs added as BufferAlteringUdf and GenericBufferAlteringUdf
- E2E test ran for UDFs
Change-Id: Ifb28bd0dce7b0482c7abe1f61f245691fcbfe212
Reviewed-on: http://gerrit.cloudera.org:8080/19507
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Sorting is not supported if the select list contains collection columns
(see IMPALA-10939). IMPALA-9551 added support for mixed complex types
(collections in structs and structs in collections). However, the case
of having structs containing collections in the select list when sorting
was not handled explicitly. The query
select id, struct_contains_arr from collection_struct_mix order by id;
resulted in
ERROR: IllegalStateException: null
After this change, a meaningful error message is given (the same as in
the case of pure collection columns):
ERROR: IllegalStateException: Sorting is not supported if the select
list contains collection columns.
The check for collections in the sorting tuple was moved to an earlier
stage of analysis from SingleNodePlanner to QueryStmt, as otherwise we
would hit another precondition check first in the case of structs
containing collections.
Testing:
- Added tests in mixed-collections-and-structs.test that test sorting
when a struct in the select list contains an array and a map
respectively.
Change-Id: I09ac27cba34ee7c6325a7e7895f3a3c9e1a088e5
Reviewed-on: http://gerrit.cloudera.org:8080/19597
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently collections and structs are supported in the select list, also
when they are nested (structs in structs and collections in
collections), but mixing different kinds of complex types, i.e. having
structs in collections or vice versa, is not supported.
This patch adds support for mixed complex types in the select list.
Limitation: zipping unnest is not supported for mixed complex types, for
example the following query:
use functional_parquet;
select unnest(struct_contains_nested_arr.arr) from
collection_struct_mix;
Testing:
- Created a new test table, 'collection_struct_mix', that contains
mixed complex types.
- Added tests in mixed-collections-and-structs.test that test having
mixed complex types in the select list. These tests are called from
test_nested_types.py::TestMixedCollectionsAndStructsInSelectList.
- Ran existing tests that test collections and structs in the select
list; test queries that expected a failure in case of mixed complex
types have been moved to mixed-collections-and-structs.test and now
expect success.
Change-Id: I476d98884b5fd192dfcd4feeec7947526aebe993
Reviewed-on: http://gerrit.cloudera.org:8080/19322
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch if an argument of a GenericUDF was NULL, then Impala
passed it as null instead of a DeferredObject. This was incorrect, as
a DeferredObject is expected with a get() function that returns null.
See the Jira for more details and GenericUDF examples in Hive.
TestGenericUdf's NULL handling was further broken in IMPALA-11549,
leading to throwing null pointer exceptions when the UDF's result is
NULL. This test bug was not detected, because Hive udf tests were
running with default abort_java_udf_on_exception=false, which means
that exceptions from Hive UDFs only led to warnings and returning NULL,
which was the expected result in all affected test queries.
This patch fixes the behavior in HiveUdfExecutorGeneric and improves
FE/EE tests to catch null handling related issues. Most Hive UDF tests
are run with abort_java_udf_on_exception=true after this patch to treat
exceptions in UDFs as errors. The ones where the test checks that NULL
is returned if an exception is thrown while abort_java_udf_on_exception
is false are moved to new .test files.
TestGenericUdf is also fixed (and simplified) to handle NULL return
values correctly.
Change-Id: I53238612f4037572abb6d2cc913dd74ee830a9c9
Reviewed-on: http://gerrit.cloudera.org:8080/19499
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The constant propagation introduced in IMPALA-10064 handled conversion
of < and > predicates from timestamps to dates incorrectly.
Example:
select * from functional.alltypes_date_partition
where date_col = cast(timestamp_col as date)
and timestamp_col > '2009-01-01 01:00:00'
and timestamp_col < '2009-02-01 01:00:00';
Before this change query rewrites added the following predicates:
date_col > DATE '2009-01-01' AND date_col < DATE '2009-02-01'
This incorrectly rejected all timestamps on the days of the
lower / upper bounds.
The fix is to rewrite < and > to <= and >= in the date predicates.
< could be kept if the upper bound is a constant with no time-of-day
part, e.g. timestamp_col < "2009-01-01" could be rewritten to
date_col < "2009-01-01", but this optimization is not added in this
patch to make it simpler.
Testing:
- added planner + EE regression tests
Change-Id: I1938bf5e91057b220daf8a1892940f674aac3d68
Reviewed-on: http://gerrit.cloudera.org:8080/19572
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala already supports IF NOT EXISTS in alter table add columns for
general hive table in IMPALA-7832, but not for kudu/iceberg table.
This patch try to add such semantics for kudu/iceberg table.
Testing:
- Updated E2E DDL tests
- Added fe tests
Change-Id: I82590e5372e881f2e81d4ed3dd0d32a2d3ddb517
Reviewed-on: http://gerrit.cloudera.org:8080/18953
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
The SCAN plan of count star query for Iceberg V2 position delete tables
as follows:
AGGREGATE
COUNT(*)
|
UNION ALL
/ \
/ \
/ \
SCAN all ANTI JOIN
datafiles / \
without / \
deletes SCAN SCAN
datafiles deletes
Since Iceberg provides the number of records in a file(record_count), we
can use this to optimize a simple count star query for Iceberg V2
position delete tables. Firstly, the number of records of all DataFiles
without corresponding DeleteFiles can be calculated by Iceberg meta
files. And then rewrite the query as follows:
ArithmeticExpr(ADD)
/ \
/ \
/ \
record_count AGGREGATE
of all COUNT(*)
datafiles |
without ANTI JOIN
deletes / \
/ \
SCAN SCAN
datafiles deletes
Testing:
* Existing tests
* Added e2e tests
Change-Id: I8172c805121bf91d23fe063f806493afe2f03d41
Reviewed-on: http://gerrit.cloudera.org:8080/19494
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
This commit implements cloning between Kudu tables, including clone the
schema and hash partitions. But there is one limitation, cloning of
Kudu tables with range paritions is not supported. For cloning range
partitions, it's tracked by IMPALA-11912.
Cloning Kudu tables from other types of tables is not implemented,
because the table creation statements are different.
Testing:
- e2e tests
- AnalyzeDDLTest tests
Change-Id: Ia3d276a6465301dbcfed17bb713aca06367d9a42
Reviewed-on: http://gerrit.cloudera.org:8080/18729
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds geospatial functions from Hive's ESRI library
as builtin UDFs. Plain Hive UDFs are imported without changes,
but the generic and varargs functions are handled differently;
generic functions are added with all of the combinations of
their parameters (cartesian product of the parameters), and
varargs functions are unfolded as an nth parameter simple
function. The varargs function wrappers are generated at build
time and they can be configured in
gen_geospatial_udf_wrappers.py. These additional steps are
required because of the limitations in Impala's UDF Executor
(lack of varargs support and only partial generics support)
which could be further improved; in this case, the additional
wrapping/mapping steps could be removed.
Changes regarding function handling/creating are sourced from
https://gerrit.cloudera.org/c/19177
A new backend flag was added to turn this feature on/off
as "geospatial_library". The default value is "NONE" which
means no geospatial function gets registered
as builtin, "HIVE_ESRI" value enables this implementation.
The ESRI geospatial implementation for Hive currently only
available in Hive 4, but CDP Hive backported it to Hive 3,
therefore for Apache Hive this feature is disabled
regardless of the "geospatial_library" flag.
Known limitations:
- ST_MultiLineString, ST_MultiPolygon only works
with the WKT overload
- ST_Polygon supports a maximum of 6 pairs of coordinates
- ST_MultiPoint, ST_LineString supports a maximum of 7
pairs of coordinates
- ST_ConvexHull, ST_Union supports a maximum of 6 geoms
These limits can be increased in gen_geospatial_udf_wrappers.py
Tests:
- test_geospatial_udfs.py added based on
https://github.com/Esri/spatial-framework-for-hadoop
Co-Authored-by: Csaba Ringhofer <csringhofer@cloudera.com>
Change-Id: If0ca02a70b4ba244778c9db6d14df4423072b225
Reviewed-on: http://gerrit.cloudera.org:8080/19425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Some new tests are added for STAR expansion on struct types when the
table is masked by Ranger masking policies. They are tested on both
Parquet and ORC tables. However, some tests explicitly use
'functional_parquet' as the db name, which lose the coverage on ORC
tables. This patch removes the explicit db names.
Change-Id: I8efea5cc2e10d8ae50ee6c1201e325932cb27fbf
Reviewed-on: http://gerrit.cloudera.org:8080/19470
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Kudu engine recently enables the auto-incrementing column feature
(KUDU-1945). The feature works by appending a system generated
auto-incrementing column to the primary key columns to guarantee the
uniqueness on primary key when the primary key columns can be non
unique. The non unique primary key columns and the auto-incrementing
column form the effective unique composite primary key.
This auto-incrementing column is named as 'auto_incrementing_id' with
big int type. The assignment to it during insertion is automatic so
insertion statements should not specify values for auto-incrementing
column. In current Kudu implementation, there is no central key provider
for auto-incrementing columns. It uses a per tablet-server global
counter to assign values for auto-incrementing columns. So the values
of auto-incrementing columns are not unique in a Kudu table, but unique
within a continuous region of the table served by a tablet-server.
This patch also upgraded Kudu version to 345fd44ca3 to pick up Kudu
changes needed for supporting non-unique primary key. It added
syntactic support for creating Kudu table with non unique primary key.
When creating a Kudu table, specifying PRIMARY KEY is optional.
If there is no primary key attribute specified, the partition key
columns will be promoted as non unique primary key if those columns
are the beginning columns of the table.
New column "key_unique" is added to the output of 'describe' table
command for Kudu table.
Examples of CREATE TABLE statement with non unique primary key:
CREATE TABLE tbl (i INT NON UNIQUE PRIMARY KEY, s STRING)
PARTITION BY HASH (i) PARTITIONS 3
STORED as KUDU;
CREATE TABLE tbl (i INT, s STRING, NON UNIQUE PRIMARY KEY(i))
PARTITION BY HASH (i) PARTITIONS 3
STORED as KUDU;
CREATE TABLE tbl NON UNIQUE PRIMARY KEY(id)
PARTITION BY HASH (id) PARTITIONS 3
STORED as KUDU
AS SELECT id, string_col FROM functional.alltypes WHERE id = 10;
CREATE TABLE tbl NON UNIQUE PRIMARY KEY(id)
PARTITION BY RANGE (id)
(PARTITION VALUES <= 1000,
PARTITION 1000 < VALUES <= 2000,
PARTITION 2000 < VALUES <= 3000,
PARTITION 3000 < VALUES)
STORED as KUDU
AS SELECT id, int_col FROM functional.alltypestiny ORDER BY id ASC
LIMIT 4000;
CREATE TABLE tbl (id INT, name STRING, NON UNIQUE PRIMARY KEY(id))
STORED as KUDU;
CREATE TABLE tbl (a INT, b STRING, c FLOAT)
PARTITION BY HASH (a, b) PARTITIONS 3
STORED as KUDU;
SELECT statement does not show the system generated auto-incrementing
column unless the column is explicitly specified in the select list.
Auto-incrementing column cannot be added, removed or renamed with
ALTER TABLE statements.
UPSERT operation is not supported now for Kudu tables with auto
incrementing column due to limitation in Kudu engine.
Testing:
- Ran manual test in impala-shell with queries to create Kudu tables
with non unique primary key, and tested insert/update/delete
operations for these tables with non unique primary key.
- Added front end tests, and end to end unit tests for Kudu tables
with non unique primary key.
- Passed exhaustive test.
Change-Id: I4d7882bf3d01a3492cc9827c072d1f3200d9eebd
Reviewed-on: http://gerrit.cloudera.org:8080/19383
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Table property 'external.table.purge' should not be ignored when
creating Iceberg tables, except that when 'iceberg.catalog' is not the
Hive Catalog for managed tables, because we need to call
'org.apache.hadoop.hive.metastore.IMetaStoreClient#createTable' and HMS
will override 'external.table.purge' to 'TRUE'.
Testing:
* existing tests
* add e2e tests
Change-Id: I2649dd38fbe050044817d6c425ef447245aa2829
Reviewed-on: http://gerrit.cloudera.org:8080/19416
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
resolvePathWithMasking() is a wrapper on resolvePath() to further
resolve nested columns inside the table masking view. When it was
added, complex types in the select list hadn't been supported yet. So
the table masking view can't expose complex type columns directly in the
select list. Any paths in nested types will be further resolved inside
the table masking view in resolvePathWithMasking().
Take the following query as an example:
select id, nested_struct.* from complextypestbl;
If Ranger column-masking/row-filter policies applied on the table, the
query is rewritten as
select id, nested_struct.* from (
select mask(id) from complextypestbl
where row-filtering-condition
) t;
Table masking view "t" can't expose the nested column "nested_struct".
So we further resolve "nested_struct" inside the inlineView to use the
masked table "complextypestbl". The underlying TableRef is expected to
be a BaseTableRef.
Paths that don't reference nested columns should be resolved and
returned directly (just like the original resolvePath() does). E.g.
select v.* from masked_view v
is rewritten to
select v.* from (
select mask(c1), mask(c2), ..., mask(cn)
from masked_view
where row-filtering-condition
) v;
The STAR path "v.*" should be resolved directly. However, it's treated
as a nested column unexpectedly. The code then tries to resolve it
inside the table "masked_view" and found "masked_view" is not a table so
throws the IllegalStateException.
These are the current conditions for identifying nested STAR paths:
- The destType is STRUCT
- And the resolved path is rooted at a valid tuple descriptor
They don't really recognize the nested struct columns because STAR paths
on table/view also match these conditions. When the STAR path is an
expansion on a catalog table/view, the root tuple descriptor is
exactly the output tuple of the table/view. The destType is the type of
the tuple descriptor which is always a StructType.
Note that STAR paths on other nested types, i.e. array/map, are invalid.
So the first condition matches for all valid cases. The second condition
also matches all valid cases since both the table/view and struct STAR
expansion have the path rooted at a valid tuple descriptor.
This patch fixes the check for nested struct STAR path by checking
the matched types instead. Note that if "v.*" is a table/view expansion,
the matched type list is empty. If "v.*" is a struct column expansion,
the matched type list contains the STRUCT column type.
Tests:
- Add missing coverage on STAR paths (v.*) on masked views.
Change-Id: I8f1e78e325baafbe23101909d47e82bf140a2d77
Reviewed-on: http://gerrit.cloudera.org:8080/19429
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Loading data from S3 did not skip hidden files because the
FileSystemUtil.listFiles() call was returning a RemoteIterator, which
compared to RecursingIterator does not filter the hidden files. This
would make a load fail because the hidden files likely have invalid
magic string.
This commit adds an extra condition to skip hidden files when creating
the CREATE subquery.
Testing:
- Added E2E test
- Ran E2E test on S3 build
Change-Id: Iffd179383c2bb2529f6f9b5f8bf5cba5f3553652
Reviewed-on: http://gerrit.cloudera.org:8080/19441
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Queries fail in the following situation involving collections and views:
1. A view returns an array
2. A second view unnests the array returned from the first view
3. The unnested view is queried in an outer query
For example:
use functional_parquet;
with sub as (
select id, arr1.item unnested_arr
from complextypes_arrays_only_view,
complextypes_arrays_only_view.int_array arr1)
select id, unnested_arr from sub;
ERROR: IllegalStateException: null
The problem is that in CollectionTableRef.analyze(), if
- there is a source view and
- the collection ref is within a WITH clause and
- it is not in the select list
then 'desc_' is not set, but it has to be set in order for
TableRef.analyzeJoin() to succeed.
This commit solves the problem by assigning a value to 'desc_' also in
the above case.
Testing:
- Added regression tests in nested-types-runtime.test.
Change-Id: Ic52655631944913553a7e7d9e9169b93da46dde3
Reviewed-on: http://gerrit.cloudera.org:8080/19426
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When finding analytic conjuncts for analytic limit pushdown, the
following conditions are checked:
- The conjunct should be a binary predicate
- Left hand side is a SlotRef referencing the analytic expression, e.g.
"rn" of "row_number() as rn"
- The underlying analytic function is rank(), dense_rank() or row_number()
- The window frame is UNBOUNDED PRECEDING to CURRENT ROW
- Right hand side is a valid numeric limit
- The op is =, <, or <=
See more details in AnalyticPlanner.inferPartitionLimits().
While checking the 2nd and 3rd condition, we get the source exprs of the
SlotRef. The source exprs could be empty if the SlotRef is actually
referencing a column of the table, i.e. a column materialized by the
scan node. Currently, we check the first source expr directly regardless
whether the list is empty, which causes the IndexOutOfBoundsException.
This patch fixes it by augmenting the check to consider an empty list.
Also fixes a similar code in AnalyticEvalNode.
Tests:
- Add FE and e2e regression tests
Change-Id: I26d6bd58be58d09a29b8b81972e76665f41cf103
Reviewed-on: http://gerrit.cloudera.org:8080/19422
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Impala allows non-string types, for example numbers, to be keys in maps.
We print maps as json objects, but json objects only allow string keys.
If the Impala map has for example an INT key, the printed json is
invalid.
For example, in Impala the following two maps are not the same:
{1: "a", 2: "b"}
{"1": "a", "2": "b"}
The first map has INT keys, the second has STRING keys. Only the second
one is valid json.
Hive has the same behaviour as Impala, i.e. it produces invalid json if
the map keys have a non-string type.
This change introduces the STRINGIFY_MAP_KEYS query option that, when
set to true, converts non-string keys to strings. The default value of
the new query option is false because
- conversion to string causes loss of information and
- setting it to true would be a breaking change.
Testing:
- Added tests in nested-map-in-select-list.test and map_null_keys.test
that check the behaviour when STRINGIFY_MAP_KEYS is set to true.
Change-Id: I1820036a1c614c34ae5d70ac4fe79a992c9bce3a
Reviewed-on: http://gerrit.cloudera.org:8080/19364
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Enable setting 'write.format.default' to a different file format
than what the table already contains.
Before IMPALA-10610 Iceberg tables with mixed-format data files were not
supported.
We used 'write.format.default' to determine the file format of the
table, which was only a temporary workaround. Because of this we did not
allow changing this table property if the table already contained
different table formats. E.g. we did not allow modifying
'write.format.default' to PARQUET if the table already contained ORC
files, because it would have made the table unreadable for Impala.
Since IMPALA-10610 'write.format.default' is not used to determine the
Iceberg table's format anymore, so we can allow changing it.
This table property change is not synchronized between HMS and Iceberg
metadata files in case of true external Hive Catalog tables.
See IMPALA-11710.
Testing:
- E2E test in iceberg-alter.test
Change-Id: I22d0a8a18fce99015fcfe1fd15cb4a4d4c2deaec
Reviewed-on: http://gerrit.cloudera.org:8080/19221
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Within the extractIcebergConjuncts() method we are tracking conjuncts
which are identity conjuncts by storing them in a temporary Map. The
conjuncts are Expr objects which have a hashCode() method based on
their id_ field, which is only present when they are registered. If the
id_ field is null, then the hashCode() will throw, and hence
unregistered predicates cannot be stored in a Map. Some predicates
produced by getBoundPredicates() are explicitly not registered.
Change extractIcebergConjuncts() to track the identity conjuncts using
a boolean array, which tracks the index of the identity conjuncts in
conjuncts_ List.
Print the name of the Class in the Expr.hashCode() error to aid future
debugging.
TESTING:
Add a query which causes an unregistered predicate Expr to be seen
during Iceberg scan planning.
Change-Id: I103e3b8b06b5a1d12214241fd5907e5192d682ce
Reviewed-on: http://gerrit.cloudera.org:8080/19390
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The test had a flaky part, it was referring to a directory which was
random generated. Removed the reference to this directory.
The test was failing with filesystems other than HDFS due to the
hdfs_client dependency, replaced the hdfs_client calls to
filesystem_client instead.
Testing:
- Executed the test locally (HDFS/Minicluster)
- Triggered an Ozone build to verify it with different FS
Change-Id: Id95523949aab7dc2417a3d06cf780d3de2e44ee3
Reviewed-on: http://gerrit.cloudera.org:8080/19385
Reviewed-by: Tamas Mate <tmater@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch extends the support of Iceberg tables containing multiple
file formats. Now AVRO data files can also be read in a mixed table
besides Parquet and ORC.
Impala uses its avro scanner to read AVRO files, therefore all the
avro related limitations apply here as well: writes/metadata
changes are not supported.
testing:
- E2E testing: extending 'iceberg-mixed-file-format.test' to include
AVRO files as well, in order to test reading all three currently
supported file formats: avro+orc+parquet
Change-Id: I941adfb659218283eb5fec1b394bb3003f8072a6
Reviewed-on: http://gerrit.cloudera.org:8080/19353
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We have been using rapidjson to print structs but didn't use it to print
collections (arrays and maps).
This change introduces the usage of rapidjson to print collections for
both the HS2 and the Beeswax protocol.
The old code handling the printing of collections in raw-value.{h,cc} is
removed.
Testing:
- Ran existing EE tests
- Added EE tests with non-string and NULL map keys in
nested-map-in-select-list.test and map_null_keys.test.
Change-Id: I08a2d596a498fbbaf1419b18284846b992f49165
Reviewed-on: http://gerrit.cloudera.org:8080/19309
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Extend LOAD DATA INPATH statement to support Iceberg tables. Native
parquet tables need Iceberg field ids, therefore to add files this
change uses child queries to load and rewrite the data. The child
queries create > insert > drop the temporary table over the specified
directory.
The create part depends on LIKE PARQUET/ORC clauses to infer the file
format. This requires identifying a file in the directory and using that
to create the temporary table.
The target file or directory is moved to a staging directory before
ingestion similar to native file formats. In case of a query failure the
files are moved back to the original location. Child query executor will
return the error message of the failing query and the child query
profiles will be available through the WebUI.
At this point the PARTITION clause it not supported because it would
require analysis of the PartitionSpec (IMPALA-11750).
Testing:
- Added e2e tests
- Added fe unit tests
Change-Id: I8499945fa57ea0499f65b455976141dcd6d789eb
Reviewed-on: http://gerrit.cloudera.org:8080/19145
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds erasure coding policy to introspection commands SHOW FILES, SHOW
PARTITIONS, SHOW TABLE STATS, and DESCRIBE EXTENDED.
Remove `throws IOException` for methods that don't throw. Removes null
check for getSd because getStorageDescriptorInfo - which is called right
after getTableMetaDataInformation - uses it without checking for null.
Adds '$ERASURECODE_POLICY' for runtime test substitution. The test suite
replaces this with the current erasure code policy - from
HDFS_ERASURECODE_POLICY - or NONE to match expected output.
Testing:
- ran backend, end-to-end, and custom cluster tests with erasure coding
- ran backend, end-to-end, and custom cluster tests with exhaustive
strategy
Change-Id: Idd95f2d18b3980581788c92993b6d2f53504b5e0
Reviewed-on: http://gerrit.cloudera.org:8080/19268
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
BE can't codegen or evaluate exprs in NULL type. So when FE transfers
exprs to BE (via thrift), it will convert exprs in NULL type into
NullLiteral with Boolean type, e.g. see code in Expr#treeToThrift().
The type doesn't matter since ScalarExprEvaluator::GetValue() in BE
returns nullptr for null values of all types, and nullptr is treated as
null value.
Most of the exprs in BE are generated from thrift TExprs transferred
from FE, which guarantees they are not NULL type exprs. However, in
TopNPlanNode::Init(), we create SlotRefs directly based on the sort
tuple descriptor. If there are NULL type slots in the tuple descriptor,
we get SlotRefs in NULL type, which will crash codegen or evaluation (if
codegen is disabled) on them.
This patch adds a type-safe create method for SlotRef which uses
TYPE_BOOLEAN for TYPE_NULL. BE codes that create SlotRef directly from
SlotDescriptors are replaced by calling this create method, which
guarantees no TYPE_NULL exprs are used in the corresponding evaluators.
Tests:
- Added new tests in partitioned-top-n.test
- Ran exhaustive tests
Change-Id: I6aaf80c5129eaf788c70c8f041021eaf73087f94
Reviewed-on: http://gerrit.cloudera.org:8080/19336
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Quanlong Huang <huangquanlong@gmail.com>
Impala generated wrong values for the FILE__POSITION column when the
Parquet file contained multiple row groups and page filtering was
used as well.
We are using the value of 'current_row_' in the Parquet column readers
to populate the file position slot. The problem is that 'current_row_'
denotes the index of the row within the row group and not within the
file. We cannot change 'current_row_' as page filtering depends on its
value, as the page index also uses the row group-based indexes of the
rows, not the file indexes.
In the meantime it turned out FILE__POSITION was also not set correctly
in the Parquet late materialization code, as
BaseScalarColumnReader::SkipRowsInternal() didn't update 'current_row_'
in some code paths.
The value of FILE__POSITION is critical for Iceberg V2 tables as
position delete files store file positions of the deleted rows.
Testing:
* added e2e tests
* the tests are now running w/o PARQUET_READ_STATISTICS to exercise
more code paths
Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605
Reviewed-on: http://gerrit.cloudera.org:8080/19328
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch only the Writable* types were accepted in GenericUdfs
as return types, while some GenericUdfs in the wild return primitive java
types (e.g. Integer instead of IntWritable). For legacy Hive UDFs these
return types were already handled, so the only change needed was to
map the ObjectInspector subclasses (e.g. JavaIntObjectInspector) to the
correct JavaUdfDataType in Impala.
Testing:
- Added a subclass for TestGenericUdf (TestGenericUdfWithJavaReturnTypes)
that returns primitive java types (probably inheriting in the opposite
direction would be more logical, but the diff is smaller this way).
- Changed EE tests to also use TestGenericUdfWithJavaReturnTypes.
- Changed FE tests (UdfExecutorTest) to check both
TestGenericUdfWithJavaReturnTypes and TestGenericUdf.
- Also added a test with BINARY type to UdfExecutorTest as this was
forgotten during the original BINARY patch.
Change-Id: I30679045d6693ebd35718b6f1a22aaa4963c1e63
Reviewed-on: http://gerrit.cloudera.org:8080/19304
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Iceberg tables containing only AVRO files or no AVRO files at all
can now be read by Impala. Mixed file format tables with AVRO are
currently unsupported.
Impala uses its avro scanner to read AVRO files, therefore all the
avro related limitations apply here as well: writes/metadata
changes are not supported.
testing:
- created test tables: 'iceberg_avro_only' contains only AVRO files;
'iceberg_avro_mixed' contains all file formats: avro+orc+parquet
- added E2E test that reads Avro-only table
- added test case to iceberg-negative.test that tries to read
mixed file format table
Change-Id: I827e5707e54bebabc614e127daa48255f86f4c4f
Reviewed-on: http://gerrit.cloudera.org:8080/19084
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The patch adds supports of the cache for CodeGen functions
to improve the performance of sub-second queries.
The main idea is to store the codegen functions to a cache,
and reuse them when it is appropriate to avoid repeated llvm
optimization time which could take over hundreds of milliseconds.
In this patch, we implement a cache to store codegen functions.
The cache is a singleton instance for each daemon, and contains
multiple cache entries. Each cache entry is at the fragment
level, that is storing all the codegen functions of a fragment
in a cache entry, if one exactly same fragment comes again, it
should be able to find all the codegen functions it needs
from the specific cache entry, therefore saving the time.
The module bitcode is used as the key to the cache, which will
be generated before the module optimization and final
compilation. If codegen_cache_mode is NORMAL, which is by default,
we will store the full bitcode string as the key. Otherwise, if
codegen_cache_mode is set to OPTIMAL, we will store a key only
containing the hash code and the total length of a full key to
reduce memory consumption.
Also, KrpcDataStreamSenderConfig::CodegenHashRow() is changed to
pass the hash seed as an argument because it can't hit the cache
for the fragment if using a dynamic hash seed within the codegen
function.
Codegen cache is disabled automatically for a fragment using a
native udf, because it can lead to a crash in this case. The reason
for that is the udf is loaded to the llvm execution engine global
mapping instead of the llvm module, however, the current key to the
cache entry uses the llvm module bitcode which can't reflect the
change of the udf address if the udf is reloaded during runtime,
for example database recreation, then it could lead to a crash due
to using an old udf address from the cache. Disable it until there
is a better solution, filed IMPALA-11771 to follow.
The patch also introduces following new flags for start and query
options for feature configuration and operation purpose.
Start option for configuration:
- codegen_cache_capacity: The capacity of the cache, if set to 0,
codegen cache is disabled.
Query option for operations:
- disable_codegen_cache: Codegen cache will be disabled when it
is set to true.
- codegen_cache_mode: It is defined by a new enum type
TCodeGenCacheMode. There are four types, NORMAL and OPTIMAL,
and two other types, NORMAL_DEBUG and OPTIMAL_DEBUG, which are
the debug mode of the first two types.
If using NORMAL, a full key will be stored to the cache, it will
cost more memory for each entry because the key is the bitcode
of the llvm module, it can be large.
If using OPTIMAL, the cache will only store the hash code and
length of the key, it reduces the memory consumption largely,
however, could be possible to have collision issues.
If using debug modes, the behavior would be the same as the
non-debug modes, but more logs or statistics will be allowed,
that means could be slower.
Only valid when disable_codegen_cache is set to false.
New impalad metrics:
- impala.codegen-cache.misses
- impala.codegen-cache.entries-in-use
- impala.codegen-cache.entries-in-use-bytes
- impala.codegen-cache.entries-evicted
- impala.codegen-cache.hits
- impala.codegen-cache.entry-sizes
New profile Metrics:
- CodegenCacheLookupTime
- CodegenCacheSaveTime
- ModuleBitcodeGenTime
- NumCachedFunctions
TPCH-1 performance evaluation (8 iteration) on AWS m5a.4xlarge,
the result removes the first iteration to show the benefit of the
cache:
Query Cached(s) NoCache(s) Delta(Avg) NoCodegen(s) Delta(Avg)
TPCH-Q1 0.39 1.02 -61.76% 5.59 -93.02%
TPCH-Q2 0.56 1.21 -53.72% 0.47 19.15%
TPCH-Q3 0.37 0.77 -51.95% 0.43 -13.95%
TPCH-Q4 0.36 0.51 -29.41% 0.33 9.09%
TPCH-Q5 0.39 1.1 -64.55% 0.39 0%
TPCH-Q6 0.24 0.27 -11.11% 0.77 -68.83%
TPCH-Q7 0.39 1.2 -67.5% 0.39 0%
TPCH-Q8 0.58 1.46 -60.27% 0.45 28.89%
TPCH-Q9 0.8 1.38 -42.03% 1 -20%
TPCH-Q10 0.6 1.03 -41.75% 0.85 -29.41%
TPCH-Q11 0.3 0.93 -67.74% 0.2 50%
TPCH-Q12 0.28 0.48 -41.67% 0.38 -26.32%
TPCH-Q13 1.11 1.22 -9.02% 1.16 -4.31%
TPCH-Q14 0.55 0.78 -29.49% 0.45 22.22%
TPCH-Q15 0.33 0.73 -54.79% 0.44 -25%
TPCH-Q16 0.32 0.78 -58.97% 0.41 -21.95%
TPCH-Q17 0.56 0.84 -33.33% 0.89 -37.08%
TPCH-Q18 0.54 0.92 -41.3% 0.89 -39.33%
TPCH-Q19 0.35 2.34 -85.04% 0.35 0%
TPCH-Q20 0.34 0.98 -65.31% 0.31 9.68%
TPCH-Q21 0.83 1.14 -27.19% 0.86 -3.49%
TPCH-Q22 0.26 0.52 -50% 0.25 4%
From the result, it shows a pretty good performance compared to
codegen without cache (default setting). However, compared
to codegen disabled, as expected, for short queries, codegen
cache is not always faster, probably because for the codegen
cache, it still needs some time to prepare the codegen functions
and generate an appropriate module bitcode to be the key, if
the time of the preparation is larger than the benefit from
the codegen functions, especially for the extremely short queries,
the result can be slower than not using the codegen. There could
be room to improve in future.
We also test the total cache entry size for tpch queries. The data
below shows the total codegen cache used by each tpch query. We
can see the optimal mode is very helpful to reduce the size of
the cache, and the reason is the much smaller key in optimal mode
we mentioned before because the only difference between two modes
is the key.
Query Normal(KB) Optimal(KB)
TPCH-Q1 604.1 50.9
TPCH-Q2 973.4 135.5
TPCH-Q3 561.1 36.5
TPCH-Q4 423.3 41.1
TPCH-Q5 866.9 93.3
TPCH-Q6 295.9 4.9
TPCH-Q7 1105.4 124.5
TPCH-Q8 1382.6 211
TPCH-Q9 1041.4 119.5
TPCH-Q10 738.4 65.4
TPCH-Q11 1201.6 136.3
TPCH-Q12 452.8 46.7
TPCH-Q13 541.3 48.1
TPCH-Q14 696.8 102.8
TPCH-Q15 1148.1 95.2
TPCH-Q16 740.6 77.4
TPCH-Q17 990.1 133.4
TPCH-Q18 376 70.8
TPCH-Q19 1280.1 179.5
TPCH-Q20 1260.9 180.7
TPCH-Q21 722.5 66.8
TPCH-Q22 713.1 49.8
Tests:
Ran exhaustive tests.
Added E2e testcase TestCodegenCache.
Added unit testcase LlvmCodeGenCacheTest.
Change-Id: If42c78a7f51fd582e5fe331fead494dadf544eb1
Reviewed-on: http://gerrit.cloudera.org:8080/19181
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Enables tests guarded by SkipIfNotHdfsMinicluster to run on Ozone as
well as HDFS. Plans are still skipped for Ozone because there's
Ozone-specific text in the plan output.
Updates explain output to allow for Ozone, which has a block size of
256MB instead of 128MB. One of the partitions read in test_explain is
~180MB, straddling the difference between Ozone and HDFS.
Testing: ran affected tests with Ozone.
Change-Id: I6b06ceacf951dbc966aa409cf24a310c9676fe7f
Reviewed-on: http://gerrit.cloudera.org:8080/19250
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Ranger provides column masking and row filtering policies to mask
sensitive data for specific users/groups. When a table should be masked
in a query, Impala replaces it with a table mask view that exposes the
columns with masked expressions.
After IMPALA-9661, only selected columns are exposed in the table mask
view. However, the columns of the view are exposed in the order that
they are registered. If the registering order differs from the column
order in the table, STAR expansions will mismatch the columns.
To be specific, let's say table 'tbl' with 3 columns a, b, c should be
masked in the following query:
select b, * from tbl;
Ideally Impala should replace the TableRef of 'tbl' with a table mask
view as:
select b, * from (
select mask(a) a, mask(b) b, mask(c) c from tbl
) t;
Currently, the rewritten query is
select b, * from (
select mask(b) b, mask(a) a, mask(c) c from tbl
) t;
This incorrectly expands the STAR as "b, a, c" in the re-analyze phase.
The cause is that column 'b' is registered earlier than all other
columns. This patch fixes it by sorting the selected columns based on
their original order in the table.
Tests:
- Add tests for selecting STAR with normal columns on table and view.
Change-Id: Ic83d78312b19fa2c5ab88ac4f359bfabaeaabce6
Reviewed-on: http://gerrit.cloudera.org:8080/19279
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Sequence container based file formats (SequenceFile, RCFile, Avro) have
a file header in each file that describes the metadata of the file, e.g.
codec, default values, etc. The header should be decoded before reading
the file content. The initial scanners will read the header and then
issue follow-up scan ranges for the file content. The decoded header
will be referenced by follow-up scanners.
Since IMPALA-9655, when MT_DOP > 1, the issued scan ranges could be
scheduled to other scan node instances. So the header resource should
live until all scan node instances close. Header objects are owned by
the object pool of the RuntimeState, which meets the requirement.
AvroFileHeader is special than other headers in that it references a
template tuple which contains the partition values and default values
for missing fields. The template tuple is initially owned by the header
scanner, then transferred to the scan node before the scanner closes.
However, when the scan node instance closes, the template tuple is
freed. Scanners of other scan node instances might still depend on it.
This could cause wrong results or crash the impalad.
When partition columns are used in the query, or when the underlying
avro files have missing fields and the table schema has default values
for them, the AvroFileHeader will have a non-null template tuple, which
could hit this bug when MT_DOP>1.
This patch fixes the bug by transferring the template tuple to
ScanRangeSharedState directly. The scan_node_pool of HdfsScanNodeBase is
also removed since it's only used to hold the template tuple (and
related buffers) of the avro header. Also no need to override
TransferToScanNodePool in HdfsScanNode since the original purpose is to
protect the pool by a lock, and now the method in ScanRangeSharedState
already has a lock.
Tests
- Add missing test coverage for compute stats on avro tables. Note that
MT_DOP=4 is set by default for compute stats.
- Add the MT_DOP dimension for TestScannersAllTableFormats. Also add
some queries that can reveal the bug in scanners.test. The ASAN build
can easily crash by heap-use-after-free error without this fix.
- Ran exhaustive tests.
Change-Id: Iafa43fce7c2ffdc867004d11e5873327c3d8cb42
Reviewed-on: http://gerrit.cloudera.org:8080/19289
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
test_resource_limits_kudu assumes scan_bytes_limit not enforced for
Kudu. Commit for IMPALA-11702 added bytes read counter for Kudu scanner
so that scan_bytes_limit was enforced for Kudu.
This patch fixs test_resource_limits_kudu.
Testing:
- Executed test_resource_limits_kudu in a loop on local machine.
- Passed core run.
Change-Id: I77e960bb9fd539442e12866be4a1c5f91dd8aca4
Reviewed-on: http://gerrit.cloudera.org:8080/19288
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If an Iceberg V2 table is partitioned, and contains delete files,
then in a query that involves runtime filters on the partition columns
return empty result set.
E.g.:
select count(*)
from store_sales, date_dim
where d_date_sk = ss_sold_date_sk and d_moy=2 and d_year=1998;
In the above query store_sales is partitioned by ss_sold_date_sk which
will be filtered by runtime filters created by the JOIN. If store_sales
has delete files then the above query returns empty result set.
The problem is that we are invoking PartitionPassesFilters() on these
Iceberg tables. It is usually a no-op for Iceberg tables, as the
template tuple is NULL. But when we have virtual columns a template
tuple has been created in HdfsScanPlanNode::InitTemplateTuple. For
Iceberg tables this tempalte tuple is incomplete, i.e. it doesn't
have the partition values set. This means the filters evaluate to
false and the files are getting filtered out, hence the query
produces an empty result set.
With this patch we don't invoke PartitionPassesFilters() on Iceberg
tables, only the Iceberg-specific IcebergPartitionPassesFilters()
gets invoked. Also added DCHECKs to ensure this.
Testing:
* e2e tests added
Change-Id: I43f3e0a4df7c1ba6d8ea61410b570d8cf7b31ad3
Reviewed-on: http://gerrit.cloudera.org:8080/19274
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Add syntactic support for creating bucketed table.
The specific syntax in the create table statement is as follows:
[CLUSTERED BY (column[, column ...]) [SORT BY (column[, column ...])]
INTO 24 BUCKETS]
Example:
CREATE TABLE tbl (i int COMMENT 'hello', s string)
CLUSTERED BY (i) INTO 24 BUCKETS;
CREATE TABLE tbl (i int COMMENT 'hello', s string)
CLUSTERED BY (i) SORT BY (s) INTO 24 BUCKETS;
Instructions:
1. The bucket partitioning algorithm is the hash function used
in Hive's bucketed tables;
2. Create Bucketed Table statements currently don't support Kudu and
Iceberg tables;
3. In the current version, alter operations(add/drop/change/replace
columns) on bucketed tables are not supported;
4. Support dropping bucketed table;
This COMMIT is the first subtask of IMPALA-3118.
Change-Id: I919b4d4139bc3a7784fa6fdb6f064e25666d548e
Reviewed-on: http://gerrit.cloudera.org:8080/19055
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When iceberg partitioning column filter not NULL or equal NULL, it
throws a ClassCastException. For example: "ERROR: ClassCastException:
org.apache.impala.analysis. NullLiteral cannot be cast to
org.apache.impala.analysis.StringLiteral".
Testing:
- Add 'col=NULL' and 'col!=NULL' queries in iceberg-query.test.
Change-Id: Id6c50978ebac2590622027a239db03f56b082de3
Reviewed-on: http://gerrit.cloudera.org:8080/19270
Reviewed-by: <lipenglin@sensorsdata.cn>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Global runtime filters are published to the coordinator and then
distributed to all executors that need it. The filter is serialized and
deserialized using protobuf. While deserializing a global runtime filter
of numeric type from protobuf, the InsertBatch() method forgot to update
the total_entries_ counter. The filter is then considered as an empty
list, which will reject any files/rows.
This patch adds the missing update of total_entries_. Some DCHECKs are
added to make sure total_entries_ is consistent with the actual size of
the value set. This patch also fixes a type error (long_val -> int_val)
in ToProtobuf() of Date type IN-list filter.
Tests:
- Added BE tests to verify the filter cloned from protobuf has the same
behavior as the original one.
- Added e2e regression tests
- Run TestInListFilters 200 times.
Change-Id: Ie90b2bce5e5ec6f6906ce9d2090b0ab19d48cc78
Reviewed-on: http://gerrit.cloudera.org:8080/19220
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Qifan Chen <qfchen@hotmail.com>
Convert SkipIf.not_hdfs to SkipIf.not_dfs for tests that require
filesystem semantics, adding more feature test coverage with Ozone.
Creates a separate not_scratch_fs flag for scratch dir tests as they're
not supported with Ozone yet. Filed IMPALA-11730 to address this.
Preserves not_hdfs for a specific test that uses the dfsadmin CLI to put
it in safemode.
Adds sfs_ofs_unsupported for SmallFileSystem tests. This should work for
many of our filesystems based on
ebb1e2fa99/ql/src/java/org/apache/hadoop/hive/ql/io/SingleFileSystem.java (L62-L87). Makes sfs tests work on S3.
Adds hardcoded_uris for IcebergV2 tests where deletes are implemented as
hardcoded URIs in parquet files. Adding a parquet read/write library for
Python is beyond the scope if this patch.
Change-Id: Iafc1dac52d013e74a459fdc4336c26891a256ef1
Reviewed-on: http://gerrit.cloudera.org:8080/19254
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
If the Impala version is set to a release build as described in point 8
in the "How to Release" document
(https://cwiki.apache.org/confluence/display/IMPALA/How+to+Release#HowtoRelease-HowtoVoteonaReleaseCandidate),
TestIcebergTable.test_compute_stats fails:
Stacktrace
query_test/test_iceberg.py:852: in test_compute_stats
self.run_test_case('QueryTest/iceberg-compute-stats', vector,
unique_database) common/impala_test_suite.py:742: in run_test_case
self.__verify_results_and_errors(vector, test_section, result, use_db)
common/impala_test_suite.py:578: in __verify_results_and_errors
replace_filenames_with_placeholder) common/test_result_verifier.py:469:
in verify_raw_results VERIFIER_MAP[verifier](expected, actual)
common/test_result_verifier.py:278: in verify_query_result_is_equal
assert expected_results == actual_results E assert Comparing
QueryTestResults (expected vs actual): E 2,1,'2.33KB','NOT CACHED','NOT
CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/test_compute_stats_74dbc105.db/ice_alltypes'
!= 2,1,'2.32KB','NOT CACHED','NOT
CACHED','PARQUET','false','hdfs://localhost:20500/test-warehouse/test_compute_stats_74dbc105.db/ice_alltypes'
The problem is the file size which is 2.32KB instead of 2.33KB. This is
because the version is written into the file, and "x.y.z-RELEASE" is one
byte shorter than "x.y.z-SNAPSHOT". The size of the file in this test is
on the boundary between 2.32KB and 2.33KB, so this one byte can change
the value.
This change fixes the problem by using a regex to accept both values so
it works for both snapshot and release versions.
Change-Id: Ia1fa12eebf936ec2f4cc1d5f68ece2c96d1256fb
Reviewed-on: http://gerrit.cloudera.org:8080/19260
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
NULL values are printed as "NULL" if they are top level or in
collections, but as "null" in structs. We should print collections and
structs in JSON form, so it should be "null" in collections, too. Hive
also follows the latter (correct) approach.
This commit changes the printing of NULL values to "null" in
collections.
Testing:
- Modified the tests to expect "null" instead of "NULL" in collections.
Change-Id: Ie5e7f98df4014ea417ddf73ac0fb8ec01ef655ba
Reviewed-on: http://gerrit.cloudera.org:8080/19236
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
This change adds a more generic approach to validate numeric
query options and report parse and validation errors.
Supported types: integers, floats, memory specifications.
Range and bound validator helper functions are added to make
validation unified on call sites.
Testing:
- Error messages got more generic, therefore the existing tests
around query options are aligned to match them
Change-Id: Ia7757b52393c094d2c661918d73cbfad7214f855
Reviewed-on: http://gerrit.cloudera.org:8080/19096
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
When trying to read from HDFS cache, ReadFromCache calls
FileReader::Open(false) to force the file to open. The prior commit for
IMPALA-11704 didn't allow for that case when using a data cache, as the
data cache check would always happen. This resulted in a crash calling
CachedFile as exclusive_hdfs_fh_ was nullptr. Tests only catch this when
reading from HDFS cache with data cache enabled.
Replaces explicit arguments to override FileReader behavior with a flag
to communicate whether FileReader supports delayed open. Then the caller
can choose whether to call Open before read. Also simplifies calls to
ReadFromPos as it already has a pointer to ScanRange and can check
whether file handle caching is enabled directly. The Open call in
DoInternalRead uses a slightly wider net by only checking UseDataCache.
If the data cache is unavailable or a miss the file will then be opened.
Adds a select from tpch.nation to the query for test_data_cache.py as
something that triggers checking the HDFS cache.
Change-Id: I741488d6195e586917de220a39090895886a2dc5
Reviewed-on: http://gerrit.cloudera.org:8080/19228
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In the 'FileMetadataUtils::AddIcebergColumns' method, when the slot is
a virtual column, it should be skipped directly. That may affect that
when we query the Iceberg v2 table (the first column is a partition
column of bool type), wrong position-delete result is given.
Testing:
- Add e2e tests
- Locally tested the result of The Position-based Iceberg tables
Change-Id: I58faf3df6ae8a5bcabb1d2ac9f11a6fbcd74bc24
Reviewed-on: http://gerrit.cloudera.org:8080/19223
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-11666 revised the message in the query plans when there are
potentially corrupt statistics, which broke test_corrupt_stat, an E2E
test only run in the exhaustive tests. This patch fixes the test file
accordingly.
Testing:
- Verified locally that the patch passes test_corrupt_stat.
Change-Id: I817c7807a07bb89b93d795bce958b9872eff2eef
Reviewed-on: http://gerrit.cloudera.org:8080/19224
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
If EXPAND_COMPLEX_TYPES is set to true, some queries that combine star
expressions and explicitly given complex columns fail:
select outer_struct, * from
functional_orc_def.complextypes_nested_structs;
ERROR: IllegalStateException: Illegal reference to non-materialized
slot: tid=1 sid=1
select *, outer_struct.str from
functional_orc_def.complextypes_nested_structs;
ERROR: IllegalStateException: null
Having two stars in a table with complex columns also fails.
select *, * from functional_orc_def.complextypes_nested_structs;
ERROR: IllegalStateException: Illegal reference to non-materialized
slot: tid=6 sid=13
The error is because of this line in 'SelectStmt.addStarResultExpr()':
8e350d0a8a/fe/src/main/java/org/apache/impala/analysis/SelectStmt.java (L811)
What we want to do is create 'SlotRef's for the struct children
(recursively) but 'reExpandStruct()' also creates new 'SlotDescriptor's
for the children. The new 'SlotDescriptor's are redundant and are not
inserted into the tree which can leave them unmaterialised or without a
correct memory layout.
The solution is to only create the 'SlotRef's for the struct children
without creating new 'SlotDescriptor's. This leads us to another
problem:
- for structs, it is 'SlotRef.analyzeImpl()' that creates the child
'SlotRef's
- the constructor 'SlotRef(SlotDescriptor desc)' sets 'isAnalyzed_' to
true.
Before structs were allowed, this was correct but now struct-typed
'SlotRef's created with the above constructor are counted as analysed
but lack child expressions, which would have been added if 'analyze()'
had been called on them. This essentially violates the contract of this
constructor.
This commit modifies 'SlotRef(SlotDescriptor desc)' so that child
expressions are generated for structs, restoring the correct semantics
of this constructor. After this, it is no longer necessary to call
'reExpandStruct()' in 'SelectStmt.addStarResultExpr()'.
Testing:
- Added the failing test cases and a few variations of them to
nested-types-star-expansion.test
Change-Id: Ia8cf53b0a7409faca668713228bfef275f3833f9
Reviewed-on: http://gerrit.cloudera.org:8080/19171
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch lowers the privilege requirement for external Kudu table
creation. Before this patch, a user was required to have the ALL
privilege on SERVER if the user wanted to create an external Kudu table.
In this patch we introduce a new type of resources called storage
handler URI and a new access type called RWSTORAGE that will be
supported by Apache Ranger once RANGER-3281 is resolved, which in turn
depends on the release of Apache Hive 4.0 that consists of HIVE-24705.
Specifically, after this patch, a user will be allowed to create an
external Kudu table as long as the user is granted the RWSTORAGE
privilege on the resource specified by a storage handler URI that points
to an existing Kudu table.
For instance, in order for a user 'non_owner' to create an external Kudu
table based on an existing Kudu table 'impala::tpch_kudu.nation', it
suffices to execute the following command as an administrator to grant
the necessary privilege to the requesting user, where "localhost" is the
default address of Kudu master host assuming there is only one single
master host in this example.
GRANT RWSTORAGE ON
STORAGEHANDLER_URI 'kudu://localhost/impala::tpch_kudu.nation'
TO USER non_owner
One may be wondering why we do not simply cancel the privilege check
that required the ALL privilege on SERVER for external Kudu table
creation. One scenario in which such relaxation is not secure is when
the owner or the creator of the existing Kudu table is different from
the requesting user who wants to create an external Kudu table in
Impala. Not requiring any additional privilege check would allow a user
without any privilege to retrieve the contents of the existing Kudu
table.
On the other hand, after this patch we still require a user to have the
ALL privilege on SERVER when the table property of
'kudu.master_addresses' is specified in a query that tries to create a
Kudu table whether or not the table is external. To be more specific,
the user 'non_owner' would be able to create an external Kudu table
using the following statement once being granted the RWSTORAGE privilege
on the specified storage handler URI above.
CREATE EXTERNAL TABLE default.kudu_tbl STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='impala::tpch_kudu.nation')
However, the following query submitted by the same user would be
rejected due to the user 'non_owner' not being granted the ALL privilege
on SERVER.
CREATE EXTERNAL TABLE default.kudu_tbl STORED AS KUDU
TBLPROPERTIES ('kudu.table_name'='impala::tpch_kudu.nation',
'kudu.master_addresses'='localhost')
We do not relax such a requirement in that specifying the addresses of
Kudu master hosts to connect should still be considered as an
administrative operation.
Testing:
- Added various FE and E2E tests to verify Impala's behavior after this
patch with respect to external Kudu table creation.
- Verified that this patch passes the core tests in the DEBUG build.
Change-Id: I7936e1d8c48696169f7ad7ad92abe44a26eea3c4
Reviewed-on: http://gerrit.cloudera.org:8080/17640
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Only test changes. Minor compacted delta dirs are supported in
Impala since IMPALA-9512, but at that time Hive supported minor
compaction only on full ACID tables. Since that time Hive added
support for minor compacting insert only/MM tables (HIVE-22610).
Change-Id: I7159283f3658f2119d38bd3393729535edd0a76f
Reviewed-on: http://gerrit.cloudera.org:8080/19164
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>