This patch contains 2 parts:
1. When both conditions below are true, push down limit to
pre-aggregation
a) aggregation node has no aggregate function
b) aggregation node has no predicate
2. finish aggregation when number of unique keys of hash table has
exceeded the limit.
Sample queries:
SELECT DISTINCT f FROM t LIMIT n
Can pass the LIMIT all the way down to the pre-aggregation, which
leads to a nearly unbounded speedup on these queries in large tables
when n is low.
Testing:
Add test targeted-perf/queries/aggregation.test
Pass core test
Change-Id: I930a6cb203615acfc03f23118d1bc1f0ea360995
Reviewed-on: http://gerrit.cloudera.org:8080/17821
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In planning phase, the planner collects and generates min-max predicates
that can be evaluated on parquet file statistics. We can easily extend
this on ORC tables.
This commit implements min/max predicate pushdown for the ORC scanner
leveraging on the external ORC library's search arguments. We build
the search arguments when we open the scanner as we need not to
modify them later.
Also added a new query option orc_read_statistics, similar to
parquet_read_statistics. If the option is set to true (it is by default)
predicate pushdown will take effect, otherwise it will be skipped. The
predicates will be evaluated at ORC row group level, i.e. by default for
every 10,000 rows.
Limitations:
- Min-max predicates on CHAR/VARCHAR types are not pushed down due to
inconsistent behaviors on padding/truncating between Hive and Impala.
(IMPALA-10882)
- Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915).
- Min-max predicates having different arg types are not pushed down
(IMPALA-10916).
- Min-max predicates with non-literal const exprs are not pushed down
since SearchArgument interfaces only accept literals. This only
happens when expr rewrites are disabled thus constant folding is
disabled.
Tests:
- Add e2e tests similar to test_parquet_stats to verify that
predicates are pushed down.
- Run CORE tests
- Run TPCH benchmark, there is no improvement, nor regression.
On the other hand, certain selective queries gained significant
speed-up, e.g. select count(*) from lineitem where l_orderkey = 1.
Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Reviewed-on: http://gerrit.cloudera.org:8080/15403
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.
This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.
Implementation
- Extends the existing MaskTransform function to deal with unicode code
points(represented by uint32_t).
- Extends the existing GetFirstChar function to get the code point of
given masked charactors in UTF-8 mode.
- Implement a MaskSubStrUtf8 method as the core functionality.
- Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
- For better testing, this patch also adds an overload for all mask
functions for only masking other chars but keeping the
upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').
Tests
- Add BE tests in expr-test
- Add e2e tests in utf8-string-functions.test
Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch implements the functionality to allow structs in the select
list of inline views, topmost blocks. When displaying the value of a
struct it is formatted into a JSON value and returned as a string. An
example of such a value:
SELECT struct_col FROM some_table;
'{"int_struct_member":12,"string_struct_member":"string value"}'
Another example where we query a nested struct:
SELECT outer_struct_col FROM some_table;
'{"inner_struct":{"string_member":"string value","int_member":12}}'
Note, the conversion from struct to JSON happens on the server side
before sending out the value in HS2 to the client. However, HS2 is
capable of handling struct values as well so in a later change we might
want to add a functionality to send the struct in thrift to the client
so that the client can use the struct directly.
-- Internal representation of a struct:
When scanning a struct the rowbatch will hold the values of the
struct's children as if they were queried one by one directly in the
select list.
E.g. Taking the following table:
CREATE TABLE tbl (id int, s struct<a:int,b:string>) STORED AS ORC
And running the following query:
SELECT id, s FROM tbl;
After scanning a row in a row batch will hold the following values:
(note the biggest size comes first)
1: The pointer for the string in s.b
2: The length for the string in s.b
3: The int value for s.a
4: The int value of id
5: A single null byte for all the slots: id, s, s.a, s.b
The size of a struct has an effect on the order of the memory layout of
a row batch. The struct size is calculated by summing the size of its
fields and then the struct gets a place in the row batch to precede all
smaller slots by size. Note, all the fields of a struct are consecutive
to each other in the row batch. Inside a struct the order of the fields
is also based on their size as it does in a regular case for primitives.
When evaluating a struct as a SlotRef a newly introduced StructVal will
be used to refer to the actual values of a struct in the row batch.
This StructVal holds a vector of pointers where each pointer represents
a member of the struct. Following the above example the StructVal would
keep two pointers, one to point to an IntVal and one to point to a
StringVal.
-- Changes related to tuple and slot descriptors:
When providing a struct in the select list there is going to be a
SlotDescriptor for the struct slot in the topmost TupleDescriptor.
Additionally, another TupleDesriptor is created to hold SlotDescriptors
for each of the struct's children. The struct SlotDescriptor points to
the newly introduced TupleDescriptor using 'itemTupleId'.
The offsets for the children of the struct is calculated from the
beginning of the topmost TupleDescriptor and not from the
TupleDescriptor that directly holds the struct's children. The null
indicator bytes as well are stored on the level of the topmost
TupleDescriptor.
-- Changes related to scalar expressions:
A struct in the select list is translated into an expression tree where
the top of this tree is a SlotRef for the struct itself and its
children in the tree are SlotRefs for the members of the struct. When
evaluating a struct SlotRef after the null checks the evaluation is
delegated to the children SlotRefs.
-- Restrictions:
- Codegen support is not included in this patch.
- Only ORC file format is supported by this patch.
- Only HS2 client supports returning structs. Beeswax support is not
implemented as it is going to be deprecated anyway. Currently we
receive an error when trying to query a struct through Beeswax.
-- Tests added:
- The ORC and Parquet functional databases are extended with 3 new
tables:
1: A small table with one level structs, holding different
kind of primitive types as members.
2: A small table with 2 and 3 level nested structs.
3: A bigger, partitioned table constructed from alltypes where all
the columns except the 'id' column are put into a struct.
- struct-in-select-list.test and nested-struct-in-select-list.test
uses these new tables to query structs directly or through an
inline view.
Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a
Reviewed-on: http://gerrit.cloudera.org:8080/17638
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In earlier versions of Impala we had a bug that affected
insertions to Iceberg tables. When Impala wrote multiple
files during a single INSERT statement it could crash, or
even worse, it could silently omit data files from the
Iceberg metadata.
The current master doesn't have this bug, but we don't
really have tests for this case.
This patch adds tests that write many files during inserts
to an Iceberg table. Both non-partitioned and partitioned
Iceberg tables are tested.
We achieve writing lots of files by setting 'parquet_file_size'
to 8 megabytes.
Testing:
* added e2e test that write many data files
* added exhaustive e2e test that writes even more data files
Change-Id: Ia2dbc2c5f9574153842af308a61f9d91994d067b
Reviewed-on: http://gerrit.cloudera.org:8080/17831
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds parquet stats to iceberg manifest as per-datafile
metrics.
The following metrics are supported:
- column_sizes :
Map from column id to the total size on disk of all regions that
store the column. Does not include bytes necessary to read other
columns, like footers.
- null_value_counts :
Map from column id to number of null values in the column.
- lower_bounds :
Map from column id to lower bound in the column serialized as
binary. Each value must be less than or equal to all non-null,
non-NaN values in the column for the file.
- upper_bounds :
Map from column id to upper bound in the column serialized as
binary. Each value must be greater than or equal to all non-null,
non-Nan values in the column for the file.
The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).
Testing:
- New e2e test was added to verify that the metrics are written to the
Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.
Relevant Iceberg documentation:
- Manifest:
https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
serialized to binary:
https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
This patch adds support "FOR SYSTEM_TIME AS OF" and
"FOR SYSTEM_VERSION AS OF" clauses for Iceberg tables. The new
clauses are part of the table ref. FOR SYSTEM_TIME AS OF conforms to the
SQL2011 standard:
https://cs.ulb.ac.be/public/_media/teaching/infoh415/tempfeaturessql2011.pdf
With FOR SYSTEM_TIME AS OF we can query a table at a specific time
point, e.g. we can retrieve what was the table content 1 day ago.
The timestamp given to "FOR SYSTEM_TIME AS OF" is interpreted in the
local timezone. The local timezone can be set via the query option
TIMEZONE. By default the timezone being used is the coordinator node's
local timezone. The timestamp is translated to UTC because table
snapshots are tagged with a UTC timestamps.
"FOR SYSTEM_VERSION AS OF" is a non-standard extension. It works
similarly to FOR SYSTEM_TIME AS OF, but with this clause we can query
a table via a snapshot ID instead of a timestamp.
HIVE-25344 also added support for these clauses to Hive.
Table snapshot IDs and timestamp information can be queried with the
help of the DESCRIBE HISTORY command.
Sample queries:
SELECT * FROM t FOR SYSTEM_TIME AS OF now();
SELECT * FROM t FOR SYSTEM_TIME AS OF '2021-08-10 11:02:34';
SELECT * FROM t FOR SYSTEM_TIME AS OF now() - interval 10 days + interval 3 hours;
SELECT * FROM t FOR SYSTEM_VERSION AS OF 7080861547601448759;
SELECT * FROM t FOR SYSTEM_TIME AS OF now()
MINUS
SELECT * FROM t FOR SYSTEM_TIME AS OF now() - interval 1 days;
This patch uses some parts of the in-progress
IMPALA-9773 (https://gerrit.cloudera.org/#/c/13342/) developed by
Todd Lipcon and Grant Henke. This patch also resolves some TODOs of
IMPALA-9773, i.e. after this patch it'll be easier to add
time travel for Kudu tables as well.
Testing:
* added parser tests (ParserTest.java)
* added analyzer tests (AnalyzeStmtsTest.java)
* added e2e tests (test_iceberg.py)
Change-Id: Ib523c5e47b8d9c377bea39a82fe20249177cf824
Reviewed-on: http://gerrit.cloudera.org:8080/17765
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
HashTable implementation in Impala comprises of contiguous array
of Buckets and each Bucket contains either data or pointer to
linked list of duplicate entries named DuplicateNode.
These are the structures of Bucket and DuplicateNode:
struct DuplicateNode {
bool matched;
DuplicateNode* next;
HtData htdata;
};
struct Bucket {
bool filled;
bool matched;
bool hasDuplicates;
uint32_t hash;
union {
HtData htdata;
DuplicateNode* duplicates;
} bucketData;
};
Size of Bucket is currently 16 bytes and size of DuplicateNode is
24 bytes. If we can remove the booleans from both struct size of
Bucket would reduce to 12 bytes and DuplicateNode will be 16 bytes.
One of the ways we can remove booleans is to fold it into pointers
already part of struct. Pointers store addresses and on
architectures like x86 and ARM the linear address is only 48 bits
long. With level 5 paging Intel is planning to expand it to 57-bit
long which means we can use most significant 7 bits i.e., 58 to 64
bits to store these booleans. This patch reduces the size of Bucket
and DuplicateNode by implementing this folding. However, there is
another requirement regarding Size of Bucket to be power of 2 and
also for the number of buckets in Hash table to be power of 2.
These requirements are for the following reasons:
1. Memory Allocator allocates memory in power of 2 to avoid
internal fragmentation. Hence, num of buckets * sizeof(Buckets)
should be power of 2.
2. Number of buckets being power of 2 enables faster modulo
operation i.e., instead of slow modulo: (hash % N), faster
(hash & (N-1)) can be used.
Due to this, 4 bytes 'hash' field from Bucket is removed and
stored separately in new array hash_array_ in HashTable.
This ensures sizeof(Bucket) is 8 which is power of 2.
New Classes:
------------
As a part of patch, TaggedPointer is introduced which is a template
class to store a pointer and 7-bit tag together in 64 bit integer.
This structure contains the ownership of the pointer and will take care
of allocation and deallocation of the object being pointed to.
However derived classes can opt out of the ownership of the object
and let the client manage it. It's derived classes for Bucket and
DuplicateNode do the same. These classes are TaggedBucketData and
TaggedDuplicateNode.
Benchmark:
----------
As a part of this patch a new Micro Benchmark for HashTable has
been introduced, which will help in measuring these:
1. Runtime for building hash table and probing it.
2. Memory consumed after building the Table.
This would help measuring the impact of changes to the HashTable's
data structure and algorithm.
Saw 25-30% reduction in memory consumed and no significant
difference in performance (0.91X-1.2X).
Other Benchmarks:
1. Billion row Synthetic benchmark on single node, single daemon:
a. 2-3% improvement in Join GEOMEAN for Probe benchmark.
b. 17% and 21% reduction in PeakMemoryUsage and
CumulativeBytes allocated respectively
2. TPCH-42: 0-1.5% improvement in GEOMEAN runtime
Change-Id: I72912ae9353b0d567a976ca712d2d193e035df9b
Reviewed-on: http://gerrit.cloudera.org:8080/17592
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
A like predicate is generally evaluated by converting it into a regex
that is evaluated at execution time. If the predicate of a like clause
is a constant (which is the common case when you say "row
like 'start%'") then there are optimizations where some cases that are
simpler then a regex are spotted, and a simple function than a regex
evaluator is used. One example is that a predicate such as ‘start%’ is
evaluated by looking for strings that begin with "start". Amusingly the
code that spots the potential optimizations uses regexes to look for
patterns in the like predicate. The code that looks for the
optimization where a simple prefix can be searched for does not deal
with the case where the '%' wildcard at the end of the predicate is
escaped. To fix this we add a test that deals with the case where the
predicate ends in an escaped '%'.
There are some other problems with escaped wildcards discussed in
IMPALA-2422. This change does not fix these problems, which are hard.
New tests for escaped wildcards are added to exprs.test - note that
these tests cannot be part of the LikeTbl tests as the like predicate
optimizations are only applied when the like predicate is a string
literal.
Exhaustive tests ran clean.
Change-Id: I30356c19f4f169d99f7cc6268937653af6b41b70
Reviewed-on: http://gerrit.cloudera.org:8080/17798
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch enables min/max filtering for non-correlated subqueries
that return one value. In this case, the filters are built from the
results of the subqueries and the filtering target is the scan node to
be qualified by one of the subqueries. Shown below is one such query
that normally gets compiled into a nested loop join. The filtering
limits the values from column store_sales.ss_sales_price to be within
[-infinite, min(ss_wholesale_cost)].
select count(*) from store_sales
where ss_sales_price <=
(select min(ss_wholesale_cost) from store_sales);
In FE, the fact that the above scalar subquery exists is recorded
in a flag in InlineViewRef in analyzer and later on transferred to
AggregationNode in planner.
In BE, the min/max filtering infrastructure is integrated with the
nested loop join as follows.
1. NljBuilderConfig is populated with filter descriptors from nested
join plan node via NljBuilder::CreateEmbeddedBuilder() (similar
to hash join), or in NljBuilderConfig::Init() when the sink config
is created (for separate builder case);
2. NljBuilder is populated with filter contexts utilizing the filter
descriptors in NljBuilderConfig. Filter contexts are the interface
to actual min/max filters;
3. New insertion methods InsertFor<op>(), where <op> is LE, LT, GE and
GT, are added to MinMaxFilter class hierarcy. They are used for
join predicate target <op> src_expr;
4. RuntimeContext::InsertPerCompareOp() calls one of the new
insertion methods above based on the comparison op saved in the
filter descriptor;
5. NljBuilder::InsertRuntimeFilters() calls the new methods.
By default, the feature is turned on only for sorted or partitioned
join columns.
Testing:
1. Add single range insertion tests in min-max-filter-test.cc;
2. Add positive and negative plan tests in
overlap_min_max_filters.test;
3. Add tests in overlap_min_max_filters_on_partition_columns.test;
4. Add tests in overlap_min_max_filters_on_sorted_columns.test;
5. Run core tests.
TODO in follow-up patches:
1. Extend min/max filter for inequality subquery for other use cases
(IMPALA-10869).
Change-Id: I7c2bb5baad622051d1002c9c162c672d428e5446
Reviewed-on: http://gerrit.cloudera.org:8080/17706
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
StringToFloatInternal is used to parse string into float. It had logic
to ensure it is faster than standard functions like strtod in many
cases, but it was not as accurate. We are replacing it by a third
party library named fast_double_parser which is both fast and doesn't
sacrifise the accuracy for speed. On benchmarking on more than
1 million rows where string is cast to double, it is found that new
patch is on par with the earlier algorithm.
Results:
W/O library: Fetched 1222386 row(s) in 32.10s
With library: Fetched 1222386 row(s) in 31.71s
Testing:
1. Added test to check for accuracy improvement.
2. Ran existing Backend tests for correctness.
Change-Id: Ic105ad38a2fcbf2fb4e8ae8af6d9a8e251a9c141
Reviewed-on: http://gerrit.cloudera.org:8080/17389
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change will allow usage of commands that do not require reading the
Json File like:
- Create Table <Table> stored as JSONFILE
- Show Create Table <Table>
- Describe <Table>
Changes:
- Added JSON as FileFormat to thrift and HdfsFileFormat.
- Allowing Sql keyword 'jsonfile' and mapping it to JSON format.
- Adding JSON serDe.
- JsonFiles have input format same as TextFile, so we need to use SerDe
library in use to differentiate between the two formats. Overloaded the
functions querying File Format based on input format to consider serDe
library too.
- Added tests for 'Create Table' and 'Show Create Table' commmands
Pending Changes:
- test for Describe command - to be added with backend changes.
Change-Id: I5b8cb2f59df3af09902b49d3bdac16c19954b305
Reviewed-on: http://gerrit.cloudera.org:8080/17727
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hive relies on engine.hive.enabled=true table property to be set
for Iceberg tables. Without it Hive overwrites table metadata with
different storage handler, SerDe/Input/OutputFormatter when it
writes the table, making it unusable.
With this patch Impala sets this table property during table creation.
Testing:
* updated show-create-table.test
* tested Impala/Hive interop manually
Change-Id: I6aa0240829697a27f48d0defcce48920a5d6f49b
Reviewed-on: http://gerrit.cloudera.org:8080/17750
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The previous patch added checks on illegal decimal schemas of parquet
files. However, it doesn't return a non-ok status in
ParquetMetadataUtils::ValidateColumn if abort_on_error is set to false.
So we continue to use the illegal file schema and hit the DCHECK.
This patch fixes this and adding test coverage for illegal decimal
schemas.
Tests:
- Add a bad parquet file with illegal decimal schemas.
- Add e2e tests on the file.
- Ran test_fuzz_decimal_tbl 100 times. Saw the errors are caught as
expected.
Change-Id: I623f255a7f40be57bfa4ade98827842cee6f1fee
Reviewed-on: http://gerrit.cloudera.org:8080/17748
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
With this patch Impala will support partition evolution for
Iceberg tables.
The DDL statement to change the default partition spec is:
ALTER TABLE <tbl> SET PARTITION SPEC(<partition-spec>)
Hive uses the same SQL syntax.
Testing:
- Added FE test to exercise parsing various well-formed and ill-formed
ALTER TABLE SET PARTITION SPEC statements.
- Added e2e tests for:
- ALTER TABLE SET PARTITION SPEC works for tables with HadoopTables
and HadoopCatalog Catalog.
- When evolving partition spec, the old data written with an earlier
spec remains unchanged. New data is written using the new spec in
a new layout. Data written with earlier spec and new spec can be
fetched in a single query.
- Invalid ALTER TABLE SET PARTITION SPEC statements yield the
expected analysis error messages.
Change-Id: I9bd935b8a82e977df9ee90d464b5fe2a7acc83f2
Reviewed-on: http://gerrit.cloudera.org:8080/17723
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for the following standard Iceberg properties:
write.parquet.compression-codec:
Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY
(default value), LZ4, ZSTD. The table property will be ignored if
COMPRESSION_CODEC query option is set.
write.parquet.compression-level:
Parquet compression level. Used with ZSTD compression only.
Supported range is [1, 22]. Default value is 3. The table property
will be ignored if COMPRESSION_CODEC query option is set.
write.parquet.row-group-size-bytes :
Parquet row group size in bytes. Supported range is [8388608,
2146435072] (8MB - 2047MB). The table property will be ignored if
PARQUET_FILE_SIZE query option is set.
If neither the table property nor the PARQUET_FILE_SIZE query option
is set, the way Impala calculates row group size will remain
unchanged.
write.parquet.page-size-bytes:
Parquet page size in bytes. Used for PLAIN encoding. Supported range
is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates page size
will remain unchanged.
write.parquet.dict-size-bytes:
Parquet dictionary page size in bytes. Used for dictionary encoding.
Supported range is [65536, 1073741824] (64KB - 1GB).
If the table property is unset, the way Impala calculates dictionary
page size will remain unchanged.
This patch also renames 'iceberg.file_format' table property to
'write.format.default' which is the standard Iceberg name for the
table property.
Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23
Reviewed-on: http://gerrit.cloudera.org:8080/17654
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch addresses a failure in ubuntu-16.04 dockerised test. The test
involved is found in overlap_min_max_filters_on_sorted_columns.test as
follows.
set minmax_filter_fast_code_path=on;
set MINMAX_FILTER_THRESHOLD=0.0;
SET RUNTIME_FILTER_WAIT_TIME_MS=$RUNTIME_FILTER_WAIT_TIME_MS;
select straight_join count(a.timestamp_col) from
alltypes_timestamp_col_only a join [SHUFFLE] alltypes_limited b
where a.timestamp_col = b.timestamp_col and b.tinyint_col = 4;
---- RUNTIME_PROFILE
aggregation(SUM, NumRuntimeFilteredPages)> 57
The patch reduces the threshold from 58 to 50.
Testing:
Ran the unit test successfully.
Change-Id: Icb4cc7d533139c4a2b46a872234a47d46cb8a17c
Reviewed-on: http://gerrit.cloudera.org:8080/17696
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.
In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.
Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.
Tests:
- Add BE unit tests and e2e tests
- Add random tests to make sure malformed UTF-8 characters won't crash
us.
Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Reviewed-on: http://gerrit.cloudera.org:8080/17580
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Two Iceberg commits got into master branch in parallel. One of
them modified the DDL syntax, the other one added some tests.
They were correct on their own, but mixing the two causes
test failures.
The affected tests have been updated.
Change-Id: Id3cf6ff04b8da5782df2b84a580cdbd4a4a16d06
Reviewed-on: http://gerrit.cloudera.org:8080/17689
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch enables min/max filtering on any Z-order sort-by columns
by default.
Since the column stats for a row group or a page is computed from the
column values stored in the row group or the page, the current
infrastructure for min/max filtering works for the Z-order out of box.
The fact that these column values are ordered by Z-order is
orthogonal to the work of min/max filtering.
By default, the new feature is enabled. Set the existing control knob
minmax_filter_sorted_columns to false to turn it off.
Testing
1. Added new z-order related sort column tests in
overlap_min_max_filters_on_sorted_columns.test;
2. Ran core-test.
Change-Id: I2a528ffbd0e333721ef38b4be7d4ddcdbf188adf
Reviewed-on: http://gerrit.cloudera.org:8080/17635
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Iceberg recently switched to use its Catalogs class to define
catalog and table properties. Catalog information is stored in
a configuration file such as hive-site.xml. And the table properties
contain information about which catalog is being used and what is
the Iceberg table id.
E.g. in the Hive conf we can have the following properties to define
catalogs:
iceberg.catalog.<catalog_name>.type = hadoop
iceberg.catalog.<catalog_name>.warehouse = somelocation
or
iceberg.catalog.<catalog_name>.type = hive
And at the table level we can have the following:
iceberg.catalog = <catalog_name>
name = <table_identifier>
Table property 'iceberg.catalog' refers to a Catalog defined in the
configuration file. This is in contradiction with Impala's current
behavior where we are already using 'iceberg.catalog', and it can
have the following values:
* hive.catalog for HiveCatalog
* hadoop.catalog for HadoopCatalog
* hadoop.tables for HadoopTables
To be backward-compatible and also support the new Catalogs properties
Impala still recognizes the above special values. But, from now Impala
doesn't define 'iceberg.catalog' by default. 'iceberg.catalog' being
NULL means HiveCatalog for both Impala and Iceberg's Catalogs API,
hence for Hive and Spark as well.
If 'iceberg.catalog' has a different value than the special values it
indicates that Iceberg's Catalogs API is being used, so Impala will
try to look up the catalog configuration from the Hive config file.
Testing:
* added SHOW CREATE TABLE tests
* added e2e tests that create/insert/drop Iceberg tables with Catalogs
* manually tested interop behavior with Hive
Change-Id: I5dfa150986117fc55b28034c4eda38a736460ead
Reviewed-on: http://gerrit.cloudera.org:8080/17466
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently we have a DDL syntax for defining Iceberg partitions that
differs from SparkSQL:
https://iceberg.apache.org/spark-ddl/#partitioned-by
E.g. Impala is using the following syntax:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)
STORED AS ICEBERG;
The same in Spark is:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
USING ICEBERG
PARTITIONED BY (bucket(5, i), months(ts), years(d))
HIVE-25179 added the following syntax for Hive:
CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITIONED BY SPEC (bucket(5, i), months(ts), years(d))
STORED BY ICEBERG;
I.e. the same syntax as Spark, but adding the keyword "SPEC".
This patch makes Impala use Hive's syntax, i.e. we will also
use the PARTITIONED BY SPEC clause + the unified partition
transform syntax.
Testing:
* existing tests has been rewritten with the new syntax
Change-Id: Ib72ae445fd68fb0ab75d87b34779dbab922bbc62
Reviewed-on: http://gerrit.cloudera.org:8080/17575
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.
Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/
A new utility class has been introduced, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.
NULL values are returned in case of overflows. This behavior is
consistent with Hive.
Parquet column stats reader is also updated to convert the decimal
values. The stats reader is used to evaluate min/max conjuncts. It
works well because later we also evaluate the conjuncts on the
converted values anyway.
The status of different filterings:
* dictionary filtering: disabled for columns that need conversion
* runtime bloom filters: work on the converted values
* runtime min/max filters: work on the converted values
This patch also enables schema evolution of decimal columns of Iceberg
tables.
Testing:
* added e2e tests
Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Reviewed-on: http://gerrit.cloudera.org:8080/17678
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
This patch enables min/max filters for partitoned columns to take
advantage of the min/max filter infrastructure already built and to
provide coverage for certain equi-joins in which the stats filters
are not feasible.
The new feature is turned on by default and to turn off the feature,
set the new query option minmax_filter_partition_column to false.
In the patch, the existing query option enabled_runtime_filter_types
is enforced in specifying the types of the filters generated. The
default value ALL generates both the bloom and min/max filters. The
alternative value BLOOM generates only the bloom filters and another
alternative value MIN_MAX generates only the min/max filters.
The normal control knobs minmax_filter_threshold (for threshold) and
minmax_filtering_level (for filtering level) still work. When the
threshold is 0, the patch automatically assigns a reasonable value
for the threshhold.
Testing:
1). Added new tests in
overlap_min_max_filters_on_partition_columns.test;
2). Core tests
Change-Id: I89e135ef48b4bb36d70075287b03d1c12496b042
Reviewed-on: http://gerrit.cloudera.org:8080/17568
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch bumps up the GBN to 14842939. This build
includes HIVE-23995 and HIVE-24175 and some of the tests
were modified to take into account of that.
Also, fixes a minor bug in environ.py
Testing done:
1. Core tests.
Change-Id: I78f167c1c0d8e90808e387aba0e86b697067ed8f
Reviewed-on: http://gerrit.cloudera.org:8080/17628
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
This patch addresses the following failure in ubuntu 16.04 dockerised
test:
select straight_join count(a.timestamp_col) from
alltypes_timestamp_col_only a join [SHUFFLE] alltypes_limited b
where a.timestamp_col = b.timestamp_col and b.tinyint_col = 4
aggregation(SUM, NumRuntimeFilteredPages): 58
EXPECTED VALUE:
58
ACTUAL VALUE:
59
OP:
:
In the patch, the result expectation is altered from "==58" to ">57".
Testing:
1). Ran test_overlap_min_max_filters_on_sorted_columns multiple
times under regular and local catalog mode.
Change-Id: I4f9eb198dc4e4b0ad1a17696a1d74ff05ac0a436
Reviewed-on: http://gerrit.cloudera.org:8080/17618
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-10166 (part 1) already added the necessary code for
DROP and CHANGE COLUMN, but disabled those stmts because to correctly
support schema evolution we had to wait for column resolution
by Iceberg field id.
Since then IMPALA-10361 and IMPALA-10485 added support for field-id
based column resolution for Parquet and ORC as well.
Hence this patch enables DROP and CHANGE column ALTER TABLE
statements. We still disallow REPLACE COLUMNS because it doesn't
really make sense for Iceberg tables as it basically makes all
existing data inaccessible.
Changing DECIMAL columns are still disabled due to IMPALA-7087.
Testing:
* added e2e tests
Change-Id: I9b0d1a55bf0ed718724a69b51392ed53680ffa90
Reviewed-on: http://gerrit.cloudera.org:8080/17593
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
This patch enables min/max filters for equi-joins on lexical sort-by
columns in a Parquet table created by Impala by default. This is to
take advantage of Impala sorting the min/max values in column index
in each data file for the table. The control knob is query option
minmax_filter_sorted_columns, default to true.
When minmax_filter_sorted_columns is true, the patch will generate
min/max filters only for the leading sort columns. The normal control
knobs minmax_filter_threshold (for threshold) and
minmax_filtering_level (for filtering level) still work. When the
threshold is 0, the patch automatically assigns a reasonable value
for the threshhold, and selects PAGE to be the filtering level.
In the backend, the skipped pages are quickly found by taking a
fast code path to identify the corresponding lower and the upper
bounds in the sorted min and max value arrays, given a range in the
filter. The skipped pages are expressed as page ranges which are
translated into row ranges later on.
A new query option minmax_filter_fast_code_path is added to control
the work of the fast code path. It can take ON (default), OFF, or
VERIFICATION three values. The last helps verify that the results
from both the fast and the regular code path are the same.
Preliminary performance testing (joining into a simpplified TPCH
lineitem table of 2 sorted BIG INT columns and a total of 6001215
rows) confirms that min/max filtering on leading sort-by columns
improves the performance of scan operators greatly. The best result
is seen with pages containing no more than 24000 rows: 84.62ms
(page level filtering) vs. 115.27ms (row group level filtering)
vs 137.14ms (no filtering). The query utilized is as follows.
select straight_join a.l_orderkey from
simpflified_lineitem a join [SHUFFLE] tpch_parquet.lineitem b
where a.l_orderkey = b.l_orderkey and b.l_receiptdate = "1998-12-31"
Also fixed in the patch are abnormal min/max display in "Final
filter table" section in a profile for DECIMAL, TIMESTAMP and DATE
data types, and reading DATE column index in batch without validation.
Testing:
1). Added a new test overlap_min_max_filters_on_sorted_columns.test
to verify
a) Min/max filters are only created for leading sort by column;
b) Query option minmax_filter_sorted_columns works;
c) Query option minmax_filter_fast_code_path works.
2). Added new tests in parquet-page-index-test.cc to test fast
code path under various conditions;
3). Ran core tests successfully.
Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Reviewed-on: http://gerrit.cloudera.org:8080/17478
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch restores the previous testing enviroment in which exchange
node 5 in the test query shown below reliablly reports memory
exceeding limit error.
set mem_limit=171520k;
set num_scanner_threads=1;
select *
from tpch_parquet.lineitem l1
join tpch_parquet.lineitem l2 on l1.l_orderkey = l2.l_orderkey and
l1.l_partkey = l2.l_partkey and l1.l_suppkey = l2.l_suppkey
and l1.l_linenumber = l2.l_linenumber
order by l1.l_orderkey desc, l1.l_partkey, l1.l_suppkey, l1.l_linenumber
limit 5
In the patch, the memory limit for the entire query is lowered to 164Mb,
less than 171520k, the minimum for the query to be accepted after
the result spooling is enabled by default. This is achieved by
disabling result spooling by setting spool_query_results to false.
Testing:
1). Ran exchange-mem-scaling.test
Change-Id: Ia4ad4508028645b67de419cfdfa2327d2847cfc4
Reviewed-on: http://gerrit.cloudera.org:8080/17586
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Built-in function has been added to compute MD5 128-bit checksum for
a non-null string. If input string is null, then output of the
function is null too. In FIPS mode, MD5 is disabled and function
will throw error on invocation.
Testing:
1. Added expression unit tests.
2. Added end-to-end tests for MD5.
Change-Id: Id406d30a7cc6573212b302fbfec43eb848352ff2
Reviewed-on: http://gerrit.cloudera.org:8080/17567
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds ability to unset or delete table properties or serde
properties for a table. It supports 'IF EXISTS' clause in case users
are not sure if property being unset exists. Without 'IF EXISTS',
trying to unset property that doesn't exist will fail.
Tests:
1. Added Unit tests and end-to-end tests
2. Covered tables of different storage type like Kudu,
Iceberg, HDFS table.
Change-Id: Ife4f6561dcdcd20c76eb299c6661c778e342509d
Reviewed-on: http://gerrit.cloudera.org:8080/17530
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This function receives two strings that are serialized Apache
DataSketches CPC sketches. Union two sketches and returns the
resulting sketch of union.
Example:
select ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2))
from sketch_tbl;
+---------------------------------------------------+
| ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) |
+---------------------------------------------------+
| 15 |
+---------------------------------------------------+
Change-Id: Ib5c616316bf2bf2ff437678e9a44a15339920150
Reviewed-on: http://gerrit.cloudera.org:8080/17440
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:
---------------------------------------
|Parquet type | Impala type |
|---------------------------------------|
|INT32 | TINYINT, SMALLINT, INT |
|INT64 | BIGINT |
|FLOAT | FLOAT |
|DOUBLE | DOUBLE |
|BYTE_ARRAY | STRING |
---------------------------------------
The following types are not supported for the given reasons:
----------------------------------------------------------------
|Impala type | Problem |
|----------------------------------------------------------------|
|VARCHAR(N) | truncation can change hash |
|CHAR(N) | padding / truncation can change hash |
|DECIMAL | multiple encodings supported |
|TIMESTAMP | multiple encodings supported, timezone conversion |
|DATE | not considered yet |
----------------------------------------------------------------
Support may be added for these types later, see IMPALA-10641.
If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.
Testing:
- Added tests/query_test/test_parquet_bloom_filter.py that tests
whether Parquet Bloom filtering works for the supported types and
that we do not incorrectly discard row groups for the unsupported
type VARCHAR. The Parquet file used in the test was generated with
an external tool.
- Added unit tests for ParquetBloomFilter in file
be/src/util/parquet-bloom-filter-test.cc
- A minor, unrelated change was done in
be/src/util/bloom-filter-test.cc: the MakeRandom() function had
return type uint64_t, the documentation claimed it returned a 64 bit
random number, but the actual number of random bits is 32, which is
what is intended in the tests. The return type and documentation
have been corrected to use 32 bits.
Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Reviewed-on: http://gerrit.cloudera.org:8080/17026
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
When a truncate table command is issued, in case of
non-transactional tables, the table and column statistics for the table
are also deleted by default. This can be a expensive operation
especially when many truncate table commands are running concurrently.
As the concurrency increases, the response time from Hive metastore
slows down the delete table and column statistics RPC calls.
In cases where truncate operation is used to remove the existing
data and then reload new data, it is likely that users will compute
stats again as soon as the new data is reloaded. This would overwrite
the existing statistics and hence the additional time spent by
the truncate operation to delete column and table statistics becomes
unnecessary.
To improve this, this change introduces a new query option:
DELETE_STATS_IN_TRUNCATE. The default value of this option is 1 or true
which means stats will be deleted as part of truncate operation.
As the name suggests, when this query options are set to false or 0,
a truncate operation will not delete the table and column statistics
for the table.
This change also makes a improvement to truncate operation on
tables which are replicated. If the table is being replicated,
previously, the statistics were not getting deleted after truncate.
Now the statistics will get deleted after truncate.
Testing:
Modified truncate-table.test to include variations of these query
options and making sure that the statistics are deleted or skipped
from deletion after truncate operation.
Change-Id: I9400c3586b4bdf46d9b4056ea1023aabae8cc519
Reviewed-on: http://gerrit.cloudera.org:8080/17521
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Currently the ORC scanner only supports position-based column
resolution. This patch adds Iceberg field-id based column resolution
which will be the default for Iceberg tables. It is needed to support
schema evolution in the future, i.e. ALTER TABLE DROP/RENAME COLUMNS.
(The Parquet scanner already supports Iceberg field-id based column
resolution)
Testing
* added e2e test 'iceberg-orc-field-id.test' by copying the contents of
nested-types-scanner-basic,
nested-types-scanner-array-materialization,
nested-types-scanner-position,
nested-types-scanner-maps,
and executing the queries on an Iceberg table with ORC data files
Change-Id: Ia2b1abcc25ad2268aa96dff032328e8951dbfb9d
Reviewed-on: http://gerrit.cloudera.org:8080/17398
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Because of an Iceberg bug Impala didn't push predicates to
Iceberg for dates/timestamps when the predicate referred to a
value before the UNIX epoch.
https://github.com/apache/iceberg/pull/1981 fixed the Iceberg
bug, and lately Impala switched to an Iceberg version that has
the fix, therefore this patch enables predicate pushdown for all
timestamp/date values.
The above Iceberg patch maintains backward compatibility with the
old, wrong behavior. Therefore sometimes we need to read plus one
Iceberg partition than necessary.
Testing:
* Updated current e2e tests
Change-Id: Ie67f41a53f21c7bdb8449ca0d27746158be7675a
Reviewed-on: http://gerrit.cloudera.org:8080/17417
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Built-in functions to compute SHA-1 digest and SHA-2 family of digest
has been added. Support for SHA2 digest includes SHA224, SHA256,
SHA384 and SHA512. In FIPS mode SHA1, SHA224 and SHA256 have been
disabled and will throw error. SHA2 functions will also throw error
for unsupported bit length i.e., bit length apart from 224, 256, 384,
512.
Testing:
1. Added Unit test for expressions.
2. Added end-to-end test for new functions.
Change-Id: If163b7abda17cca3074c86519d59bcfc6ace21be
Reviewed-on: http://gerrit.cloudera.org:8080/17464
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Table alltypes has no statistics, so the cardinality of alltypes
will be estimated based on the hdfs files and the avg row size.
Calling PrintUtils.printMetric, double will be divided by long. There
will be accuracy problems. In most cases, the number of lines
calculated is 17.91 K. But due to accuracy problems here, the
calculated value is 17.90K.
I modified line 221 of stats-extrapolation.test and used row_regex
to match, referring to the matching method of cardinality in line
224,in this case, their values are the same
Testing:
metadata/test_stats_extrapolation.py
Change-Id: I0a1a3809508c90217517705b2b188b2ccba6f23f
Reviewed-on: http://gerrit.cloudera.org:8080/17411
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Jim Apple <jbapple@apache.org>
AVG used to contain a back and forth timezone conversion if
use_local_tz_for_unix_timestamp_conversions is true. This could
affect the results if there were values from different DST rules.
Note that AVG on timestamps has other issues besides this, see
IMPALA-7472 for details.
Testing:
- added a regression test
Change-Id: I999099de8e07269b96b75d473f5753be4479cecd
Reviewed-on: http://gerrit.cloudera.org:8080/17412
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This function receives a set of serialized Apache DataSketches CPC
sketches produced by ds_cpc_sketch() and merges them into a single
sketch.
An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
SELECT
ds_cpc_estimate(ds_cpc_union(sketch_col))
FROM sketch_tbl
WHERE partition_col=1 OR partition_col=5;
Testing:
- Apart from the automated tests I added to this patch I also
tested ds_cpc_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_cpc_union() on those sketches
Change-Id: Ib94b45ae79efcc11adc077dd9df9b9868ae82cb6
Reviewed-on: http://gerrit.cloudera.org:8080/17372
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.
Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.
For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html
Testing:
- Added some tests running estimates for small datasets where the
amount of data is small enough to get the correct results.
- Ran manual tests on tpch_parquet.lineitem to compare perfomance
with ndv(). Depending on data characteristics ndv() appears 2x-3x
faster. CPC gives closer estimate than current ndv(). CPC is more
accurate than HLL in some cases
Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Reviewed-on: http://gerrit.cloudera.org:8080/16656
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
ORC-189 and ORC-666 added support for a new timestamp type
'TIMESTMAP WITH LOCAL TIMEZONE' to the Orc library.
This patch adds support for reading such timestamps with Impala.
These are UTC-normalized timestamps, therefore we convert them
to local timezone during scanning.
Testing:
* added test for CREATE TABLE LIKE ORC
* added scanner tests to test_scanners.py
Change-Id: Icb0c6a43ebea21f1cba5b8f304db7c4bd43967d9
Reviewed-on: http://gerrit.cloudera.org:8080/17347
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-10482: SELECT * query on unrelative collection column of
transactional ORC table will hit IllegalStateException.
The AcidRewriter will rewrite queries like
"select item from my_complex_orc.int_array" to
"select item from my_complex_orc t, t.int_array"
This cause troubles in star expansion. Because the original query
"select * from my_complex_orc.int_array" is analyzed as
"select item from my_complex_orc.int_array"
But the rewritten query "select * from my_complex_orc t, t.int_array" is
analyzed as "select id, item from my_complex_orc t, t.int_array".
Hidden table refs can also cause issues during regular column
resolution. E.g. when the table has top-level 'pos'/'item'/'key'/'value'
columns.
The workaround is to keep track of the automatically added table refs
during query rewrite. So when we analyze the rewritten query we can
ignore these auxiliary table refs.
IMPALA-10493: Using JOIN ON syntax to join two full ACID collections
produces wrong results.
When AcidRewriter.splitCollectionRef() creates a new collection ref
it doesn't copy every information needed to correctly execute the
query. E.g. it dropped the ON clause, turning INNER joins to CROSS
joins.
Testing:
* added e2e tests
Change-Id: I8fc758d3c1e75c7066936d590aec8bff8d2b00b0
Reviewed-on: http://gerrit.cloudera.org:8080/17038
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The issue occurs when a CastFormatExpr is shared among threads and
multiple threads call its OpenEvaluator(). Later calls delete the
DateTimeFormatContext created by older calls which makes
fn_ctx->GetFunctionState() a dangling pointer.
This only happens when CastFormatExpr is shared among
FragmentInstances - in case of scanner threads OpenEvaluator() is
called with THREAD_LOCAL and returns early without modifying anything.
Testing:
- added a regression test
Change-Id: I501c8a184591b1c836b2ca4cada1f2117f9f5c99
Reviewed-on: http://gerrit.cloudera.org:8080/17374
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Original approach to convert DecimalValue(internal representation
of decimals) to double was not accurate.
It was:
static_cast<double>(value_) / pow(10.0, scale).
However only integers from −2^53 to 2^53 can be represented
accurately by double precision without any loss.
Hence, it would not work for numbers like -0.43149576573887316.
For DecimalValue representing -0.43149576573887316, value_ would be
-43149576573887316 and scale would be 17. As value_ < -2^53,
result would not be accurate. In newer approach we are using third
party library https://github.com/lemire/fast_double_parser, which
handles above scenario in a performant manner.
Testing:
1. Added End to End Tests covering following scenarios:
a. Test to show precision limitation of 16 in the write path
b. DecimalValue's value_ between -2^53 and 2^53.
b. value_ outside above range but abs(value_) < UINT64_MAX
c. abs(value_) > UINT64_MAX -covers DecimalValue<__int128_t>
2. Ran existing backend and end-to-end tests completely
Change-Id: I56f0652cb8f81a491b87d9b108a94c00ae6c99a1
Reviewed-on: http://gerrit.cloudera.org:8080/17303
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In EE tests HS2 returned results with smaller precision than Beeswax for
FLOAT/DOUBLE/TIMESTAMP types. These differences are not inherent to the
HS2 protocol - the results are returned with full precision in Thrift
and lose precision during conversion in client code.
This patch changes to conversion in HS2 to match Beeswax and removes
test section DBAPI_RESULTS that was used to handle the differences:
- float/double: print method is changed from str() to ":.16".format()
- timestamp: impyla's cursor is created with convert_types=False to
avoid conversion to datetime.datetime (which has only
microsec precision)
Note that FLOAT/DOUBLE are still different in impala-shell, this change
only deals with EE tests.
Testing:
- ran the changed tests
Change-Id: If69ae90c6333ff245c2b951af5689e3071f85cb2
Reviewed-on: http://gerrit.cloudera.org:8080/17325
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The change improves how a coordinator behaves when a just
arriving min/max filter is always true. A new member
'always_true_filter_received_' is introduced to record such a
fact. Similarily, the new member always_false_flipped_to_false_
is added to indicate that the always false flag is flipped from
'true' to 'false'. These two members only influence how the min
and max columns in "Filter routing table" and "Final filter
table" in profile are displayed as follows.
1. 'PartialUpdates' - The min and the max are partially updated;
2. 'AlwaysTrue' - One received filter is AlwaysTrue;
3. 'AlwaysFalse' - No filter is received or all received
filters are empty;
4. 'Real values' - The final accumulated min/max from all
received filters.
A second change introduced is to record, in scan node, the
arrival time of min/max filters (as a timestamp since the system
is rebooted, obtained by calling MonotonicMillis()). A timestamp
of similar nature is recorded for hdfs parquet scanners when a
row group is processed. By comparing these two timestamps, one
can easily diagnose issues related to late arrival of min/max
filters.
This change also addresses a flaw with rows unexpectedly
filtered out, due to the reason that the always_true_ flag in
a min/max filter, when set, is ignored in the eval code path
in RuntimeFilter::Eval().
Testing:
1. Added three new tests in overlap_min_max_filters.test to
verify that the min/max are displayed correctly when the
min/max filter in hash join builder is set to always true,
always false, or a pair of meaningful min and max values.
2. Ran unit tests;
3. Ran runtime-filter-test;
4. Ran core tests successfully.
Change-Id: I326317833979efcbe02ce6c95ad80133dd5c7964
Reviewed-on: http://gerrit.cloudera.org:8080/17252
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.
The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
1. compute_column_minmax_stats
2. show_column_minmax_stats
Usage examples.
set compute_column_minmax_stats=true;
compute stats tpcds_parquet.store_sales;
set show_column_minmax_stats=true;
show column stats tpcds_parquet.store_sales;
+-----------------------+--------------+-...-------+---------+---------+
| Column | Type | #Falses | Min | Max |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk | INT | -1 | 28800 | 75599 |
| ss_item_sk | BIGINT | -1 | 1 | 18000 |
| ss_customer_sk | INT | -1 | 1 | 100000 |
| ss_cdemo_sk | INT | -1 | 15 | 1920797 |
| ss_hdemo_sk | INT | -1 | 1 | 7200 |
| ss_addr_sk | INT | -1 | 1 | 50000 |
| ss_store_sk | INT | -1 | 1 | 10 |
| ss_promo_sk | INT | -1 | 1 | 300 |
| ss_ticket_number | BIGINT | -1 | 1 | 240000 |
| ss_quantity | INT | -1 | 1 | 100 |
| ss_wholesale_cost | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_list_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_sales_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_discount_amt | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_sales_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_wholesale_cost | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_list_price | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_ext_tax | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_coupon_amt | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_net_paid | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_net_paid_inc_tax | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_net_profit | DECIMAL(7,2) | -1 | -1 | -1 |
| ss_sold_date_sk | INT | -1 | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+
Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.
The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.
Testing:
- Added new compute/show stats tests in
compute-stats-column-minmax.test;
- Added new tests in overlap_min_max_filters.test to demonstrate the
usefulness of column stats to quickly disable useless filters in
both hash join builder and Parquet scanner;
- Added tests in min-max-filter-test.cc to demonstrate method Or(),
ToProtobuf() and constructor can deal with always true flag well;
- Tested with TPCDS 3TB to demonstrate the usefulness of the min
and max column stats in disabling min/max filters that are not
useful.
- core tests.
TODO:
1. IMPALA-10602: Intersection of multiple min/max filters when
applying to common equi-join columns;
2. IMPALA-10601: Creating lineitem_orderkey_only table in
tpch_parquet database;
3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
tables with Parquet data files;
4. IMPALA-10617: Compute min/max column stats beyond parquet tables.
Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Reviewed-on: http://gerrit.cloudera.org:8080/17075
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>