Commit Graph

1508 Commits

Author SHA1 Message Date
liuyao
39cc4b6bf4 IMPALA-2581: LIMIT can be propagated down into some aggregations
This patch contains 2 parts:
1. When both conditions below are true, push down limit to
pre-aggregation
     a) aggregation node has no aggregate function
     b) aggregation node has no predicate
2. finish aggregation when number of unique keys of hash table has
exceeded the limit.

Sample queries:
SELECT DISTINCT f FROM t LIMIT n
Can pass the LIMIT all the way down to the pre-aggregation, which
leads to a nearly unbounded speedup on these queries in large tables
when n is low.

Testing:
Add test targeted-perf/queries/aggregation.test
Pass core test

Change-Id: I930a6cb203615acfc03f23118d1bc1f0ea360995
Reviewed-on: http://gerrit.cloudera.org:8080/17821
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-22 20:42:10 +00:00
norbert.luksa
35b21083b1 IMPALA-6505: Min-Max predicate push down in ORC scanner
In planning phase, the planner collects and generates min-max predicates
that can be evaluated on parquet file statistics. We can easily extend
this on ORC tables.

This commit implements min/max predicate pushdown for the ORC scanner
leveraging on the external ORC library's search arguments. We build
the search arguments when we open the scanner as we need not to
modify them later.

Also added a new query option orc_read_statistics, similar to
parquet_read_statistics. If the option is set to true (it is by default)
predicate pushdown will take effect, otherwise it will be skipped. The
predicates will be evaluated at ORC row group level, i.e. by default for
every 10,000 rows.

Limitations:
 - Min-max predicates on CHAR/VARCHAR types are not pushed down due to
   inconsistent behaviors on padding/truncating between Hive and Impala.
   (IMPALA-10882)
 - Min-max predicates on TIMESTAMP are not pushed down (IMPALA-10915).
 - Min-max predicates having different arg types are not pushed down
   (IMPALA-10916).
 - Min-max predicates with non-literal const exprs are not pushed down
   since SearchArgument interfaces only accept literals. This only
   happens when expr rewrites are disabled thus constant folding is
   disabled.

Tests:
 - Add e2e tests similar to test_parquet_stats to verify that
   predicates are pushed down.
 - Run CORE tests
 - Run TPCH benchmark, there is no improvement, nor regression.
   On the other hand, certain selective queries gained significant
   speed-up, e.g. select count(*) from lineitem where l_orderkey = 1.

Change-Id: I136622413db21e0941d238ab6aeea901a6464845
Reviewed-on: http://gerrit.cloudera.org:8080/15403
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-17 00:44:15 +00:00
AlexanderSaydakov
c925807b1a IMPALA-10901 cleaner and faster operations with datasketches
- serialize using bytes instead of stream
- avoid unnecessary constructor during deserialization
- simplified code slightly
- added original exception message to re-thrown generic message

Change-Id: I306a2489dac0f4d2d475e8f9987cd58bf95474bb
Reviewed-on: http://gerrit.cloudera.org:8080/17818
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-15 13:58:23 +00:00
stiga-huang
3850d49711 IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
  mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
  mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-15 05:04:07 +00:00
Gabor Kaszab
1e21aa6b96 IMPALA-9495: Support struct in select list for ORC tables
This patch implements the functionality to allow structs in the select
list of inline views, topmost blocks. When displaying the value of a
struct it is formatted into a JSON value and returned as a string. An
example of such a value:

SELECT struct_col FROM some_table;
'{"int_struct_member":12,"string_struct_member":"string value"}'

Another example where we query a nested struct:
SELECT outer_struct_col FROM some_table;
'{"inner_struct":{"string_member":"string value","int_member":12}}'

Note, the conversion from struct to JSON happens on the server side
before sending out the value in HS2 to the client. However, HS2 is
capable of handling struct values as well so in a later change we might
want to add a functionality to send the struct in thrift to the client
so that the client can use the struct directly.

-- Internal representation of a struct:
When scanning a struct the rowbatch will hold the values of the
struct's children as if they were queried one by one directly in the
select list.

E.g. Taking the following table:
CREATE TABLE tbl (id int, s struct<a:int,b:string>) STORED AS ORC

And running the following query:
SELECT id, s FROM tbl;

After scanning a row in a row batch will hold the following values:
(note the biggest size comes first)
 1: The pointer for the string in s.b
 2: The length for the string in s.b
 3: The int value for s.a
 4: The int value of id
 5: A single null byte for all the slots: id, s, s.a, s.b

The size of a struct has an effect on the order of the memory layout of
a row batch. The struct size is calculated by summing the size of its
fields and then the struct gets a place in the row batch to precede all
smaller slots by size. Note, all the fields of a struct are consecutive
to each other in the row batch. Inside a struct the order of the fields
is also based on their size as it does in a regular case for primitives.

When evaluating a struct as a SlotRef a newly introduced StructVal will
be used to refer to the actual values of a struct in the row batch.
This StructVal holds a vector of pointers where each pointer represents
a member of the struct. Following the above example the StructVal would
keep two pointers, one to point to an IntVal and one to point to a
StringVal.

-- Changes related to tuple and slot descriptors:
When providing a struct in the select list there is going to be a
SlotDescriptor for the struct slot in the topmost TupleDescriptor.
Additionally, another TupleDesriptor is created to hold SlotDescriptors
for each of the struct's children. The struct SlotDescriptor points to
the newly introduced TupleDescriptor using 'itemTupleId'.
The offsets for the children of the struct is calculated from the
beginning of the topmost TupleDescriptor and not from the
TupleDescriptor that directly holds the struct's children. The null
indicator bytes as well are stored on the level of the topmost
TupleDescriptor.

-- Changes related to scalar expressions:
A struct in the select list is translated into an expression tree where
the top of this tree is a SlotRef for the struct itself and its
children in the tree are SlotRefs for the members of the struct. When
evaluating a struct SlotRef after the null checks the evaluation is
delegated to the children SlotRefs.

-- Restrictions:
  - Codegen support is not included in this patch.
  - Only ORC file format is supported by this patch.
  - Only HS2 client supports returning structs. Beeswax support is not
    implemented as it is going to be deprecated anyway. Currently we
    receive an error when trying to query a struct through Beeswax.

-- Tests added:
  - The ORC and Parquet functional databases are extended with 3 new
    tables:
    1: A small table with one level structs, holding different
    kind of primitive types as members.
    2: A small table with 2 and 3 level nested structs.
    3: A bigger, partitioned table constructed from alltypes where all
    the columns except the 'id' column are put into a struct.
  - struct-in-select-list.test and nested-struct-in-select-list.test
    uses these new tables to query structs directly or through an
    inline view.

Change-Id: I0fbe56bdcd372b72e99c0195d87a818e7fa4bc3a
Reviewed-on: http://gerrit.cloudera.org:8080/17638
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-14 21:21:47 +00:00
Zoltan Borok-Nagy
6b4693ddbf IMPALA-10900: Add Iceberg tests that write many files
In earlier versions of Impala we had a bug that affected
insertions to Iceberg tables. When Impala wrote multiple
files during a single INSERT statement it could crash, or
even worse, it could silently omit data files from the
Iceberg metadata.

The current master doesn't have this bug, but we don't
really have tests for this case.

This patch adds tests that write many files during inserts
to an Iceberg table. Both non-partitioned and partitioned
Iceberg tables are tested.

We achieve writing lots of files by setting 'parquet_file_size'
to 8 megabytes.

Testing:
 * added e2e test that write many data files
 * added exhaustive e2e test that writes even more data files

Change-Id: Ia2dbc2c5f9574153842af308a61f9d91994d067b
Reviewed-on: http://gerrit.cloudera.org:8080/17831
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-08 18:38:37 +00:00
Attila Jeges
c8aa5796d9 IMPALA-10879: Add parquet stats to iceberg manifest
This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
2021-09-02 21:34:41 +00:00
Zoltan Borok-Nagy
4f9f8c33ca IMPALA-10840: Add support for "FOR SYSTEM_TIME AS OF" and "FOR SYSTEM_VERSION AS OF" for Iceberg tables
This patch adds support "FOR SYSTEM_TIME AS OF" and
"FOR SYSTEM_VERSION AS OF" clauses for Iceberg tables. The new
clauses are part of the table ref. FOR SYSTEM_TIME AS OF conforms to the
SQL2011 standard:
https://cs.ulb.ac.be/public/_media/teaching/infoh415/tempfeaturessql2011.pdf

With FOR SYSTEM_TIME AS OF we can query a table at a specific time
point, e.g. we can retrieve what was the table content 1 day ago.

The timestamp given to "FOR SYSTEM_TIME AS OF" is interpreted in the
local timezone. The local timezone can be set via the query option
TIMEZONE. By default the timezone being used is the coordinator node's
local timezone. The timestamp is translated to UTC because table
snapshots are tagged with a UTC timestamps.

"FOR SYSTEM_VERSION AS OF" is a non-standard extension. It works
similarly to FOR SYSTEM_TIME AS OF, but with this clause we can query
a table via a snapshot ID instead of a timestamp.

HIVE-25344 also added support for these clauses to Hive.

Table snapshot IDs and timestamp information can be queried with the
help of the DESCRIBE HISTORY command.

Sample queries:

 SELECT * FROM t FOR SYSTEM_TIME AS OF now();
 SELECT * FROM t FOR SYSTEM_TIME AS OF '2021-08-10 11:02:34';
 SELECT * FROM t FOR SYSTEM_TIME AS OF now() - interval 10 days + interval 3 hours;

 SELECT * FROM t FOR SYSTEM_VERSION AS OF 7080861547601448759;

 SELECT * FROM t FOR SYSTEM_TIME AS OF now()
 MINUS
 SELECT * FROM t FOR SYSTEM_TIME AS OF now() - interval 1 days;

This patch uses some parts of the in-progress
IMPALA-9773 (https://gerrit.cloudera.org/#/c/13342/) developed by
Todd Lipcon and Grant Henke. This patch also resolves some TODOs of
IMPALA-9773, i.e. after this patch it'll be easier to add
time travel for Kudu tables as well.

Testing:
 * added parser tests (ParserTest.java)
 * added analyzer tests (AnalyzeStmtsTest.java)
 * added e2e tests (test_iceberg.py)

Change-Id: Ib523c5e47b8d9c377bea39a82fe20249177cf824
Reviewed-on: http://gerrit.cloudera.org:8080/17765
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-30 23:09:42 +00:00
Amogh Margoor
2040b2621f IMPALA-7635: Reducing HashTable size by packing it's buckets efficiently.
HashTable implementation in Impala comprises of contiguous array
of Buckets and each Bucket contains either data or pointer to
linked list of duplicate entries named DuplicateNode.
These are the structures of Bucket and DuplicateNode:

  struct DuplicateNode {
    bool matched;
    DuplicateNode* next;
    HtData htdata;
  };

  struct Bucket {
    bool filled;
    bool matched;
    bool hasDuplicates;
    uint32_t hash;
    union {
      HtData htdata;
      DuplicateNode* duplicates;
    } bucketData;
  };

Size of Bucket is currently 16 bytes and size of DuplicateNode is
24 bytes. If we can remove the booleans from both struct size of
Bucket would reduce to 12 bytes and DuplicateNode will be 16 bytes.
One of the ways we can remove booleans is to fold it into pointers
already part of struct. Pointers store addresses and on
architectures like x86 and ARM the linear address is only 48 bits
long. With level 5 paging Intel is planning to expand it to 57-bit
long which means we can use most significant 7 bits i.e., 58 to 64
bits to store these booleans. This patch reduces the size of Bucket
and DuplicateNode by implementing this folding. However, there is
another requirement regarding Size of Bucket to be power of 2 and
also for the number of buckets in Hash table to be power of 2.
These requirements are for the following reasons:
1. Memory Allocator allocates memory in power of 2 to avoid
   internal fragmentation. Hence, num of buckets * sizeof(Buckets)
   should be power of 2.
2. Number of buckets being power of 2 enables faster modulo
   operation i.e., instead of slow modulo: (hash % N), faster
   (hash & (N-1)) can be used.

Due to this, 4 bytes 'hash' field from Bucket is removed and
stored separately in new array hash_array_ in HashTable.
This ensures sizeof(Bucket) is 8 which is power of 2.

New Classes:
------------
As a part of patch, TaggedPointer is introduced which is a template
class to store a pointer and 7-bit tag together in 64 bit integer.
This structure contains the ownership of the pointer and will take care
of allocation and deallocation of the object being pointed to.
However derived classes can opt out of the ownership of the object
and let the client manage it. It's derived classes for Bucket and
DuplicateNode do the same. These classes are TaggedBucketData and
TaggedDuplicateNode.

Benchmark:
----------
As a part of this patch a new Micro Benchmark for HashTable has
been introduced, which will help in measuring these:
1. Runtime for building hash table and probing it.
2. Memory consumed after building the Table.
This would help measuring the impact of changes to the HashTable's
data structure and algorithm.
Saw 25-30% reduction in memory consumed and no significant
difference in performance (0.91X-1.2X).

Other Benchmarks:
1. Billion row Synthetic benchmark on single node, single daemon:
   a. 2-3% improvement in Join GEOMEAN for Probe benchmark.
   b. 17% and 21% reduction in PeakMemoryUsage and
      CumulativeBytes allocated respectively
2. TPCH-42: 0-1.5% improvement in GEOMEAN runtime

Change-Id: I72912ae9353b0d567a976ca712d2d193e035df9b
Reviewed-on: http://gerrit.cloudera.org:8080/17592
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-25 20:05:47 +00:00
Andrew Sherman
b54d0c35ff IMPALA-10849: Ignore escaped wildcards that terminate like predicates.
A like predicate is generally evaluated by converting it into a regex
that is evaluated at execution time. If the predicate of a like clause
is a constant (which is the common case when you say "row
like 'start%'") then there are optimizations where some cases that are
simpler then a regex are spotted, and a simple function than a regex
evaluator is used. One example is that a predicate such as ‘start%’ is
evaluated by looking for strings that begin with "start". Amusingly the
code that spots the potential optimizations uses regexes to look for
patterns in the like predicate. The code that looks for the
optimization where a simple prefix can be searched for does not deal
with the case where the '%' wildcard at the end of the predicate is
escaped. To fix this we add a test that deals with the case where the
predicate ends in an escaped '%'.

There are some other problems with escaped wildcards discussed in
IMPALA-2422. This change does not fix these problems, which are hard.

New tests for escaped wildcards are added to exprs.test - note that
these tests cannot be part of the LikeTbl tests as the like predicate
optimizations are only applied when the like predicate is a string
literal.

Exhaustive tests ran clean.

Change-Id: I30356c19f4f169d99f7cc6268937653af6b41b70
Reviewed-on: http://gerrit.cloudera.org:8080/17798
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-25 05:31:42 +00:00
Qifan Chen
cd902d8c22 IMPALA-3430: Runtime filter : Extend runtime filter to support Min/Max values for HDFS scans
This patch enables min/max filtering for non-correlated subqueries
that return one value. In this case, the filters are built from the
results of the subqueries and the filtering target is the scan node to
be qualified by one of the subqueries. Shown below is one such query
that normally gets compiled into a nested loop join. The filtering
limits the values from column store_sales.ss_sales_price to be within
[-infinite, min(ss_wholesale_cost)].

  select count(*) from store_sales
    where ss_sales_price <=
      (select min(ss_wholesale_cost) from store_sales);

In FE, the fact that the above scalar subquery exists is recorded
in a flag in InlineViewRef in analyzer and later on transferred to
AggregationNode in planner.

In BE, the min/max filtering infrastructure is integrated with the
nested loop join as follows.

 1. NljBuilderConfig is populated with filter descriptors from nested
    join plan node via NljBuilder::CreateEmbeddedBuilder() (similar
    to hash join), or in NljBuilderConfig::Init() when the sink config
    is created (for separate builder case);
 2. NljBuilder is populated with filter contexts utilizing the filter
    descriptors in NljBuilderConfig. Filter contexts are the interface
    to actual min/max filters;
 3. New insertion methods InsertFor<op>(), where <op> is LE, LT, GE and
    GT, are added to MinMaxFilter class hierarcy. They are used for
    join predicate target <op> src_expr;
 4. RuntimeContext::InsertPerCompareOp() calls one of the new
    insertion methods above based on the comparison op saved in the
    filter descriptor;
 5. NljBuilder::InsertRuntimeFilters() calls the new methods.

By default, the feature is turned on only for sorted or partitioned
join columns.

Testing:
 1. Add single range insertion tests in min-max-filter-test.cc;
 2. Add positive and negative plan tests in
    overlap_min_max_filters.test;
 3. Add tests in overlap_min_max_filters_on_partition_columns.test;
 4. Add tests in overlap_min_max_filters_on_sorted_columns.test;
 5. Run core tests.

TODO in follow-up patches:
 1. Extend min/max filter for inequality subquery for other use cases
    (IMPALA-10869).

Change-Id: I7c2bb5baad622051d1002c9c162c672d428e5446
Reviewed-on: http://gerrit.cloudera.org:8080/17706
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-21 14:46:51 +00:00
Amogh Margoor
0bde6b443c IMPALA-10680: Replace StringToFloatInternal using fast_double_parser library
StringToFloatInternal is used to parse string into float. It had logic
to ensure it is faster than standard functions like strtod in many
cases, but it was not as accurate. We are replacing it by a third
party library named fast_double_parser which is both fast and doesn't
sacrifise the accuracy for speed. On benchmarking on more than
1 million rows where string is cast to double, it is found that new
patch is on par with the earlier algorithm.

Results:
W/O library: Fetched 1222386 row(s) in 32.10s
With library: Fetched 1222386 row(s) in 31.71s

Testing:
1. Added test to check for accuracy improvement.
2. Ran existing Backend tests for correctness.

Change-Id: Ic105ad38a2fcbf2fb4e8ae8af6d9a8e251a9c141
Reviewed-on: http://gerrit.cloudera.org:8080/17389
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-15 20:40:39 +00:00
ShikhaAsrani
b1ca089446 IMPALA-10797: Frontend changes to enable 'stored as JSONFILE'
This change will allow usage of commands that do not require reading the
 Json File like:
- Create Table <Table> stored as JSONFILE
- Show Create Table <Table>
- Describe <Table>

Changes:
- Added JSON as FileFormat to thrift  and HdfsFileFormat.
- Allowing Sql keyword 'jsonfile' and mapping it to JSON format.
- Adding JSON serDe.
- JsonFiles have input format same as TextFile, so we need to use SerDe
library in use to differentiate between the two formats. Overloaded the
functions querying File Format based on input format to consider serDe
library too.
- Added tests for 'Create Table' and 'Show Create Table' commmands

Pending Changes:
- test for Describe command - to be added with backend changes.

Change-Id: I5b8cb2f59df3af09902b49d3bdac16c19954b305
Reviewed-on: http://gerrit.cloudera.org:8080/17727
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-13 16:04:19 +00:00
Zoltan Borok-Nagy
a1d5891c57 IMPALA-10741: Set engine.hive.enabled=true table property for Iceberg tables
Hive relies on engine.hive.enabled=true table property to be set
for Iceberg tables. Without it Hive overwrites table metadata with
different storage handler, SerDe/Input/OutputFormatter when it
writes the table, making it unusable.

With this patch Impala sets this table property during table creation.

Testing:
 * updated show-create-table.test
 * tested Impala/Hive interop manually

Change-Id: I6aa0240829697a27f48d0defcce48920a5d6f49b
Reviewed-on: http://gerrit.cloudera.org:8080/17750
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-05 15:45:24 +00:00
stiga-huang
599c84b4dd IMPALA-10808: (addendum) Abort on illegal decimal parquet schemas
The previous patch added checks on illegal decimal schemas of parquet
files. However, it doesn't return a non-ok status in
ParquetMetadataUtils::ValidateColumn if abort_on_error is set to false.
So we continue to use the illegal file schema and hit the DCHECK.

This patch fixes this and adding test coverage for illegal decimal
schemas.

Tests:
 - Add a bad parquet file with illegal decimal schemas.
 - Add e2e tests on the file.
 - Ran test_fuzz_decimal_tbl 100 times. Saw the errors are caught as
   expected.

Change-Id: I623f255a7f40be57bfa4ade98827842cee6f1fee
Reviewed-on: http://gerrit.cloudera.org:8080/17748
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-05 07:55:26 +00:00
Attila Jeges
4fc42c379e IMPALA-10739: Support setting new partition spec for Iceberg tables
With this patch Impala will support partition evolution for
Iceberg tables.

The DDL statement to change the default partition spec is:
ALTER TABLE <tbl> SET PARTITION SPEC(<partition-spec>)

Hive uses the same SQL syntax.

Testing:
- Added FE test to exercise parsing various well-formed and ill-formed
  ALTER TABLE SET PARTITION SPEC statements.

- Added e2e tests for:
  - ALTER TABLE SET PARTITION SPEC works for tables with HadoopTables
    and HadoopCatalog Catalog.
  - When evolving partition spec, the old data written with an earlier
    spec remains unchanged. New data is written using the new spec in
    a new layout. Data written with earlier spec and new spec can be
    fetched in a single query.
  - Invalid ALTER TABLE SET PARTITION SPEC statements yield the
    expected analysis error messages.

Change-Id: I9bd935b8a82e977df9ee90d464b5fe2a7acc83f2
Reviewed-on: http://gerrit.cloudera.org:8080/17723
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-08-03 16:27:07 +00:00
Attila Jeges
fabe994d1f IMPALA-10627: Use standard parquet-related Iceberg table properties
This patch adds support for the following standard Iceberg properties:

write.parquet.compression-codec:
  Parquet compression codec. Supported values are: NONE, GZIP, SNAPPY
  (default value), LZ4, ZSTD. The table property will be ignored if
  COMPRESSION_CODEC query option is set.

write.parquet.compression-level:
  Parquet compression level. Used with ZSTD compression only.
  Supported range is [1, 22]. Default value is 3. The table property
  will be ignored if COMPRESSION_CODEC query option is set.

write.parquet.row-group-size-bytes :
  Parquet row group size in bytes. Supported range is [8388608,
  2146435072] (8MB - 2047MB). The table property will be ignored if
  PARQUET_FILE_SIZE query option is set.
  If neither the table property nor the PARQUET_FILE_SIZE query option
  is set, the way Impala calculates row group size will remain
  unchanged.

write.parquet.page-size-bytes:
  Parquet page size in bytes. Used for PLAIN encoding. Supported range
  is [65536, 1073741824] (64KB - 1GB).
  If the table property is unset, the way Impala calculates page size
  will remain unchanged.

write.parquet.dict-size-bytes:
  Parquet dictionary page size in bytes. Used for dictionary encoding.
  Supported range is [65536, 1073741824] (64KB - 1GB).
  If the table property is unset, the way Impala calculates dictionary
  page size will remain unchanged.

This patch also renames 'iceberg.file_format' table property to
'write.format.default' which is the standard Iceberg name for the
table property.

Change-Id: I3b8aa9a52c13c41b48310d2f7c9c7426e1ff5f23
Reviewed-on: http://gerrit.cloudera.org:8080/17654
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-20 23:58:06 +00:00
Qifan Chen
147b4b9e58 IMPALA-10754: test_overlap_min_max_filters_on_sorted_columns failed during GVO
This patch addresses a failure in ubuntu-16.04 dockerised test. The test
involved is found in overlap_min_max_filters_on_sorted_columns.test as
follows.

  set minmax_filter_fast_code_path=on;
  set MINMAX_FILTER_THRESHOLD=0.0;
  SET RUNTIME_FILTER_WAIT_TIME_MS=$RUNTIME_FILTER_WAIT_TIME_MS;
  select straight_join count(a.timestamp_col) from
  alltypes_timestamp_col_only a join [SHUFFLE] alltypes_limited b
  where a.timestamp_col = b.timestamp_col and b.tinyint_col = 4;
  ---- RUNTIME_PROFILE
  aggregation(SUM, NumRuntimeFilteredPages)> 57

The patch reduces the threshold from 58 to 50.

Testing:
   Ran the unit test successfully.

Change-Id: Icb4cc7d533139c4a2b46a872234a47d46cb8a17c
Reviewed-on: http://gerrit.cloudera.org:8080/17696
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-20 22:56:48 +00:00
stiga-huang
4df03a31ec IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.

In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.

Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.

Tests:
 - Add BE unit tests and e2e tests
 - Add random tests to make sure malformed UTF-8 characters won't crash
   us.

Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Reviewed-on: http://gerrit.cloudera.org:8080/17580
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-20 13:28:30 +00:00
Zoltan Borok-Nagy
62028d00e6 IMPALA-10802: test_show_create_table and test_catalogs fails with Iceberg syntax error
Two Iceberg commits got into master branch in parallel. One of
them modified the DDL syntax, the other one added some tests.
They were correct on their own, but mixing the two causes
test failures.

The affected tests have been updated.

Change-Id: Id3cf6ff04b8da5782df2b84a580cdbd4a4a16d06
Reviewed-on: http://gerrit.cloudera.org:8080/17689
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-19 22:16:18 +00:00
Qifan Chen
36d8e6766e IMPALA-10763: Min/max filters should be enabled on Z-order sorted columns
This patch enables min/max filtering on any Z-order sort-by columns
by default.

Since the column stats for a row group or a page is computed from the
column values stored in the row group or the page, the current
infrastructure for min/max filtering works for the Z-order out of box.
The fact that these column values are ordered by Z-order is
orthogonal to the work of min/max filtering.

By default, the new feature is enabled. Set the existing control knob
minmax_filter_sorted_columns to false to turn it off.

Testing
  1. Added new z-order related sort column tests in
     overlap_min_max_filters_on_sorted_columns.test;
  2. Ran core-test.

Change-Id: I2a528ffbd0e333721ef38b4be7d4ddcdbf188adf
Reviewed-on: http://gerrit.cloudera.org:8080/17635
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-15 19:07:25 +00:00
Zoltan Borok-Nagy
474e022fda IMPALA-10626: Add support for Iceberg's Catalogs API
Iceberg recently switched to use its Catalogs class to define
catalog and table properties. Catalog information is stored in
a configuration file such as hive-site.xml. And the table properties
contain information about which catalog is being used and what is
the Iceberg table id.

E.g. in the Hive conf we can have the following properties to define
catalogs:

 iceberg.catalog.<catalog_name>.type = hadoop
 iceberg.catalog.<catalog_name>.warehouse = somelocation

 or

 iceberg.catalog.<catalog_name>.type = hive

And at the table level we can have the following:

iceberg.catalog = <catalog_name>
name = <table_identifier>

Table property 'iceberg.catalog' refers to a Catalog defined in the
configuration file. This is in contradiction with Impala's current
behavior where we are already using 'iceberg.catalog', and it can
have the following values:

 * hive.catalog for HiveCatalog
 * hadoop.catalog for HadoopCatalog
 * hadoop.tables for HadoopTables

To be backward-compatible and also support the new Catalogs properties
Impala still recognizes the above special values. But, from now Impala
doesn't define 'iceberg.catalog' by default. 'iceberg.catalog' being
NULL means HiveCatalog for both Impala and Iceberg's Catalogs API,
hence for Hive and Spark as well.

If 'iceberg.catalog' has a different value than the special values it
indicates that Iceberg's Catalogs API is being used, so Impala will
try to look up the catalog configuration from the Hive config file.

Testing:
 * added SHOW CREATE TABLE tests
 * added e2e tests that create/insert/drop Iceberg tables with Catalogs
 * manually tested interop behavior with Hive

Change-Id: I5dfa150986117fc55b28034c4eda38a736460ead
Reviewed-on: http://gerrit.cloudera.org:8080/17466
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-15 16:47:35 +00:00
Zoltan Borok-Nagy
d0749d59de IMPALA-10732: Use consistent DDL for specifying Iceberg partitions
Currently we have a DDL syntax for defining Iceberg partitions that
differs from SparkSQL:
https://iceberg.apache.org/spark-ddl/#partitioned-by

E.g. Impala is using the following syntax:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITION BY SPEC (i BUCKET 5, ts MONTH, d YEAR)
STORED AS ICEBERG;

The same in Spark is:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
USING ICEBERG
PARTITIONED BY (bucket(5, i), months(ts), years(d))

HIVE-25179 added the following syntax for Hive:

CREATE TABLE ice_t (i int, s string, ts timestamp, d date)
PARTITIONED BY SPEC (bucket(5, i), months(ts), years(d))
STORED BY ICEBERG;

I.e. the same syntax as Spark, but adding the keyword "SPEC".

This patch makes Impala use Hive's syntax, i.e. we will also
use the PARTITIONED BY SPEC clause + the unified partition
transform syntax.

Testing:
 * existing tests has been rewritten with the new syntax

Change-Id: Ib72ae445fd68fb0ab75d87b34779dbab922bbc62
Reviewed-on: http://gerrit.cloudera.org:8080/17575
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-15 15:15:07 +00:00
Zoltan Borok-Nagy
9d46255739 IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.

Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/

A new utility class has been introduced, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.

NULL values are returned in case of overflows. This behavior is
consistent with Hive.

Parquet column stats reader is also updated to convert the decimal
values. The stats reader is used to evaluate min/max conjuncts. It
works well because later we also evaluate the conjuncts on the
converted values anyway.

The status of different filterings:
 * dictionary filtering: disabled for columns that need conversion
 * runtime bloom filters: work on the converted values
 * runtime min/max filters: work on the converted values

This patch also enables schema evolution of decimal columns of Iceberg
tables.

Testing:
 * added e2e tests

Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Reviewed-on: http://gerrit.cloudera.org:8080/17678
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2021-07-14 12:51:09 +00:00
Qifan Chen
84d784351c IMPALA-10738: Min/max filters should be enabled for partition columns
This patch enables min/max filters for partitoned columns to take
advantage of the min/max filter infrastructure already built and to
provide coverage for certain equi-joins in which the stats filters
are not feasible.

The new feature is turned on by default and to turn off the feature,
set the new query option minmax_filter_partition_column to false.

In the patch, the existing query option enabled_runtime_filter_types
is enforced in specifying the types of the filters generated. The
default value ALL generates both the bloom and min/max filters. The
alternative value BLOOM generates only the bloom filters and another
alternative value MIN_MAX generates only the min/max filters.

The normal control knobs minmax_filter_threshold (for threshold) and
minmax_filtering_level (for filtering level) still work. When the
threshold is 0, the patch automatically assigns a reasonable value
for the threshhold.

Testing:
  1). Added new tests in
      overlap_min_max_filters_on_partition_columns.test;
  2). Core tests

Change-Id: I89e135ef48b4bb36d70075287b03d1c12496b042
Reviewed-on: http://gerrit.cloudera.org:8080/17568
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-07-09 13:40:40 +00:00
Vihang Karajgaonkar
46ae99a36b Bump up the GBN to 14842939
This patch bumps up the GBN to 14842939. This build
includes HIVE-23995 and HIVE-24175 and some of the tests
were modified to take into account of that.

Also, fixes a minor bug in environ.py

Testing done:
1. Core tests.

Change-Id: I78f167c1c0d8e90808e387aba0e86b697067ed8f
Reviewed-on: http://gerrit.cloudera.org:8080/17628
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2021-07-06 18:35:30 +00:00
Qifan Chen
d99caa1f3a IMPALA-10754: test_overlap_min_max_filters_on_sorted_columns failed during GVO
This patch addresses the following failure in ubuntu 16.04 dockerised
test:

  select straight_join count(a.timestamp_col) from
  alltypes_timestamp_col_only a join [SHUFFLE] alltypes_limited b
  where a.timestamp_col = b.timestamp_col and b.tinyint_col = 4

  aggregation(SUM, NumRuntimeFilteredPages): 58

   EXPECTED VALUE:
   58

   ACTUAL VALUE:
   59

   OP:
   :

In the patch, the result expectation is altered from "==58" to ">57".

Testing:
  1). Ran test_overlap_min_max_filters_on_sorted_columns multiple
      times under regular and local catalog mode.

Change-Id: I4f9eb198dc4e4b0ad1a17696a1d74ff05ac0a436
Reviewed-on: http://gerrit.cloudera.org:8080/17618
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-23 00:20:31 +00:00
Zoltan Borok-Nagy
c591b846c7 IMPALA-10166 (part 2): Enable DROP and CHANGE column
IMPALA-10166 (part 1) already added the necessary code for
DROP and CHANGE COLUMN, but disabled those stmts because to correctly
support schema evolution we had to wait for column resolution
by Iceberg field id.

Since then IMPALA-10361 and IMPALA-10485 added support for field-id
based column resolution for Parquet and ORC as well.

Hence this patch enables DROP and CHANGE column ALTER TABLE
statements. We still disallow REPLACE COLUMNS because it doesn't
really make sense for Iceberg tables as it basically makes all
existing data inaccessible.

Changing DECIMAL columns are still disabled due to IMPALA-7087.

Testing:
 * added e2e tests

Change-Id: I9b0d1a55bf0ed718724a69b51392ed53680ffa90
Reviewed-on: http://gerrit.cloudera.org:8080/17593
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2021-06-22 10:41:07 +00:00
Qifan Chen
40c3074e79 IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables
This patch enables min/max filters for equi-joins on lexical sort-by
columns in a Parquet table created by Impala by default. This is to
take advantage of Impala sorting the min/max values in column index
in each data file for the table. The control knob is query option
minmax_filter_sorted_columns, default to true.

When minmax_filter_sorted_columns is true, the patch will generate
min/max filters only for the leading sort columns. The normal control
knobs minmax_filter_threshold (for threshold) and
minmax_filtering_level (for filtering level) still work. When the
threshold is 0, the patch automatically assigns a reasonable value
for the threshhold, and selects PAGE to be the filtering level.

In the backend, the skipped pages are quickly found by taking a
fast code path to identify the corresponding lower and the upper
bounds in the sorted min and max value arrays, given a range in the
filter.  The skipped pages are expressed as page ranges which are
translated into row ranges later on.

A new query option minmax_filter_fast_code_path is added to control
the work of the fast code path. It can take ON (default), OFF, or
VERIFICATION three values. The last helps verify that the results
from both the fast and the regular code path are the same.

Preliminary performance testing (joining into a simpplified TPCH
lineitem table of 2 sorted BIG INT columns and a total of 6001215
rows) confirms that min/max filtering on leading sort-by columns
improves the performance of scan operators greatly. The best result
is seen with pages containing no more than 24000 rows: 84.62ms
(page level filtering) vs. 115.27ms (row group level filtering)
vs 137.14ms (no filtering). The query utilized is as follows.

  select straight_join a.l_orderkey from
  simpflified_lineitem a join [SHUFFLE] tpch_parquet.lineitem b
  where a.l_orderkey = b.l_orderkey and b.l_receiptdate = "1998-12-31"

Also fixed in the patch are abnormal min/max display in "Final
filter table" section in a profile for DECIMAL, TIMESTAMP and DATE
data types, and reading DATE column index in batch without validation.

Testing:
  1). Added a new test overlap_min_max_filters_on_sorted_columns.test
      to verify
      a) Min/max filters are only created for leading sort by column;
      b) Query option minmax_filter_sorted_columns works;
      c) Query option minmax_filter_fast_code_path works.
  2). Added new tests in parquet-page-index-test.cc to test fast
      code path under various conditions;
  3). Ran core tests successfully.

Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Reviewed-on: http://gerrit.cloudera.org:8080/17478
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-15 18:33:43 +00:00
Qifan Chen
4290c5e297 IMPALA-9355: TestExchangeMemUsage.test_exchange_mem_usage_scaling doesn't hit the memory limit
This patch restores the previous testing enviroment in which exchange
node 5 in the test query shown below reliablly reports memory
exceeding limit error.

  set mem_limit=171520k;
  set num_scanner_threads=1;
  select *
  from tpch_parquet.lineitem l1
    join tpch_parquet.lineitem l2 on l1.l_orderkey = l2.l_orderkey and
        l1.l_partkey = l2.l_partkey and l1.l_suppkey = l2.l_suppkey
        and l1.l_linenumber = l2.l_linenumber
  order by l1.l_orderkey desc, l1.l_partkey, l1.l_suppkey, l1.l_linenumber
  limit 5

In the patch, the memory limit for the entire query is lowered to 164Mb,
less than 171520k, the minimum for the query to be accepted after
the result spooling is enabled by default. This is achieved by
disabling result spooling by setting spool_query_results to false.

Testing:
  1). Ran exchange-mem-scaling.test

Change-Id: Ia4ad4508028645b67de419cfdfa2327d2847cfc4
Reviewed-on: http://gerrit.cloudera.org:8080/17586
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-14 17:13:25 +00:00
Amogh Margoor
8e7a6227a4 IMPALA-10730: Add MD5 fuction to compute 128-bit checksum for a string.
Built-in function has been added to compute MD5 128-bit checksum for
a non-null string. If input string is null, then output of the
function is null too. In FIPS mode, MD5 is disabled and function
will throw error on invocation.

Testing:
1. Added expression unit tests.
2. Added end-to-end tests for MD5.

Change-Id: Id406d30a7cc6573212b302fbfec43eb848352ff2
Reviewed-on: http://gerrit.cloudera.org:8080/17567
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-10 22:24:16 +00:00
Amogh Margoor
a4373983df IMPALA-5569: Add statement ALTER TABLE UNSET TBLPROPERTIES/SERDEPROPERTIES
This patch adds ability to unset or delete table properties or serde
properties for a table. It supports 'IF EXISTS' clause in case users
are not sure if property being unset exists. Without 'IF EXISTS',
trying to unset property that doesn't exist will fail.

Tests:
1. Added Unit tests and end-to-end tests
2. Covered tables of different storage type like Kudu,
   Iceberg, HDFS table.
Change-Id: Ife4f6561dcdcd20c76eb299c6661c778e342509d
Reviewed-on: http://gerrit.cloudera.org:8080/17530
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-10 15:46:56 +00:00
Fucun Chu
ce21fe74b8 IMPALA-10689: Implement ds_cpc_union_f() function.
This function receives two strings that are serialized Apache
DataSketches CPC sketches. Union two sketches and returns the
resulting sketch of union.

Example:
select ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2))
from sketch_tbl;
+---------------------------------------------------+
| ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) |
+---------------------------------------------------+
| 15                                                |
+---------------------------------------------------+

Change-Id: Ib5c616316bf2bf2ff437678e9a44a15339920150
Reviewed-on: http://gerrit.cloudera.org:8080/17440
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-07 12:53:23 +00:00
Daniel Becker
817ca5920d IMPALA-10640: Support reading Parquet Bloom filters - most common types
This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---------------------------------------
|Parquet type |  Impala type            |
|---------------------------------------|
|INT32        |  TINYINT, SMALLINT, INT |
|INT64        |  BIGINT                 |
|FLOAT        |  FLOAT                  |
|DOUBLE       |  DOUBLE                 |
|BYTE_ARRAY   |  STRING                 |
 ---------------------------------------

The following types are not supported for the given reasons:

 ----------------------------------------------------------------
|Impala type |  Problem                                          |
|----------------------------------------------------------------|
|VARCHAR(N)  | truncation can change hash                        |
|CHAR(N)     | padding / truncation can change hash              |
|DECIMAL     | multiple encodings supported                      |
|TIMESTAMP   | multiple encodings supported, timezone conversion |
|DATE        | not considered yet                                |
 ----------------------------------------------------------------

Support may be added for these types later, see IMPALA-10641.

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
    whether Parquet Bloom filtering works for the supported types and
    that we do not incorrectly discard row groups for the unsupported
    type VARCHAR. The Parquet file used in the test was generated with
    an external tool.
  - Added unit tests for ParquetBloomFilter in file
    be/src/util/parquet-bloom-filter-test.cc
  - A minor, unrelated change was done in
    be/src/util/bloom-filter-test.cc: the MakeRandom() function had
    return type uint64_t, the documentation claimed it returned a 64 bit
    random number, but the actual number of random bits is 32, which is
    what is intended in the tests. The return type and documentation
    have been corrected to use 32 bits.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Reviewed-on: http://gerrit.cloudera.org:8080/17026
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2021-06-03 06:32:45 +00:00
Vihang Karajgaonkar
e24bdd2175 IMPALA-10700: Add query options to skip deleting stats
When a truncate table command is issued, in case of
non-transactional tables, the table and column statistics for the table
are also deleted by default. This can be a expensive operation
especially when many truncate table commands are running concurrently.
As the concurrency increases, the response time from Hive metastore
slows down the delete table and column statistics RPC calls.

In cases where truncate operation is used to remove the existing
data and then reload new data, it is likely that users will compute
stats again as soon as the new data is reloaded. This would overwrite
the existing statistics and hence the additional time spent by
the truncate operation to delete column and table statistics becomes
unnecessary.

To improve this, this change introduces a new query option:
DELETE_STATS_IN_TRUNCATE. The default value of this option is 1 or true
which means stats will be deleted as part of truncate operation.

As the name suggests, when this query options are set to false or 0,
a truncate operation will not delete the table and column statistics
for the table.

This change also makes a improvement to truncate operation on
tables which are replicated. If the table is being replicated,
previously, the statistics were not getting deleted after truncate.
Now the statistics will get deleted after truncate.

Testing:
Modified truncate-table.test to include variations of these query
options and making sure that the statistics are deleted or skipped
from deletion after truncate operation.

Change-Id: I9400c3586b4bdf46d9b4056ea1023aabae8cc519
Reviewed-on: http://gerrit.cloudera.org:8080/17521
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-28 04:02:10 +00:00
Fucun Chu
67de4a48b0 IMPALA-10688: Implement ds_cpc_stringify() function
This function receives a string that is a serialized Apache
DataSketches CPC sketch and returns its stringified format.

A stringified format should look like and contains the following data:

select ds_cpc_stringify(ds_cpc_sketch(float_col)) from
functional_parquet.alltypestiny;
+--------------------------------------------+
| ds_cpc_stringify(ds_cpc_sketch(float_col)) |
+--------------------------------------------+
| ### CPC sketch summary:                    |
|    lg_k           : 11                     |
|    seed hash      : 93cc                   |
|    C              : 2                      |
|    flavor         : 1                      |
|    merged         : true                   |
|    intresting col : 0                      |
|    table entries  : 2                      |
|    window         : not allocated          |
| ### End sketch summary                     |
|                                            |
+--------------------------------------------+

Change-Id: I8c9d089bfada6bebd078d8f388d2e146c79e5285
Reviewed-on: http://gerrit.cloudera.org:8080/17373
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
2021-05-26 06:42:02 +00:00
Zoltan Borok-Nagy
ced7b7d221 IMPALA-10485: Support Iceberg field-id based column resolution in the ORC scanner
Currently the ORC scanner only supports position-based column
resolution. This patch adds Iceberg field-id based column resolution
which will be the default for Iceberg tables. It is needed to support
schema evolution in the future, i.e. ALTER TABLE DROP/RENAME COLUMNS.
(The Parquet scanner already supports Iceberg field-id based column
resolution)

Testing
 * added e2e test 'iceberg-orc-field-id.test' by copying the contents of
   nested-types-scanner-basic,
   nested-types-scanner-array-materialization,
   nested-types-scanner-position,
   nested-types-scanner-maps,
   and executing the queries on an Iceberg table with ORC data files

Change-Id: Ia2b1abcc25ad2268aa96dff032328e8951dbfb9d
Reviewed-on: http://gerrit.cloudera.org:8080/17398
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-20 19:19:50 +00:00
Zoltan Borok-Nagy
824b39e829 IMPALA-10433: Use Iceberg's fixed partition transforms
Because of an Iceberg bug Impala didn't push predicates to
Iceberg for dates/timestamps when the predicate referred to a
value before the UNIX epoch.

https://github.com/apache/iceberg/pull/1981 fixed the Iceberg
bug, and lately Impala switched to an Iceberg version that has
the fix, therefore this patch enables predicate pushdown for all
timestamp/date values.

The above Iceberg patch maintains backward compatibility with the
old, wrong behavior. Therefore sometimes we need to read plus one
Iceberg partition than necessary.

Testing:
 * Updated current e2e tests

Change-Id: Ie67f41a53f21c7bdb8449ca0d27746158be7675a
Reviewed-on: http://gerrit.cloudera.org:8080/17417
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-19 00:59:35 +00:00
Amogh Margoor
86beb2f9dd IMPALA-10679: Add builtin functions to comptute SHA-1 and SHA-2 digest.
Built-in functions to compute SHA-1 digest and SHA-2 family of digest
has been added. Support for SHA2 digest includes SHA224, SHA256,
SHA384 and SHA512. In FIPS mode SHA1, SHA224 and SHA256 have been
disabled and will throw error. SHA2 functions will also throw error
for unsupported bit length i.e., bit length apart from 224, 256, 384,
512.

Testing:
1. Added Unit test for expressions.
2. Added end-to-end test for new functions.

Change-Id: If163b7abda17cca3074c86519d59bcfc6ace21be
Reviewed-on: http://gerrit.cloudera.org:8080/17464
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-18 18:48:10 +00:00
liuyao
9c38568657 IMPALA-10696: fix accuracy problem
Table alltypes has no statistics, so the cardinality of alltypes
will be estimated based on the hdfs files and the avg row size.
Calling PrintUtils.printMetric, double will be divided by long. There
will be accuracy problems. In most cases, the number of lines
calculated is 17.91 K. But due to accuracy problems here, the
calculated value is 17.90K.

I modified line 221 of stats-extrapolation.test and used row_regex
to match, referring to the matching method of cardinality in line
224,in this case, their values are the same

Testing:
metadata/test_stats_extrapolation.py

Change-Id: I0a1a3809508c90217517705b2b188b2ccba6f23f
Reviewed-on: http://gerrit.cloudera.org:8080/17411
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Jim Apple <jbapple@apache.org>
2021-05-15 22:20:58 +00:00
Csaba Ringhofer
4697db0214 IMPALA-5121: Fix AVG() on timestamp col with use_local_tz_for_unix_timestamp_conversions
AVG used to contain a back and forth timezone conversion if
use_local_tz_for_unix_timestamp_conversions is true. This could
affect the results if there were values from different DST rules.

Note that AVG on timestamps has other issues besides this, see
IMPALA-7472 for details.

Testing:
- added a regression test

Change-Id: I999099de8e07269b96b75d473f5753be4479cecd
Reviewed-on: http://gerrit.cloudera.org:8080/17412
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-14 13:58:04 +00:00
Fucun Chu
b1326f7eff IMPALA-10687: Implement ds_cpc_union() function
This function receives a set of serialized Apache DataSketches CPC
sketches produced by ds_cpc_sketch() and merges them into a single
sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
      ds_cpc_estimate(ds_cpc_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Testing:
  - Apart from the automated tests I added to this patch I also
    tested ds_cpc_union() on a bigger dataset to check that
    serialization, deserialization and merging steps work well. I
    took TPCH25.linelitem, created a number of sketches with grouping
    by l_shipdate and called ds_cpc_union() on those sketches

Change-Id: Ib94b45ae79efcc11adc077dd9df9b9868ae82cb6
Reviewed-on: http://gerrit.cloudera.org:8080/17372
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-13 12:18:55 +00:00
Fucun Chu
e39c30b3cd IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch_parquet.lineitem to compare perfomance
   with ndv(). Depending on data characteristics ndv() appears 2x-3x
   faster. CPC gives closer estimate than current ndv(). CPC is more
   accurate than HLL in some cases

Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Reviewed-on: http://gerrit.cloudera.org:8080/16656
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-11 18:07:40 +00:00
Zoltan Borok-Nagy
e26543426c IMPALA-9967: Add support for reading ORC's TIMESTAMP WITH LOCAL TIMEZONE
ORC-189 and ORC-666 added support for a new timestamp type
'TIMESTMAP WITH LOCAL TIMEZONE' to the Orc library.

This patch adds support for reading such timestamps with Impala.
These are UTC-normalized timestamps, therefore we convert them
to local timezone during scanning.

Testing:
 * added test for CREATE TABLE LIKE ORC
 * added scanner tests to test_scanners.py

Change-Id: Icb0c6a43ebea21f1cba5b8f304db7c4bd43967d9
Reviewed-on: http://gerrit.cloudera.org:8080/17347
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-07 14:11:08 +00:00
Zoltan Borok-Nagy
f0f083e45e IMPALA-10482, IMPALA-10493: Fix bugs in full ACID collection query rewrites
IMPALA-10482: SELECT * query on unrelative collection column of
transactional ORC table will hit IllegalStateException.

The AcidRewriter will rewrite queries like
"select item from my_complex_orc.int_array" to
"select item from my_complex_orc t, t.int_array"

This cause troubles in star expansion. Because the original query
"select * from my_complex_orc.int_array" is analyzed as
"select item from my_complex_orc.int_array"

But the rewritten query "select * from my_complex_orc t, t.int_array" is
analyzed as "select id, item from my_complex_orc t, t.int_array".

Hidden table refs can also cause issues during regular column
resolution. E.g. when the table has top-level 'pos'/'item'/'key'/'value'
columns.

The workaround is to keep track of the automatically added table refs
during query rewrite. So when we analyze the rewritten query we can
ignore these auxiliary table refs.

IMPALA-10493: Using JOIN ON syntax to join two full ACID collections
produces wrong results.

When AcidRewriter.splitCollectionRef() creates a new collection ref
it doesn't copy every information needed to correctly execute the
query. E.g. it dropped the ON clause, turning INNER joins to CROSS
joins.

Testing:
 * added e2e tests

Change-Id: I8fc758d3c1e75c7066936d590aec8bff8d2b00b0
Reviewed-on: http://gerrit.cloudera.org:8080/17038
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-03 20:42:30 +00:00
Csaba Ringhofer
676f79aa81 IMPALA-10691: Fix multithreading related crash in CAST FORMAT
The issue occurs when a CastFormatExpr is shared among threads and
multiple threads call its OpenEvaluator(). Later calls delete the
DateTimeFormatContext created by older calls which makes
fn_ctx->GetFunctionState() a dangling pointer.

This only happens when CastFormatExpr is shared among
FragmentInstances - in case of scanner threads OpenEvaluator() is
called with THREAD_LOCAL and returns early without modifying anything.

Testing:
- added a regression test

Change-Id: I501c8a184591b1c836b2ca4cada1f2117f9f5c99
Reviewed-on: http://gerrit.cloudera.org:8080/17374
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-30 23:36:31 +00:00
Amogh Margoor
af6adf7618 IMPALA-10654: Fix precision loss in DecimalValue to double conversion.
Original approach to convert DecimalValue(internal representation
of decimals) to double was not accurate.
It was:
           static_cast<double>(value_) / pow(10.0, scale).
However only integers from −2^53 to 2^53 can be represented
accurately by double precision without any loss.
Hence, it would not work for numbers like -0.43149576573887316.
For DecimalValue representing -0.43149576573887316, value_ would be
-43149576573887316 and scale would be 17. As value_ < -2^53,
 result would not be accurate. In newer approach we are using third
party library https://github.com/lemire/fast_double_parser, which
handles above scenario in a performant manner.

Testing:
1. Added End to End Tests covering following scenarios:
    a. Test to show precision limitation of 16 in the write path
    b. DecimalValue's value_ between -2^53 and 2^53.
    b. value_ outside above range but abs(value_) < UINT64_MAX
    c. abs(value_) > UINT64_MAX -covers DecimalValue<__int128_t>
2. Ran existing  backend and end-to-end tests completely

Change-Id: I56f0652cb8f81a491b87d9b108a94c00ae6c99a1
Reviewed-on: http://gerrit.cloudera.org:8080/17303
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-27 14:52:43 +00:00
Csaba Ringhofer
9355b25e11 IMPALA-10662: Change EE tests to return the same results for HS2 as Beeswax
In EE tests HS2 returned results with smaller precision than Beeswax for
FLOAT/DOUBLE/TIMESTAMP types. These differences are not inherent to the
HS2 protocol - the results are returned with full precision in Thrift
and lose precision during conversion in client code.

This patch changes to conversion in HS2 to match Beeswax and removes
test section DBAPI_RESULTS that was used to handle the differences:
- float/double: print method is changed from str() to ":.16".format()
- timestamp: impyla's cursor is created with convert_types=False to
             avoid conversion to datetime.datetime (which has only
             microsec precision)

Note that FLOAT/DOUBLE are still different in impala-shell, this change
only deals with EE tests.

Testing:
- ran the changed tests

Change-Id: If69ae90c6333ff245c2b951af5689e3071f85cb2
Reviewed-on: http://gerrit.cloudera.org:8080/17325
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-20 22:21:32 +00:00
Qifan Chen
a985e1134e IMPALA-10647 Improve always-true min/max filter handling in coordinator
The change improves how a coordinator behaves when a just
arriving min/max filter is always true. A new member
'always_true_filter_received_' is introduced to record such a
fact. Similarily, the new member always_false_flipped_to_false_
is added to indicate that the always false flag is flipped from
'true' to 'false'. These two members only influence how the min
and max columns in "Filter routing table" and "Final filter
table" in profile are displayed as follows.

  1. 'PartialUpdates' - The min and the max are partially updated;
  2. 'AlwaysTrue'     - One received filter is AlwaysTrue;
  3. 'AlwaysFalse'    - No filter is received or all received
                        filters are empty;
  4. 'Real values'    - The final accumulated min/max from all
                        received filters.

A second change introduced is to record, in scan node, the
arrival time of min/max filters (as a timestamp since the system
is rebooted, obtained by calling MonotonicMillis()). A timestamp
of similar nature is recorded for hdfs parquet scanners when a
row group is processed. By comparing these two timestamps, one
can easily diagnose issues related to late arrival of min/max
filters.

This change also addresses a flaw with rows unexpectedly
filtered out, due to the reason that the always_true_ flag in
a min/max filter, when set, is ignored in the eval code path
in RuntimeFilter::Eval().

Testing:
  1. Added three new tests in overlap_min_max_filters.test to
     verify that the min/max are displayed correctly when the
     min/max filter in hash join builder is set to always true,
     always false, or a pair of meaningful min and max values.
  2. Ran unit tests;
  3. Ran runtime-filter-test;
  4. Ran core tests successfully.

Change-Id: I326317833979efcbe02ce6c95ad80133dd5c7964
Reviewed-on: http://gerrit.cloudera.org:8080/17252
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-20 00:44:48 +00:00
Qifan Chen
1231208da7 IMPALA-10494: Making use of the min/max column stats to improve min/max filters
This patch adds the functionality to compute the minimal and the maximal
value for column types of integer, float/double, date, or decimal for
parquet tables, and to make use of the new stats to discard min/max
filters, in both hash join builders and Parquet scanners, when their
coverage are too close to the actual range defined by the column min
and max.

The computation and dislay of the new column min/max stats can be
controlled by two new Boolean query options (default to false):
  1. compute_column_minmax_stats
  2. show_column_minmax_stats

Usage examples.

  set compute_column_minmax_stats=true;
  compute stats tpcds_parquet.store_sales;

  set show_column_minmax_stats=true;
  show column stats tpcds_parquet.store_sales;

+-----------------------+--------------+-...-------+---------+---------+
| Column                | Type         |   #Falses | Min     | Max     |
+-----------------------+--------------+-...-------+---------+---------+
| ss_sold_time_sk       | INT          |   -1      | 28800   | 75599   |
| ss_item_sk            | BIGINT       |   -1      | 1       | 18000   |
| ss_customer_sk        | INT          |   -1      | 1       | 100000  |
| ss_cdemo_sk           | INT          |   -1      | 15      | 1920797 |
| ss_hdemo_sk           | INT          |   -1      | 1       | 7200    |
| ss_addr_sk            | INT          |   -1      | 1       | 50000   |
| ss_store_sk           | INT          |   -1      | 1       | 10      |
| ss_promo_sk           | INT          |   -1      | 1       | 300     |
| ss_ticket_number      | BIGINT       |   -1      | 1       | 240000  |
| ss_quantity           | INT          |   -1      | 1       | 100     |
| ss_wholesale_cost     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_list_price         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sales_price        | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_discount_amt   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_sales_price    | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_wholesale_cost | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_list_price     | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_ext_tax            | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_coupon_amt         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid           | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_paid_inc_tax   | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_net_profit         | DECIMAL(7,2) |   -1      | -1      | -1      |
| ss_sold_date_sk       | INT          |   -1      | 2450816 | 2452642 |
+-----------------------+--------------+-...-------+---------+---------+

Only the min/max values for non-partition columns are stored in HMS.
The min/max values for partition columns are computed in coordinator.

The min-max filters, in C++ class or protobuf form, are augmented to
deal with the always true state better. Once always true is set, the
actual min and max values in the filter are no longer populated.

Testing:
 - Added new compute/show stats tests in
   compute-stats-column-minmax.test;
 - Added new tests in overlap_min_max_filters.test to demonstrate the
   usefulness of column stats to quickly disable useless filters in
   both hash join builder and Parquet scanner;
 - Added tests in min-max-filter-test.cc to demonstrate method Or(),
   ToProtobuf() and constructor can deal with always true flag well;
 - Tested with TPCDS 3TB to demonstrate the usefulness of the min
   and max column stats in disabling min/max filters that are not
   useful.
 - core tests.

TODO:
 1. IMPALA-10602: Intersection of multiple min/max filters when
    applying to common equi-join columns;
 2. IMPALA-10601: Creating lineitem_orderkey_only table in
    tpch_parquet database;
 3. IMPALA-10603: Enable min/max overlap filter feature for Iceberg
    tables with Parquet data files;
 4. IMPALA-10617: Compute min/max column stats beyond parquet tables.

Change-Id: I08581b44419bb8da5940cbf98502132acd1c86df
Reviewed-on: http://gerrit.cloudera.org:8080/17075
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-02 21:50:17 +00:00