impala

mirror of https://github.com/apache/impala.git synced 2026-02-03 09:00:39 -05:00

Files

Zoltan Borok-Nagy 522ee1fcc0 IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables

Virtual column FILE__POSITION returns the ordinal position of the row
in the data file. It will be useful to add support for Iceberg's
position-based delete files

This patch only adds FILE__POSITION to Parquet tables. It works
similarly to the handling of collection position slots. I.e. we
add the responsibility of dealing with the file position slot to
an existing column reader. Because of page-filtering and late
materialization we already tracked the file position in member
'current_row_' during scanning.

Querying the FILE__POSITION in other file formats raises an error.

Testing:
 * added e2e tests

Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7
Reviewed-on: http://gerrit.cloudera.org:8080/18704
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2022-08-12 19:21:55 +00:00

000000_0

IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.

2015-10-05 11:30:39 -07:00

lineitem_multiblock_variable_num_rows.parquet

IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables

2022-08-12 19:21:55 +00:00

lineitem_one_row_group.parquet

IMPALA-2466: Add more tests for the HDFS parquet scanner.

2016-03-25 13:10:15 +00:00

lineitem_orc_multiblock_one_stripe.orc

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

lineitem_sixblocks.orc

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

lineitem_sixblocks.parquet

IMPALA-2466: Add more tests for the HDFS parquet scanner.

2016-03-25 13:10:15 +00:00

lineitem_threeblocks.orc

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

README.dox

IMPALA-11350: Add virtual column FILE__POSITION for Parquet tables

2022-08-12 19:21:55 +00:00

README.dox

This file was created for:
IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.
IMPALA-2466: Add more tests to the HDFS parquet scanner.
IMPALA-5717: Add tests for HDFS orc scanner.

The table lineitem_multiblock is a single parquet file with:
- A row group size of approximately 12 KB each.
- 200 row groups in total.

Assuming a 1 MB HDFS block size, it has:
- 3 blocks of up to 1 MB each.
- Multiple row groups per block
- Some row groups that span across block boundaries and live on 2 blocks.

----

This table was created using hive and has the same table structure and some of the data of
'tpch.lineitem'.

The following commands were used:

create table functional_parquet.lineitem_multiblock like tpch.lineitem
stored as parquet;

set parquet.block.size=4086; # This is to set the row group size

insert into functional_parquet.lineitem_multiblock select * from
tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small

'lineitem_sixblocks' was created the same way but with more rows, so that we got more
blocks.

'lineitem_multiblock_one_row_group' was created similarly but with a much higher
'parquet.block.size' so that everything fit in one row group.

'lineitem_multiblock_variable_num_rows' was created similarly, but with
'parquet.block.size'=80000 so we have a bit fewer row groups, but the real point is that
the number of rows in the row groups are not the same. Also, the source table was
lineitem_multiblock so the resulting table will have the same rows in the same order in
the data file.

----

The orc files are created by the following hive queries:

use functional_orc_def;

set orc.stripe.size=1024;
set orc.compress=ZLIB;
create table lineitem_threeblocks like tpch.lineitem stored as orc;
create table lineitem_sixblocks like tpch.lineitem stored as orc;
insert overwrite table lineitem_threeblocks select * from tpch.lineitem limit 16000;
insert overwrite table lineitem_sixblocks select * from tpch.lineitem limit 30000;

set orc.stripe.size=67108864;
create table lineitem_orc_multiblock_one_stripe like tpch.lineitem stored as orc;
insert overwrite table lineitem_orc_multiblock_one_stripe select * from
tpch.lineitem limit 16000;