mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
Virtual column FILE__POSITION returns the ordinal position of the row in the data file. It will be useful to add support for Iceberg's position-based delete files This patch only adds FILE__POSITION to Parquet tables. It works similarly to the handling of collection position slots. I.e. we add the responsibility of dealing with the file position slot to an existing column reader. Because of page-filtering and late materialization we already tracked the file position in member 'current_row_' during scanning. Querying the FILE__POSITION in other file formats raises an error. Testing: * added e2e tests Change-Id: I4ef72c683d0d5ae2898bca36fa87e74b663671f7 Reviewed-on: http://gerrit.cloudera.org:8080/18704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This file was created for: IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups. IMPALA-2466: Add more tests to the HDFS parquet scanner. IMPALA-5717: Add tests for HDFS orc scanner. The table lineitem_multiblock is a single parquet file with: - A row group size of approximately 12 KB each. - 200 row groups in total. Assuming a 1 MB HDFS block size, it has: - 3 blocks of up to 1 MB each. - Multiple row groups per block - Some row groups that span across block boundaries and live on 2 blocks. ---- This table was created using hive and has the same table structure and some of the data of 'tpch.lineitem'. The following commands were used: create table functional_parquet.lineitem_multiblock like tpch.lineitem stored as parquet; set parquet.block.size=4086; # This is to set the row group size insert into functional_parquet.lineitem_multiblock select * from tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small 'lineitem_sixblocks' was created the same way but with more rows, so that we got more blocks. 'lineitem_multiblock_one_row_group' was created similarly but with a much higher 'parquet.block.size' so that everything fit in one row group. 'lineitem_multiblock_variable_num_rows' was created similarly, but with 'parquet.block.size'=80000 so we have a bit fewer row groups, but the real point is that the number of rows in the row groups are not the same. Also, the source table was lineitem_multiblock so the resulting table will have the same rows in the same order in the data file. ---- The orc files are created by the following hive queries: use functional_orc_def; set orc.stripe.size=1024; set orc.compress=ZLIB; create table lineitem_threeblocks like tpch.lineitem stored as orc; create table lineitem_sixblocks like tpch.lineitem stored as orc; insert overwrite table lineitem_threeblocks select * from tpch.lineitem limit 16000; insert overwrite table lineitem_sixblocks select * from tpch.lineitem limit 30000; set orc.stripe.size=67108864; create table lineitem_orc_multiblock_one_stripe like tpch.lineitem stored as orc; insert overwrite table lineitem_orc_multiblock_one_stripe select * from tpch.lineitem limit 16000;