mirror of
https://github.com/apache/impala.git
synced 2025-12-30 12:02:10 -05:00
These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins
34 lines
1.2 KiB
Plaintext
Executable File
34 lines
1.2 KiB
Plaintext
Executable File
This file was created for:
|
|
IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.
|
|
IMPALA-2466: Add more tests to the HDFS parquet scanner.
|
|
|
|
The table lineitem_multiblock is a single parquet file with:
|
|
- A row group size of approximately 12 KB each.
|
|
- 200 row groups in total.
|
|
|
|
Assuming a 1 MB HDFS block size, it has:
|
|
- 3 blocks of up to 1 MB each.
|
|
- Multiple row groups per block
|
|
- Some row groups that span across block boundaries and live on 2 blocks.
|
|
|
|
----
|
|
|
|
This table was created using hive and has the same table structure and some of the data of
|
|
'tpch.lineitem'.
|
|
|
|
The following commands were used:
|
|
|
|
create table functional_parquet.lineitem_multiblock like tpch.lineitem
|
|
stored as parquet;
|
|
|
|
set parquet.block.size=4086; # This is to set the row group size
|
|
|
|
insert into functional_parquet.lineitem_multiblock select * from
|
|
tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small
|
|
|
|
'lineitem_sixblocks' was created the same way but with more rows, so that we got more
|
|
blocks.
|
|
|
|
'lineitem_multiblock_one_row_group' was created similarly but with a much higher
|
|
'parquet.block.size' so that everything fit in one row group.
|