impala

mirror of https://github.com/apache/impala.git synced 2025-12-25 02:03:09 -05:00

Files

Sailesh Mukil 76b674850f IMPALA-2466: Add more tests for the HDFS parquet scanner.

These tests functionally test whether the following type of files
are able to be scanned properly:

1) Add a parquet file with multiple blocks such that each node has to
   scan multiple blocks.
2) Add a parquet file with multiple blocks but only one row group
   that spans the entire file. Only one scan range should do any work
   in this case.

Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368
Reviewed-on: http://gerrit.cloudera.org:8080/1500
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins

2016-03-25 13:10:15 +00:00

000000_0

IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.

2015-10-05 11:30:39 -07:00

lineitem_one_row_group.parquet

IMPALA-2466: Add more tests for the HDFS parquet scanner.

2016-03-25 13:10:15 +00:00

lineitem_sixblocks.parquet

IMPALA-2466: Add more tests for the HDFS parquet scanner.

2016-03-25 13:10:15 +00:00

README.dox

IMPALA-2466: Add more tests for the HDFS parquet scanner.

2016-03-25 13:10:15 +00:00

README.dox

This file was created for:
IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.
IMPALA-2466: Add more tests to the HDFS parquet scanner.

The table lineitem_multiblock is a single parquet file with:
 - A row group size of approximately 12 KB each.
 - 200 row groups in total.

Assuming a 1 MB HDFS block size, it has:
 - 3 blocks of up to 1 MB each.
 - Multiple row groups per block
 - Some row groups that span across block boundaries and live on 2 blocks.

----

This table was created using hive and has the same table structure and some of the data of
'tpch.lineitem'.

The following commands were used:

create table functional_parquet.lineitem_multiblock like tpch.lineitem
stored as parquet;

set parquet.block.size=4086; # This is to set the row group size

insert into functional_parquet.lineitem_multiblock select * from
tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small

'lineitem_sixblocks' was created the same way but with more rows, so that we got more
blocks.

'lineitem_multiblock_one_row_group' was created similarly but with a much higher
'parquet.block.size' so that everything fit in one row group.