impala

mirror of https://github.com/apache/impala.git synced 2026-01-01 00:00:20 -05:00

Files

Sailesh Mukil 7778b3ded5 IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.

Impala supports reading Parquet files with multiple row groups but
with possible performance degradation due to remote reads. This patch
maximizes scan locality by allowing multiple impalads to scan the
rowgroups in their local splits. Each impalad starts a new scan range
for each split local to it if that split contains row group(s) that
need to be scanned.

Change-Id: Iaecc5fb8e89364780bc59dbfa9ae51d0d124d16e
Reviewed-on: http://gerrit.cloudera.org:8080/908
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins

2015-10-05 11:30:39 -07:00

000000_0

IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.

2015-10-05 11:30:39 -07:00

README.dox

IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.

2015-10-05 11:30:39 -07:00

README.dox

This file was created for:
IMPALA-1881- Maximize data locality when scanning Parquet files with multiple row groups.

The table lineitem_multiblock is a single parquet file with:
 - A row group size of approximately 12 KB each.
 - 200 row groups in total.

Assuming a 1 MB HDFS block size, it has:
 - 3 blocks of up to 1 MB each.
 - Multiple row groups per block
 - Some row groups that span across block boundaries and live on 2 blocks.

----

This table was created using hive and has the same table structure and some of the data of
'tpch.lineitem'.

The following commands were used:

create table functional_parquet.lineitem_multiblock like tpch.lineitem
stored as parquet;

set parquet.block.size=4086; # This is to set the row group size

insert into functional_parquet.lineitem_multiblock select * from
tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small