mirror of
https://github.com/apache/impala.git
synced 2026-01-01 00:00:20 -05:00
Impala supports reading Parquet files with multiple row groups but with possible performance degradation due to remote reads. This patch maximizes scan locality by allowing multiple impalads to scan the rowgroups in their local splits. Each impalad starts a new scan range for each split local to it if that split contains row group(s) that need to be scanned. Change-Id: Iaecc5fb8e89364780bc59dbfa9ae51d0d124d16e Reviewed-on: http://gerrit.cloudera.org:8080/908 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins
This file was created for: IMPALA-1881- Maximize data locality when scanning Parquet files with multiple row groups. The table lineitem_multiblock is a single parquet file with: - A row group size of approximately 12 KB each. - 200 row groups in total. Assuming a 1 MB HDFS block size, it has: - 3 blocks of up to 1 MB each. - Multiple row groups per block - Some row groups that span across block boundaries and live on 2 blocks. ---- This table was created using hive and has the same table structure and some of the data of 'tpch.lineitem'. The following commands were used: create table functional_parquet.lineitem_multiblock like tpch.lineitem stored as parquet; set parquet.block.size=4086; # This is to set the row group size insert into functional_parquet.lineitem_multiblock select * from tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small