impala

mirror of https://github.com/apache/impala.git synced 2025-12-26 14:02:53 -05:00

Files

stiga-huang 818cd8fa27 IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>

2018-04-11 05:13:02 +00:00

000000_0

IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.

2015-10-05 11:30:39 -07:00

lineitem_one_row_group.parquet

IMPALA-2466: Add more tests for the HDFS parquet scanner.

2016-03-25 13:10:15 +00:00

lineitem_orc_multiblock_one_stripe.orc

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

lineitem_sixblocks.orc

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

lineitem_sixblocks.parquet

IMPALA-2466: Add more tests for the HDFS parquet scanner.

2016-03-25 13:10:15 +00:00

lineitem_threeblocks.orc

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

README.dox

IMPALA-5717: Support for reading ORC data files

2018-04-11 05:13:02 +00:00

README.dox

This file was created for:
IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups.
IMPALA-2466: Add more tests to the HDFS parquet scanner.
IMPALA-5717: Add tests for HDFS orc scanner.

The table lineitem_multiblock is a single parquet file with:
- A row group size of approximately 12 KB each.
- 200 row groups in total.

Assuming a 1 MB HDFS block size, it has:
- 3 blocks of up to 1 MB each.
- Multiple row groups per block
- Some row groups that span across block boundaries and live on 2 blocks.

----

This table was created using hive and has the same table structure and some of the data of
'tpch.lineitem'.

The following commands were used:

create table functional_parquet.lineitem_multiblock like tpch.lineitem
stored as parquet;

set parquet.block.size=4086; # This is to set the row group size

insert into functional_parquet.lineitem_multiblock select * from
tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small

'lineitem_sixblocks' was created the same way but with more rows, so that we got more
blocks.

'lineitem_multiblock_one_row_group' was created similarly but with a much higher
'parquet.block.size' so that everything fit in one row group.

----

The orc files are created by the following hive queries:

use functional_orc_def;

set orc.stripe.size=1024;
set orc.compress=ZLIB;
create table lineitem_threeblocks like tpch.lineitem stored as orc;
create table lineitem_sixblocks like tpch.lineitem stored as orc;
insert overwrite table lineitem_threeblocks select * from tpch.lineitem limit 16000;
insert overwrite table lineitem_sixblocks select * from tpch.lineitem limit 30000;

set orc.stripe.size=67108864;
create table lineitem_orc_multiblock_one_stripe like tpch.lineitem stored as orc;
insert overwrite table lineitem_orc_multiblock_one_stripe select * from
tpch.lineitem limit 16000;