mirror of
https://github.com/apache/impala.git
synced 2025-12-26 14:02:53 -05:00
This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This file was created for: IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups. IMPALA-2466: Add more tests to the HDFS parquet scanner. IMPALA-5717: Add tests for HDFS orc scanner. The table lineitem_multiblock is a single parquet file with: - A row group size of approximately 12 KB each. - 200 row groups in total. Assuming a 1 MB HDFS block size, it has: - 3 blocks of up to 1 MB each. - Multiple row groups per block - Some row groups that span across block boundaries and live on 2 blocks. ---- This table was created using hive and has the same table structure and some of the data of 'tpch.lineitem'. The following commands were used: create table functional_parquet.lineitem_multiblock like tpch.lineitem stored as parquet; set parquet.block.size=4086; # This is to set the row group size insert into functional_parquet.lineitem_multiblock select * from tpch.lineitem limit 20000; # We limit to 20000 to keep the size of the table small 'lineitem_sixblocks' was created the same way but with more rows, so that we got more blocks. 'lineitem_multiblock_one_row_group' was created similarly but with a much higher 'parquet.block.size' so that everything fit in one row group. ---- The orc files are created by the following hive queries: use functional_orc_def; set orc.stripe.size=1024; set orc.compress=ZLIB; create table lineitem_threeblocks like tpch.lineitem stored as orc; create table lineitem_sixblocks like tpch.lineitem stored as orc; insert overwrite table lineitem_threeblocks select * from tpch.lineitem limit 16000; insert overwrite table lineitem_sixblocks select * from tpch.lineitem limit 30000; set orc.stripe.size=67108864; create table lineitem_orc_multiblock_one_stripe like tpch.lineitem stored as orc; insert overwrite table lineitem_orc_multiblock_one_stripe select * from tpch.lineitem limit 16000;