Files
impala/testdata/datasets/functional/schema_constraints.csv
stiga-huang 818cd8fa27 IMPALA-5717: Support for reading ORC data files
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-11 05:13:02 +00:00

14 KiB

1# Table level constraints:
2# Allows for defining constraints on which file formats to generate for an individual
3# table. The table name should match the base table name defined in the schema template
4# file.
5table_name:stringids, constraint:restrict_to, table_format:hbase/none/none
6table_name:hbasecolumnfamilies, constraint:restrict_to, table_format:hbase/none/none
7table_name:insertalltypesagg, constraint:restrict_to, table_format:hbase/none/none
8table_name:alltypessmallbinary, constraint:restrict_to, table_format:hbase/none/none
9table_name:insertalltypesaggbinary, constraint:restrict_to, table_format:hbase/none/none
10table_name:hbasealltypeserror, constraint:restrict_to, table_format:hbase/none/none
11table_name:hbasealltypeserrornonulls, constraint:restrict_to, table_format:hbase/none/none
12table_name:alltypesinsert, constraint:restrict_to, table_format:text/none/none
13table_name:stringpartitionkey, constraint:restrict_to, table_format:text/none/none
14table_name:alltypesnopart_insert, constraint:restrict_to, table_format:text/none/none
15table_name:insert_overwrite_nopart, constraint:restrict_to, table_format:text/none/none
16table_name:insert_overwrite_partitioned, constraint:restrict_to, table_format:text/none/none
17table_name:insert_string_partitioned, constraint:restrict_to, table_format:text/none/none
18table_name:alltypesinsert, constraint:restrict_to, table_format:parquet/none/none
19table_name:alltypesnopart_insert, constraint:restrict_to, table_format:parquet/none/none
20table_name:alltypesinsert, constraint:restrict_to, table_format:text/none/none
21table_name:alltypesnopart_insert, constraint:restrict_to, table_format:text/none/none
22table_name:insert_overwrite_nopart, constraint:restrict_to, table_format:text/none/none
23table_name:insert_overwrite_partitioned, constraint:restrict_to, table_format:text/none/none
24table_name:insert_string_partitioned, constraint:restrict_to, table_format:text/none/none
25table_name:alltypesinsert, constraint:restrict_to, table_format:parquet/none/none
26table_name:alltypesnopart_insert, constraint:restrict_to, table_format:parquet/none/none
27table_name:insert_overwrite_nopart, constraint:restrict_to, table_format:parquet/none/none
28table_name:insert_overwrite_partitioned, constraint:restrict_to, table_format:parquet/none/none
29table_name:insert_string_partitioned, constraint:restrict_to, table_format:parquet/none/none
30table_name:old_rcfile_table, constraint:restrict_to, table_format:rc/none/none
31table_name:bad_text_lzo, constraint:restrict_to, table_format:text/lzo/block
32table_name:bad_text_gzip, constraint:restrict_to, table_format:text/gzip/block
33table_name:bad_seq_snap, constraint:restrict_to, table_format:seq/snap/block
34table_name:bad_avro_snap_strings, constraint:restrict_to, table_format:avro/snap/block
35table_name:bad_avro_snap_floats, constraint:restrict_to, table_format:avro/snap/block
36table_name:bad_avro_decimal_schema, constraint:restrict_to, table_format:avro/snap/block
37table_name:bad_parquet, constraint:restrict_to, table_format:parquet/none/none
38table_name:bad_parquet_strings_negative_len, constraint:restrict_to, table_format:parquet/none/none
39table_name:bad_parquet_strings_out_of_bounds, constraint:restrict_to, table_format:parquet/none/none
40table_name:bad_magic_number, constraint:restrict_to, table_format:parquet/none/none
41table_name:bad_metadata_len, constraint:restrict_to, table_format:parquet/none/none
42table_name:bad_dict_page_offset, constraint:restrict_to, table_format:parquet/none/none
43table_name:bad_compressed_size, constraint:restrict_to, table_format:parquet/none/none
44table_name:alltypesagg_hive_13_1, constraint:restrict_to, table_format:parquet/none/none
45table_name:kite_required_fields, constraint:restrict_to, table_format:parquet/none/none
46table_name:bad_column_metadata, constraint:restrict_to, table_format:parquet/none/none
47table_name:lineitem_multiblock, constraint:restrict_to, table_format:parquet/none/none
48table_name:lineitem_sixblocks, constraint:restrict_to, table_format:parquet/none/none
49table_name:lineitem_multiblock_one_row_group, constraint:restrict_to, table_format:parquet/none/none
50table_name:customer_multiblock, constraint:restrict_to, table_format:parquet/none/none
51# TODO: Support Avro. Data loading currently fails for Avro because complex types
52# cannot be converted to the corresponding Avro types yet.
53table_name:allcomplextypes, constraint:restrict_to, table_format:text/none/none
54table_name:allcomplextypes, constraint:restrict_to, table_format:parquet/none/none
55table_name:allcomplextypes, constraint:restrict_to, table_format:hbase/none/none
56table_name:functional, constraint:restrict_to, table_format:text/none/none
57table_name:complextypes_fileformat, constraint:restrict_to, table_format:text/none/none
58table_name:complextypes_fileformat, constraint:restrict_to, table_format:parquet/none/none
59table_name:complextypes_fileformat, constraint:restrict_to, table_format:avro/snap/block
60table_name:complextypes_fileformat, constraint:restrict_to, table_format:rc/snap/block
61table_name:complextypes_fileformat, constraint:restrict_to, table_format:seq/snap/block
62table_name:complextypes_fileformat, constraint:restrict_to, table_format:orc/def/block
63table_name:complextypes_multifileformat, constraint:restrict_to, table_format:text/none/none
64# TODO: Avro
65table_name:complextypestbl, constraint:restrict_to, table_format:parquet/none/none
66table_name:alltypeserror, constraint:exclude, table_format:parquet/none/none
67table_name:alltypeserrornonulls, constraint:exclude, table_format:parquet/none/none
68table_name:unsupported_types, constraint:exclude, table_format:parquet/none/none
69table_name:escapechartesttable, constraint:exclude, table_format:parquet/none/none
70table_name:TblWithRaggedColumns, constraint:exclude, table_format:parquet/none/none
71# the text_ tables are for testing test delimiters and escape chars in text files
72table_name:text_comma_backslash_newline, constraint:restrict_to, table_format:text/none/none
73table_name:text_dollar_hash_pipe, constraint:restrict_to, table_format:text/none/none
74table_name:text_thorn_ecirc_newline, constraint:restrict_to, table_format:text/none/none
75table_name:bad_serde, constraint:restrict_to, table_format:text/none/none
76table_name:rcfile_lazy_binary_serde, constraint:restrict_to, table_format:rc/none/none
77table_name:unsupported_partition_types, constraint:restrict_to, table_format:text/none/none
78table_name:nullformat_custom, constraint:exclude, table_format:parquet/none/none
79table_name:alltypes_view, constraint:restrict_to, table_format:text/none/none
80table_name:allcomplextypes_view, constraint:restrict_to, table_format:text/none/none
81table_name:alltypes_view, constraint:restrict_to, table_format:seq/snap/block
82table_name:alltypes_hive_view, constraint:restrict_to, table_format:text/none/none
83table_name:alltypes_view_sub, constraint:restrict_to, table_format:text/none/none
84table_name:alltypes_view_sub, constraint:restrict_to, table_format:seq/snap/block
85table_name:alltypes_parens, constraint:restrict_to, table_format:text/none/none
86table_name:complex_view, constraint:restrict_to, table_format:text/none/none
87table_name:complex_view, constraint:restrict_to, table_format:seq/snap/block
88table_name:view_view, constraint:restrict_to, table_format:text/none/none
89table_name:view_view, constraint:restrict_to, table_format:seq/snap/block
90table_name:subquery_view, constraint:restrict_to, table_format:seq/snap/block
91table_name:subquery_view, constraint:restrict_to, table_format:rc/none/none
92# liketbl and tblwithraggedcolumns all have
93# NULLs in primary key columns. hbase does not support
94# writing NULLs to primary key columns.
95table_name:liketbl, constraint:exclude, table_format:hbase/none/none
96table_name:tblwithraggedcolumns, constraint:exclude, table_format:hbase/none/none
97# Tables with only one column are not supported in hbase.
98table_name:greptiny, constraint:exclude, table_format:hbase/none/none
99table_name:tinyinttable, constraint:exclude, table_format:hbase/none/none
100# overflow uses a manually constructed text file which doesn't make sense to write to
101# other table formats since the values that would be written are different (e.g. already
102# truncated.)
103table_name:overflow, constraint:restrict_to, table_format:text/none/none
104# widerow has a single column with a single row containing a 10MB string. hbase doesn't
105# seem to like this.
106table_name:widerow, constraint:exclude, table_format:hbase/none/none
107# nullformat_custom is used in null-insert tests, which user insert overwrite,
108# which is not supported in hbase. The schema is also specified in HIVE_CREATE
109# with no corresponding LOAD statement.
110table_name:nullformat_custom, constraint:exclude, table_format:hbase/none/none
111table_name:unsupported_types, constraint:exclude, table_format:hbase/none/none
112# Decimal can only be tested on formats Impala can write to (text and parquet).
113# TODO: add Avro once Hive or Impala can write Avro decimals
114table_name:decimal_tbl, constraint:restrict_to, table_format:text/none/none
115table_name:decimal_tiny, constraint:restrict_to, table_format:text/none/none
116table_name:decimal_tbl, constraint:restrict_to, table_format:parquet/none/none
117table_name:decimal_tiny, constraint:restrict_to, table_format:parquet/none/none
118table_name:decimal_tbl, constraint:restrict_to, table_format:kudu/none/none
119table_name:decimal_tiny, constraint:restrict_to, table_format:kudu/none/none
120table_name:decimal_tbl, constraint:restrict_to, table_format:orc/def/block
121table_name:decimal_tiny, constraint:restrict_to, table_format:orc/def/block
122table_name:avro_decimal_tbl, constraint:restrict_to, table_format:avro/snap/block
123# TODO first set of tests are for text/none/none
124table_name:chars_tiny, constraint:restrict_to, table_format:text/none/none
125# invalid_decimal_part_tbl[1,2,3] tables are used for testing invalid decimal
126# partition key values (see IMPALA-1040)
127table_name:invalid_decimal_part_tbl1, constraint:restrict_to, table_format:text/none/none
128table_name:invalid_decimal_part_tbl2, constraint:restrict_to, table_format:text/none/none
129table_name:invalid_decimal_part_tbl3, constraint:restrict_to, table_format:text/none/none
130table_name:avro_decimal_tbl, constraint:restrict_to, table_format:avro/snap/block
131# testescape tables are used for testing text scanner delimiter handling
132table_name:table_no_newline, constraint:restrict_to, table_format:text/none/none
133table_name:table_no_newline_part, constraint:restrict_to, table_format:text/none/none
134table_name:testescape_16_lf, constraint:restrict_to, table_format:text/none/none
135table_name:testescape_16_crlf, constraint:restrict_to, table_format:text/none/none
136table_name:testescape_17_lf, constraint:restrict_to, table_format:text/none/none
137table_name:testescape_17_crlf, constraint:restrict_to, table_format:text/none/none
138table_name:testescape_32_lf, constraint:restrict_to, table_format:text/none/none
139table_name:testescape_32_crlf, constraint:restrict_to, table_format:text/none/none
140# alltimezones is used to verify that impala properly deals with timezones
141table_name:alltimezones, constraint:restrict_to, table_format:text/none/none
142# Avro schema is inferred from the column definitions (IMPALA-1136)
143table_name:no_avro_schema, constraint:restrict_to, table_format:avro/snap/block
144table_name:avro_unicode_nulls, constraint:restrict_to, table_format:avro/snap/block
145# test single and multi stream bz2 files
146table_name:bzip2_tbl, constraint:restrict_to, table_format:text/bzip/block
147table_name:large_bzip2_tbl, constraint:restrict_to, table_format:text/bzip/block
148table_name:multistream_bzip2_tbl, constraint:restrict_to, table_format:text/bzip/block
149table_name:large_multistream_bzip2_tbl, constraint:restrict_to, table_format:text/bzip/block
150# Kudu can't handle certain types such as timestamp so we pick and choose the tables
151# we actually use for Kudu related tests.
152table_name:alltypes, constraint:only, table_format:kudu/none/none
153table_name:alltypessmall, constraint:only, table_format:kudu/none/none
154table_name:alltypestiny, constraint:only, table_format:kudu/none/none
155table_name:alltypesagg, constraint:only, table_format:kudu/none/none
156table_name:alltypesaggnonulls, constraint:only, table_format:kudu/none/none
157table_name:testtbl, constraint:only, table_format:kudu/none/none
158table_name:jointbl, constraint:only, table_format:kudu/none/none
159table_name:emptytable, constraint:only, table_format:kudu/none/none
160table_name:dimtbl, constraint:only, table_format:kudu/none/none
161table_name:tinytable, constraint:only, table_format:kudu/none/none
162table_name:tinyinttable, constraint:only, table_format:kudu/none/none
163table_name:zipcode_incomes, constraint:only, table_format:kudu/none/none
164table_name:nulltable, constraint:only, table_format:kudu/none/none
165table_name:nullescapedtable, constraint:only, table_format:kudu/none/none
166table_name:decimal_tbl, constraint:only, table_format:kudu/none/none
167table_name:decimal_tiny, constraint:only, table_format:kudu/none/none
168# Skipping header lines is only effective with text tables
169table_name:table_with_header, constraint:restrict_to, table_format:text/none/none
170table_name:table_with_header_2, constraint:restrict_to, table_format:text/none/none
171table_name:table_with_header_insert, constraint:restrict_to, table_format:text/none/none
172# We also test that skipping header lines works on compressed tables (IMPALA-5287)
173table_name:table_with_header, constraint:restrict_to, table_format:text/gzip/block
174table_name:table_with_header_2, constraint:restrict_to, table_format:text/gzip/block
175table_name:table_with_header_insert, constraint:restrict_to, table_format:text/gzip/block
176# Inserting into parquet tables should not be affected by the 'skip.header.line.count'
177# property, so we test parquet format as well.
178table_name:table_with_header_insert, constraint:restrict_to, table_format:parquet/none/none