impala

mirror of https://github.com/apache/impala.git synced 2026-01-02 03:00:32 -05:00

Files

Lars Volker b5570da405 IMPALA-1740: Add support for skip.header.line.count.

HIVE-5795 introduced a parameter skip.header.line.count to skip header
lines from input files. This change introduces the capability to skip
an arbitrary number of header lines from csv input files on hdfs. The
size of the total file header must be smaller than
max_scan_range_length, otherwise an error will be reported. This is
necessary because scan ranges are not read in disk order, so there is
no way of identifying header lines except by counting from the start
of the first scan range.

[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='1');
Query: alter table t1 set tblproperties('skip.header.line.count'='1')
[localhost:21000] > select * from t1;
Query: select * from t1
+----+----+
| c1 | c2 |
+----+----+
| 1  | 1  |
| 2  | 2  |
| 3  | 3  |
+----+----+
Fetched 3 row(s) in 0.32s
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='0');
Query: alter table t1 set tblproperties('skip.header.line.count'='0')
[localhost:21000] > select * from t1;
Query: select * from t1
+------+------+
| c1   | c2   |
+------+------+
| NULL | NULL |
| 1    | 1    |
| 2    | 2    |
| 3    | 3    |
+------+------+
WARNINGS: Error converting column: 0 TO INT (Data is: num1)
Error converting column: 1 TO DOUBLE (Data is: num2)
file: hdfs://localhost:20500/test-warehouse/t1/test.txt
record: num1,num2

Fetched 4 row(s) in 0.41s

Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543
Reviewed-on: http://gerrit.cloudera.org:8080/2110
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins

2016-05-12 14:17:46 -07:00

local_tbl

Enable local filesystem tables

2015-02-27 18:48:56 +00:00

mstr

- Move thrift out of FE src and into impala/common

2011-12-30 19:35:20 -08:00

schemas

Add nested types support to Create Table Like File

2015-08-22 01:46:26 +00:00

alltypesagg_hive_13_1.parquet

IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps

2015-02-11 13:28:17 +00:00

avro_decimal_tbl.avro

Decimal: read from Avro

2014-05-16 22:26:11 -07:00

bad_column_metadata.parquet

IMPALA-2558: DCHECK in parquet scanner after block read error

2015-10-30 22:35:57 +00:00

bad_compressed_size.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_dict_page_offset.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_magic_number.parquet

IMPALA-2130: Wrong verification of Parquet file version

2015-07-14 02:52:02 +00:00

bad_metadata_len.parquet

S3: Don't seek/read past file end

2015-01-08 16:19:35 -08:00

bad_parquet_data.parquet

IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8

2014-01-08 10:54:27 -08:00

chars-formats.avro

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-formats.parquet

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-formats.txt

Char PARQUET, AVRO, and TEXT tests

2014-09-26 12:24:07 -07:00

chars-tiny.txt

Bugfix and tests for CHAR(N) and VARCHAR(N)

2014-09-23 07:30:07 -07:00

data-bzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

data-pbzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

decimal_tbl.txt

Decimal implementation.

2014-04-14 21:07:32 -07:00

decimal-tiny.txt

Decimal implementation.

2014-04-14 21:07:32 -07:00

kite_required_fields.parquet

Parquet: Fix value def level when max def level is 0

2015-05-15 06:41:02 +00:00

large_bzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

large_pbzip2.bz2

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

long_page_header.parquet

IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to

2014-10-27 16:30:56 -07:00

multiple_rowgroups.parquet

IMPALA-729: fix resource management in Parquet scanner for multiple row groups

2014-01-08 10:56:26 -08:00

oldrcfile.rc

Fix pre-hive 9 rc file scanner.

2014-01-08 10:48:41 -08:00

overflow.txt

- Cleaned up some TODOs.

2012-01-18 23:08:29 -08:00

README

IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files.

2016-02-28 21:31:37 -08:00

repeated_values.parquet

Allow zero bit width dict/RLE decoders.

2014-01-08 10:54:27 -08:00

table_missing_columns.csv

IMPALA-1973: Fixing crash when uninitialized, empty row is added in HdfsTextScanner

2015-05-05 00:19:12 +00:00

table_no_newline.csv

IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line.

2015-03-20 19:58:50 -07:00

table_with_header_2.csv

IMPALA-1740: Add support for skip.header.line.count.

2016-05-12 14:17:46 -07:00

table_with_header.csv

IMPALA-1740: Add support for skip.header.line.count.

2016-05-12 14:17:46 -07:00

text-comma-backslash-newline.txt

IMPALA-496: Fix escaping of field delimiter and escape character in inserts

2014-01-08 10:52:09 -08:00

text-dollar-hash-pipe.txt

IMPALA-496: Fix escaping of field delimiter and escape character in inserts

2014-01-08 10:52:09 -08:00

text-thorn-ecirc-newline.txt

IMP-1291: Support "extended" ASCII characters as delimiters in text files

2014-03-13 13:00:15 -07:00

timezoneverification.csv

IMPALA-1381: Expand set of supported timezones.

2015-05-22 01:32:54 +00:00

widerow.txt

IMPALA-525: Adjust IO buffer size based on read length and other memory fixes

2014-01-08 10:54:01 -08:00

README

bad_parquet_data.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"is"
"fun"

repeated_values.parquet:
Generated with parquet-mr 1.2.5
Contains 3 single-column rows:
"parquet"
"parquet"
"parquet"

multiple_rowgroups.parquet:
Generated with parquet-mr 1.2.5
Populated with:
hive> set parquet.block.size=500;
hive> INSERT INTO TABLE tbl
      SELECT l_comment FROM tpch.lineitem LIMIT 1000;

alltypesagg_hive_13_1.parquet:
Generated with parquet-mr version 1.5.0-cdh5.4.0-SNAPSHOT
hive> create table alltypesagg_hive_13_1 stored as parquet as select * from alltypesagg;

bad_column_metadata.parquet:
Generated with hacked version of parquet-mr 1.8.2-SNAPSHOT
Schema:
 {"type": "record",
  "namespace": "com.cloudera.impala",
  "name": "bad_column_metadata",
  "fields": [
      {"name": "id", "type": ["null", "long"]},
      {"name": "int_array", "type": ["null", {"type": "array", "items": ["null", "int"]}]}
  ]
 }
Contains 3 row groups, each with ten rows and each array containing ten elements. The
first rowgroup column metadata for 'int_array' incorrectly states there are 50 values
(instead of 100), and the second rowgroup column metadata for 'id' incorrectly states
there are 11 values (instead of 10). The third rowgroup has the correct metadata.

data-bzip2.bz2
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size < 8M

large_bzip2.bz2
Generated with bzip2, contains single bzip2 stream
Contains 1 column, uncompressed data size > 8M

data-pbzip2.bz2
Generated with pbzip2, contains multiple bzip2 streams
Contains 1 column, uncompressed data size < 8M

large_pbzip2.bz2
Generated with pbzip2, contains multiple bzip2 stream
Contains 1 column, uncompressed data size > 8M