impala

mirror of https://github.com/apache/impala.git synced 2025-12-30 12:02:10 -05:00

Author	SHA1	Message	Date
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Tim Armstrong	588e1d46e9	IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner Impala already supported RLE encoding for levels and dictionary pages, so the only task was to integrate it into BoolColumnReader. A new benchmark, rle-benchmark.cc is added to test the speed of RLE decoding for different bit widths and run lengths. There might be a small performance impact on PLAIN encoded booleans, because of the additional branch when the cache of BoolColumnReader is filled. As the cache size is 128, I considered this to be outside the "hot loop". Testing: As Impala cannot write RLE encoded bool columns at the moment, parquet-mr was used to create a test file, testdata/data/rle_encoded_bool.parquet tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table that uses this file, and tries to query from it. Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f Reviewed-on: http://gerrit.cloudera.org:8080/9403 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-22 02:47:33 +00:00
Tim Armstrong	e148c1a7c3	IMPALA-6589: remove invalid DCHECK in parquet reader The DCHECK was only valid if the Parquet file metadata is internally consistent, with the number of values reported by the metadata matching the number of encoded levels. The DCHECK was intended to directly detect misuse of the RleBatchDecoder interface, which would lead to incorrect results. However, our other test coverage for reading Parquet files is sufficient to test the correctness of level decoding. Testing: Added a minimal corrupt test file that reproduces the issue. Change-Id: Idd6e09f8c8cca8991be5b5b379f6420adaa97daa Reviewed-on: http://gerrit.cloudera.org:8080/9556 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-17 02:52:19 +00:00
Vincent Tran	0d7787fe4d	IMPALA-5315: Cast to timestamp fails for YYYY-M-D format This change allows casting of a string in 'lazy' date/time format to timestamp. The supported lazy date formats are: yyyy-[M]M-[d]d yyyy-[M]M-[d]d [H]H:[m]m:[s]s[.SSSSSSSSS] [H]H:[m]m:[s]s[.SSSSSSSSS] We will incur a SCAN performance penalty (approximately 1/2 TotalReadThroughput) when the string is in one of these lazy date/time format. Testing: Benchmarked the performance consequence by executing this SQL on a private build over 3.8 billion rows: select min(cast (time_string as timestamp)) from private.impala_5315 Added tests for valid and invalid date/time format strings in expr-test.cc to be inline with existing tests for CAST() function. Added end-to-end tests into exprs.test and select-lazy-timestamp.test to exercise the new function within the context of a query. Added tests to exercise the leading and trailing white space trimming behaviour in default and lazy date/time string format (IMPALA-6630). Change-Id: Ib9a184a09d7e7783f04d47588537612c2ecec28f Reviewed-on: http://gerrit.cloudera.org:8080/7009 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-13 22:10:18 +00:00
Tim Armstrong	73e90d237e	IMPALA-6592: add test for invalid parquet codecs IMPALA-6592 revealed a gap in test coverage for files with invalid/unsupported Parquet codecs. This adds a test that reproduces the bug that was present in my IMPALA-4835 patch. master is unaffected by this bug. I also hid the conversion tables and made the conversion go through functions that validate the enum values, to make it easier to track down problems like this in the future. Testing: Ran exhaustive tests. Change-Id: I1502ea7b7f39aa09f0ed2677e84219b37c64c416 Reviewed-on: http://gerrit.cloudera.org:8080/9500 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-08 04:48:36 +00:00
Zoltan Borok-Nagy	881e00a8bf	IMPALA-6538: Fix read path when Parquet min/max statistics contain NaN If the first number in a row group written by Impala is NaN, then Impala writes incorrect statistics in the metadata. This will result in incorrect results when filtering the data. This commit fixes the read path when encountering NaNs in Parquet min/max statistics. If min and max are both NaN, we can't use the statistics at all. If only one of them is NaN, the other still can be used. I added some tests to QueryTest/parqet-stats.test Change-Id: If3897fc1426541239223670812f59e2bed32f455 Reviewed-on: http://gerrit.cloudera.org:8080/9358 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-22 00:57:46 +00:00
Tianyi Wang	6cc76d7201	IMPALA-6353: Fix crash in snappy decompressor SnappyDecompressor::MaxOutputLen assumes the input pointer to be non-null. It's not true when the parquet file is corrupted and the compressed_page_size field in a page header is 0. This patch handles this error instead of failing a DCHECK. Testing: A bad parquet file with 0 compressed_page_size is added. It crashes impala without this patch. Change-Id: I0d42937aab92a74f8e104d2f7fcd64dc24f6a500 Reviewed-on: http://gerrit.cloudera.org:8080/8977 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-17 04:18:24 +00:00
aphadke	38461c524f	IMPALA-5052: Read and write signed integer logical types in Parquet This patch maps a signed integer logical type in parquet to a supported Impala column type. This change introduces the following mapping - INT_8 -> TINYINT INT_16 -> SMALLINT INT_32 -> INT INT_64 -> BIGINT Also, added a parquet file with the following schema for testing - schema { optional int32 id; optional int32 tinyint_col (INT_8); optional int32 smallint_col (INT_16); optional int32 int_col; optional int64 bigint_col; } Change-Id: I47a8371858c9597c6a440808cf6f933532468927 Reviewed-on: http://gerrit.cloudera.org:8080/8548 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Tianyi Wang <twang@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-09 04:55:59 +00:00
Tim Armstrong	ae116b5bf7	IMPALA-4177,IMPALA-6039: batched bit reading and rle decoding Switch the decoders to using more batch-oriented interfaces. As an intermediate step this doesn't make the interfaces of LevelDecoder or DictDecoder batch-oriented, only the lower-level utility classes. The next step would be to change those interfaces to be batch-oriented and make according optimisations in parquet. This could deliver much larger perf improvements than the current patch. The high-level changes are. * BitReader -> BatchedBitReader, which is built to unpack runs of 32 bit-packed values efficiently. * RleDecoder -> RleBatchDecoder, which exposes the repeated and literal runs to the caller and uses BatchedBitReader to unpack literal runs efficiently. * Dict decoding uses RleBatchDecoder to decode repeated runs efficiently and uses the BitPacking utilities to unpack and encode in a single step. Also removes an older benchmark that isn't too interesting (since the batch-oriented approach to encoding and decoding is so much faster than the value-by-value approach). Testing: * Ran core tests. * Updated unit tests to exercise new code. * Added test coverage for the deprecated bit-packed level encoding to that it still works (there was no coverage previously). Perf: Single-node benchmarks showed a few % performance gain. 16 node cluster benchmarks only showed a gain for TPC-H nested. Change-Id: I35de0cf80c86f501c4a39270afc8fb8111552ac6 Reviewed-on: http://gerrit.cloudera.org:8080/8267 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-16 21:23:09 +00:00
Bikramjeet Vig	94236ff2ff	IMPALA-2494: Support for byte array encoded decimals in Parquet scanner Extendes parquet column reader and associated classes to allow for more than one possible physical type for a given logical type. This patch only adds support for variable sized byte array encoded decimals and more will be added in upcoming commits. Also, column level metadata verification which was currently being done per row group will now only be done once per column per file. Testing: Added backend test for verifying newly added decimal types are decoded correctly. Added Query test that decodes both plain and dictionary-encoded decimals using binary encoding. Performance: Initial perf testing using tpcds_1000 shows no regression. Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e Reviewed-on: http://gerrit.cloudera.org:8080/7822 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-07 04:34:26 +00:00
Gabor Kaszab	545eab6d62	IMPALA-4826: Fix error during a scan on repeated root schema in Parquet. Having the repetition level set to REPEATED on the root schema resulted a scan to fail with error when Impala tried to parse that table. As a solution, the 'REPEATED' repetition level is ignored when the root schema is processed. The reasoning behind is that the Parquet format description says that the repetition level of the root schema should not be set to REPEATED anyway, so it's safe to ignore it in case it is set to this value for some reason. Change-Id: I7ea84589e1d122ad9d43adde46893ec0ecc5f9c4 Reviewed-on: http://gerrit.cloudera.org:8080/7870 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-06 20:07:56 +00:00
Jakub Kukul	0992a6afda	IMPALA-2525: Treat parquet ENUMs as STRINGs when creating impala tables. Change-Id: Ia7a2e20c3ab83eb3fac422c3b33c117856fec475 Reviewed-on: http://gerrit.cloudera.org:8080/6550 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-07 02:51:54 +00:00
Lars Volker	9270346825	IMPALA-4815, IMPALA-4817, IMPALA-4819: Write and Read Parquet Statistics for remaining types This change adds functionality to write and read parquet::Statistics for Decimal, String, and Timestamp values. As an exception, we don't read statistics for CHAR columns, since CHAR support is broken in Impala (IMPALA-1652). This change also switches from using the deprecated fields 'min' and 'max' to populate the new fields 'min_value' and 'max_value' in parquet::Statistics, that were added in parquet-format pull request #46. The HdfsParquetScanner will preferably read the new fields if they are populated and if the column order 'TypeDefinedOrder' has been used to compute the statistics. For columns without a column order set or with only the deprecated fields populated, the scanner will read them only if they are of simple numeric type, i.e. boolean, integer, or floating point. This change removes the validation of the Parquet Statistics we write to Hive from the tests, since Hive does not write the new fields. Instead it adds a parquet file written by Hive that uses the deprecated fields for its statistics. It uses that file to exercise the fallback logic for supported types in a test. This change also cleans up the interface of ParquetPlainEncoder in parquet-common.h. Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Reviewed-on: http://gerrit.cloudera.org:8080/6563 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2017-05-09 15:47:21 +00:00
Lars Volker	12f3ecceab	IMPALA-5287: Test skip.header.line.count on gzip This change fixed IMPALA-4873 by adding the capability to supply a dict 'test_file_vars' to run_test_case(). Keys in this dict will be replaced with their values inside test queries before they are executed. Change-Id: Ie3f3c29a42501cfb2751f7ad0af166eb88f63b70 Reviewed-on: http://gerrit.cloudera.org:8080/6817 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-09 01:36:46 +00:00
Alex Behm	d3cc23e569	IMPALA-5021: Fix count() remaining rows overflow in Parquet. Zero-slot scans of Parquet files that have num_rows > MAX_INT32 in the footer metadata used to run forever due to an overflow when calculating the remaining number of rows to process. Testing: - Added a regression test using a file with num_rows = 2MAX_INT32. - Locally ran test_scanners.py which succeeded. - Private core/hdfs run succeeded Change-Id: Ib9f8a6b83f8f621451d5977423ef81a6e4b124bd Reviewed-on: http://gerrit.cloudera.org:8080/6286 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-08 02:00:30 +00:00
Dan Hecht	bf2e897209	IMPALA-4810: add DECIMAL test case to strict_mode tests The string parsing code already errors if the decimal column either overflows or underflows (i.e. loses scale). Let's just add a test case. Change-Id: Idd66c0fb5a4d201919d39f73dea08b87339d6469 Reviewed-on: http://gerrit.cloudera.org:8080/6150 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-03-03 01:43:42 +00:00
Taras Bobrovytsky	858f5c2197	IMPALA-4363: Add Parquet timestamp validation Before this patch, we would simply read the INT96 Parquet timestamp representation and assume that it's valid. However, not all bit permutations represent a valid timestamp. One of the boost functions raised an exception (that we didn't catch) when passed an invalid boost date object, which resulted in a crash. This patch fixes problem by validating that the date falls into 1400..9999 year range as we are scanning Parquet. Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac Reviewed-on: http://gerrit.cloudera.org:8080/5343 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-12-03 06:41:07 +00:00
Alex Behm	0449b5beab	IMPALA-3943: Do not throw scan errors for empty Parquet files. For Parquet files with no row groups but with num_rows=0 in the file footer the Parquet scanner returns an error indicating that the file is invalid. This behavior is a regression from previous Impala versions which used to accept such files. This patch restores the previous behavior and adds tests. Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e Reviewed-on: http://gerrit.cloudera.org:8080/4693 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-12 09:22:57 +00:00
Thomas Tauber-Marshall	b2c2fe7813	IMPALA-3786: Replace "cloudera" with "apache" (part 2) As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . \| grep "\.java\\|\.py\\|\.h\\|\.cc" \| \ xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 21:14:13 +00:00
Jim Apple	bd2947329e	IMPALA-4110: Clean up issues found by Apache RAT. Change-Id: I5bfe77f9a871018e7a67553ed270e2df53006962 Reviewed-on: http://gerrit.cloudera.org:8080/4361 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:09:24 +00:00
Alex Behm	025fd3bd7f	IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0. Adds handling and testing for a specific Parquet data corruption scenario with plain dictionary encoded values. The problematic scenario is when the repeat or literal count of the RLE-encoded dictionary indexes is decoded as 0 - an invalid value. There are several other cases of data corruption that are not yet handled gracefully. This patch only handles one specific case. Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63 Reviewed-on: http://gerrit.cloudera.org:8080/3299 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-06-07 17:29:59 -07:00
Lars Volker	b5570da405	IMPALA-1740: Add support for skip.header.line.count. HIVE-5795 introduced a parameter skip.header.line.count to skip header lines from input files. This change introduces the capability to skip an arbitrary number of header lines from csv input files on hdfs. The size of the total file header must be smaller than max_scan_range_length, otherwise an error will be reported. This is necessary because scan ranges are not read in disk order, so there is no way of identifying header lines except by counting from the start of the first scan range. [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='1'); Query: alter table t1 set tblproperties('skip.header.line.count'='1') [localhost:21000] > select * from t1; Query: select * from t1 +----+----+ \| c1 \| c2 \| +----+----+ \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +----+----+ Fetched 3 row(s) in 0.32s [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='0'); Query: alter table t1 set tblproperties('skip.header.line.count'='0') [localhost:21000] > select * from t1; Query: select * from t1 +------+------+ \| c1 \| c2 \| +------+------+ \| NULL \| NULL \| \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +------+------+ WARNINGS: Error converting column: 0 TO INT (Data is: num1) Error converting column: 1 TO DOUBLE (Data is: num2) file: hdfs://localhost:20500/test-warehouse/t1/test.txt record: num1,num2 Fetched 4 row(s) in 0.41s Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543 Reviewed-on: http://gerrit.cloudera.org:8080/2110 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:46 -07:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Skye Wanderman-Milne	dd2eb951d7	IMPALA-2558: DCHECK in parquet scanner after block read error There was an incorrect DCHECK in the parquet scanner. If abort_on_error is false, the intended behaviour is to skip to the next row group, but the DCHECK assumed that execution should have aborted if a parse error was encountered. This also: - Fixes a DCHECK after an empty row group. InitColumns() would try to create empty scan ranges for the column readers. - Uses metadata_range_->file() instead of stream_->filename() in the scanner. InitColumns() was using stream_->filename() in error messages, which used to work but now stream_ is set to NULL before calling InitColumns(). Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701 Reviewed-on: http://gerrit.cloudera.org:8080/1257 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-30 22:35:57 +00:00
Taras Bobrovytsky	b8b7930377	Add nested types support to Create Table Like File Add support for creating a table based on a parquet file which contains arrays, structs and/or maps. Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae Reviewed-on: http://gerrit.cloudera.org:8080/582 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-22 01:46:26 +00:00
Taras Bobrovytsky	3c9ceb1a2b	Add Parquet nested schemas to testdata A script is added that generates two parquet files with nested data. One file has modern nested types encoding and the other one has legacy encoding. This data will be used for testing nested types support for "create table like file" statement. Change-Id: I8a4f64c9f7b3228583f3cb0af5507a9dd4d152ef Reviewed-on: http://gerrit.cloudera.org:8080/610 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 10:25:39 +00:00
Ippokratis Pandis	e99c68fe52	IMPALA-2130: Wrong verification of Parquet file version This patch corrects a mistake in the Parquet magic file number verification and adds a test about it. Note that with this patch Impala may fail to read Parquet files with wrong magic number that it used to read before. Change-Id: Iff31accda1e1d541946ef1f750e38886ce4cb8d5 Reviewed-on: http://gerrit.cloudera.org:8080/515 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-07-14 02:52:02 +00:00
Juan Yu	934b28fe5e	IMPALA-1381: Expand set of supported timezones. The hardcoded timezone information is from Java version 1.7.0_76. Change-Id: I32c40d0036473079e5bfd4d0252a648cbb0e7c23 Reviewed-on: http://gerrit.cloudera.org:8080/393 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-22 01:32:54 +00:00
Casey Ching	ac0c075997	Parquet: Fix value def level when max def level is 0 When running with a release build, NULL would be returned when reading values from required fields in parquet files (with a debug build a DCHECK would be hit). Previously when the max definition level for a field was 0 (which happens if a field is required), the definition level for value was incorrectly set to 1. The max definition level is related to nested data and is defined to be the number of nullable fields that will be encountered when traversing a path to reach the desired end field. For example, if a nested schema has a path a.b.c.d where b and d are nullable then the max def level is 2. A def level is attached to each value to indicate the number of optional values that are present (in the previous example an def level of 2 means both b and d are not null). So having a def level for a value that is greater than the max def level for a field should never happen. Change-Id: Ia91a97cf79e672c420d10416c6817f0930dcc920 (cherry picked from commit cdd67e4c7fd62d5b08adfaa303d7bb2382e6932c) Reviewed-on: http://gerrit.cloudera.org:8080/386 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-05-15 06:41:02 +00:00
Juan Yu	d1c263402e	IMPALA-1973: Fixing crash when uninitialized, empty row is added in HdfsTextScanner This patch fixes an issue when an uninitialized, empty row is falsely added to the rowbatch. The uninitialized data inside this row leads later on to a crash when the null byte is checked together with the offsets (that contains garbage). The fix is to not only check for the number of materialized columns, but as well for the number of materialized partition key columns. Only if both are empty and the parser has an unfinished tuple, add the empty row. To accommodate for the last row, check in FinishScanRange() if there is an unfinished tuple with materialized slots or materialized partition key. Write the fields if necessary. Change-Id: I2808cc228e62d048d917d3a6352d869d117597ab (cherry picked from commit c1795a8b40d10fbb32d9051a0e7de5ebffc8a6bd) Reviewed-on: http://gerrit.cloudera.org:8080/364 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-05 00:19:12 +00:00
Juan Yu	e121bc9b0a	IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line. I did a local benchmark and there's minimal performance impact(<1%) Change-Id: I8d84a145acad886c52587258b27d33cff96ea399 (cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0) Reviewed-on: http://gerrit.cloudera.org:8080/189 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 19:58:50 -07:00
Dan Hecht	99d3caacb7	Enable local filesystem tables The S3 work really enabled any Hadoop FileSystem to work with Impala, but a small tweak is needed for LocalFileSystem due to how the Hadoop Path code deals with URIs that don't have an authority component. While we aren't claiming support for arbitrary FileSystem's at this tiem, it is useful to test this. Since the S3 testing is done as a nightly test rather than pre-checkin, we can use the LocalFileSystem to regression test that: 1) Impala can access table data living on a secondary filesystem, i.e. not the filesystem specified by fs.defaultFS. 2) Impala does not make assumptions that the filesystem has type DistributedFileSystem. Change-Id: Ie9b858ea440c9b3b332602e034c8052b168c57da Reviewed-on: http://gerrit.cloudera.org:8080/121 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-02-27 18:48:56 +00:00
casey	87b9fac2ad	IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps No changes to writing were made. No changes to reading Impala written files were made. Hive writes TIMESTAMP values to parquet files differently than Impala does. Hive converts the value from local time to UTC before writing; Impala does not. This change adds a startup flag that will convert UTC to local when reading files written by Hive. The Hive-file detection actually checks for "parquet-mr" (which is the library Hive uses) in the file metadata. A slight possibility exists that TIMESTAMP values written by something other than Hive but also using parquet-mr may become incorrect. The possibility should be very small because TIMESTAMP values are stored and encoded in a non-standard way other applications are unlikely to be aware of. Flags from be/src/exec/hdfs-parquet-scanner.cc: -convert_legacy_hive_parquet_utc_timestamps (When true, TIMESTAMPs read from files written by Parquet-MR (used by Hive) will be converted from UTC to local time. Writes are unaffected.) type: bool default: false Change-Id: I79a499fe24049b7025ee2dd76c9c3e07010d346a Reviewed-on: http://gerrit.cloudera.org:8080/35 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-02-11 13:28:17 +00:00
Dan Hecht	3735ea94a0	S3: Don't seek/read past file end DistributedFileSystem is lenient about seeking past the end of the file. Other FileSystem implementations, such as NativeS3FileSystem, return an error on this condition. That leads to a scary looking message in the query warnings. So, when creating scan ranges, let's require that the ranges fall within the file bounds (at least according to what the HdfsFileDesc indicates is the length). There were a couple of kinds of AllocateScanRange() callsites that needed to be fixed up: 1) When a stream wants to read past a scan range, be careful not to read past the end of the file. 2) When Impala needs to "guess" at the length of a range, use the file_length as an upper bound on the guess. We were already doing this someplaces but not everywhere. 3) When the scan range is derived from parquet metadata, validate the metadata against file_length and issue appropriate errors. This will give better diagnostics for corrupt files. Note that we can't rely on this for safety (HdfsFileDesc file_length may be stale), but it does mean that when metadata is up-to-date Impala will no longer try to access beyond the end of files (and so we'll no longer get false positive errors from the filesystem). Additionally, this change revealed a pre-existing problem with files that have multiple row-groups. The first time through InitColumns(), stream_ was set to NULL. But, stream_->filename could potentially be accessed when constructing error statuses for subsequent row-groups. Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: jenkins	2015-01-08 16:19:35 -08:00
Skye Wanderman-Milne	4a722980e5	IMPALA-1401: raise MAX_PAGE_HEADER_SIZE and use scanner context to stitch together header buffer Change-Id: I4f33b90e845e9bef1ac929bf4ebb8e98eaff985c Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4961 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins (cherry picked from commit c3a90183b2f03434a9604f3aa2ef6dd08c9ba97c) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4981 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>	2014-10-27 16:30:56 -07:00
Victor Bittorf	af4b2086dc	Char PARQUET, AVRO, and TEXT tests Adds fixes and tests for Hive CHAR & VARCHAR compatibility. Also fixes a bug in tuple materialization for VARCHAR and non in-lined CHAR. Change-Id: I400b089cb8ddba2e264ef9f2e37956b2ceaaf9fb Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4054 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-26 12:24:07 -07:00
Victor Bittorf	9939c9d009	Bugfix and tests for CHAR(N) and VARCHAR(N) Fixed a bug when setting the length in reading/write text files for CHAR(N). Also added chars_tiny table for testing CHAR(N) and VARCHAR(N). Change-Id: If5d5db30afa4b00cf03c68c6a845f182970329f4 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4415 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-23 07:30:07 -07:00
Victor Bittorf	2d7f2e19b2	IMPALA 938: Infer schema from Parquet file Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'". Supports all options that CREATE TABLE does. Currently only PARQUET is supported. Run testdata/bin/create-load-data.sh after pulling this patch. Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158	2014-06-20 17:38:01 -07:00
Skye Wanderman-Milne	edbbe6035e	Decimal: read from Avro Allows reading decimal columns with or without codegen. Includes tests based on a data file posted on HIVE-5823. Change-Id: Ie541c6b98bd24543691850cb45a434af60b5a5a6 (cherry picked from commit 6983dcefdf70cce14724e17d03bc061ffb8f671c) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2596 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-05-16 22:26:11 -07:00
Nong Li	87295a4e06	Decimal implementation. This patch implements decimal support for text based formats. Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238 Tested-by: jenkins	2014-04-14 21:07:32 -07:00
Lenni Kuff	cc1c0c61fd	IMP-1291: Support "extended" ASCII characters as delimiters in text files This fixes how we validate delimiters to be in line with Hive. A delimiter must fit in a single byte and can be specified in the following formats, as far as I can tell (there isn't documentation): - A single ASCII or unicode character (ex. '\|') - An escape character in octal format (ex. \001. Stored in the metastore as a unicode character: \u0001). - A signed decimal integer in the range [-128:127]. Used to support delimiters for ASCII character values between 128-255 (-2 maps to ASCII 254). Previously, we were not handling the "signed integer" case so there was no way to specify a delimiter in the "extended" ASCII range of 128-255. To support result validation, the test infrastructure had to be updated to support reading/writing different character encodings. Change-Id: Ie3c4d444dc9c6e60192093ed0c0f6f151eab16bc Reviewed-on: http://gerrit.ent.cloudera.com:8080/1848 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1888	2014-03-13 13:00:15 -07:00
Skye Wanderman-Milne	561da008c7	IMPALA-729: fix resource management in Parquet scanner for multiple row groups We weren't attaching resources to the row batch when starting a new row group, so it was possible for string data to be overwritten. This patch removes CloseStreams() and merges its functionality with AttachCompletedResources() so it's not possible to destroy streams without transferring the resources first. It also merges and removes ScannerContext::Close(). Also adds test cases for IMPALA-720. Change-Id: Ia8f40c7d39d8702716f1d337fe797e2696bd0fcb	2014-01-08 10:56:26 -08:00
Skye Wanderman-Milne	9e17042185	Allow zero bit width dict/RLE decoders. This allows us to read single-value dictionary-encoded columns generated by parquet-mr. Change-Id: I80903d910d0cc3a3e4ebf02e34212d868e94feb4 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1098 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne	de531e15bd	IMPALA-694: Allow Impala to read files produced by parquet-mr version <= 1.2.8 parquet-mr had a bug where it didn't include the dictionary page's header in the total column size. We now compensate for this by detecting these files and padding the scan range length. This required changing how the scanner detects when it's finished: it now counts the number of rows rather than checking eosr (since the scan range may be longer than the column). Change-Id: Id9933808b965003c0c3b3aa78c32fe29a0c4bcbe Reviewed-on: http://gerrit.ent.cloudera.com:8080/1097 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:27 -08:00
Skye Wanderman-Milne	9147cd7518	IMPALA-525: Adjust IO buffer size based on read length and other memory fixes We were previously wasting memory by always reading into 8MB IO buffers, even when the data read was much less than 8MB. With this patch, the IO manager picks a buffer size closer to the actual amount being read (we don't use the exact size so we can continue to recycle buffers). The minimum IO buffer size is determined via the --min_buffer_size flag, and the max IO buffer size via the --read_size flag. This technique also helps with IMPALA-652, since short columns will not use as much memory as before (we will not use considerably more memory than the size of the table). This patch also changes StringBuffer to use a doubling strategy so it doesn't end up allocating many large unused buffers, and has the scanner context use the requested length as the sync read size if it's larger than the size produced by read_past_size_cb(). These changes help prevent the boundary buffer in the scanner context from allocating excess memory. Change-Id: I0efb3b023ddfddb08bca22d5cb5f9511fb4d6c50 Reviewed-on: http://gerrit.ent.cloudera.com:8080/938 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:01 -08:00
Alex Behm	9a201645cd	IMPALA-496: Fix escaping of field delimiter and escape character in inserts Change-Id: I49c36ae9823b35dcb9e92d1a13bef270657e36f2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/163 Tested-by: jenkins <kitchen-build@cloudera.com> Reviewed-by: Nong Li <nong@cloudera.com> Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:52:09 -08:00
Nong Li	0385d14d69	Fix pre-hive 9 rc file scanner.	2014-01-08 10:48:41 -08:00
Nong Li	783480d6bf	- Cleaned up some TODOs. - Fix tuple template. Fixed strcmp - atoi/atof handle overflows. - added likely/unlikely compiler directive - Runquery now reports mean/stddev for profile runs - removed quoted char	2012-01-18 23:08:29 -08:00
Nong Li	c84fec38d3	- Move thrift out of FE src and into impala/common - Thrift files now build using cmake instead of mvn - Added cmake build to impala/ which drives the build process	2011-12-30 19:35:20 -08:00
Nong Li	2880f54d35	Perf Work: - Added perf counter utility - Added google perf tools - Added html data set - Added escape char test - Initial perf tuning	2011-12-30 00:26:27 -08:00

1 2

53 Commits