impala

mirror of https://github.com/apache/impala.git synced 2025-12-30 21:02:41 -05:00

Author	SHA1	Message	Date
Dimitris Tsirogiannis	3db5ced4ce	IMPALA-3726: Add support for Kudu-specific column options This commit adds support for Kudu-specific column options in CREATE TABLE statements. The syntax is: CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [, ....]) where option is: \| NULL \| NOT NULL \| ENCODING encoding_val \| COMPRESSION compression_algorithm \| DEFAULT expr \| BLOCK_SIZE num The output of the SHOW CREATE TABLE statement was altered to include all the specified column options for Kudu tables. Change-Id: I727b9ae1b7b2387db752b58081398dd3f3449c02 Reviewed-on: http://gerrit.cloudera.org:8080/5026 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-18 11:41:01 +00:00
Taras Bobrovytsky	eb8120d218	IMPALA-3812: Fix error message for unsupported types Before this patch an unclear error message was returned if DATE or DATETIME appeared in the select list after a star expansion. This was because DATE and DATETIME PrimitiveType was serialized as INVALID_TYPE. This is fixed by serializing correctly. Change-Id: I9019b4bfd219f94e554c795befd3ff5e39706ea9 Reviewed-on: http://gerrit.cloudera.org:8080/4859 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 05:31:34 +00:00
Dimitris Tsirogiannis	d802f321b2	IMPALA-3724: Support Kudu non-covering range partitions This commit adds support for non-covering range partitions in Kudu tables. The SPLIT ROWS clause is now deprecated and no longer supported. The following new syntax provides more flexibility in creating range partitions and it supports bounded and unbounded ranges as well as single value partitions; multi-column range partitions are supported as well. The new syntax is: DISTRIBUTE BY RANGE (col_list) ( PARTITION lower_1 <[=] VALUES <[=] upper_1, PARTITION lower_2 <[=] VALUES <[=] upper_2, .... PARTITION lower_n <[=] VALUES <[=] upper_n, PARTITION VALUE = val_1, .... PARTITION VALUE = val_n ) Multi-column range partitions are specified as follows: DISTRIBUTE BY RANGE (col1, col2,..., coln) ( PARTITION VALUE = (col1_val, col2_val, ..., coln_val), .... PARTITION VALUE = (col1_val, col2_val, ..., coln_val) ) Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06 Reviewed-on: http://gerrit.cloudera.org:8080/4856 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 22:02:22 +00:00
Tim Armstrong	6587c08f70	IMPALA-4387: validate decimal type in Avro file schema This patch prevents an invalid decimal type in an Avro file schema from crashing Impala. Most invalid Avro schemas are caught by the frontend, but file schemas still need to be validated by the backend. After this patch files with bad schemas are skipped. Testing: This was hit very rarely by the scanner fuzzing. Added a regression test that scans a file with a bad schema. Change-Id: I25a326ee2220bc14d3b5f887dc288b4adf859cfc Reviewed-on: http://gerrit.cloudera.org:8080/4876 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-30 00:12:58 +00:00
Dimitris Tsirogiannis	8a49ceaae5	IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Reviewed-on: http://gerrit.cloudera.org:8080/4327 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 11:01:37 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00
Matthew Jacobs	c7fa03286b	IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:11:04 +00:00
Alex Behm	c77fb628f7	IMPALA-3915: Register privilege and audit requests when analyzing resolved table refs. The bug: We used to register privilege requests for table refs in TableRef.analyze() which only got called for unresolved TableRefs. As a result, a reference to a view that contains a subquery did not get properly authorized, explained as follows. 1. In the first analysis pass the view is replaced by a an InlineViewRef and we correctly register an authorizarion request. 2. We rewrite the subquery via the StmtRewriter and wipe the analysis state, but preserve the InlineViewRef that replaces the view reference. 3. The rewritten statement is analyzed again, but since an InlineViewRef is considered to be resolved, we never call TableRef.analyze(), and hence never register an authorization event for the view. The fix: We now register authorization and auditing events when calling analyze() on a resolved TableRef (BaseTableRef, InlineViewRef, CollectionTableRef). Change-Id: I18fa8af9a94ce190c5a3c29c3221c659a2ace659 Reviewed-on: http://gerrit.cloudera.org:8080/3783 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-07-29 05:16:23 +00:00
Tim Armstrong	547be27e77	IMPALA-3745: parquet invalid data handling Added checks/error handling: * Negative string lengths while decoding dictionary or data page. * Buffer overruns while decoding dictionary or data page. * Some metadata FILECHECKs were converted to statuses. Testing: Unit tests for: * decoding of strings with negative lengths * truncation of all parquet types * dictionary creation correctly handling error returns from Decode(). End-to-end tests for handling of negative string lengths in dictionary- and plain-encoded data in corrupt files, and for handling of buffer overruns for string data. The corrupted parquet files were generated by hacking Impala's parquet writer to write invalid lengths, and by hacking it to write plain-encoded data instead of dictionary-encoded data by default. Performance: set num_nodes=1; set num_scanner_threads=1; select * from biglineitem where l_orderkey = -1; I inspected MaterializeTupleTime. Before the average was 8.24s and after was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%). Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece Reviewed-on: http://gerrit.cloudera.org:8080/3387 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-06-15 21:33:39 -07:00
Skye Wanderman-Milne	01287a3ba9	IMPALA-3441, IMPALA-3659: check for malformed Avro data This patch adds error checking to the Avro scanner (both the codegen'd and interepted paths), including out-of-bounds checks and data validity checks. I ran a local benchmark using the following queries: set num_scanner_threads=1; select count(i) from default.avro_bigints_big; # file contains only longs select max(l_orderkey) from biglineitem_avro; # file has tpch.lineitem schema Both benchmark queries see negligable or no performance impact. This patch adds a new Avro scanner unit test and an end-to-end test that queries several corrupted files, as well as updates the zig-zag varlen int unit test. Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132 Reviewed-on: http://gerrit.cloudera.org:8080/3072 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-06-13 18:32:32 -07:00
Matthew Jacobs	f413e236a8	IMPALA-3579: Strict handling of numeric overflow in text parsing Adds a query option 'strict_mode' which treats integer and floating pt overflows as parse errors. In the past, overflows were ignored and the max value was returned. When this query option is set, overflowing values are treated as if they were completely invalid data, i.e. NULL is returned. When abort_on_error is enabled, this means the query is aborted. Notes: * DECIMAL overflow/underflow is already treated as an error. * The handling in text-converter treats underflows the same as overflows, so they would result in the same behavior. However, floating point parsing never returns an underflow today. * We may also want to handle numeric values that are truncated when parsing to integer types, e.g. 10.5 -> 10. Change-Id: I7409c31ec0cb6fe0b2d9842b9f58fe1670914836 Reviewed-on: http://gerrit.cloudera.org:8080/3150 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:20 -07:00
Lars Volker	a09c80a33e	IMPALA-3458: Fix table creation to test insert with header lines For IMPALA-1740 we added a test to insert.test, which creates a table and inserts data. The table was created on HDFS by default and thus inserts with compression enabled did not work. This change adds the required table to the functional schema in the same way we do it for the other insert tests. Change-Id: Ie68e7067b7a16218d27935820d5d1ce7035d2e6c Reviewed-on: http://gerrit.cloudera.org:8080/2919 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:47 -07:00
Lars Volker	b5570da405	IMPALA-1740: Add support for skip.header.line.count. HIVE-5795 introduced a parameter skip.header.line.count to skip header lines from input files. This change introduces the capability to skip an arbitrary number of header lines from csv input files on hdfs. The size of the total file header must be smaller than max_scan_range_length, otherwise an error will be reported. This is necessary because scan ranges are not read in disk order, so there is no way of identifying header lines except by counting from the start of the first scan range. [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='1'); Query: alter table t1 set tblproperties('skip.header.line.count'='1') [localhost:21000] > select * from t1; Query: select * from t1 +----+----+ \| c1 \| c2 \| +----+----+ \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +----+----+ Fetched 3 row(s) in 0.32s [localhost:21000] > alter table t1 set tblproperties('skip.header.line.count'='0'); Query: alter table t1 set tblproperties('skip.header.line.count'='0') [localhost:21000] > select * from t1; Query: select * from t1 +------+------+ \| c1 \| c2 \| +------+------+ \| NULL \| NULL \| \| 1 \| 1 \| \| 2 \| 2 \| \| 3 \| 3 \| +------+------+ WARNINGS: Error converting column: 0 TO INT (Data is: num1) Error converting column: 1 TO DOUBLE (Data is: num2) file: hdfs://localhost:20500/test-warehouse/t1/test.txt record: num1,num2 Fetched 4 row(s) in 0.41s Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543 Reviewed-on: http://gerrit.cloudera.org:8080/2110 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:46 -07:00
Skye Wanderman-Milne	2cbd327d41	Regenerate complextypestbl files to include nested_struct.g field This field was included in the schema and data files, but the checked-in generated parquet files didn't include it. It's not referenced in any tests so we didn't catch it. Change-Id: I5d394f074e7082fa12fafb7e57a144a83b3099a6 Reviewed-on: http://gerrit.cloudera.org:8080/2562 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-04-01 05:06:38 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00
David Alves	7381304a23	Merge branch 'feature/kudu' into cdh5-trunk This is the final merge commit that merges the 'feature/kudu' branch into cdh5-trunk. Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a	2016-03-13 19:28:43 -07:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Alex Behm	c6fd5a0fe4	IMPALA-2844: Allow count(*) on RC files with complex types. This patch also fixes the incorrect error message reported in the JIRA. Change-Id: I2c7b732767d154c36bc7189df5177d27a35d0d7b Reviewed-on: http://gerrit.cloudera.org:8080/2267 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
Alex Behm	db90f9954e	Fix OOM when data loading with the new Hive SNAPSHOT. This patch is required for upgrading thirdparty. After upgrading thirdparty, Hive OOMs in data loading TPCDS. This patch fixes the OOM, but I have no idea why exactly it works. I had tried several other approaches like increasing the mapred and JVM heap memory to no avail. Change-Id: I3c69e4c3bf0f24e0c3b6272c946f71c17e312e01 Reviewed-on: http://gerrit.cloudera.org:8080/2145 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-14 07:08:22 +00:00
Alex Behm	1a4a830a6d	IMPALA-2776: Remove escapechartesttable and associated tests. The original purpose of the escapechartesttable was to test Impala's behavior on text tables that have the same character as line terminator and escape character. Recent changes in Hive have made creating such a table impossible because 1) Only newline is allowed as the line terminator 2) Newline is forbidden as the escape character See HIVE-11785 for details on the Hive changes. This commit removes escapechartesttable and all associated tests, but does not add the same enforcement rules as Hive. These enforcement rules should be added in a follow-on change. Change-Id: I2bd9755f4c2cc3d7dfd8d67c3759885951550f08 Reviewed-on: http://gerrit.cloudera.org:8080/1690 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-01-05 06:04:41 +00:00
Skye Wanderman-Milne	dd2eb951d7	IMPALA-2558: DCHECK in parquet scanner after block read error There was an incorrect DCHECK in the parquet scanner. If abort_on_error is false, the intended behaviour is to skip to the next row group, but the DCHECK assumed that execution should have aborted if a parse error was encountered. This also: - Fixes a DCHECK after an empty row group. InitColumns() would try to create empty scan ranges for the column readers. - Uses metadata_range_->file() instead of stream_->filename() in the scanner. InitColumns() was using stream_->filename() in error messages, which used to work but now stream_ is set to NULL before calling InitColumns(). Change-Id: I8e29e4c0c268c119e1583f16bd6cf7cd59591701 Reviewed-on: http://gerrit.cloudera.org:8080/1257 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-30 22:35:57 +00:00
Matthew Jacobs	1cd95f79f0	Add partitions to table_no_newline_part IF NOT EXISTS Loading functional data may fail if table_no_newline_part partitions already exist. We should only ADD with IF NOT EXISTS to handle this case as we do with other tables. Change-Id: I5fe5c318d2cbbd5b5419394212b94a2fe7d386ce Reviewed-on: http://gerrit.cloudera.org:8080/1261 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-17 00:52:53 +00:00
Sailesh Mukil	277a92a14a	IMPALA-2479: Failure in TestParquet.test_verify_runtime_profile The test_verify_runtime_profile test failed during C5.5 builds and GVMs because this test relies on the table lineitem_multiblock to have 3 blocks. However, due to the rules to load the data not being followed in the functional_schema_template.sql file, the table ended up being stored with only one block. This change moves the data load to the end of create-load-data.sh file which would load the data even for snapshots. Change-Id: I78030dd390d2453230c4b7b581ae33004dbf71be Reviewed-on: http://gerrit.cloudera.org:8080/1153 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2015-10-08 15:16:35 -07:00
Sailesh Mukil	7778b3ded5	IMPALA-1881: Maximize data locality when scanning Parquet files with multiple row groups. Impala supports reading Parquet files with multiple row groups but with possible performance degradation due to remote reads. This patch maximizes scan locality by allowing multiple impalads to scan the rowgroups in their local splits. Each impalad starts a new scan range for each split local to it if that split contains row group(s) that need to be scanned. Change-Id: Iaecc5fb8e89364780bc59dbfa9ae51d0d124d16e Reviewed-on: http://gerrit.cloudera.org:8080/908 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2015-10-05 11:30:39 -07:00
Martin Grund	4b102fefd2	KUDU-1115: Deprecate 'kudu.split_keys' for range partitioning For range partitioning use DISTRIBUTE BY RANGE instead. Change-Id: I055b605312a6be76a5439526f9c10ab4c4b432ce Reviewed-on: http://gerrit.sjc.cloudera.com:8080/8063 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-09-16 16:43:47 -07:00
ishaan	3278ab9ef7	IMPALA-2292: Change the type of timestamp_col to string in the table no_avro_schema. Hive does not allow DDL when an avro table has a timestamp column. This breaks data loading. This change simply uses a string type to denote the previous timestamp column. Change-Id: Id4eb42da3ac8a805d12b1c3ff35b200298269084 Reviewed-on: http://gerrit.cloudera.org:8080/745 Tested-by: Internal Jenkins Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2015-09-04 06:46:35 +00:00
Skye Wanderman-Milne	bcc73a36da	Nested types: read and materialize nested types in Parquet scanner This patch modifies the Parquet scanner to resolve nested schemas, and read and materialize collection types. The high-level modification is to create a CollectionColumnReader that recursively materializes map- and array-type slots. This patch also adds many tests, most of which query a new table called complextypestbl. This table contains hand-generated data that is meant to expose edge cases in the scanner. The tests mostly test the scanner, with a few tests of other functionality (e.g. array serialization). I ran a local benchmark comparing this scanner code to the original scanner code on an expanded version of tpch_parquet.lineitem with 48009720 rows. My benchmark involved selecting different numbers of columns with a single scanner thread, and I looked at the HDFS scan node time in the query profiles. This code introduces a 10%-20% regression in single-threaded scan time. Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a Reviewed-on: http://gerrit.cloudera.org:8080/576 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-02 19:23:54 +00:00
Martin Grund	f927a285c6	IMPALA-1136, IMPALA-2161: Skip \u0000 characters when dealing Avro schemas The limitation of the Avro JSON library not to handle \u0000 characters is to avoid problems with builtin functions like strlen() that would report wrong length when encountering such a character. Now, in the case if Impala, for now, we don't support any Unicode characters. This allows us to actually skip the \u0000 character instead of interpreting it. It is important to say that even the most recent versions of Avro do not support parsing \u0000 characters. Change-Id: I56dfa7f0f12979fe9705c51c751513aebce4beca Reviewed-on: http://gerrit.cloudera.org:8080/712 Tested-by: Internal Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2015-09-02 00:37:28 +00:00
Alex Behm	9d46853fbc	Nested Types: Check un/supported file formats for complex types. Before this patch, we used to accept any query referencing complex types, regardless of the table/partition's file format being scanned. We would ultimately hit a DCHECK in the BE when attempting to scan complex types of a table/partition with an unsupported format. This patch makes queries fail gracefully during planning if a scan would access a table/partition in a format for which we do not support complex types. For mixed-format partitioned Hdfs tables we perform this check at the partition granularity, so such a table can be scanned as long as only partitions with supported formats are accessed. HBase tables with complex-typed columns can be scanned as long as no complex-typed columns are accessed in the query. Change-Id: I2fd2e386c9755faf2cfe326541698a7094fa0ffc Reviewed-on: http://gerrit.cloudera.org:8080/705 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-01 03:26:53 +00:00
Alex Behm	6f0b255c5a	Address several shortcomings with respect to the usability of Avro tables. Addressed JIRAs: IMPALA-1947 and IMPALA-1813 New Feature: Adds support for creating an Avro table without an explicit Avro schema with the following syntax. CREATE TABLE <table_name> column_defs STORED AS AVRO Fixes and Improvements: This patch fixes and unifies the logic for reconciling differences between an Avro table's Avro Schema and its column definitions. This reconciliation logic is executed during Impala's CREATE TABLE and when loading a table's metadata. Impala generally performs the schema reconciliation during table creation, but Hive does not. In many cases, Hive's CREATE TABLE stores the original column definitions in the HMS (in the StorageDescriptor) instead of the reconciled column definitions. The reconciliation logic considers the field/column names and follows this conflict resolution policy which is similar to Hive's: Mismatched number of columns -> Prefer Avro columns. Mismatched name/type -> Prefer Avro column, except: A CHAR/VARCHAR column definition maps to an Avro STRING, and is preserved as a CHAR/VARCHAR in the reconciled schema. Behavior for TIMESTAMP: A TIMESTAMP column definition maps to an Avro STRING and is presented as a STRING in the reconciled schema, because Avro has no binary TIMESTAMP representation. As a result, no Avro table may have a TIMESTAMP column (existing behavior). Change-Id: I8457354568b6049b2dd2794b65fadc06e619d648 Reviewed-on: http://gerrit.cloudera.org:8080/550 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-25 09:52:18 +00:00
Alex Behm	c908ba1b7e	IMPALA-1136: Support loading Avro tables without an explicit Avro schema Hive allows creating Avro tables without an explicit Avro schema since 0.14.0. For such tables, the Avro schema is inferred from the column definitions, and not stored in the metadata at all (no Avro schema literal or Avro schema file). This patch adds support for loading the metadata of such tables, although Impala currently cannot create such tables (expect a follow-on patch). Change-Id: I9e66921ffbeff7ce6db9619bcfb30278b571cd95 Reviewed-on: http://gerrit.cloudera.org:8080/538 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-07-31 12:13:37 +00:00
Ippokratis Pandis	e99c68fe52	IMPALA-2130: Wrong verification of Parquet file version This patch corrects a mistake in the Parquet magic file number verification and adds a test about it. Note that with this patch Impala may fail to read Parquet files with wrong magic number that it used to read before. Change-Id: Iff31accda1e1d541946ef1f750e38886ce4cb8d5 Reviewed-on: http://gerrit.cloudera.org:8080/515 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-07-14 02:52:02 +00:00
David Alves	8a91b6b60f	Allow to specify split keys in json format in Kudu's CREATE TABLE statement This allows to specify split keys in Kudu's CREATE TABLE statement as part of the key/value pairs in TBLPROPERTIES. Splits are expected to be specified in as json arrays of arrays: [[key1], [key2], ...] 'key1', 'key2 might be single values or comma separated lists of values, depending on whether the table has a simple of compound primary key. This also adds a series of test tables to be created for the kutu table format when load-data.py is executed. Change-Id: I1824199fda14abb2d7352800789f2b9c2f2124ae Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6974 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-06-19 10:40:16 -07:00
David Alves	cf2618df1e	Allow to specify a TABLE_PROPERTIES section in functional_schema_template.sql This adds a new TABLE_PROPERTIES section that will be included in the CREATE TABLE statement as TBLPROPERTIES. Each line in this new section is expected to be in the form: <file_format>:<key>=<value> Properties are only added to create table statements of the file format they specify. Change-Id: I89ef7ced3351ecf2c727050ca426f6616f3e5bcd Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6945 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-06-18 14:23:28 -07:00
Juan Yu	934b28fe5e	IMPALA-1381: Expand set of supported timezones. The hardcoded timezone information is from Java version 1.7.0_76. Change-Id: I32c40d0036473079e5bfd4d0252a648cbb0e7c23 Reviewed-on: http://gerrit.cloudera.org:8080/393 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-22 01:32:54 +00:00
Casey Ching	ac0c075997	Parquet: Fix value def level when max def level is 0 When running with a release build, NULL would be returned when reading values from required fields in parquet files (with a debug build a DCHECK would be hit). Previously when the max definition level for a field was 0 (which happens if a field is required), the definition level for value was incorrectly set to 1. The max definition level is related to nested data and is defined to be the number of nullable fields that will be encountered when traversing a path to reach the desired end field. For example, if a nested schema has a path a.b.c.d where b and d are nullable then the max def level is 2. A def level is attached to each value to indicate the number of optional values that are present (in the previous example an def level of 2 means both b and d are not null). So having a def level for a value that is greater than the max def level for a field should never happen. Change-Id: Ia91a97cf79e672c420d10416c6817f0930dcc920 (cherry picked from commit cdd67e4c7fd62d5b08adfaa303d7bb2382e6932c) Reviewed-on: http://gerrit.cloudera.org:8080/386 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-05-15 06:41:02 +00:00
Juan Yu	d1c263402e	IMPALA-1973: Fixing crash when uninitialized, empty row is added in HdfsTextScanner This patch fixes an issue when an uninitialized, empty row is falsely added to the rowbatch. The uninitialized data inside this row leads later on to a crash when the null byte is checked together with the offsets (that contains garbage). The fix is to not only check for the number of materialized columns, but as well for the number of materialized partition key columns. Only if both are empty and the parser has an unfinished tuple, add the empty row. To accommodate for the last row, check in FinishScanRange() if there is an unfinished tuple with materialized slots or materialized partition key. Write the fields if necessary. Change-Id: I2808cc228e62d048d917d3a6352d869d117597ab (cherry picked from commit c1795a8b40d10fbb32d9051a0e7de5ebffc8a6bd) Reviewed-on: http://gerrit.cloudera.org:8080/364 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-05 00:19:12 +00:00
Alex Behm	35e669f2ee	Nested Types: Analysis of struct fields and basic collection table refs. This patch adds analysis support for the following new constructs. Struct-field references: Dot-separated paths such as "tblalias.col.field1.field2..." can now be used to access fields of a struct. Such struct-field SlotRefs show up as new slot in the tuple descriptor of the parent table. Collection table references: A collection-typed column or field can be referenced as a table in the FROM clause using a dot-separated path, e.g., "FROM db.tbl.array_col". This exposes the collection's fields as a table whose columns can be referenced as usual. Collections referenced this way are flattened during execution. The following subsections explain the analysis behavior in more detail: Aliases: Collection table references can either have an explicit alias a single implicit alias. Examples: SELECT ... FROM db.tbl.array_col AS a -> explicit alias 'a' SELECT ... FROM db.tbl.array_col -> implicit alias 'array_col' Flattening of collections: Collection table refs can refer to table aliases defined to their left in the same select block. References linked in this way indicate an implicit parent/child join. Example: SELECT parent.id, child.item FROM db.tbl parent, parent.array_col child The aliases 'parent' and 'child' can be used to access the columns of 'tbl' and the fields of 'array_col', respectively. In the flattened result set, collection items implicitly join with parent rows iff the collection row is nested within the parent. The following stmt is not an implicit join, but a scan of a nested collection. SELECT db.tbl.map_col.value FROM db.tbl.map_col The following stmt is not a parent/child join but a cartesian product because the collection does not reference an existing table alias. SELECT a.id, map_col.value FROM db.tbl a, db.tbl.map_col Collection 'table' field names: Structs have explicit field names, but collections generally do not. This patch uses fixed implicit field names such as "item" for arrays and "key"/"value" for maps. If an array or map has a struct element/value then the struct's fields are also accessible without the "item"/"value" indirection. Analysis prevents complex-typed exprs from appearing in the select list of any query statement because it's not yet clear how to present them to a consumer of the query results. To keep things simple, I've deliberately left these work items for follow-on patches: 1. analysis of correlated inline views to handle queries like: SELECT cnt FROM customers c (SELECT COUNT(1) FROM c.orders) v 2. generalized expression substitution to handle cases like: SELECT x.c.d FROM (SELECT a.b x FROM tbl) Change-Id: I5f52f083bcf7056d5bfd21cc784133edb7e82ef2 Reviewed-on: http://gerrit.cloudera.org:8080/43 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-04-29 09:04:25 +00:00
ishaan	90bc481be7	Fix bad auto-merge to get cdh-trunk in shape. While cherry-picking commit c289974 from cdh5-2.2.0_5.4.x, a line that causes a dataset change was somehow missed by git's cherry-pick. This patch adds that line back in. Change-Id: Id163643c55d44fe6a9503b54397f1f370b198554	2015-03-23 14:15:53 -07:00
Juan Yu	e121bc9b0a	IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line. I did a local benchmark and there's minimal performance impact(<1%) Change-Id: I8d84a145acad886c52587258b27d33cff96ea399 (cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0) Reviewed-on: http://gerrit.cloudera.org:8080/189 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 19:58:50 -07:00
ishaan	d87f83083c	Revert "Fix to enable metadata load for functional tables." This reverts commit 587fedaa983d0b99e79025782d8c3d1891fbb0f5. Change-Id: I6cf34f6f6fc0ee8af6ec7da407569033605b31e4	2015-03-20 19:58:50 -07:00
ishaan	daf9dfd6e1	Fix to enable metadata load for functional tables. We used mkdir without a -p to load a hive13 parquet table. This failed when trying to load the metadata, which is necessary for changes that mutate the warehouse snapshot. With this change, it now passes. Change-Id: Id163643c55d44fe6a9503b54397f1f370b198554	2015-03-20 14:37:21 -07:00
casey	87b9fac2ad	IMPALA-1658: Add compatibility flag for Hive-Parquet-Timestamps No changes to writing were made. No changes to reading Impala written files were made. Hive writes TIMESTAMP values to parquet files differently than Impala does. Hive converts the value from local time to UTC before writing; Impala does not. This change adds a startup flag that will convert UTC to local when reading files written by Hive. The Hive-file detection actually checks for "parquet-mr" (which is the library Hive uses) in the file metadata. A slight possibility exists that TIMESTAMP values written by something other than Hive but also using parquet-mr may become incorrect. The possibility should be very small because TIMESTAMP values are stored and encoded in a non-standard way other applications are unlikely to be aware of. Flags from be/src/exec/hdfs-parquet-scanner.cc: -convert_legacy_hive_parquet_utc_timestamps (When true, TIMESTAMPs read from files written by Parquet-MR (used by Hive) will be converted from UTC to local time. Writes are unaffected.) type: bool default: false Change-Id: I79a499fe24049b7025ee2dd76c9c3e07010d346a Reviewed-on: http://gerrit.cloudera.org:8080/35 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-02-11 13:28:17 +00:00
Dan Hecht	3735ea94a0	S3: Don't seek/read past file end DistributedFileSystem is lenient about seeking past the end of the file. Other FileSystem implementations, such as NativeS3FileSystem, return an error on this condition. That leads to a scary looking message in the query warnings. So, when creating scan ranges, let's require that the ranges fall within the file bounds (at least according to what the HdfsFileDesc indicates is the length). There were a couple of kinds of AllocateScanRange() callsites that needed to be fixed up: 1) When a stream wants to read past a scan range, be careful not to read past the end of the file. 2) When Impala needs to "guess" at the length of a range, use the file_length as an upper bound on the guess. We were already doing this someplaces but not everywhere. 3) When the scan range is derived from parquet metadata, validate the metadata against file_length and issue appropriate errors. This will give better diagnostics for corrupt files. Note that we can't rely on this for safety (HdfsFileDesc file_length may be stale), but it does mean that when metadata is up-to-date Impala will no longer try to access beyond the end of files (and so we'll no longer get false positive errors from the filesystem). Additionally, this change revealed a pre-existing problem with files that have multiple row-groups. The first time through InitColumns(), stream_ was set to NULL. But, stream_->filename could potentially be accessed when constructing error statuses for subsequent row-groups. Change-Id: Ia668fa8c261547f85a18a96422846edcea57043e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5424 Reviewed-by: Daniel Hecht <dhecht@cloudera.com> Tested-by: jenkins	2015-01-08 16:19:35 -08:00
Matthew Jacobs	25428fdb21	Add support for streaming decompression of gzip text Compressed text formats currently require entire compressed files be read into memory to be decompressed in a single call to the decompression codec. This changes the HdfsTextScanner to drive gzip in a streaming mode, i.e. produce partial output as input is consumed. Change-Id: Id5c0805e18cf6b606bcf27a5df4b5f58895809fd Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5233 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 05c3cc55e7a601d97adc4eebe03f878c68a33e56) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5385	2014-11-23 01:55:55 -08:00
Victor Bittorf	9939c9d009	Bugfix and tests for CHAR(N) and VARCHAR(N) Fixed a bug when setting the length in reading/write text files for CHAR(N). Also added chars_tiny table for testing CHAR(N) and VARCHAR(N). Change-Id: If5d5db30afa4b00cf03c68c6a845f182970329f4 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4415 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-09-23 07:30:07 -07:00
Alex Behm	68592a82a3	IMPALA-1021: Fix loading of views with decimal and complex-typed columns. Change-Id: I8b63c31be47dd64f1e13fb29be3105b0f7e245dc Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3820 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-08-17 12:26:54 -07:00
Lenni Kuff	286e312460	[CDH5] Minor code changes for Hive .13 support Changes include: * Fix compile errors due to new column stats API and other stats related fixes. * Temporarily disable JDBC tests due to new serialization format in Hive .13 * Disable view compatibility tests until we can get them to work in Hive .13 * Test fixes due to Hive's type checking for partition column values Change-Id: I05cc6a95976e0e037be79d91bc330a06d2fdc46c	2014-08-11 09:53:02 -07:00
Alex Behm	19bab59854	Create/alter/describe tables with complex types. This patch adds parsing of complex types and tests for using complex types in various exprs and create/alter/describe stmts. Change-Id: Ibc211a560c889f5ccfb616813700b923c89d8245 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3577 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3594	2014-07-23 17:26:14 -07:00

1 2 3

112 Commits