impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Dimitris Tsirogiannis	d802f321b2	IMPALA-3724: Support Kudu non-covering range partitions This commit adds support for non-covering range partitions in Kudu tables. The SPLIT ROWS clause is now deprecated and no longer supported. The following new syntax provides more flexibility in creating range partitions and it supports bounded and unbounded ranges as well as single value partitions; multi-column range partitions are supported as well. The new syntax is: DISTRIBUTE BY RANGE (col_list) ( PARTITION lower_1 <[=] VALUES <[=] upper_1, PARTITION lower_2 <[=] VALUES <[=] upper_2, .... PARTITION lower_n <[=] VALUES <[=] upper_n, PARTITION VALUE = val_1, .... PARTITION VALUE = val_n ) Multi-column range partitions are specified as follows: DISTRIBUTE BY RANGE (col1, col2,..., coln) ( PARTITION VALUE = (col1_val, col2_val, ..., coln_val), .... PARTITION VALUE = (col1_val, col2_val, ..., coln_val) ) Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06 Reviewed-on: http://gerrit.cloudera.org:8080/4856 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 22:02:22 +00:00
Matthew Jacobs	32294220c4	IMPALA-4379: Fix and test Kudu table type checking, follow up The first fix for IMPALA-4379 went in before all comments were addressed. First commit: `9b507b6`. This addresses some follow-up comments about how to handling ALTER TABLE setting the storage_handler table property, which doesn't really make sense to ever allow. Change-Id: I93d04a04483af598b392c28874363e3b0202e1f3 Reviewed-on: http://gerrit.cloudera.org:8080/4894 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 06:54:18 +00:00
Alex Behm	c5f49ec9bb	IMPALA-4423: Correct but conservative implementation of Subquery.equals(). The underlying problem was for trivial/constant [NOT] EXISTS subqueries we substituted out Subqueries with bool literals using an ExprSubstitutionMap, but the Subquery.equals() function was not implemented properly, so we ended up matching Subqueries to the wrong entry in the ExprSubstitutionMap. This could ultimately lead to wrong plans and results. Testing: Corrected an existing test and modified an existing test for extra coverage. Change-Id: I5562d98ce36507aa5e253323e184fd42b54f27ed Reviewed-on: http://gerrit.cloudera.org:8080/4923 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-04 00:19:33 +00:00
Laszlo Gaal	9af59bfe2b	IMPALA-4153: Fix count(*) on all blank('') columns - test This change adds test coverage for the fixes committed for IMPALA-2399 in commit `9ed3b685a1`. It uses the table nulltable in the workload functional-query to verify the materialization and counting of NULL and empty- valued columns. The test can be run on any supported storage and compression combination. Change-Id: I23923f95f43d67977ee1520a1fc09ce297548b3f Reviewed-on: http://gerrit.cloudera.org:8080/4755 Tested-by: Internal Jenkins Reviewed-by: Jim Apple <jbapple@cloudera.com>	2016-11-03 23:08:56 +00:00
Alex Behm	1d8cdb02c6	IMPALA-4309: Introduce Expr rewrite phase and supporting classes. Introduces a new phase for rewriting Exprs after analysis and before subquery rewriting. The transformed Exprs replace the original ones in analyzed statements. If Exprs were changed, the whole statement is reset() and re-analyzed, similar to how subqueries are rewritten. If both Exprs and subqueries are rewritten there is only one re-analysis of the changed statement. The following new classes work together to perform transformations: 1. ExprRewriteRule - base class for Expr transformation rules 2. ExprRewriter - drives the transformation of Exprs using a list of ExprRewriteRules Statements that have exprs to be rewritten need to implement a new method rewriteExprs() that accepts an ExprRewriter. As an example, this patch adds a rule for converting BetweenPredicates into their equivalent CompoundPredicates. The BetweenPredicate has been notoriously buggy due to a lack of such a separate rewrite phase and is now cleaned up. Testing: 1. Added a new test for checking that the rewrite framework covers all relevant statements, clauses and can properly handle nested statements and subqueries. 2. Added a new test for ExprRewriteRules and implemented tests for the BetweenPredicate rewrite. 2. There are many existing tests for BetweePredicates and they all exercise the new rewrite rule/phase. 3. Ran a private core/hdfs run and it passed. Change-Id: I2279dc984bcf7742db4fa3b1aa67283ecbb05e6e Reviewed-on: http://gerrit.cloudera.org:8080/4746 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 20:08:37 +00:00
Alex Behm	6a16e44a2a	Add functional tests for compute stats with mt_dop > 0. Change-Id: Icd4e7e44f9f23f66e59ad1fb298e13da76ad817a Reviewed-on: http://gerrit.cloudera.org:8080/4879 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 11:59:07 +00:00
Alex Behm	795c085fa3	IMPALA-4336: Cast exprs after unnesting union operands. The bug was that we cast the result exprs of operands before unnesting them. If we unnested an operands, casts were missing on those unnested operands' result exprs. The fix is to first unnest operands and then cast result exprs. Also clarifies the use of resultExprs vs. baseTblResultExprs. Change-Id: I5e3ab7349df7d67d0d9c2baf4a56342d3f04e76d Reviewed-on: http://gerrit.cloudera.org:8080/4815 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-11-03 08:59:45 +00:00
Alex Behm	4918b20ac0	IMPALA-4408: Omit null bytes for Kudu scans with no nullable slots. Kudu does not allocate null bytes if all projected columns are non-nullable. Otherwise, Kudu allocates a null bit for all columns, even the non-nullable ones. The bug was that Impala's memory layout did not match the first requirement. Change-Id: I762ad9d5cc4198922ea4b5218c504fde355c49a5 Reviewed-on: http://gerrit.cloudera.org:8080/4892 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-01 01:47:30 +00:00
Matthew Jacobs	9b507b6ed6	IMPALA-4379: Fix and test Kudu table type checking Creating Kudu tables shouldn't allow types not supported by Kudu (e.g. VARCHAR/CHAR, DECIMAL, TIMESTAMP, collection types). The behavior is inconsistent: for some types it throws in the catalog, for VARCHAR/CHAR these become strings. This changes behavior so that all fail during analysis. Analysis tests were added. Similarly, external tables cannot contain Kudu types that Impala doesn't support (e.g. UNIXTIME_MICROS, BINARY). Tests were added to validate this behavior. Note that this required upgrading the python Kudu client. This also fixes a small corner case with ALTER TABLE: ALTER TABLE shouldn't allow Kudu tables to change the storage descriptor tblproperty, otherwise the table metadata gets in an inconsistent state. Tests were added for all of the above. Change-Id: I475273cbbf4110db8d0f78ddf9a56abfc6221e3e Reviewed-on: http://gerrit.cloudera.org:8080/4857 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-10-31 16:03:54 +00:00
Marcel Kornacker	0d857237a8	IMPALA-4314: Standardize on MT-related data structures This removes the data structures that were "superceded" in IMPALA-3903 and changes all control flow to utilize the new data structures. The new data structures are renamed to remove the "Mt" prefix. Change-Id: I465d0e15e2cf17cafe4c747d34c8f595d3645151 Reviewed-on: http://gerrit.cloudera.org:8080/4853 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-10-31 16:03:32 +00:00
Tim Armstrong	6587c08f70	IMPALA-4387: validate decimal type in Avro file schema This patch prevents an invalid decimal type in an Avro file schema from crashing Impala. Most invalid Avro schemas are caught by the frontend, but file schemas still need to be validated by the backend. After this patch files with bad schemas are skipped. Testing: This was hit very rarely by the scanner fuzzing. Added a regression test that scans a file with a bad schema. Change-Id: I25a326ee2220bc14d3b5f887dc288b4adf859cfc Reviewed-on: http://gerrit.cloudera.org:8080/4876 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-30 00:12:58 +00:00
Alex Behm	f7d71950e3	IMPALA-4369: Avoid DCHECK in Parquet scanner with MT_DOP > 0. When HdfsParquetScanner::Open() failed we used to hit a DCHECK when trying to access HdfsParquetScanner::batch() which is only valid to call for non-MT scan nodes. Change-Id: Ifbfdde505dbbd2742e7ab79a2415ff317a9bfa2f Reviewed-on: http://gerrit.cloudera.org:8080/4851 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 22:21:19 +00:00
Lars Volker	c24e9da914	IMPALA-2521: Add clustered hint to insert statements This change introduces a clustered/noclustered hint for insert statements. Specifying this hint adds an additional sort node to the plan, just before the table sink. This has the effect that data will be clustered by its partition prior to writing partitions, which therefore can be written sequentially. Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14 Reviewed-on: http://gerrit.cloudera.org:8080/4745 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-26 04:56:14 +00:00
Michael Ho	13455b5a24	IMPALA-3884: Support TYPE_TIMESTAMP for HashTableCtx::CodegenAssignNullValue() This change implements support for TYPE_TIMESTAMP for HashTableCtx::CodegenAssignNullValue(). TimestampValue itself is 16 bytes in size. To match RawValue::Write() in the interpreted path, CodegenAssignNullValue() emits code to assign HashUtil::FNV_SEED to both the upper and lower 64-bit of the destination value. This change also fixes the handling of 128-bit Decimal16Value in CodegenAssignNullValue() so the emitted code matches the behavior of the interpreted path. Change-Id: I0211d38cbef46331e0006fa5ed0680e6e0867bc8 Reviewed-on: http://gerrit.cloudera.org:8080/4794 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Michael Ho <kwho@cloudera.com>	2016-10-25 05:52:33 +00:00
Matthew Jacobs	99ed6dc67a	IMPALA-4134,IMPALA-3704: Kudu INSERT improvements 1.) IMPALA-4134: Use Kudu AUTO FLUSH Improves performance of writes to Kudu up to 4.2x in bulk data loading tests (load 200 million rows from lineitem). 2.) IMPALA-3704: Improve errors on PK conflicts The Kudu client reports an error for every PK conflict, and all errors were being returned in the error status. As a result, inserts/updates/deletes could return errors with thousands errors reported. This changes the error handling to log all reported errors as warnings and return only the first error in the query error status. 3.) Improve the DataSink reporting of the insert stats. The per-partition stats returned by the data sink weren't useful for Kudu sinks. Firstly, the number of appended rows was not being displayed in the profile. Secondly, the 'stats' field isn't populated for Kudu tables and thus was confusing in the profile, so it is no longer printed if it is not set in the thrift struct. Testing: Ran local tests, including new tests to verify the query profile insert stats. Manual cluster testing was conducted of the AUTO FLUSH functionality, and that testing informed the default mutation buffer value of 100MB which was found to provide good results. Change-Id: I5542b9a061b01c543a139e8722560b1365f06595 Reviewed-on: http://gerrit.cloudera.org:8080/4728 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 02:06:10 +00:00
Alex Behm	ff6b450ad3	IMPALA-4285/IMPALA-4286: Fixes for Parquet scanner with MT_DOP > 0. IMPALA-4258: The problem was that there was a reference to HdfsScanner::batch_ hidden inside WriteEmptyTuples(). The batch_ reference is NULL when the scanner is run with MT_DOP > 1. IMPALA-4286: When there are no scan ranges HdfsScanNodeBase::Open() exits early without initializing the reader context. This lead to a DCHECK in IoMgr::GetNextRange() called from HdfsScanNodeMt. The fix is to remove that unnecessary short-circuit Open(). I combined these two bugfixes because the new basic test covers both cases. Testing: Added a new test_mt_dop.py test. A private code/hdfs run passed. Change-Id: I79c0f6fd2aeb4bc6fa5f87219a485194fef2db1b Reviewed-on: http://gerrit.cloudera.org:8080/4767 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 10:24:24 +00:00
Michael Ho	51268c053f	IMPALA-4120: Incorrect results with LEAD() analytic function This change fixes a memory management problem with LEAD()/LAG() analytic functions which led to incorrect result. In particular, the update functions specified for these analytic functions only make a shallow copy of StringVal (i.e. copying only the pointer and the length of the string) without copying the string itself. This may lead to problem if the string is created from some UDFs which do local allocations whose buffer may be freed and reused before the result tuple is copied out. This change fixes the problem above by allocating a buffer at the Init() functions of these analytic functions to track the intermediate value. In addition, when the value is copied out in GetValue(), it will be copied into the MemPool belonging to the AnalyticEvalNode and attached to the outgoing row batches. This change also fixes a missing free of local allocations in QueryMaintenance(). Change-Id: I85bb1745232d8dd383a6047c86019c6378ab571f Reviewed-on: http://gerrit.cloudera.org:8080/4740 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 07:39:37 +00:00
Dimitris Tsirogiannis	8a49ceaae5	IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Reviewed-on: http://gerrit.cloudera.org:8080/4327 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 11:01:37 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00
Yuanhao Luo	f8d48b8582	IMPALA-4325: StmtRewrite lost parentheses of CompoundPredicate StmtRewrite lost parentheses of CompoundPredicate in pushNegationToOperands() and leads to incorrect toSql() result. Even though this issue would not leads to incorrect result of query, it makes user confuse of the logical operator precedence of predicates shown in EXPLAIN statement. Change-Id: I79bfc67605206e0e026293bf7032a88227a95623 Reviewed-on: http://gerrit.cloudera.org:8080/4753 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 07:45:53 +00:00
Taras Bobrovytsky	bf1d9677fc	IMPALA-4155: Update default partition when table is altered If the table format is changed by the Alter Table statement, the default partition in partitioned tables used to not get updated. This caused a problem because Insert picks up the file format for new partitions from the default partition. This patch fixes the problem by calling addDefaultPartition(). Also removed "drop table if not exists" in tests in alter-table.test because we already have the unique_database fixture. Change-Id: I59bf21caa5c5e7867d07d87cda0c0a5b4b994859 Reviewed-on: http://gerrit.cloudera.org:8080/4750 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-20 23:47:52 +00:00
Michael Ho	b15d992abe	IMPALA-4080, IMPALA-3638: Introduce ExecNode::Codegen() This patch is mostly mechanical move of codegen related logic from each exec node's Prepare() to its Codegen() function. After this change, code generation will no longer happen in Prepare(). Instead, it will happen after Prepare() completes in PlanFragmentExecutor. This is an intermediate step towards the final goal of sharing compiled code among fragment instances in multi-threading. As part of the clean up, this change also removes the logic for lazy codegen object creation. In other words, if codegen is enabled, the codegen object will always be created. This simplifies some of the logic in ScalarFnCall::Prepare() and various Codegen() functions by reducing error checking needed. This change also removes the logic added for tackling IMPALA-1755 as it's not needed anymore after the clean up. The clean up also rectifies a not so well documented situation. Previously, even if a user explicitly sets DISABLE_CODEGEN to true, we may still codegen a UDF if it was written in LLVM IR or if it has more than 8 arguments. This patch enforces the query option by failing the query in both cases. To run the query, the user must enable codegen. This change also extends the number of arguments supported in the interpretation path of ScalarFn to 20. Change-Id: I207566bc9f4c6a159271ecdbc4bbdba3d78c6651 Reviewed-on: http://gerrit.cloudera.org:8080/4651 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-10-19 08:18:37 +00:00
Alex Behm	0480253566	IMPALA-4270: Gracefully fail unsupported queries with mt_dop > 0. MT_DOP > 0 is only supported for plans without distributed joins or table sinks. Adds validation to fail unsupported queries gracefully in planning. For scans in queries that are executable with MT_DOP > 0 we either use the optimized MT scan node BE implementation (only Parquet), or we use the conventional scan node with num_scanner_threads=1. TODO: Still need to add end-to-end tests. Change-Id: I91a60ea7b6e3ae4ee44be856615ddd3cd0af476d Reviewed-on: http://gerrit.cloudera.org:8080/4677 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-17 09:22:57 +00:00
Henry Robinson	9f61397fc4	IMPALA-2905: Handle coordinator fragment lifecycle like all others The plan-root fragment instance that runs on the coordinator should be handled like all others: started via RPC and run asynchronously. Without this, the fragment requires special-case code throughout the coordinator, and does not show up in system metrics etc. This patch adds a new sink type, PlanRootSink, to the root fragment instance so that the coordinator can pull row batches that are pushed by the root instance. The coordinator signals completion to the fragment instance via closing the consumer side of the sink, whereupon the instance is free to complete. Since the root instance now runs asynchronously wrt to the coordinator, we add several coordination methods to allow the coordinator to wait for a point in the instance's execution to be hit - e.g. to wait until the instance has been opened. Done in this patch: * Add PlanRootSink * Add coordination to PFE to allow coordinator to observe lifecycle * Make FragmentMgr a singleton * Removed dead code from Coordinator::Wait() and elsewhere. * Moved result output exprs out of QES and into PlanRootSink. * Remove special-case limit-based teardown of coordinator fragment, and supporting functions in PlanFragmentExecutor. * Simplified lifecycle of PlanFragmentExecutor by separating Open() into Open() and Exec(), the latter of which drives the sink by reading rows from the plan tree. * Add child profile to PlanFragmentExecutor to measure time spent in each lifecycle phase. * Removed dependency between InitExecProfiles() and starting root fragment. * Removed mostly dead-code handling of LIMIT 0 queries. * Ensured that SET returns a result set in all cases. * Fix test_get_log() HS2 test. Errors are only guaranteed to be visible after fetch calls return EOS, but test was assuming this would happen after first fetch. Change-Id: Ibb0064ec2f085fa3a5598ea80894fb489a01e4df Reviewed-on: http://gerrit.cloudera.org:8080/4402 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-10-16 15:55:29 +00:00
Lars Volker	1a5c43ef5e	IMPALA-3644 Make predicate order deterministic This adds a tie-break to make sure that we sort predicates in a deterministic order on Java 7 and 8. This was suggested by Alex in IMPALA-3644. There are still three broken tests when run in Java 8, but it seems best to address them in a subsequent change. Change-Id: Id11010bfeaff368869e6d430eeb4773ddf41faff Reviewed-on: http://gerrit.cloudera.org:8080/4671 Reviewed-by: Jim Apple <jbapple@cloudera.com> Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 22:04:30 +00:00
Alex Behm	2a04b0e21a	IMPALA-3943: Address post-merge comments. Adds code comments and issues a warning for Parquet files with num_rows=0 but at least one non-empty row group. Change-Id: I72ccf00191afddb8583ac961f1eaf11e5eb28791 Reviewed-on: http://gerrit.cloudera.org:8080/4696 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 05:41:22 +00:00
Lars Volker	ef4c9958d0	IMPALA-4047: Remove occurrences of 'CDH'/'cdh' from repo This change removes some of the occurrences of the strings 'CDH'/'cdh' from the Impala repository. References to Cloudera-internal Jiras have been replaced with upstream Jira issues on issues.cloudera.org. For several categories of occurrences (e.g. pom.xml files, DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to remove the occurrences left after this change. Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8 Reviewed-on: http://gerrit.cloudera.org:8080/4187 Reviewed-by: Jim Apple <jbapple@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-13 00:40:41 +00:00
Alex Behm	0449b5beab	IMPALA-3943: Do not throw scan errors for empty Parquet files. For Parquet files with no row groups but with num_rows=0 in the file footer the Parquet scanner returns an error indicating that the file is invalid. This behavior is a regression from previous Impala versions which used to accept such files. This patch restores the previous behavior and adds tests. Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e Reviewed-on: http://gerrit.cloudera.org:8080/4693 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-12 09:22:57 +00:00
Taras Bobrovytsky	acb25a6d16	IMPALA-4076: Fix runtime filter sort compare method Fixed 2 isssues: - The getSelectivity() method sometimes returned NaN double values which could not be sorted properly. - The compare method for sorting runtime filters was swtiched to use the builtin Double comparison method. Change-Id: Iad433f2ece423ea29e79e81b68fa53cb0af18378 Reviewed-on: http://gerrit.cloudera.org:8080/4652 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-07 05:59:50 +00:00
Alex Behm	d9dc9090ea	IMPALA-4237: Fix materialization of 4 byte decimals in data source scan node. There was a missing break in a switch statement leading to bad fallthrough. An existing test already expected incorrect results. The bug is covered by expecting correct results. Change-Id: I5340c2eda813afc032ba72203bd59eb3f2c4f482 Reviewed-on: http://gerrit.cloudera.org:8080/4585 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-07 03:36:43 +00:00
Matthew Jacobs	2b5d1344c9	IMPALA-4213: Fix Kudu predicates that need constant folding Folding const exprs where there were implicit casts on the slot resulted in the predicate not being pushed to Kudu. Change-Id: I3bab22d90ee00a054c847de6c734b4f24a3f5a85 Reviewed-on: http://gerrit.cloudera.org:8080/4613 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-06 04:06:38 +00:00
Yonghyun Hwang	112ff68edd	IMPALA-4042: Preserve root types when substituting grouping exprs In case of count(distinct), FunctionCallExpr.analyze() changes type for "NULL" into "BOOLEAN" to make sure that BE doesn't see any "NULL_TYPE" exprs. In the meantime, Expr substitution, happening in Expr.substituteImpl() reverts this change back to original type, "NULL_TYPE". This causes an issue when AggregateInfo.checkConsistency() performs precondition check where slot types from AggregateInfo.outputTupleDesc_ should be matched with the types from AggregateInfo.groupingExpr_. The slot type shows "BOOLEAN" while type from groupingExpr_ is "NULL_TYPE", which makes the precondition fail and throws an exception. To resolve the issue, preserveRootType is set to true when Expr.substituteList() gets called in AggregateInfo.substitute() Change-Id: Icf3b4511234e473e5b9548fbf3e97f333c9980f1 (cherry picked from commit b17785b4890bedd1c825140ce3c48cd7d9734295) Reviewed-on: http://gerrit.cloudera.org:8080/4600 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-05 03:04:17 +00:00
Bharath Vissapragada	64c394827a	IMPALA-4196: Cross compile bit-byte-functions Change-Id: I5a1291bfd202b500405a884e4a62f0ca2447244a Reviewed-on: http://gerrit.cloudera.org:8080/4557 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-10-01 01:42:21 +00:00
Michael Ho	2a31fbdbfa	IMPALA-4180: Synchronize accesses to RuntimeState::reader_contexts_ HdfsScanNodeBase::Close() may add its outstanding DiskIO context to RuntimeState::reader_contexts_ to be unregistered later when the fragment is closed. In a plan fragment with multiple HDFS scan nodes, it's possible for HdfsScanNodeBase::Close() to be called concurrently. To allow safe concurrent accesses, this change adds a SpinLock to synchronize accesses to 'reader_contexts_' in RuntimeState. Change-Id: I911fda526a99514b12f88a3e9fb5952ea4fe1973 Reviewed-on: http://gerrit.cloudera.org:8080/4558 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-09-30 01:21:05 +00:00
Thomas Tauber-Marshall	b2c2fe7813	IMPALA-3786: Replace "cloudera" with "apache" (part 2) As part of the ASF transition, we need to replace references to Cloudera in Impala with references to Apache. This primarily means changing Java package names from com.cloudera.impala.* to org.apache.impala.* A prior patch renamed all the files as necessary, and this patch performs the actual code changes. Most of the changes in this patch were generated with some commands of the form: find . \| grep "\.java\\|\.py\\|\.h\\|\.cc" \| \ xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g along with some manual fixes. After this patch, the remaining references to Cloudera in the repo mostly fall into the categories: - External components that have cloudera in their own package names, eg. com.cloudera.kudu/llama - URLs, eg. https://repository.cloudera.com/ Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2 Reviewed-on: http://gerrit.cloudera.org:8080/3937 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 21:14:13 +00:00
Alex Behm	a5e84ac014	IMPALA-4206: Add column lineage regression test. The underlying issue was already fixed in IMPALA-3940. This patch adds a new regression test to cover the IMPALA-4206. Change-Id: I5b164000c7b0ce7e2f296d168d75a6860f5963d8 Reviewed-on: http://gerrit.cloudera.org:8080/4556 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-29 07:45:19 +00:00
Alex Behm	3aa4351625	IMPALA-4170: Fix identifier quoting in COMPUTE INCREMENTAL STATS. The SQL statements generated from COMPUTE INCREMENTAL STATS did not properly quote identifiers when incrementally updating the stats for newly added partitions. Our existing tests did not catch this case because the code paths for doing the initial stats computation and the incremental stats computation are different, in particular, the code for generating the SQL statements. Change-Id: I63adcc45dc964ce769107bf4139fc4566937bb96 Reviewed-on: http://gerrit.cloudera.org:8080/4479 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-09-21 01:24:53 +00:00
Matthew Jacobs	c7fa03286b	IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:11:04 +00:00
Alex Behm	d379875806	IMPALA-3491: Use unique db in test_scanners.py and test_aggregation.py Testing: Ran the tests locally in a loop on exhaustive. Did a private debug/exhaustive run. Change-Id: Ided0848c138bdc1d43694a12222010c48e23ee1c Reviewed-on: http://gerrit.cloudera.org:8080/4339 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-13 21:57:36 +00:00
Alex Behm	8c37bf3543	IMPALA-3491: Use unique database fixture in test_partitioning.py Testing: Ran the test locally in a loop on exhaustive. Did a private debug/exhaustive/hdfs test run. Change-Id: Ib1b33d9977a98894288662a711805e9a54329ec8 Reviewed-on: http://gerrit.cloudera.org:8080/4316 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 04:31:27 +00:00
Alex Behm	f0ffbca2c3	IMPALA-3491: Use unique database fixture in test_insert_parquet.py Testing: Ran the test locally in a loop. Did a private debug/core/hdfs build. Change-Id: I790b2ed5236640c7263826d1d2a74b64d43ac6f7 Reviewed-on: http://gerrit.cloudera.org:8080/4317 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 03:25:29 +00:00
Matthew Jacobs	157c80056c	IMPALA-3481: Use Kudu ScanToken API for scan ranges Switches the planner and KuduScanNode to use Kudu's new ScanToken API instead of explicitly constructing scan ranges for all tablets of a table, regardless of whether they were needed. The ScanToken API allows Impala to specify the projected columns and predicates during planning, and Kudu returns a set of 'scan tokens' that represent a scanner for each tablet that needs to be scanned. The scan tokens can be serialized and distributed to the scan nodes, which can then deserialize them into Kudu scanner objects. Upon deserialization, the scan token has all scan parameters already, including the 'pushed down' predicates. Impala no longer needs to send the Kudu predicates to the BE and convert them at the scan node. This change also fixes: 1) IMPALA-4016: Avoid materializing slots only referenced by Kudu conjuncts 2) IMPALA-3874: Predicates are not always pushed to Kudu TODO: Consider additional planning improvements. Testing: Updated the existing tests, verified everything works as expected. Some BE tests no longer make sense and they were removed. TODO: When KUDU-1065 is resolved, add tests that demonstrate pruning. Change-Id: I160e5849d372755748ff5ba3c90a4651c804b220 Reviewed-on: http://gerrit.cloudera.org:8080/4120 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 01:50:51 +00:00
Alex Behm	c8f3d40efc	IMPALA-3491: Use unique database fixture in test_nested_types.py Testing: Ran the tests locally in a loop. Did a core/debug/hdfs private build. Change-Id: I0c56df0c6a5f771222dedb69353f8bebe01d5a90 Reviewed-on: http://gerrit.cloudera.org:8080/4302 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-03 00:39:07 +00:00
Yuanhao Luo	052d3cc8dd	IMPALA-4056: Fix toSql() of DistributeParam This commit fixes two issues in toSql() of DistributeParam: 1. string literals were not quoted 2. range partition split rows were not printed. Besides, this commit fixes a small issue in run-hive-server.sh Change-Id: I984a63a24f02670347b0e1efceb864d265d1f931 Reviewed-on: http://gerrit.cloudera.org:8080/4195 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 20:11:27 +00:00
Alex Behm	ab9e54bc42	IMPALA-3491: Use unique database fixture in test_ddl.py. Adds new parametrization to the unique database fixture: - num_dbs: allows creating multiple unique databases at once; the 2nd, 3rd, etc. datbase name is generated by appending "2", "3", etc., to the first database name - sync_ddl: allows creating the dabatases(s) with sync_ddl which is needed by most tests in test_ddl.py Testing: I ran debug/core and debug/exhaustive on HDFS and core/debug on S3. Also ran the test locally in a loop on exhaustive. Change-Id: Idf667dd5e960768879c019e2037cf48ad4e4241b Reviewed-on: http://gerrit.cloudera.org:8080/4155 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 02:47:02 +00:00
Alex Behm	16f1c8d8de	IMPALA-4054: Remove serial test workarounds for IMPALA-2479. The underlying issue IMPALA-2479 has been fixed, so it should be safe to execute these tests in parallel again: - test_runtime_filters.py (all tests) - test_scanners.py::TestParquet::test_multiple_blocks - test_scanners.py::testParquet::test_multiple_blocks_one_row_group Testing: Ran the tests locally in a loop. Did a private core/hdfs run. Change-Id: I8f046e67eb1de1c6ff87980f906870ec9816f551 Reviewed-on: http://gerrit.cloudera.org:8080/4291 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 02:19:52 +00:00
Jim Apple	20ef3b016e	IMPALA-4058: ByteSwap256 assumed memory was 16-byte aligned. This changes the code to use the lddqu and movdqu instructions (via Intel intrinsics) to allow unaligned memory access. Change-Id: I39b2b47bb717d5ac9727512a24fcf8a8a6a8dcc6 Reviewed-on: http://gerrit.cloudera.org:8080/4205 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-09-02 01:47:08 +00:00
Henry Robinson	24869d40fd	IMPALA-3610: Account for memory used by filters in the coordinator Before this patch, Impala would not account for the memory used to aggregate runtime filters together in the coordinator. Impala's memory could therefore be silently overcommitted. This patch accounts for aggregated filter memory in a new filter memtracker that is attached to the coordinator's query_mem_tracker(). If the query memory limit is exceeded when a filter update arrives, that update is discarded. If the filter is from a partitioned join, the entire filter can therefore be discarded immediately (to alleviate memory pressure) and a dummy 'always true' filter is sent to backends to unblock them. If the filter is from a broadcast join, no aggregation is done, so there is no tracking. The Thrift input and output filter data structures are not tracked (as we generally don't track RPC objects, but plan to in the future). The filter payload is moved from the input request structure to the output broadcast structure without copying. Memory that is added to a memtracker must always be released. To do this, we need to signal to the coordinator that it is finished, and that there is no point trying to process any future updates that might arrive concurrently. This patch adds Coordinator::Done() which is called from QueryExecState::Done(), and which releases memory from all in-process runtime filters. Finally, this patch increases the upper limit for runtime filters to 512MB. This allows testing on very large datasets. The default maximum is still 16MB, per RUNTIME_FILTER_MAX_SIZE. Testing: Added a new test that triggers the OOM condition on the coordinator. All existing runtime filter tests pass. Change-Id: I3c52c8a1c2e79ef370c77bf264885fc859678d1b Reviewed-on: http://gerrit.cloudera.org:8080/4066 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-09-01 02:35:41 +00:00
Tim Armstrong	1350c34763	IMPALA-4049: fix empty batch handling NLJ build side Memory from the build side of a nested loop join is referenced by its output batches, so accumulated memory build side resources must be transferred to the caller. Special-cased handling of empty batches did not transfer the memory. The fix is to accumulate empty batches and transfer their resources in the same way as non-empty batches. The iterator required changes to handle empty batches in the list. Testing: Added a unit test that exercises the bug RowBatchList. Add a query test that causes a crash in the ASAN build and incorrect results in the debug build. Change-Id: I3cb19e536b87bbb4d4ae82d1636ba1463a422789 Reviewed-on: http://gerrit.cloudera.org:8080/4182 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-31 21:20:29 +00:00
Alex Behm	5adedc6a1a	IMPALA-3930,IMPALA-2570: Fix shuffle insert hint with constant partition exprs. Fixes inserts into partitioned tables that have a shuffle hint and only constant partition exprs. The rows to be inserted are merged at the coordinator where the table sink is executed. There is no need to hash exchange rows. Now accepts insert hints when inserting into unpartitioned tables. The shuffle hint leads to a plan where all rows are merged at the coordinator where the table sink is executed. Change-Id: I1084d49c95b7d867eeac3297fd2016daff0ab687 Reviewed-on: http://gerrit.cloudera.org:8080/4162 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-08-31 09:59:00 +00:00

1 2 3 4 5 ...

1109 Commits