impala

mirror of https://github.com/apache/impala.git synced 2025-12-23 21:08:39 -05:00

Author	SHA1	Message	Date
Matthew Jacobs	99ed6dc67a	IMPALA-4134,IMPALA-3704: Kudu INSERT improvements 1.) IMPALA-4134: Use Kudu AUTO FLUSH Improves performance of writes to Kudu up to 4.2x in bulk data loading tests (load 200 million rows from lineitem). 2.) IMPALA-3704: Improve errors on PK conflicts The Kudu client reports an error for every PK conflict, and all errors were being returned in the error status. As a result, inserts/updates/deletes could return errors with thousands errors reported. This changes the error handling to log all reported errors as warnings and return only the first error in the query error status. 3.) Improve the DataSink reporting of the insert stats. The per-partition stats returned by the data sink weren't useful for Kudu sinks. Firstly, the number of appended rows was not being displayed in the profile. Secondly, the 'stats' field isn't populated for Kudu tables and thus was confusing in the profile, so it is no longer printed if it is not set in the thrift struct. Testing: Ran local tests, including new tests to verify the query profile insert stats. Manual cluster testing was conducted of the AUTO FLUSH functionality, and that testing informed the default mutation buffer value of 100MB which was found to provide good results. Change-Id: I5542b9a061b01c543a139e8722560b1365f06595 Reviewed-on: http://gerrit.cloudera.org:8080/4728 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 02:06:10 +00:00
Jim Apple	0eaff805e2	Add distcc infrastructure. This has been working for several months, and it it was written mainly by Casey Ching while he was at Cloudera working on Impala. Change-Id: Ia4bc78ad46dda13e4533183195af632f46377cae Reviewed-on: http://gerrit.cloudera.org:8080/4820 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 01:15:50 +00:00
Henry Robinson	e0a3272129	Minor compute stats script fixes * Change run-step to output full log path * Change text to say "Computing table stats" rather than "Computing HBase stats" when running compute-table-stats.sh Change-Id: I326f4c370fda8d5e388af8e2395623185c06bc07 Reviewed-on: http://gerrit.cloudera.org:8080/4825 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-25 00:13:54 +00:00
Jim Apple	61fcb48974	IMPALA-4300: Speed up BloomFilter::Or with SIMD The previous code was not written in a way that GCC could auto-vectorize it. Manually vectorizing speeds up BloomFilter::Or by up to 184x. Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Reviewed-on: http://gerrit.cloudera.org:8080/4813 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-10-24 18:07:25 +00:00
Jim Apple	0fbb5b7e71	Remove unused Bitmap code. These methods and code paths have been made obsolete by the switch to Bloom filters. Change-Id: I95fcaaa40243999800c2ec2ead5b3479d66a63e7 Reviewed-on: http://gerrit.cloudera.org:8080/4801 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-24 17:53:33 +00:00
Alex Behm	ff6b450ad3	IMPALA-4285/IMPALA-4286: Fixes for Parquet scanner with MT_DOP > 0. IMPALA-4258: The problem was that there was a reference to HdfsScanner::batch_ hidden inside WriteEmptyTuples(). The batch_ reference is NULL when the scanner is run with MT_DOP > 1. IMPALA-4286: When there are no scan ranges HdfsScanNodeBase::Open() exits early without initializing the reader context. This lead to a DCHECK in IoMgr::GetNextRange() called from HdfsScanNodeMt. The fix is to remove that unnecessary short-circuit Open(). I combined these two bugfixes because the new basic test covers both cases. Testing: Added a new test_mt_dop.py test. A private code/hdfs run passed. Change-Id: I79c0f6fd2aeb4bc6fa5f87219a485194fef2db1b Reviewed-on: http://gerrit.cloudera.org:8080/4767 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 10:24:24 +00:00
Michael Ho	51268c053f	IMPALA-4120: Incorrect results with LEAD() analytic function This change fixes a memory management problem with LEAD()/LAG() analytic functions which led to incorrect result. In particular, the update functions specified for these analytic functions only make a shallow copy of StringVal (i.e. copying only the pointer and the length of the string) without copying the string itself. This may lead to problem if the string is created from some UDFs which do local allocations whose buffer may be freed and reused before the result tuple is copied out. This change fixes the problem above by allocating a buffer at the Init() functions of these analytic functions to track the intermediate value. In addition, when the value is copied out in GetValue(), it will be copied into the MemPool belonging to the AnalyticEvalNode and attached to the outgoing row batches. This change also fixes a missing free of local allocations in QueryMaintenance(). Change-Id: I85bb1745232d8dd383a6047c86019c6378ab571f Reviewed-on: http://gerrit.cloudera.org:8080/4740 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 07:39:37 +00:00
Henry Robinson	48085274fa	IMPALA-4335: Don't send 0-row batches to clients This patch restores some behaviour from pre-IMPALA-2905 where we would not send 0-row batches to the client. Although 0-row batches are legal, they're not very useful for clients to receive (and clients may not correctly process them). No query was found which reliably produced 0-row batches, so no test is added. Change-Id: I7d339c1f9a55d9d75fb0e97d16b3176cc34f2171 Reviewed-on: http://gerrit.cloudera.org:8080/4787 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 04:49:39 +00:00
Jim Apple	e39f1676e1	IMPALA-4295: XFAIL wildcard SSL test commit `9f61397fc4` exposed a bug (one that was latent before the commit). I am XFAILing this now just to green the build; IMPALA-4295 can be resolved when this issue is fixed and not just XFAILed. Change-Id: Ie809c6c6c967447d527927ebbc6b110095e7320a Reviewed-on: http://gerrit.cloudera.org:8080/4784 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 02:51:25 +00:00
Tim Armstrong	d1d88aaccd	IMPALA-4241: remove spurious child queries event "IMPALA-4037,IMPALA-4038: fix locking during query cancellation" accidentally added the "Child queries finished" event unconditionally. We should only do this if there are actually child queries. Change-Id: I3881d032622750444d750f161ad6843bdbd16c30 Reviewed-on: http://gerrit.cloudera.org:8080/4768 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-22 01:27:39 +00:00
Matthew Jacobs	8d7b01faea	IMPALA-3718: Add test_cancellation tests for Kudu Additional functional tests for Kudu. Change-Id: Icf3d3853e7075991f6d12f125407ebdbe6a287e2 Reviewed-on: http://gerrit.cloudera.org:8080/4700 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 23:32:58 +00:00
Dimitris Tsirogiannis	8a49ceaae5	IMPALA-3739: Enable stress tests on Kudu This commit modifies the stress test framework to run TPC-H and TPC-DS workloads against Kudu. The follwing changes are included in this commit: 1. Created template files with DDL and DML statements for loading TPC-H and TPC-DS data in Kudu 2. Created a script (load-tpc-kudu.py) to load data in Kudu. The script is invoked by the stress test runner to load test data in an existing Impala/Kudu cluster (both local and CM-managed clusters are supported). 3. Created SQL files with TPC-DS queries to be executed in Kudu. SQL files with TPC-H queries for Kudu were added in a previous patch. 4. Modified the stress test runner to take additional parameters specific to Kudu (e.g. kudu master addr) The stress test runner for Kudu was tested on EC2 clusters for both TPC-H and TPC-DS workloads. Missing functionality: * No CRUD operations in the existing TPC-H/TPC-DS workloads for Kudu. * Not all supported TPC-DS queries are included. Currently, only the TPC-DS queries from the testdata/workloads/tpcds/queries directory were modified to run against Kudu. Change-Id: I3c9fc3dae24b761f031ee8e014bd611a49029d34 Reviewed-on: http://gerrit.cloudera.org:8080/4327 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 11:01:37 +00:00
Dimitris Tsirogiannis	041fa6d946	IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables With this commit we simplify the syntax and handling of CREATE TABLE statements for both managed and external Kudu tables. Syntax example: CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b)) DISTRIBUTE BY HASH (a) INTO 3 BUCKETS, RANGE (b) SPLIT ROWS (('abc', 'def')) STORED AS KUDU Changes: 1) Remove the requirement to specify table properties such as key columns in tblproperties. 2) Read table schema (column definitions, primary keys, and distribution schemes) from Kudu instead of the HMS. 3) For external tables, the Kudu table is now required to exist at the time of creation in Impala. 4) Disallow table properties that could conflict with an existing table. Ex: key_columns cannot be specified. 5) Add KUDU as a file format. 6) Add a startup flag to impalad to specify the default Kudu master addresses. The flag is used as the default value for the table property kudu_master_addresses but it can still be overriden using TBLPROPERTIES. 7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE wasn't implemented for Kudu tables and silently ignored. The Kudu tables wouldn't be removed in Kudu. 8) Remove DDL delegates. There was only one functional delegate (for Kudu) the existence of the other delegate and the use of delegates in general has led to confusion. The Kudu delegate only exists to provide functionality missing from Hive. 9) Add PRIMARY KEY at the column and table level. This syntax is fairly standard. When used at the column level, only one column can be marked as a key. When used at the table level, multiple columns can be used as a key. Only Kudu tables are allowed to use PRIMARY KEY. The old "kudu.key_columns" table property is no longer accepted though it is still used internally. "PRIMARY" is now a keyword. The ident style declaration is used for "KEY" because it is also used for nested map types. 10) For managed tables, infer a Kudu table name if none was given. The table property "kudu.table_name" is optional for managed tables and is required for external tables. If for a managed table a Kudu table name is not provided, a table name will be generated based on the HMS database and table name. 11) Use Kudu master as the source of truth for table metadata instead of HMS when a table is loaded or refreshed. Table/column metadata are cached in the catalog and are stored in HMS in order to be able to use table and column statistics. Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1 Reviewed-on: http://gerrit.cloudera.org:8080/4414 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 10:52:25 +00:00
Yuanhao Luo	f8d48b8582	IMPALA-4325: StmtRewrite lost parentheses of CompoundPredicate StmtRewrite lost parentheses of CompoundPredicate in pushNegationToOperands() and leads to incorrect toSql() result. Even though this issue would not leads to incorrect result of query, it makes user confuse of the logical operator precedence of predicates shown in EXPLAIN statement. Change-Id: I79bfc67605206e0e026293bf7032a88227a95623 Reviewed-on: http://gerrit.cloudera.org:8080/4753 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 07:45:53 +00:00
Tim Armstrong	9ef9512e5b	IMPALA-4277: remove references for unsupported s3/s3n connectors We only support s3a://. Support will be removed for s3:// in Hadoop 3.0 by HADOOP-12709 Change-Id: Ibfadd2bc91c7dbcb6f2bc962c404caea30f9b776 Reviewed-on: http://gerrit.cloudera.org:8080/4748 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com> (cherry picked from commit `3cb3f34d6d`) Reviewed-on: http://gerrit.cloudera.org:8080/4778 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 05:45:18 +00:00
Lars Volker	2fa1633e40	IMPALA-4329: Prevent crash in scheduler when no backends are registered The scheduler crashed with a segmentation fault when there were no backends registered: After not being able to find a local backend (none are configured at all) in ComputeScanRangeAssignment(), the previous code would eventually try to return the top of assignment_ctx.assignment_heap in SelectRemoteBackendHost(), but that heap would be empty. Subsequently, when using the IP address of that heap node, a segmentation fault would occur. This change adds a check and aborts scheduling with an error. It also contains a test. Change-Id: I6d93158f34841ea66dc3682290266262c87ea7ff Reviewed-on: http://gerrit.cloudera.org:8080/4776 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-10-21 03:16:30 +00:00
Taras Bobrovytsky	bf1d9677fc	IMPALA-4155: Update default partition when table is altered If the table format is changed by the Alter Table statement, the default partition in partitioned tables used to not get updated. This caused a problem because Insert picks up the file format for new partitions from the default partition. This patch fixes the problem by calling addDefaultPartition(). Also removed "drop table if not exists" in tests in alter-table.test because we already have the unique_database fixture. Change-Id: I59bf21caa5c5e7867d07d87cda0c0a5b4b994859 Reviewed-on: http://gerrit.cloudera.org:8080/4750 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-20 23:47:52 +00:00
Alex Behm	c6fc899134	IMPALA-4301: Fix IGNORE NULLS with subquery rewriting. AnayticExpr.analyze() replaces the original FIRST/LAST_VALUE function with a FIRST/LAST_VALUE_IGNORE_NULLS function if the IGNORE NULLS clause is specified. The bug was that several places in AnalyticExpr.analyze() assumed and asserted that only the original FIRST/LAST_VALUE function could be encountered during analysis. However, with subquery rewriting the IGNORE NULLS version of the function may also be seen because the whole statement is re-analyzed after rewriting. The fix is to unset the IGNORE NULLS flag of the function params after changing the analytic function name. Change-Id: I708de7925fe6aeef582fd7510da93d24c71229d9 Reviewed-on: http://gerrit.cloudera.org:8080/4732 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-20 09:14:19 +00:00
Michael Ho	502220c69d	IMPALA-4269: Codegen merging exchange node This change enables codegen for the tuple row comparator used in merging-exchange node. With this change, merging-exchange operator improves by 40% and 50% respectively for primitive_orderby_bigint and primitive_orderby_all on TPCH-300, speeding up the query by 6% and 11% respectively. Change-Id: I944b8d52ea63ede58e4dc6fbe6e6953756394d41 Reviewed-on: http://gerrit.cloudera.org:8080/4759 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-20 04:42:09 +00:00
Tim Armstrong	f1f54fe65d	IMPALA-3420: use gold by default Also pass the flag that enables ld.gold directly to the compiler. This is understood by both gcc and clang (if prefixed with -Wl, clang just forwards the flag to ld, where it is ignored). Testing: Did ASAN and debug private builds to validate it works. Tested shared library, release, ninja and distcc builds locally as part of my normal workflow. Change-Id: Ib05c944ced9cdfe54941f4b690574e45a25110a2 Reviewed-on: http://gerrit.cloudera.org:8080/4751 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-10-20 03:09:59 +00:00
Jim Apple	5a2c50a163	IMPALA-4230: ASF policy issues from 2.7.0 rc3. In our IPMC vote to release 2.7.0 rc3, Justing Mclean pointed out a number of issues of compliance with ASF policy. He asked: 1. "Please place build instruction and supported platforms in the README. The wiki may change over time and that may make it difficult to build older versions." 2. Remove binary file llvm-ir/test-loop.bc 3. Add be/src/gutil/valgrind.h, shell/ext-py/sqlparse-0.1.14/sqlparse/pipeline.py and cmake_modules/FindJNI.cmake, normalize.css (embedded in bootstrap.css) to LICENSE.txt 4. Fix be/src/thirdparty/squeasel/squeasel* in LICENSE.txt 5. Remove outdated copyright lines from HBase (see https://issues.apache.org/jira/browse/HBASE-3870) 6. Remove duplicate jquery notice from LICENSE.txt Change-Id: I30ff77d7ac28ce67511c200764fba19ae69922e0 Reviewed-on: http://gerrit.cloudera.org:8080/4582 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-10-19 23:59:02 +00:00
aphadke	ef5d0c36aa	IMPALA-3920: TotalStorageWaitTime counter not populated for fragments with Kudu scan node Currently we do not start the TotalStorageWaitTime timer in the kudu-scanner. This patch replaces the kudu_read_timer with the TotalStorageWaitTime which measures the intended time. Change-Id: If0c793930799fdcaff53e705f94b52cadac2f53a Reviewed-on: http://gerrit.cloudera.org:8080/4639 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-19 23:07:47 +00:00
Michael Ho	b15d992abe	IMPALA-4080, IMPALA-3638: Introduce ExecNode::Codegen() This patch is mostly mechanical move of codegen related logic from each exec node's Prepare() to its Codegen() function. After this change, code generation will no longer happen in Prepare(). Instead, it will happen after Prepare() completes in PlanFragmentExecutor. This is an intermediate step towards the final goal of sharing compiled code among fragment instances in multi-threading. As part of the clean up, this change also removes the logic for lazy codegen object creation. In other words, if codegen is enabled, the codegen object will always be created. This simplifies some of the logic in ScalarFnCall::Prepare() and various Codegen() functions by reducing error checking needed. This change also removes the logic added for tackling IMPALA-1755 as it's not needed anymore after the clean up. The clean up also rectifies a not so well documented situation. Previously, even if a user explicitly sets DISABLE_CODEGEN to true, we may still codegen a UDF if it was written in LLVM IR or if it has more than 8 arguments. This patch enforces the query option by failing the query in both cases. To run the query, the user must enable codegen. This change also extends the number of arguments supported in the interpretation path of ScalarFn to 20. Change-Id: I207566bc9f4c6a159271ecdbc4bbdba3d78c6651 Reviewed-on: http://gerrit.cloudera.org:8080/4651 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-10-19 08:18:37 +00:00
Tim Armstrong	ee2a06d827	Remove Llama dependency This change prevents us from depending on LLAMA to build. Note that the LLAMA MiniKDC is left in - it is a test utility that does not depend on LLAMA itself. IMPALA-4292 tracks cleaning this up. Testing: Ran a private build to verify that all tests pass. Change-Id: If2e5e21d8047097d56062ded11b0832a1d397fe0 Reviewed-on: http://gerrit.cloudera.org:8080/4739 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-18 16:35:58 +00:00
Henry Robinson	3f5380dc73	IMPALA-2905: Move QueryResultSet implementations into separate module This mostly mechanical change moves the definition and implementation of the Beeswax and HS2-specific result sets into their own module. Result sets are now uniformly created by one of two factory methods, so the implementation is decoupled from the client. Change-Id: I6ab883b62d3ec7012240edf8d56889349e7c0e32 Reviewed-on: http://gerrit.cloudera.org:8080/4736 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-18 09:30:09 +00:00
Juan Yu	080a67848b	IMPALA-4253: impala-server.backends.client-cache.total-clients shows negative value Fixed double decrement in case a cached connection is broken and cannot be re-created. Change-Id: Ic9e28055cb232cdb543c4c9f05a558ab0f73f777 Reviewed-on: http://gerrit.cloudera.org:8080/4668 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-10-18 09:29:31 +00:00
Henry Robinson	5a91964893	IMPALA-4310: Make push_to_asf.py respect --apache_remote Change-Id: I03e15753e685b1b8cf953e8009fb473c9c12aa93 Reviewed-on: http://gerrit.cloudera.org:8080/4747 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2016-10-18 06:34:22 +00:00
Lars Volker	0686cc4e1f	IMPALA-2916: Add warning to query profile if debug build Change-Id: I85ce4d4a5624382203e6b2c8f5b96d04c4482f37 Reviewed-on: http://gerrit.cloudera.org:8080/4588 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-10-18 06:12:51 +00:00
Tim Armstrong	df680cfe3a	IMPALA-4277: allow overriding of Hive/Hadoop versions/locations This is to help with IMPALA-4277 to make it easier to build against Hadoop/Hive distributions where the directory layout doesn't exactly match our current CDH dependencies, or where we may want to temporarily override a version without making a source change. Change-Id: I7da10e38f9c4309f2d193dc25f14a6ea308c9639 Reviewed-on: http://gerrit.cloudera.org:8080/4720 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-10-18 05:54:09 +00:00
Henry Robinson	d0a2d1d43d	Add search / sort to HTML tables for metrics and threads Change-Id: If069ce6a9eae00bacaa30605d23bea72f29e5c4f Reviewed-on: http://gerrit.cloudera.org:8080/4743 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-10-18 05:08:21 +00:00
Tim Armstrong	e3a0891445	Buffer pool: Add basic counters to buffer pool client Change-Id: I9a5a57b7cfccf67ee498e68964f1e077075ee325 Reviewed-on: http://gerrit.cloudera.org:8080/4714 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-18 04:48:43 +00:00
Tim Armstrong	07da7679d1	IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Reviewed-on: http://gerrit.cloudera.org:8080/4494 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-18 02:53:16 +00:00
Tim Armstrong	ef762b73a1	IMPALA-4299: add buildall.sh option to start test cluster A previous commit "IMPALA-4259: build Impala without any test cluster setup" altered some undocumented side-effects of buildall.sh. Previously the following commands reconfigured and restarted the test cluster. It worked because buildall.sh unconditionally regenerated the test cluster configs. ./buildall.sh -notests && ./testdata/bin/run-all.sh ./buildall.sh -noclean -notests && ./testdata/bin/run-all.sh Instead of restoring the old behaviour and continuing to encourage mixing use of low and high level scripts like testdata/bin/run-all.sh as part of the "standard" workflow, this commit adds another high-level option to buildall.sh, -start_minicluster, that accomplishes the high-level task of restarting a minicluster with fresh configs. The above commands can be replaced with: ./buildall.sh -notests -start_minicluster ./buildall.sh -notests -noclean -start_minicluster Change-Id: I0ab3461f8ff3de49b3f28a0dc22fa0a6d5569da5 Reviewed-on: http://gerrit.cloudera.org:8080/4734 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-17 22:19:06 +00:00
Alex Behm	0480253566	IMPALA-4270: Gracefully fail unsupported queries with mt_dop > 0. MT_DOP > 0 is only supported for plans without distributed joins or table sinks. Adds validation to fail unsupported queries gracefully in planning. For scans in queries that are executable with MT_DOP > 0 we either use the optimized MT scan node BE implementation (only Parquet), or we use the conventional scan node with num_scanner_threads=1. TODO: Still need to add end-to-end tests. Change-Id: I91a60ea7b6e3ae4ee44be856615ddd3cd0af476d Reviewed-on: http://gerrit.cloudera.org:8080/4677 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-17 09:22:57 +00:00
Alex Behm	b0e87c685d	IMPALA-2789: More compact mem layout with null bits at the end. There are two motivations for this change: 1. Reduce memory consumption. 2. Pave the way for full memory layout compatibility between Impala and Kudu to eventually enable zero-copy scans. This patch is a only first step towards that goal. New Memory Layout Slots are placed in descending order by size with trailing bytes to store null flags. Null flags are omitted for non-nullable slots. There is no padding between tuples when stored back-to-back in a row batch. Example: select bool_col, int_col, string_col, smallint_col from functional.alltypes Slots: string_col\|int_col\|smallint_col\|bool_col\|null_byte Offsets: 0 16 20 22 23 The main change is to move the null indicators to the end of tuples. The new memory layout is fully packed with no padding in between slots or tuples. Performance: Our standard cluster perf tests showed no significant difference in query response times as well as consumed cycles, and a slight reduction in peak memory consumption. Testing: An exhaustive test run passed. Ran a few select tests like TPC-H/DS with ASAN locally. These follow-on changes are planned: 1. Planner needs to mark slots non-nullable if they correspond to a non-nullable Kudu column. 2. Update Kudu scan node to copy tuples with memcpy. 3. Kudu client needs to support transferring ownership of the tuple memory (maybe do direct and indirect buffers separately). 4. Update Kudu scan node to use memory transfer instead of copy Change-Id: Ib6510c75d841bddafa6638f1bd2ac6731a7053f6 Reviewed-on: http://gerrit.cloudera.org:8080/4673 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-16 23:36:10 +00:00
Henry Robinson	9f61397fc4	IMPALA-2905: Handle coordinator fragment lifecycle like all others The plan-root fragment instance that runs on the coordinator should be handled like all others: started via RPC and run asynchronously. Without this, the fragment requires special-case code throughout the coordinator, and does not show up in system metrics etc. This patch adds a new sink type, PlanRootSink, to the root fragment instance so that the coordinator can pull row batches that are pushed by the root instance. The coordinator signals completion to the fragment instance via closing the consumer side of the sink, whereupon the instance is free to complete. Since the root instance now runs asynchronously wrt to the coordinator, we add several coordination methods to allow the coordinator to wait for a point in the instance's execution to be hit - e.g. to wait until the instance has been opened. Done in this patch: * Add PlanRootSink * Add coordination to PFE to allow coordinator to observe lifecycle * Make FragmentMgr a singleton * Removed dead code from Coordinator::Wait() and elsewhere. * Moved result output exprs out of QES and into PlanRootSink. * Remove special-case limit-based teardown of coordinator fragment, and supporting functions in PlanFragmentExecutor. * Simplified lifecycle of PlanFragmentExecutor by separating Open() into Open() and Exec(), the latter of which drives the sink by reading rows from the plan tree. * Add child profile to PlanFragmentExecutor to measure time spent in each lifecycle phase. * Removed dependency between InitExecProfiles() and starting root fragment. * Removed mostly dead-code handling of LIMIT 0 queries. * Ensured that SET returns a result set in all cases. * Fix test_get_log() HS2 test. Errors are only guaranteed to be visible after fetch calls return EOS, but test was assuming this would happen after first fetch. Change-Id: Ibb0064ec2f085fa3a5598ea80894fb489a01e4df Reviewed-on: http://gerrit.cloudera.org:8080/4402 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-10-16 15:55:29 +00:00
David Knupp	05b91a973c	IMPALA-4294: Make check-schema-diff.sh executable from anywhere. Fixes a regression in the data load process that had been introduced by commit `75a857c`. To making check-schema-diff.sh work from anywhere. we need to specify the git-dir and work-tree arguments everywhere we call git. Change-Id: I32e0dce2c10c443763a038aa3b64b1c123ed62ad Reviewed-on: http://gerrit.cloudera.org:8080/4726 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-10-15 04:05:04 +00:00
Matthew Jacobs	ca3fd401be	IMPALA-3348: Avoid per-slot check vector size in KuduScanner Fixes a small perf issue by avoiding extra calls to check a vector size on every slot. Testing: Ran EE tests. Change-Id: Ie76d33c3d00e3be6d238226d28c4100bb65aac58 Reviewed-on: http://gerrit.cloudera.org:8080/4688 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-15 02:49:58 +00:00
Thomas Tauber-Marshall	7fad3e5dc3	IMPALA-3002/IMPALA-1473: Cardinality observability cleanup IMPALA-3002: The shell prints an incorrect value for '#Rows' in the exec summary for broadcast nodes due to incorrect logic around whether to use max or agg stats. This patch makes the behavior consistent with the way the be treats exec summaries in summary-util.cc. This incorrect logic was also duplicated in the impala_beeswax test framework. IMPALA-1473: When there is a merging exchange with a limit, we may copy rows into the output batch beyond the limit. In this case, we currently update the output batch's size to reflect the limit, but we also need to update ExecNode::num_rows_returned_ or the exec summary may show that the exchange node returned more rows than it really did. Additionally, PlanFragmentExecutor::GetNext does not update rows_produced_counter_ in some cases, leading the runtime profile to display an incorrect value for 'RowsProduced'. Change-Id: I386719370386c9cff09b8b35d15dc712dc6480aa Reviewed-on: http://gerrit.cloudera.org:8080/4679 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-15 01:25:51 +00:00
Matthew Jacobs	a1c9cb3646	IMPALA-4102: Remote Kudu reads should be reported Adds a profile counter for the number of kudu scan tokens (ranges) that are "expected" to be remote. Testing: Manual; Have been running with this on the Kudu cluster. Cannot easily simulate this in the minicluster because the scheduler considers multiple impalads on the same host to be local for the purposes of determining locality. See BackendConfig::LookUpBackendIp(). Change-Id: I74fd5773c4ae10267de80b6572d93197a4131696 Reviewed-on: http://gerrit.cloudera.org:8080/4687 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-15 00:43:56 +00:00
Lars Volker	1a5c43ef5e	IMPALA-3644 Make predicate order deterministic This adds a tie-break to make sure that we sort predicates in a deterministic order on Java 7 and 8. This was suggested by Alex in IMPALA-3644. There are still three broken tests when run in Java 8, but it seems best to address them in a subsequent change. Change-Id: Id11010bfeaff368869e6d430eeb4773ddf41faff Reviewed-on: http://gerrit.cloudera.org:8080/4671 Reviewed-by: Jim Apple <jbapple@cloudera.com> Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 22:04:30 +00:00
Michael Brown	db5de41a80	IMPALA-4188: Leopard: support external Docker volumes To be able to run the Random Query Generator with Impala and Kudu, we need to mount an external Docker volume as a workaround to KUDU-1419. This patch introduces a series of environment variables a user may tweak in order to help with that purpose. The patch assumes a viable, reasonable Docker container based on a standard Linux distribution like Ubuntu 14. To assist users, I've updated the Leopard README with instructions on the environment variables' meanings. The gist here is that the container is the source of truth, which means to create an external volume, we need to copy the testdata off the container onto the host running Docker Engine. To do that we suggest a strategy using rsync via passwordless SSH key. Testing: I used a Cloudera Docker container that has Impala in /home/dev/Impala. Before, Kudu would fail to start due to KUDU-1419. Now, we load testdata into an external volume, build Impala, run the minicluster including Kudu, and can access the tpch_kudu data. I made flake8 fixes as well. flake8 on this file is now clean. Change-Id: Ia7d9d9253fcd7e3905e389ddeb1438cee3e24480 Reviewed-on: http://gerrit.cloudera.org:8080/4678 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 07:44:23 +00:00
Jim Apple	784716f776	IMPALA-3971, IMPALA-3229: Bootstrap an Impala dev environment This script bootstraps an Impala dev environment on Ubuntu 14.04. It is not hermetic -- it changes some config files for the user and for the OS. It is green on Jenkins, and it runs in about 6.5 hours. The intention is to have this script run in a CI tool for post-commit testing, with the hope that this will make it easier for new developers to get a working development environment. Previously, the new developer workflow lived on wiki pages and tended to bit-rot. Still left to do: migrating the install script into the official Impala repo. Change-Id: If166a8a286d7559af547da39f6cc09e723f34c7e Reviewed-on: http://gerrit.cloudera.org:8080/4674 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 06:07:46 +00:00
Alex Behm	2a04b0e21a	IMPALA-3943: Address post-merge comments. Adds code comments and issues a warning for Parquet files with num_rows=0 but at least one non-empty row group. Change-Id: I72ccf00191afddb8583ac961f1eaf11e5eb28791 Reviewed-on: http://gerrit.cloudera.org:8080/4696 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 05:41:22 +00:00
Michael Ho	47b8aa3a9e	IMPALA-4291: Reduce LLVM module's preparation time Previously, when creating a LlvmCodeGen object, we run an O(mn) algorithm to map the IRFunction::Type to the actual LLVM::Function object in the module. m is the size of IRFunction::Type enum and n is the total number of functions in the module. This is a waste of time if we only use few functions from the module. This change reduces the preparation time of a simple query from 23ms to 10ms. select count(*) from tpch100_parquet.lineitem where l_orderkey > 20; Change-Id: I61ab9fa8cca5a0909bb716c3c62819da3e3b3041 Reviewed-on: http://gerrit.cloudera.org:8080/4691 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 03:02:35 +00:00
Tim Armstrong	c7fe4385d9	IMPALA-4231: fix codegen time regression The commit "IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder" slightly increased codegen time, which caused TPC-H Q2 to sometimes regress significantly because of races in runtime filter arrival. This patch attempts to fix the regression by improving codegen time in a few places. * Revert to using the old bool/Status return pattern. The regular Status return pattern results in significantly more complex IR because it has to emit code to copy and free statuses. I spent some time trying to convince it to optimise the extra code out, but didn't have much success. * Remove some code that cannot be specialized from cross-compilation. * Add noexcept to some functions that are used from the IR to ensure exception-handling IR is not emitted. This is less important after the first change but still should help produce cleaner IR. Performance: I was able to reproduce a regression locally, which is fixed by this patch. I'm in the process of trying to verify the fix on a cluster. Change-Id: Idf0fdedabd488550b6db90167a30c582949d608d Reviewed-on: http://gerrit.cloudera.org:8080/4623 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-14 02:53:59 +00:00
Jim Apple	89b41c68c1	Match .clang-format more closely to actual practice. In order to attempt to get code like double VeryLongFunctionNames(double x1, double x2, double x3, double x4) { return 1.0; } rather than double VeryLongFunctionNames( double x1, double x2, double x3, double x4) { return 1.0; } I wrote a small set of programs to infer which .clang-format params fit the current Impala codebase most closely; this patch is the result. This patch is the best the inferencer found (while maintaining certain enforced parameters, like 90-character lines). It is about 10% closer to Impala's current code base than the .clang-format that is checked in at the moment, as measured by number of lines in the diff. Change-Id: Iccaec6c1673c3e08d2c39200b0c84437af629aed Reviewed-on: http://gerrit.cloudera.org:8080/4590 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Jim Apple <jbapple@cloudera.com>	2016-10-14 00:08:17 +00:00
Tim Armstrong	67a0451e3b	Bump Bzip2 version This picks up the latest toolchain version. The only change is that some symlinks in the previous version were broken. Change-Id: I0c5e9ef10984fc8c6840acf285a04e472fc8b304 Reviewed-on: http://gerrit.cloudera.org:8080/4716 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-10-13 21:25:06 +00:00
Tim Armstrong	75a857c0ce	IMPALA-4259: build Impala without any test cluster setup. The main outcome of this change is to avoid making unnecessary modification to the Impala or other source trees when we don't need the test cluster. To achieve that, this refactors the script to make the flow easier to understand and makes it more consistent which build steps are executed in which modes. Change-Id: I429da7bc6681b16c07fe58bb3efac6d1a8579137 Reviewed-on: http://gerrit.cloudera.org:8080/4685 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-10-13 05:45:47 +00:00
Alex Behm	0b3efb19cc	IMPALA-4289: Mark agg slots of NDV() functions as non-nullable. This change might give a minor speedup to COMPUTE STATS and COMPUTE INCREMENTAL STATS. In any case, marking the slots non-nullable seems strictly better than leaving them nullable. Testing: I ran our local testdata/compute-table-stats.sh and it succeeded. Change-Id: I1c05b8dfb797b2a42ee1a7bf14ad56bb83d2b1c5 Reviewed-on: http://gerrit.cloudera.org:8080/4707 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-10-13 04:31:42 +00:00

1 2 3 4 5 ...

5120 Commits