impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 03:02:48 -05:00

Author	SHA1	Message	Date
Bharath Vissapragada	5cd7ada727	IMPALA-3194: Allow queries materializing scalar type columns in RC/sequence files This commit unblocks queries materializing only scalar typed columns on tables backed by RC/sequence files containing complex typed columns. This worked prior to 2.3.0 release. Change-Id: I3a89b211bdc01f7e07497e293fafd75ccf0500fe Reviewed-on: http://gerrit.cloudera.org:8080/2580 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-31 12:06:57 +00:00
Casey Ching	39a28185e8	Re-enable Kudu in build using client stubs when needed The stubs in Impala broke during the merge commit. This commit removes the stubs in hopes of improving robustness of the build. The original problem (Kudu clients are only available for some OSs) is now addressed by moving the stubbing into a dummy Kudu client. The dummy client only allows linking to succeed, if any client method is called, Impala will crash. Before calling any such method, Kudu availability must be checked. Change-Id: I4bf1c964faf21722137adc4f7ba7f78654f0f712 Reviewed-on: http://gerrit.cloudera.org:8080/2585 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-03-29 23:57:54 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00
Henry Robinson	0d1eab7a9e	IMPALA-3141: Send dummy filters when filter production is disabled The PHJ may disable runtime filter production for one of several reasons, including a predicted high false-positive rate. If the filters are not produced, any scans will wait for their entire timeout before continuing. This patch changes the filter logic to always send a filter, even if one wasn't actually produced by the PHJ. To preserve correctness, that filter must contain every element of the set. Such a filter is represented by (BloomFilter*)NULL. This allows us to make no changes to RuntimeFilter::Eval(), which already returns true if the member Bloom filter is NULL. In RPCs, a new field is added to TBloomFilter to identify filters that are always true. The HdfsParquetScanner checks to see if filters would always return true for any element, and disables them if so. There is some miscellaneous cleanup in this patch, particularly the removal of unused members in BloomFilter. This patch has been manually tested on queries that would otherwise take a long time to time-out. A unit test was added to ensure that queries do not wait. Change-Id: I04b3e6542651c1e7b77a9bab01d0e3d9506af42f Reviewed-on: http://gerrit.cloudera.org:8080/2475 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-03-24 23:17:50 +00:00
Henry Robinson	c06912ebb6	IMPALA-3226: Increase timeout for runtime filter tests When running with ASAN enabled, runtime filters may take a lot longer to be produced, triggering timeouts in the filter tests. This patch triples the timeout time. We still want the timeout to be reasonable as protection against excessive regressions in filter production time, which is why I've not set the timeout to a very large value, plus if the test fails and filters aren't produced we don't want to hang the build for a large timeout delay. Change-Id: Ife1d36a78d6ad587462fe112afda573f6e480441 Reviewed-on: http://gerrit.cloudera.org:8080/2609 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-24 07:59:53 +00:00
Henry Robinson	b3937295fb	Runtime filters tests This patch adds functional tests for runtime filters. It relies on setting RUNTIME_FILTER_WAIT_TIME_MS high enough to ensure that filters are received. To make the test files more readable, this patch also adds a new COMMENT section to the test syntax, and allows blank spaces between queries so that the separation of different test cases can be made more obvious. Currently missing is a test for disabling probe-side filters based on selectivity, as we lack suitable tables to trigger the disable condition. Change-Id: I94d617c6d23ffa394a6eb7ead56f1cfb701e0d90 Reviewed-on: http://gerrit.cloudera.org:8080/2603 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-03-23 04:07:14 +00:00
Thomas Tauber-Marshall	445c88339f	IMPALA-2738 Hive/Impala inconsistency in GRANT/REVOKE syntax Added the ability for the "GRANT/REVOKE ALL ON SERVER TO ROLE <role>" statement to optionally take a server name parameter as: "GRANT/REVOKE ALL ON SERVER <server> TO ROLE <role>" since Hive allows this. The specified server name is checked against the expected server name from the config during analysis, and an exception is thrown if they do not match. Change-Id: Id6c136d9a171ec062d4ff803682d026422497e8b Reviewed-on: http://gerrit.cloudera.org:8080/2296 Tested-by: Internal Jenkins Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2016-03-19 00:03:03 +00:00
Bharath Vissapragada	978d837758	IMPALA-3139: Fix drop table statement to not drop views and vice versa This commit fixes the following two issues - A drop table statement can drop a view with same name when "IF EXISTS" is specified. - A drop view statement can drop a table with same name when "IF EXISTS" is specified. This happens due to lack of checks in the Catalog before the drop executes. Change-Id: I0d35cd1f50d9b8d50223660f753c56529cbbc311 Reviewed-on: http://gerrit.cloudera.org:8080/2458 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-03-15 12:10:33 +00:00
Tim Armstrong	f5b7842414	IMPALA-2502: don't redundantly repartition grouping aggregations Grouping aggregations previously always repartitioned their input, even if preceding joins or aggs had already partitioned the data on the required key (or an equivalent key). This patch checks to see if data is already partitioned on the required exprs (or equivalent ones), and if so skips the preaggregation and only does a merge aggregation. The patch also does some refactoring of the aggregation planning in DistributedPlanner to make it easier to implement the change. Includes planner tests for the three cases that are affected: grouping aggregations, non-grouping distinct aggregations and grouping distinct aggregations. Change-Id: Iffdcfd3629b8a69bd23915e1adba3b8323cbbaef Reviewed-on: http://gerrit.cloudera.org:8080/2414 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-03-15 09:21:22 +00:00
David Alves	7381304a23	Merge branch 'feature/kudu' into cdh5-trunk This is the final merge commit that merges the 'feature/kudu' branch into cdh5-trunk. Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a	2016-03-13 19:28:43 -07:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Michael Ho	13007f9634	IMPALA-561: Allow multiple callbacks in a thread resource pool. Previously, thread resource manager only supports a single callback for each resource pool. The callback is invoked when a thread token is available. This mostly works as scan node is the only consumer and there is usually one scan node in a plan fragment. As shown in IMPALA-3064 and IMPALA-561, it's possible to generate a plan fragment with more than one scan nodes. In which case, one of the scan nodes may be running with single thread and in debug builds, a DCHECK will be hit. This change fixes the problem by allowing more than one callbacks in a given resource pool. The thread resource manager will go through all the registered callbacks in round robin fashion. This change also adds a missing thread token release call in HdfsScanNode::ThreadTokenAvailableCb(). Change-Id: Iddfff1feef0b59d407994ad3bc560166acbfa623 Reviewed-on: http://gerrit.cloudera.org:8080/2430 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-10 23:16:29 +00:00
Alex Behm	4a25f87d5c	Improve the SQL for nested TPCH-Q18. Marcel spotted that nested TPCH-Q18 can be expressed with more efficient SQL. Results on nested TPCH-300: Before 160s After 100s Change-Id: I8b351b7f467e8bef0c256dc43cea325d7f177edf Reviewed-on: http://gerrit.cloudera.org:8080/2418 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-04 04:35:54 +00:00
Alex Behm	54a46e9459	IMPALA-3065/IMPALA-3062: Restrict !empty() predicates to scan nodes. The bug: Evaluating !empty() predicates at non-scan nodes interacts poorly with our BE projection of collection slots. For example, rows could incorrectly be filtered if a !empty() predicate is assigned to a plan node that comes after the unnest of the collection that also performs the projection. The fix: This patch reworks the generation of !empty() predicates introduced in IMPALA-2663 for correctness purposes. The predicates are generated in cases where we can ensure that they will be assigned only by the parent scan, and no other plan node. The conditions are as follows: - collection table ref is relative and non-correlated - collection table ref represents the rhs of an inner/cross/semi join - collection table ref's parent tuple is not outer joined Change-Id: Ie975ce139a103285c4e9f93c59ce1f1d2aa71767 Reviewed-on: http://gerrit.cloudera.org:8080/2399 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Silvius Rus <srus@cloudera.com> Tested-by: Internal Jenkins	2016-03-02 23:23:05 -08:00
Tim Armstrong	6cdcdb12ff	Test for IMPALA-2987 Add a custom cluster test that tests for delays in registering data stream receivers. We add a stress option to artificially delay this registration to ensure that it can be handled correctly. Change-Id: Id5f5746b6023c301bacfa305c525846cdde822c9 Reviewed-on: http://gerrit.cloudera.org:8080/2306 Tested-by: Internal Jenkins Reviewed-by: Silvius Rus <srus@cloudera.com>	2016-03-02 23:23:04 -08:00
Alex Behm	a303f25256	IMPALA-3071: Fix assignment of On-clause predicates belonging to an inner join. The bug: On-clause predicates belonging to an inner join were not always assigned correctly if they referenced an outer-joined tuple. Specifically, our logic for detecting whether a predicate can be assigned below an outer join if also left at the outer-join node was not correct, and so we assigned the predicate below the join, but did not also leave it at the outer join. The fix: Assign an inner join On-clause conjunct that references an outer-joined tuple to the join that the On-clause belongs to. Change-Id: Iffef7718679d48f866fa90fd3257f182cbb385ae Reviewed-on: http://gerrit.cloudera.org:8080/2309 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-29 22:22:41 -08:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Dimitris Tsirogiannis	2c37d99fed	IMPALA-3089: Perform static partition pruning in the FE with disjunctive BETWEEN predicates This commit fixes an issue where the slow path is employed during static partition pruning for disjunctive BETWEEN predicates, inroducing significant latency during planning, especially for tables with large number of partitions. Change-Id: I66ef566fa176a859d126d49152921a176a491b0a Reviewed-on: http://gerrit.cloudera.org:8080/2320 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-26 15:37:24 -08:00
Alex Behm	5c0e1fa1e8	IMPALA-2974: Use Type.toSql() instead of toString() in ALTER TABLE CHANGE COLUMN. Change-Id: I140bdea755e44d3f2ceb4a8f5e288faaddaa963f Reviewed-on: http://gerrit.cloudera.org:8080/2285 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-26 15:37:24 -08:00
Dimitris Tsirogiannis	197eb43477	IMPALA-3074: AnalysisError when runtime filter has incompatible source and target exprs This commit fixes an issue where an AnalysisError is thrown when a runtime filter has incompatible source and target exprs. This is triggered when a runtime filter has multiple candidate target scan nodes not all of which produce a target expr which is cast-compatible with the source expr. Change-Id: I544c8fc66915f684ba24d20de525563638c4039d Reviewed-on: http://gerrit.cloudera.org:8080/2307 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 19:54:40 -08:00
Dimitris Tsirogiannis	d3b92b0d9f	IMPALA-3039: Restrict the number of runtime filters generated This commit adds a query option, MAX_NUM_RUNTIME_FILTERS, to restrict the number of runtime filters generated per query. If more than MAX_NUM_RUNTIME_FILTERS are generated, the runtime filters are sorted by the selectivity of the associate source join nodes and the MAX_NUM_RUNTIME_FILTERS most selective filters are applied. Also with this commit, non-selective filters are automatically discarded, irrespective of the value of MAX_NUM_RUNTIME_FILTERS. Change-Id: Ifd41ef6919a6d2b283a8801861a7179c96ed87c6 Reviewed-on: http://gerrit.cloudera.org:8080/2262 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 19:54:40 -08:00
Alex Behm	2c8f41b7d4	IMPALA-2832: Fix cloning of FunctionCallExpr. The bug was that we were not properly cloning the params of a FunctionCallExpr. In a CTAS we analyze the underlying query stmt twice, the first time on a clone of the original stmt. The problem was that the first analysis affected the second analysis due to an improper clone, leading to missing slots in a scan because the corresponding SlotRefs were already analyzed. Change-Id: I0025c0ee54b2f2cb3ba470b26a9de5aa5a3a3ade Reviewed-on: http://gerrit.cloudera.org:8080/2291 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Tim Armstrong	52362d4079	IMPALA-3047: separate create table test with nested types We need to skip queries that select from tables wiht nested types is running with the old aggs and joins. To achieve this, move the failing test to a separate test and use the skip decorator. Change-Id: Iaf1351c711b524be66a99084657926909425cbff Reviewed-on: http://gerrit.cloudera.org:8080/2272 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Alex Behm	a99e17457b	Fix a non-determinisic test in complex-types-file-formats.test. Change-Id: I98cc3045a6a6131dba8b0a475d5d51de7bdba455 Reviewed-on: http://gerrit.cloudera.org:8080/2268 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
Alex Behm	c6fd5a0fe4	IMPALA-2844: Allow count(*) on RC files with complex types. This patch also fixes the incorrect error message reported in the JIRA. Change-Id: I2c7b732767d154c36bc7189df5177d27a35d0d7b Reviewed-on: http://gerrit.cloudera.org:8080/2267 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
Alex Behm	8b32cbb904	IMPALA-2820: Support unquoted keywords as struct-field names. After this patch structs can be parsed/created with field names that are regular identifiers or keywords, even if unquoted. This fix is needed for parsing type strings stored in the Hive Metastore which could contain unquoted identifiers that correspond to Impala keywords. The parser changes required an upgrade of Cup and its Maven plugin. In the old version, the generated parser would not compile because of a giant method that exceeded the JVM maximum allowed size for a single method. Change-Id: Ic989c7afd034216f6db4c8f9f3901c025cceb524 Reviewed-on: http://gerrit.cloudera.org:8080/2249 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
David Alves	2591a6718a	Handle booleans in the Kudu scanner We were missing handling booleans in the Kudu scanner, though we handled them in the sink. This patch fixes this issue and adds some tests. Change-Id: If8edbe85ae257c6374eddf757845c1ec917b1693	2016-02-22 13:49:10 -08:00
David Alves	af69097f19	IMPALA-2674 - Add support for VARCHAR to the backend In the frontend we support creating Kudu tables with VARCHAR but in the backend we don't handle it. Moreover we were swallowing the error in release mode, causing inserts to just skip values when this type is used. This patch adds support for VARCHAR, along with the corresponding tests. Change-Id: Ic734890e1b3aae2eef1e0a3d45a7561d02eeb917	2016-02-22 13:46:34 -08:00
Bharath Vissapragada	ef0dac661c	IMPALA-2843: Persist hive udfs across catalog restarts This commit adds a new feature to persist hive/java udfs across catalog restarts. IMPALA-1748 already added this for non-java udfs by storing them in parameters map of the Db object and reading them back at catalog startup. However we follow a different approach for hive udfs by converting them to Hive's function format and adding them as hive functions to the metastore. This makes it possible to share udfs between hive and Impala as the udfs added from one service are accessible to other. This commit takes care of format conversions between hive and impala and user can just add function once in either of the services. Background: Hive and impala treat udfs differently. Hive resolves the evaluate function in the udf class at runtime depending on the data types of the input arguments. So user can add one function by name and can pass any arguments to it as long as there is a compatible evaluate function in the udf class. However Impala takes the input types of the udf as a part of function definition (that maps to only one evaluate function) and loads the function only for those set of input argument types. If we have multiple 'evaluate' methods, we need to add multiple functions one for each of them. This commit adds new variants of CREATE \| DROP FUNCTIONS to Impala which lets the user to create and drop hive/java udfs without input argument types or return types. Catalog takes care of loading/dropping the udf signatures corresponding to each "evaluate" method in the udf symbol class. The syntax is as follows, CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts> DROP FUNCTION [IF EXISTS] <function name> Examples: CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf'; CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2'; DROP FUNCTION foo; DROP FUNCTION IF EXISTS bar; The older way of creating hive/java udfs with specific signature is still supported, however they are not persisted across restarts. So a restart of catalog can wipe them out. Additionally this commit also loads all the compatible java udfs added outside of Impala and they needn't be separately loaded. One thing to note here is that the functions added using the new CREATE FUNCTION can only be dropped using the new DROP FUNCTION syntax (without signature). The same rule applies for the java udfs added using the old CREATE FUNCTION syntax (with signature). Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d Reviewed-on: http://gerrit.cloudera.org:8080/2250 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 23:04:03 -08:00
Marcell Szabo	8135ef6eaa	IMPALA-2641: Add IF EXISTS clause to TRUNCATE TABLE statement Change-Id: I3169390b0e04f07fb4ea53d987d86a76482d7e9d Reviewed-on: http://gerrit.cloudera.org:8080/1905 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 14:08:58 +00:00
Skye Wanderman-Milne	5a81d2db88	IMPALA-2184: don't inline timestamp methods with try/catch blocks in IR We do not have exceptions enabled for codegen'd code, so exceptions thrown by functions called by codegen'd functions cannot be caught by the codegen'd functions. TimestampValue::UnixTimeToPtime() has a try/catch around boost::posix_time::ptime_from_tm(), but since it was inlined into the TimestampFunctions::FromUnix() IR the try/catch didn't work. This patch moves the UnixTimeToPtime() implementation to the .cc file so it doesn't get included in the IR. It does the same for TimestampParser::Parse() in case it gets inlined into IR code as well. Change-Id: Ic0af73629e1e3b6bf18cbf5d832973712b068527 Reviewed-on: http://gerrit.cloudera.org:8080/2210 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 00:03:23 -08:00
Bharath Vissapragada	1b40a83903	IMPALA-2382: Add support for Hive udfs returning primitive types Hive allows udfs with primitive data types as return values (along with Writables) and input arguments. This commmit adds this support for Impala. Change-Id: I2ec24eab5a824772a8618d7fb97ae5c7ea2a0e39 Reviewed-on: http://gerrit.cloudera.org:8080/2207 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 00:03:22 -08:00
Lars Volker	6b566a2d35	IMPALA-3004: Fix QueryTest tests Test files in testdata/workloads/functional-query/queries/QueryTest are parsed by test_file_parser.py, which used to ignore everything before the first ==== line as a file header. This change fixes all affected files. This change also modifies the test file parser to forbid headers starting with what looks like a subsection title ('----'), which should prevent the reintroduction of similar errors in the future. Change-Id: Iaa1bc5ffd02782e24289c7843dcb35401c334519 Reviewed-on: http://gerrit.cloudera.org:8080/2220 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 00:03:15 -08:00
Michael Ho	f9232c98b0	IMPALA-3018: Fix AllocBuffer() and CopyStringVal() to handle empty strings. AllocBuffer() and CopyStringVal() are two helper functions used by various UDAs to allocate buffers for StringVal during their Init() and Update() functions. Previously, these functions assumed that the buffer length is always greater than 0. That turned out to be an invalid assumption. This change removes this assumption and handles zero-length StringVal by initializing its 'ptr' to NULL and 'len' to 0. A new test is also added to exercise this case. Change-Id: Ia1e4140376c65ca3c734c40ecc3cce15b8bf2d3f Reviewed-on: http://gerrit.cloudera.org:8080/2211 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-02-18 01:25:10 -08:00
Skye Wanderman-Milne	9aeb77023f	IMPALA-2993: don't check for "Failed to allocate buffer for collection" error This test query is supposed to check the error path for when a collection buffer cannot be allocated. However, it's flaky because the collection allocations are not very big (< 2KB), so it's possible for a different operator to trigger OOM. I think the correct solution is to create a test file that contains very large collections, so a large collection allocation will trigger OOM, rather than many small collection allocations. For now though, let's disable the specific collection allocation check to unblock the build, even though we risk losing coverage. Change-Id: Iab4c9b605186926c522cf692246a37882fbdfcdb Reviewed-on: http://gerrit.cloudera.org:8080/2208 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-02-18 01:25:10 -08:00
Matthew Jacobs	35ad46c1ce	IMPALA-2996: Mem limit too high for expected OOM test failure A regression test for IMPALA-2265, IMPALA-2559 expected a query to fail with an OOM but the mem limit is now too high. This reduces the mem limit of the test case to be as low as it can be without failing to set up the operators. Change-Id: I056c3ad4067e5466e3690c3b4d597b9815a7a234 Reviewed-on: http://gerrit.cloudera.org:8080/2186 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins (cherry picked from commit 45ba3109e752dfdeefdf5627a5d57079f73b24c9)	2016-02-17 20:22:14 -08:00
Tim Armstrong	212bea529f	IMPALA-2994: Temporary workaround for flaky spilling test The test was recently reenabled in commit 71a0a7d998702781ae44270f8c742b10c34c0efc. Continue running the test but loosen the memory limit and don't check the runtime profile. The memory limits for this set of tests needs revisiting in any case. Change-Id: I195e8ad3b67c8ff85d5d15c2646a13f5feb57553 Reviewed-on: http://gerrit.cloudera.org:8080/2183 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins (cherry picked from commit 51632f39a45ba9deac9b86bbdb14ff10cbee35ac)	2016-02-17 20:21:57 -08:00
Henry Robinson	2212240106	IMPALA-2552: Runtime filter forwarding between joins and scans This patch adds the ability for operators to compute and forward bitmap filters from one operator to another, across fragment and machine boundaries. Filters are provided as part of the plan from the frontend. In this patch hash join nodes produce filters from their build side input, and propagate them, via the query's coordinator, to the scan nodes which provide the probe side for that join. The scan nodes may then filter their rows before they are sent to the join, reducing the amount of work the join has to do. Filters are attached to the local RuntimeState's RuntimeFilterBank by the join node. When complete, they are asynchronously sent to the coordinator via a new UpdateFilter() RPC. The coordinator maintains a routing table that maps incoming filters to their recipient backends. For partitioned joins, the filters must be aggregated from all providers. The coordinator performs this aggregation and transmits the completed filter only when all inputs have been received. In this patch, filtering can occur in up to four places in a scan: 1. Before initial scan ranges are issued (all file formats, partition columns only) 2. Before each scan range is processed (all file formats, partition columns only) 3. Before a row group is processed (Parquet, partition columns only) 4. During assembly of every row (Parquet, any column) This patch also replaces the existing bitmap-based filters with Bloom Filter based ones. The Bloom Filters are statically sized to have an expected false positive rate of 0.1 on 2^20 distinct items. This yields Bloom Filters of 1MB in size. This is configurable by setting --bloom_filter_size, and we will perform tests to determine a good default. The query option RUNTIME_BLOOM_FILTER_SIZE can override the command-line flag on a per-query basis. This patch also simplifies and improves the memory handling for allocated filters by the RuntimeFilterBank. New filters are tracked through the query memory tracker, and owned by the fragment instance's RuntimeState::obj_pool(). This patch also adds a simple heuristic to disable filter creation based on estimated false-positive rate for the Bloom Filter. By default the maximum FP rate is set to 75%. It can be controlled by setting --max_filter_error_rate. Finally, this patch adds short-circuit publication for filters that are broadcast, and does so always even when distributed runtime filter propagation is disabled. To avoid cross-compilation problems, bloom-filter.h was rewritten in C++98. Change-Id: Icea03a87cf1705c1b4aa46f86f13141c4b58da10 Reviewed-on: http://gerrit.cloudera.org:8080/1861 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2016-02-13 16:19:41 +00:00
Tim Armstrong	1c102d9d8e	Reenable tests that were disabled for IMPALA-1305 A couple of tests were disabled because of IMPALA-1305. Now that the fix is in, those tests can be reenabled. I ran them in a loop to make sure that they weren't flaky. Also fix the spelling mistake in the file name. Change-Id: I1bfcc619911a92d93b871be3a14852aa11f78da9 Reviewed-on: http://gerrit.cloudera.org:8080/2150 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-02-13 10:08:13 +00:00
Alex Behm	d7ee6fa7a4	IMPALA-2663: Filter out tuples with empty collections in scan. We now generate predicates for filtering out empty collections directly in the parent scan that materializes the collections. This optimization is conservatively applied only for uncorrelated relative table references because that makes it safe/easy to determine the join type (the optimization is incorrect for outer and anti joins). The change provides a substantial improvement for queries that have selective predicates on nested collections, or for data sets that naturally have many empty collections. The performance improvement comes from: (1) The new predicates are assigned to a scan, so we get multi-threading. (2) We avoid expensive subplan iterations for collections that would yield an empty subplan result anyway. Performance measurements on 10-node using nested TPCH-300 on some of the queries originally mentioned in the JIRA: TPCH-Q12, 10x speedup Before: 111s After: 11s TPCH-Q7 Before: 205s After: 128s TPCH-Q5 Before: 48s After: 40s TPCH-Q3 Before: 18s After: 14s The following microbenchmark query designed to highlight the improvement also gets a ~10x speedup. select c_custkey, o_orderkey from customer c, c.c_orders where o_orderkey = 1884930 Before: 11.3s After: 1.8s Change-Id: I0d0dc90442a61d62cc8f7dad186560490b62441a Reviewed-on: http://gerrit.cloudera.org:8080/2118 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-12 00:23:43 +00:00
Dimitris Tsirogiannis	c943d6ab7d	IMPALA-2552: Add support for runtime filter propagation (FE) This commit adds support for runtime filter propagation in the frontend. During planning, the frontend computes a set of filters that are constructed by join operators and are applied at scan operators in order to filter scanned tuples or scan ranges. The filters are identified from equi-join predicates by traversing the single-node plan tree in a top-down fashion. A query option, termed enable_runtime_filter_propagation, is added to enable/disable runtime filter propagation (disabled by default). When runtime filter propagation is enabled, the output of EXPLAIN is modified to include information about the runtime filters that are constructed/applied. Also, an event is added to the query timeline to track the time spent in the planner while computing runtime filters. Testing: Functional planner tests are added. Change-Id: Id79a38313051d95da32c897b176a40d26b0dda1d Reviewed-on: http://gerrit.cloudera.org:8080/1532 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Henry Robinson <henry@cloudera.com>	2016-02-12 00:11:45 +00:00
Tim Armstrong	2c2670e389	IMPALA-1305: streaming pre-aggregations Aggregations are implemented as a distributed pre-aggregation, an exchange, then a final aggregation that produces the results of the aggregation. In many cases the pre-aggregation significantly reduces the amount of data to be exchanged. However, in other cases, the preaggregation does not greatly reduce the amount of data exchanged or can use a lot of memory and starve other operators that would benefit more from the additional memory. In these cases we would be better off "passing through" some input tuples by transforming them into intermediate tuples without aggregating them. This patch adds a streaming pre-aggregation mode to PartitionedAggregationNode that tries to aggregate input rows with a hash table, but can switch to passing through the input tuples (after transforming them into the appropriate tuple format). It does this if it hits a memory limit or if the aggregation is not sufficiently reducing the node's output (specifically, if the number of aggregated rows in the hash table is more than half the number of unaggregated rows consumed by the pre-aggregation). Pre-aggregations never need to spill because they can pass through rows when under memory pressure. This initial implementation is quite conservative: it retains the partitioning of the previous implementation because switching to a single partition proved to regress performance of some queries while improving others. It also always keeps hash tables around and updates them with matching input rows so that reduction statistics are updated and early decisions to pass through data can be reversed. Future work could explore different approaches within the new framework to get larger performance gains. Currently we see significant performance benefits for queries with a very low reduction factor, e.g. group by on a nearly unique column Includes codegen support for the passthrough streaming. Adds a query option, disable_streaming_preaggregations, in case a user wants to revert to the old behaviour. Adds TPC-H tests to exercise the new passthrough code path and updates planner tests to include the new [STREAMING] detail added by the planner. Change-Id: Ia40525340cba89a8c4e70164ae11447e96494664 Reviewed-on: http://gerrit.cloudera.org:8080/1698 Tested-by: Internal Jenkins Reviewed-by: Dan Hecht <dhecht@cloudera.com>	2016-02-11 19:03:51 +00:00
Skye Wanderman-Milne	039bd44fdf	IMPALA-2688: decimal codegen support in aggregations This patch implements codegen support for aggregations with decimal input and intermediate type. For the following benchmark query: SELECT l_discount, count(*) AS cnt FROM biglineitem GROUP BY l_discount HAVING cnt > 9999999999999 Query time went from 8.85s to 3.74s (2.4x faster). Change-Id: I25934fcd6324e5bf1fa6859496107bf2ec68b8d3 Reviewed-on: http://gerrit.cloudera.org:8080/2050 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-02-11 02:32:22 +00:00
Anuj Phadke	d787e1e3a7	IMPALA-2425: Broadcast join hint not enforced when low memory limit is set. Broadcast joins are disabled if the size of the rhs hash table exceeds the per node mem_limit. This change forces a broadcast join if the broadcast join hint is enforced. Change-Id: Iff9bd4d01736c48e52306ac79f74ab6ef0938f2a Reviewed-on: http://gerrit.cloudera.org:8080/1967 Reviewed-by: Huaisi Xu <hxu@cloudera.com> Tested-by: Internal Jenkins	2016-02-10 11:30:19 +00:00
Alex Behm	9a9886ee37	IMPALA-2950: Fully resolve exprs before wrapping with TupleIsNullPredicates. The bug: In SingleNodePlanner.createInlineViewPlan() we need to wrap some exprs with TupleIsNullPredicates to preserve correctness if the inline view is outer joined. The bug was that we used to perform this wrapping on the rhs of the inline view's smap, and not the final output smap after those rhs exprs have been resolved against the physical output of the inline view's plan root. As a result, the TupleIsNullWrapping did not work correctly for deeply nested inline views with exprs that require wrapping at various nesting levels. The fix: Resolve the exprs against the physical output of the inline view's plan root before performing the TupleIsNullPredicate wrapping. Change-Id: I183bba6a36bf5e19a88687ed8c82977ae769ddf4 Reviewed-on: http://gerrit.cloudera.org:8080/2092 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-10 07:16:58 +00:00
Michael Ho	40f75fb1ba	IMPALA-2925: Fix flaky tests in test_alloc_fail_update() test_alloc_fail_update() aims to stress memory allocation failure in the Update(), Serialize() and/or Finalize() functions of UDAs. However, this test included some UDFs which allocated memory in their Init() functions and not during their Update() functions. This change removes those UDFs from the test. Change-Id: I1ecc7e838e34ebc9ea3c878fee8ea2497b5fa23e Reviewed-on: http://gerrit.cloudera.org:8080/2005 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-02-10 00:54:11 +00:00
Lars Volker	f9c718e4ea	IMPALA-2959: Fix S3 failure caused by broken regex IMPALA-2862 fixed parsing for regular expressions in the result verifier. This change fixes a test that had a broken regular expression, which was not caught by the exhaustive test suite. I search for tests with a similar issue but couldn't find any: git grep "regex:[^,]\+'" Change-Id: I3aaca6bdfdc1eaab715929aa5fc6b64e6c969656 Reviewed-on: http://gerrit.cloudera.org:8080/2089 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-08 20:14:05 +00:00
Alex Behm	ecf46a5af8	IMPALA-976: Improvements to scan and join cardinality estimates. 1. Improved join cardinality estimation. For each equi join predicate we try to determine whether it is a foreign/primary key (FK/PK) join condition, and either use a special FK/PK estimation or a generic estimation method. We maintain the minimum cardinality for each method separately, and finally return in order of preference: - the FK/PK estimate, if there was at least one FP/PK predicate - the generic estimate, if there was at least one predicate with sufficient stats - otherwise, we optimistically assume a FK/PK join with a join selectivity of 1, and return the left-hand size cardinality 2. More robust handling of conjuncts with unknown selectivities, and conjuncts that are not independent. Uses exponential backoff. 3. More accurate broadcast vs. partitioned join cost estimation. We now account for the 4 byte per-tuple overhead when serializing rows over an exchange. This change is especially helpful in cases where one side of the join has no materialized slots, i.e., it has a row size of 0, and an exchange used to appear free. We are obviously not done with improving join cardinality estimates. This patch is merely a step in the right direction, in particular, the code and behavior are now more explicit and easier to reason about than before, and better reflects the original intent (i.e., fixes the IMPALA-976 bug). Change-Id: I00d8e8230e2844cb807d128d82b35ee78db7d774 Reviewed-on: http://gerrit.cloudera.org:8080/1668 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-06 09:26:46 +00:00
Michael Ho	3d7a4477ee	IMPALA-2948: Fix a bug in the planner when fast partition key scan is enabled When the query option OPTIMIZE_PARTITION_KEY_SCANS is true, we may acquire the partition key values from the metadata and generate a union node containing constant expressions only. There is a bug in the planner when generating the union node as it skips evaluating the constant expressions for unmaterialized slots but union node expects an entry in the constant expression lists for each slot in the tuple descriptor even if the slot is not materialized. This change fixes the problem by inserting a dummy null values in the constant expression list for unmaterialized slots and lets the union node filter them out. A test is also added to verify the fix. Change-Id: I9ed49dca0101b96bd9b20e6d1e5b1d56f654e911 Reviewed-on: http://gerrit.cloudera.org:8080/2067 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-06 05:28:28 +00:00
Alex Behm	733d135212	IMPALA-852,IMPALA-2215: Analyze HAVING clause before aggregation. In SelectStmt.analyzeAggregation(), we need to analyze the HAVING clause first so we can check if it contains aggregates. Also, we need analyze/register it even if we are not computing aggregates. Change-Id: Ieedfb64bf9a8f1390c0231a8b4aa25120ee5542b Reviewed-on: http://gerrit.cloudera.org:8080/2066 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-06 01:31:34 +00:00

1 2 3 4 5 ...

970 Commits