impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 21:03:01 -05:00

Author	SHA1	Message	Date
casey	cef87e39dc	Updates for new Kudu toolchain layout and upgrade Kudu The directory structure of the newer Kudu toolchain artifacts has changed. Now the root directory is split into /release and /debug. A few little updates are needed to the build and service scripts. Since the toolchain no longer provides stubs for platforms that Kudu doesn't support the stubs need to be generated. This will be done as part of the toolchain bootstrapping. Also this upgrades Kudu to 0.8 RC1. Developers will need to run bin/create-test-configuration.sh after pulling in this change. Otherwise the Kudu service will fail to start. Change-Id: I625903bd92afece0ad819a96fc275d5812b5eb2a Reviewed-on: http://gerrit.cloudera.org:8080/2720 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:35 -07:00
Henry Robinson	c14a6f11df	IMPALA-3077: Enable runtime filters when PHJ spills This patch changes when runtime filters are produced in the partitioned hash-join node to allow filters to be produced even when the PHJ spills. Filters are now produced during the level0 processing of the PHJ's build-side input in ProcessBuildBatch(). Since this function is codegen'ed, so now is filter production. We use constant-propagation via constant argument injection to disable filter production at no cost when it is not needed (including in level1+ repartitioning). I inspected the IR to confirm that the constant propagation works as expected. This change also allows us to send filters earlier during build-side processing. A tradeoff is that filters are still built even if the expected FP rate is too high, although any too-permissive filters are still not sent to the scan (see 'Performance impact' below). The restriction that prevented filters from being computed inside a sub-plan is removed as part of this cleanup (since the FE handles assigning filters correctly in subplans), and a test is added to confirm that one of the correct cases for filters in subplans works. This patch also fixes a bug where re-partitioning beyond level0 would not use the codegen'ed implementation of ProcessBuildBatch(). A new test is added to test_runtime_row_filters, for Parquet only, which spills and confirms that filtering still occurs. Finally, the legacy --enable_phj_probe_side_filtering / --enable_probe_side_filtering flags have been deprecated, as runtime filtering can be permanently disabled via setting RUNTIME_FILTER_MODE=OFF. The implementation that the old flags referred to has been removed. Performance impact ------------------ We benchmark the performance loss due to always computing runtime filters even when the FP-rate will turn out to be too high as follows: select STRAIGHT_JOIN count() from (select id from functional.alltypes LIMIT 1) a JOIN [BROADCAST] (select FROM p LIMIT 100000000) b on a.id = -b.id and b.part_col > 0 ('p' is a two-column Parquet table with 1B rows). This builds a 100M row build table (benchmarks run on one node). When filtering is enabled, the filter is built but selects all rows from the probe side (so that there's no benefit to having the filter, to emphasise the cost of building the filter in the first place). RUNTIME_FILTER_MODE Avg. time (s) over 5 runs OFF 18.95 GLOBAL 19.55 ------------------------------- Change +3% Change-Id: I59a2d9ee03ccea6b674392584e4c7f272233571e Reviewed-on: http://gerrit.cloudera.org:8080/2783 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-05-12 14:17:34 -07:00
Casey Ching	5387636140	IMPALA-3373: Computing stats on Kudu table duplicates the columns Computing stats caused the Kudu table to be reloaded in the catalog and the column definitions ended up getting appended to the existing ones. There was already a method to clear the column state, so now that is called during load(). Change-Id: I9ad42338750e9d8873a3584bc22a7cd7bd465c5d Reviewed-on: http://gerrit.cloudera.org:8080/2813 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:34 -07:00
Michael Brown	6d8c075d9c	IMPALA-1996: Start HBase per directions in documentation; Implement HBase startup retry I. Start HBase per directions 1. https://hbase.apache.org/book.html#_configuration_files mentions a 'regionservers' file that points to a list of hosts on which to start HBase RegionServers. When HBase starts in our mini-cluster there are messages printed like this: cat: /home/mikeb/Impala/fe/src/test/resources/regionservers: No such file or directory The presence of this file now starts a single RegionServer and takes the place of RegionServer 1 in the "additional region servers" startup, a separate call. 2. The additional RegionServers are started but now we only start 2 from index 2. See https://hbase.apache.org/book.html#quickstart_pseudo There are still 3 total RegionServers using the same ports as before. We are simply configuring our settings as directed in the documentation. There were mentions in testdata/bin/run-hbase.sh of a "hbase race". One possible such bug is https://issues.apache.org/jira/browse/HBASE-5780 which has been fixed for a while. I've removed the check to wait for that Master, though I have not removed the Python script that does the waiting. We could remove that later after we let this patch bake. Also, https://issues.apache.org/jira/browse/HBASE-4467 has been marked "not a problem", so I've removed references to that. II. Implement HBase start retry If starting either HBase Master or additional RegionServers fails, kill all of HBase and try again. Do this for some number of attempts. In order to keep errexit ("set -e") happy, we expect the possibility of some of the startup attempts failing. We use control flow in those cases. In the last case, errexit can fail on our behalf. There is some code duplication here, but because Bash can't give us a stack trace on failure, and only a line number, I chose not to use functions to handle reuse. We don't really have functions anywhere else at the moment, either. Testing: It's pretty difficult to try to trigger a real "HBase fails to start" situation. I tested my changes by faking HBase failures, both when starting up the Master and first RegionServer, and also starting subsequent RegionServers. Multiple private builds have passed. Change-Id: Ib1d055a8a9098ce24e2f31b969501b6e090eab19 Reviewed-on: http://gerrit.cloudera.org:8080/2804 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:33 -07:00
Jim Apple	1c16dd0cf8	IMPALA-2107: Add Base64 encoder/decoder Change-Id: I911451c5d68e8ae9d352abfcf4d5ff36484f0bf3 Reviewed-on: http://gerrit.cloudera.org:8080/2633 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:32 -07:00
Michael Ho	cbcda93dfb	IMPALA-3334: Fix some bugs in query options' parsing. This change fixes two problems: 1. The query options OPTIMIZE_PARTITION_KEY_SCANS and DISABLE_STREAMING_PREAGGREGATIONS are both boolean so they should accept 'true' and '1' as input values. Previously, these two options are treated as int and value such as 'true' doesn't work with them. 2. The break statement in the case statement of the option SCAN_NODE_CODEGEN_THRESHOLD was 'stolen' by the option DISABLE_STREAMING_PREAGGREGATIONS when it was added. This change adds the missing break statement back for SCAN_NODE_CODEGEN_THRESHOLD. Change-Id: I5c74a1e5c49e3bda15a91b40740fc7310303207b Reviewed-on: http://gerrit.cloudera.org:8080/2776 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:31 -07:00
Jim Apple	83a1434bc9	IMPALA-2840: Don't store table location in partition location For a table with location "ABC", most partitions will have locations like "ABC/DEF=2". The "ABC" part of the location does not need to be stored in Catalog for each partition; we can compress it down to one int in the common case. This is done by stripping from each partition location the last N directories (where N is the number of clustering columns) and storing the resulting string in a cache of partition location prefixes. In the cache, this location prefix string is mapped to an int. Partition locations are then stored as a tuple consisting of that int and a suffix string; the partition location can be reconstructed as the concatenation of the prefix string (from the cache) and the suffix. Though this scheme was designed in the expectation that most partitions will be stored in directories like "/part_col_1=1.23/part_col_2=234/", it works even when that is not the case. TODO: Since each partition stores the literal values for the partitioning columns, we could also elide the column names and values when partitions are placed in directories like "/part_col_1=1.23/part_col_2=234/" Change-Id: I8c67b6ce0f83de2f5277a528a9ce67e47d638adb Reviewed-on: http://gerrit.cloudera.org:8080/2355 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:29 -07:00
Sailesh Mukil	c083c79888	IMPALA-3256: TestUdfs.test_libs_with_same_filenames failure There was an observed race between TestUdfs.test_java_udfs and TestUdfs.test_libs_with_same_filenames, because they both used the same database name. This patch just changes the name of the database used by test_libs_with_same_filenames. Change-Id: Icc38cbe720a3b9d864935416eb10612171132e17 Reviewed-on: http://gerrit.cloudera.org:8080/2767 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:29 -07:00
Lars Volker	ee8c309187	Fix typo in load-test-warehouse-snapshot.sh Change-Id: I2ef9b32cbc56819f80db864a6590a9a7b2732c9c Reviewed-on: http://gerrit.cloudera.org:8080/2310 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Todd Lipcon	4bdd0b976d	IMPALA-3148. Fix selectivity computation for pushed Kudu predicates This follows up on a TODO from the Kudu merge and also fixes a bug: IMPALA-976 changed the computation of selectivities for a combined list of conjuncts to better handle expressions with no selectivity estimate. The Kudu implementation was forked from before this change and thus did not have an equivalent change. This refactors the algorithm to a new static method and calls it from both PlanNode and KuduScanNode so that the selectivity estimate behavior is the same regardless of whether Kudu can evaluate the predicate server-side. Todd tested this on TPCH 3TB and verified that the plans are reasonable now where they used to be nonsense. Change-Id: Id507077b577ed5804fc80517f33ea185f2bff41a Reviewed-on: http://gerrit.cloudera.org:8080/2628 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:44 -07:00
Sailesh Mukil	86fd262dc9	IMPALA-3324: Hive server does not start for S3 builds. The hive server does not start for S3 builds because HDFS is marked as an unsupported service in testdata/cluster/admin; and so HDFS is not started at all, and so the Hive server is unable to start as well. Due to this, all our S3 builds fail. Currently our S3 builds need HDFS to run correctly. (This has to be reverted once IMPALA-1850 goes in, because then S3 can run as a default FS without HDFS) Change-Id: Ibda9dc3ef895c2aa4d39eb5694ac5f2dbd83bee4 Reviewed-on: http://gerrit.cloudera.org:8080/2741 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:43 -07:00
Casey Ching	9bb1b8a366	Kudu: Disable fsnyc in the mini-cluster The Kudu team recommended disabling this for testing purposes. This should help with timeouts in cloud machines (ec2/gce). Disabling fsyncs could lead to data loss if the system crashed before the OS had a chance to write the data to disk. Our test setups don't need that level of reliability. Change-Id: I72fd85ce5c4bc71f071b854ea6a9ebe60fc1305f Reviewed-on: http://gerrit.cloudera.org:8080/2734 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:03:43 -07:00
Casey Ching	9d43aac6ce	IMPALA-3274: Always start Kudu for testing Previously Kudu would only be started when the test configuration was the standard mini-cluster. That led to failures during data loading when testing without the mini-cluster (ex: local file system). Kudu doesn't require any other services so now it'll be started for all test environments. Change-Id: I92643ca6ef1acdbf4d4cd2fa5faf9ac97a3f0865 Reviewed-on: http://gerrit.cloudera.org:8080/2690 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-04-12 14:02:35 -07:00
Skye Wanderman-Milne	9b51b2b6e6	IMPALA-2835: introduce PARQUET_FALLBACK_SCHEMA_RESOLUTION query option This patch introduces a new query option, PARQUET_FALLBACK_SCHEMA_RESOLUTION which allows Parquet files' schemas to be resolved by either name or position. It's "fallback" because eventually field IDs will be the primary schema resolution scheme, and we don't want to create an option that we will have to change the name of later. The default is still by position. I chose to do a query option because it will make testing easier and also be easier to diagnose resolution problems quickly in the field. If users want to switch the default behavior to be by name (like Hive), they can use the --default_query_options flag. This patch also introduces a new test section, SHELL, which can be used to execute shell commands in a .test file. This is useful for copying files into test tables. Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6 Reviewed-on: http://gerrit.cloudera.org:8080/2384 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-04-02 04:04:25 +00:00
Skye Wanderman-Milne	2cbd327d41	Regenerate complextypestbl files to include nested_struct.g field This field was included in the schema and data files, but the checked-in generated parquet files didn't include it. It's not referenced in any tests so we didn't catch it. Change-Id: I5d394f074e7082fa12fafb7e57a144a83b3099a6 Reviewed-on: http://gerrit.cloudera.org:8080/2562 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-04-01 05:06:38 +00:00
Bharath Vissapragada	5cd7ada727	IMPALA-3194: Allow queries materializing scalar type columns in RC/sequence files This commit unblocks queries materializing only scalar typed columns on tables backed by RC/sequence files containing complex typed columns. This worked prior to 2.3.0 release. Change-Id: I3a89b211bdc01f7e07497e293fafd75ccf0500fe Reviewed-on: http://gerrit.cloudera.org:8080/2580 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-31 12:06:57 +00:00
Sailesh Mukil	49a73cd598	IMPALA-3249: Failed to mkdirs on core-local-filesystem build. This failure happens on filesystems other than HDFS because as a part of IMPALA-2466, the $FILESYSTEM_PREFIX was not added to the new directories that the patch tries to create in create-load-data. Change-Id: I8de74db93893c5273ccc9c687f608959628f5004 Reviewed-on: http://gerrit.cloudera.org:8080/2644 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-30 00:03:45 +00:00
Casey Ching	39a28185e8	Re-enable Kudu in build using client stubs when needed The stubs in Impala broke during the merge commit. This commit removes the stubs in hopes of improving robustness of the build. The original problem (Kudu clients are only available for some OSs) is now addressed by moving the stubbing into a dummy Kudu client. The dummy client only allows linking to succeed, if any client method is called, Impala will crash. Before calling any such method, Kudu availability must be checked. Change-Id: I4bf1c964faf21722137adc4f7ba7f78654f0f712 Reviewed-on: http://gerrit.cloudera.org:8080/2585 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-03-29 23:57:54 +00:00
Alex Behm	b2ccb17c21	Print last 50 lines of log if data loading fails. The 20 lines we dump currently are often not enough to diagnose a failure quickly. Increasing to 50 lines. Printing 50 lines is also consistent with our run-step script which also prints 50 lines. Change-Id: I353a2030be6fad1cd63879b4717e237344f85c73 Reviewed-on: http://gerrit.cloudera.org:8080/2632 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 20:22:18 +00:00
Alex Behm	7e76e92bef	Consolidate test and cluster logs under a single directory. All logs, test results and SQL files generated during data loading and testing are now consolidated under a single new directory $IMPALA_HOME/logs. The goal is to simplify archiving in Jenkins runs and debugging. The new structure is as follows: $IMPALA_HOME/logs/cluster - logs of Hadoop components and Impala $IMPALA_HOME/logs/data_loading - logs and SQL files produced in data loading $IMPALA_HOME/logs/fe_tests - logs and test output of Frontend unit tests $IMPALA_HOME/logs/be_tests - logs and test output of Backend unit tests $IMPALA_HOME/logs/ee_tests - logs and test output of end-to-end tests $IMPALA_HOME/logs/custom_cluster_tests - logs and test output of custom cluster tests I tested this change with a full data load which was successful. Change-Id: Ief1f58f3320ec39d31b3c6bc6ef87f58ff7dfdfa Reviewed-on: http://gerrit.cloudera.org:8080/2456 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-28 19:23:22 +00:00
Sailesh Mukil	76b674850f	IMPALA-2466: Add more tests for the HDFS parquet scanner. These tests functionally test whether the following type of files are able to be scanned properly: 1) Add a parquet file with multiple blocks such that each node has to scan multiple blocks. 2) Add a parquet file with multiple blocks but only one row group that spans the entire file. Only one scan range should do any work in this case. Change-Id: I4faccd9ce3fad42402652c8f17d4e7aa3d593368 Reviewed-on: http://gerrit.cloudera.org:8080/1500 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-03-25 13:10:15 +00:00
Henry Robinson	0d1eab7a9e	IMPALA-3141: Send dummy filters when filter production is disabled The PHJ may disable runtime filter production for one of several reasons, including a predicted high false-positive rate. If the filters are not produced, any scans will wait for their entire timeout before continuing. This patch changes the filter logic to always send a filter, even if one wasn't actually produced by the PHJ. To preserve correctness, that filter must contain every element of the set. Such a filter is represented by (BloomFilter*)NULL. This allows us to make no changes to RuntimeFilter::Eval(), which already returns true if the member Bloom filter is NULL. In RPCs, a new field is added to TBloomFilter to identify filters that are always true. The HdfsParquetScanner checks to see if filters would always return true for any element, and disables them if so. There is some miscellaneous cleanup in this patch, particularly the removal of unused members in BloomFilter. This patch has been manually tested on queries that would otherwise take a long time to time-out. A unit test was added to ensure that queries do not wait. Change-Id: I04b3e6542651c1e7b77a9bab01d0e3d9506af42f Reviewed-on: http://gerrit.cloudera.org:8080/2475 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2016-03-24 23:17:50 +00:00
Henry Robinson	c06912ebb6	IMPALA-3226: Increase timeout for runtime filter tests When running with ASAN enabled, runtime filters may take a lot longer to be produced, triggering timeouts in the filter tests. This patch triples the timeout time. We still want the timeout to be reasonable as protection against excessive regressions in filter production time, which is why I've not set the timeout to a very large value, plus if the test fails and filters aren't produced we don't want to hang the build for a large timeout delay. Change-Id: Ife1d36a78d6ad587462fe112afda573f6e480441 Reviewed-on: http://gerrit.cloudera.org:8080/2609 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-24 07:59:53 +00:00
Henry Robinson	b3937295fb	Runtime filters tests This patch adds functional tests for runtime filters. It relies on setting RUNTIME_FILTER_WAIT_TIME_MS high enough to ensure that filters are received. To make the test files more readable, this patch also adds a new COMMENT section to the test syntax, and allows blank spaces between queries so that the separation of different test cases can be made more obvious. Currently missing is a test for disabling probe-side filters based on selectivity, as we lack suitable tables to trigger the disable condition. Change-Id: I94d617c6d23ffa394a6eb7ead56f1cfb701e0d90 Reviewed-on: http://gerrit.cloudera.org:8080/2603 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-03-23 04:07:14 +00:00
Thomas Tauber-Marshall	445c88339f	IMPALA-2738 Hive/Impala inconsistency in GRANT/REVOKE syntax Added the ability for the "GRANT/REVOKE ALL ON SERVER TO ROLE <role>" statement to optionally take a server name parameter as: "GRANT/REVOKE ALL ON SERVER <server> TO ROLE <role>" since Hive allows this. The specified server name is checked against the expected server name from the config during analysis, and an exception is thrown if they do not match. Change-Id: Id6c136d9a171ec062d4ff803682d026422497e8b Reviewed-on: http://gerrit.cloudera.org:8080/2296 Tested-by: Internal Jenkins Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>	2016-03-19 00:03:03 +00:00
Casey Ching	432a76e4dd	Temporarily disable Kudu support Change-Id: I9aeb808a9898972788cb1d5d071619d8c64b514c Reviewed-on: http://gerrit.cloudera.org:8080/2551 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-16 00:15:34 +00:00
Bharath Vissapragada	978d837758	IMPALA-3139: Fix drop table statement to not drop views and vice versa This commit fixes the following two issues - A drop table statement can drop a view with same name when "IF EXISTS" is specified. - A drop view statement can drop a table with same name when "IF EXISTS" is specified. This happens due to lack of checks in the Catalog before the drop executes. Change-Id: I0d35cd1f50d9b8d50223660f753c56529cbbc311 Reviewed-on: http://gerrit.cloudera.org:8080/2458 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-03-15 12:10:33 +00:00
Tim Armstrong	f5b7842414	IMPALA-2502: don't redundantly repartition grouping aggregations Grouping aggregations previously always repartitioned their input, even if preceding joins or aggs had already partitioned the data on the required key (or an equivalent key). This patch checks to see if data is already partitioned on the required exprs (or equivalent ones), and if so skips the preaggregation and only does a merge aggregation. The patch also does some refactoring of the aggregation planning in DistributedPlanner to make it easier to implement the change. Includes planner tests for the three cases that are affected: grouping aggregations, non-grouping distinct aggregations and grouping distinct aggregations. Change-Id: Iffdcfd3629b8a69bd23915e1adba3b8323cbbaef Reviewed-on: http://gerrit.cloudera.org:8080/2414 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-03-15 09:21:22 +00:00
David Alves	7381304a23	Merge branch 'feature/kudu' into cdh5-trunk This is the final merge commit that merges the 'feature/kudu' branch into cdh5-trunk. Change-Id: Ib3dfb4fc7a69c5cb1c5789422ee52fa192ed677a	2016-03-13 19:28:43 -07:00
casey	804cfbdd64	Get and use Kudu from the toolchain by default This is for review purposes only. This patch will be merged with David's big merge patch. Changes: 1) Make Kudu compilation dependent on the OS since not all OSs support Kudu. 2) Only run Kudu related tests when Kudu is supported (see #1). 3) Look for Kudu locally, but in a different location. To use a local build of Kudu, set KUDU_BUILD_DIR to the path Kudu was built in and set KUDU_CLIENT_DIR to the path KUDU was installed in. Example: git clone https://github.com/cloudera/kudu.git ...build 3rd party etc... mkdir -p $KUDU_BUILD_DIR cd $KUDU_BUILD_DIR cmake <path to Kudu source dir> make DESTDIR=$KUDU_CLIENT_DIR make install 4) Look for Kudu in the toolchain if not using a local Kudu build. 5) Add Kudu service startup scripts. The Kudu in the toolchain is actually a parcel that has been renamed (the contents were not modified in any way), that mean the Kudu service binaries are there. Those binaries are now used to run the Kudu service. Change-Id: I3db88cbd27f2ea2394f011bc8d1face37411ed58	2016-03-11 11:38:05 -08:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Michael Ho	13007f9634	IMPALA-561: Allow multiple callbacks in a thread resource pool. Previously, thread resource manager only supports a single callback for each resource pool. The callback is invoked when a thread token is available. This mostly works as scan node is the only consumer and there is usually one scan node in a plan fragment. As shown in IMPALA-3064 and IMPALA-561, it's possible to generate a plan fragment with more than one scan nodes. In which case, one of the scan nodes may be running with single thread and in debug builds, a DCHECK will be hit. This change fixes the problem by allowing more than one callbacks in a given resource pool. The thread resource manager will go through all the registered callbacks in round robin fashion. This change also adds a missing thread token release call in HdfsScanNode::ThreadTokenAvailableCb(). Change-Id: Iddfff1feef0b59d407994ad3bc560166acbfa623 Reviewed-on: http://gerrit.cloudera.org:8080/2430 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-10 23:16:29 +00:00
Alex Behm	4a25f87d5c	Improve the SQL for nested TPCH-Q18. Marcel spotted that nested TPCH-Q18 can be expressed with more efficient SQL. Results on nested TPCH-300: Before 160s After 100s Change-Id: I8b351b7f467e8bef0c256dc43cea325d7f177edf Reviewed-on: http://gerrit.cloudera.org:8080/2418 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-03-04 04:35:54 +00:00
Alex Behm	54a46e9459	IMPALA-3065/IMPALA-3062: Restrict !empty() predicates to scan nodes. The bug: Evaluating !empty() predicates at non-scan nodes interacts poorly with our BE projection of collection slots. For example, rows could incorrectly be filtered if a !empty() predicate is assigned to a plan node that comes after the unnest of the collection that also performs the projection. The fix: This patch reworks the generation of !empty() predicates introduced in IMPALA-2663 for correctness purposes. The predicates are generated in cases where we can ensure that they will be assigned only by the parent scan, and no other plan node. The conditions are as follows: - collection table ref is relative and non-correlated - collection table ref represents the rhs of an inner/cross/semi join - collection table ref's parent tuple is not outer joined Change-Id: Ie975ce139a103285c4e9f93c59ce1f1d2aa71767 Reviewed-on: http://gerrit.cloudera.org:8080/2399 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Silvius Rus <srus@cloudera.com> Tested-by: Internal Jenkins	2016-03-02 23:23:05 -08:00
Tim Armstrong	6cdcdb12ff	Test for IMPALA-2987 Add a custom cluster test that tests for delays in registering data stream receivers. We add a stress option to artificially delay this registration to ensure that it can be handled correctly. Change-Id: Id5f5746b6023c301bacfa305c525846cdde822c9 Reviewed-on: http://gerrit.cloudera.org:8080/2306 Tested-by: Internal Jenkins Reviewed-by: Silvius Rus <srus@cloudera.com>	2016-03-02 23:23:04 -08:00
Alex Behm	a303f25256	IMPALA-3071: Fix assignment of On-clause predicates belonging to an inner join. The bug: On-clause predicates belonging to an inner join were not always assigned correctly if they referenced an outer-joined tuple. Specifically, our logic for detecting whether a predicate can be assigned below an outer join if also left at the outer-join node was not correct, and so we assigned the predicate below the join, but did not also leave it at the outer join. The fix: Assign an inner join On-clause conjunct that references an outer-joined tuple to the join that the On-clause belongs to. Change-Id: Iffef7718679d48f866fa90fd3257f182cbb385ae Reviewed-on: http://gerrit.cloudera.org:8080/2309 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-29 22:22:41 -08:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Dimitris Tsirogiannis	2c37d99fed	IMPALA-3089: Perform static partition pruning in the FE with disjunctive BETWEEN predicates This commit fixes an issue where the slow path is employed during static partition pruning for disjunctive BETWEEN predicates, inroducing significant latency during planning, especially for tables with large number of partitions. Change-Id: I66ef566fa176a859d126d49152921a176a491b0a Reviewed-on: http://gerrit.cloudera.org:8080/2320 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-26 15:37:24 -08:00
Alex Behm	5c0e1fa1e8	IMPALA-2974: Use Type.toSql() instead of toString() in ALTER TABLE CHANGE COLUMN. Change-Id: I140bdea755e44d3f2ceb4a8f5e288faaddaa963f Reviewed-on: http://gerrit.cloudera.org:8080/2285 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-26 15:37:24 -08:00
Dimitris Tsirogiannis	197eb43477	IMPALA-3074: AnalysisError when runtime filter has incompatible source and target exprs This commit fixes an issue where an AnalysisError is thrown when a runtime filter has incompatible source and target exprs. This is triggered when a runtime filter has multiple candidate target scan nodes not all of which produce a target expr which is cast-compatible with the source expr. Change-Id: I544c8fc66915f684ba24d20de525563638c4039d Reviewed-on: http://gerrit.cloudera.org:8080/2307 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 19:54:40 -08:00
Dimitris Tsirogiannis	d3b92b0d9f	IMPALA-3039: Restrict the number of runtime filters generated This commit adds a query option, MAX_NUM_RUNTIME_FILTERS, to restrict the number of runtime filters generated per query. If more than MAX_NUM_RUNTIME_FILTERS are generated, the runtime filters are sorted by the selectivity of the associate source join nodes and the MAX_NUM_RUNTIME_FILTERS most selective filters are applied. Also with this commit, non-selective filters are automatically discarded, irrespective of the value of MAX_NUM_RUNTIME_FILTERS. Change-Id: Ifd41ef6919a6d2b283a8801861a7179c96ed87c6 Reviewed-on: http://gerrit.cloudera.org:8080/2262 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 19:54:40 -08:00
Alex Behm	2c8f41b7d4	IMPALA-2832: Fix cloning of FunctionCallExpr. The bug was that we were not properly cloning the params of a FunctionCallExpr. In a CTAS we analyze the underlying query stmt twice, the first time on a clone of the original stmt. The problem was that the first analysis affected the second analysis due to an improper clone, leading to missing slots in a scan because the corresponding SlotRefs were already analyzed. Change-Id: I0025c0ee54b2f2cb3ba470b26a9de5aa5a3a3ade Reviewed-on: http://gerrit.cloudera.org:8080/2291 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Tim Armstrong	52362d4079	IMPALA-3047: separate create table test with nested types We need to skip queries that select from tables wiht nested types is running with the old aggs and joins. To achieve this, move the failing test to a separate test and use the skip decorator. Change-Id: Iaf1351c711b524be66a99084657926909425cbff Reviewed-on: http://gerrit.cloudera.org:8080/2272 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Alex Behm	a99e17457b	Fix a non-determinisic test in complex-types-file-formats.test. Change-Id: I98cc3045a6a6131dba8b0a475d5d51de7bdba455 Reviewed-on: http://gerrit.cloudera.org:8080/2268 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
Alex Behm	c6fd5a0fe4	IMPALA-2844: Allow count(*) on RC files with complex types. This patch also fixes the incorrect error message reported in the JIRA. Change-Id: I2c7b732767d154c36bc7189df5177d27a35d0d7b Reviewed-on: http://gerrit.cloudera.org:8080/2267 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
Alex Behm	8b32cbb904	IMPALA-2820: Support unquoted keywords as struct-field names. After this patch structs can be parsed/created with field names that are regular identifiers or keywords, even if unquoted. This fix is needed for parsing type strings stored in the Hive Metastore which could contain unquoted identifiers that correspond to Impala keywords. The parser changes required an upgrade of Cup and its Maven plugin. In the old version, the generated parser would not compile because of a giant method that exceeded the JVM maximum allowed size for a single method. Change-Id: Ic989c7afd034216f6db4c8f9f3901c025cceb524 Reviewed-on: http://gerrit.cloudera.org:8080/2249 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-02-22 20:16:24 -08:00
David Alves	2591a6718a	Handle booleans in the Kudu scanner We were missing handling booleans in the Kudu scanner, though we handled them in the sink. This patch fixes this issue and adds some tests. Change-Id: If8edbe85ae257c6374eddf757845c1ec917b1693	2016-02-22 13:49:10 -08:00
David Alves	af69097f19	IMPALA-2674 - Add support for VARCHAR to the backend In the frontend we support creating Kudu tables with VARCHAR but in the backend we don't handle it. Moreover we were swallowing the error in release mode, causing inserts to just skip values when this type is used. This patch adds support for VARCHAR, along with the corresponding tests. Change-Id: Ic734890e1b3aae2eef1e0a3d45a7561d02eeb917	2016-02-22 13:46:34 -08:00
Bharath Vissapragada	ef0dac661c	IMPALA-2843: Persist hive udfs across catalog restarts This commit adds a new feature to persist hive/java udfs across catalog restarts. IMPALA-1748 already added this for non-java udfs by storing them in parameters map of the Db object and reading them back at catalog startup. However we follow a different approach for hive udfs by converting them to Hive's function format and adding them as hive functions to the metastore. This makes it possible to share udfs between hive and Impala as the udfs added from one service are accessible to other. This commit takes care of format conversions between hive and impala and user can just add function once in either of the services. Background: Hive and impala treat udfs differently. Hive resolves the evaluate function in the udf class at runtime depending on the data types of the input arguments. So user can add one function by name and can pass any arguments to it as long as there is a compatible evaluate function in the udf class. However Impala takes the input types of the udf as a part of function definition (that maps to only one evaluate function) and loads the function only for those set of input argument types. If we have multiple 'evaluate' methods, we need to add multiple functions one for each of them. This commit adds new variants of CREATE \| DROP FUNCTIONS to Impala which lets the user to create and drop hive/java udfs without input argument types or return types. Catalog takes care of loading/dropping the udf signatures corresponding to each "evaluate" method in the udf symbol class. The syntax is as follows, CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts> DROP FUNCTION [IF EXISTS] <function name> Examples: CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf'; CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2'; DROP FUNCTION foo; DROP FUNCTION IF EXISTS bar; The older way of creating hive/java udfs with specific signature is still supported, however they are not persisted across restarts. So a restart of catalog can wipe them out. Additionally this commit also loads all the compatible java udfs added outside of Impala and they needn't be separately loaded. One thing to note here is that the functions added using the new CREATE FUNCTION can only be dropped using the new DROP FUNCTION syntax (without signature). The same rule applies for the java udfs added using the old CREATE FUNCTION syntax (with signature). Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d Reviewed-on: http://gerrit.cloudera.org:8080/2250 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 23:04:03 -08:00
Marcell Szabo	8135ef6eaa	IMPALA-2641: Add IF EXISTS clause to TRUNCATE TABLE statement Change-Id: I3169390b0e04f07fb4ea53d987d86a76482d7e9d Reviewed-on: http://gerrit.cloudera.org:8080/1905 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 14:08:58 +00:00

1 2 3 4 5 ...

1260 Commits