impala

mirror of https://github.com/apache/impala.git synced 2026-01-26 03:01:30 -05:00

Author	SHA1	Message	Date
Skye Wanderman-Milne	68fef6a5bf	IMPALA-2213: make Parquet scanner fail query if the file size metadata is stale This patch changes the Parquet scanner to check if it can't read the full footer scan range, indicating that file has been overwritten by a shorter file without refreshing the table metadata. Before it would DCHECK. This patch adds a test for this case, as well as the case where the new file is longer than the metadata states (which fails with an existing error). Change-Id: Ie2031ac2dc90e4f2573bd3ca8a3709db60424f07 Reviewed-on: http://gerrit.cloudera.org:8080/1084 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2015-10-01 13:58:39 -07:00
Juan Yu	6bac14a283	IMPALA-2005: Cleanup the newly created table if CTAS fails. If CTAS query fails during the DML part Impala should drop the newly created table. Change-Id: I39e04a6923a36afa48f3252addd50ddda83d1706 (cherry picked from commit e03ce43585f68590a95038341e74db458f34bf32) Reviewed-on: http://gerrit.cloudera.org:8080/870 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-10-01 13:58:38 -07:00
Skye Wanderman-Milne	0c5e6a804f	IMPALA-2443: add support for more Parquet array encodings This patch adds full support for the various Parquet array encodings, as well as tests that use files from https://github.com/apache/hive/tree/master/data/files. This should allow us to read any existing array data. Change-Id: I3d22ae237b1dc82ee75a83c1d4890d76316fadee Reviewed-on: http://gerrit.cloudera.org:8080/826 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-10-01 13:58:37 -07:00
Dimitris Tsirogiannis	6c9b93973a	IMPALA-2441: CREATE DATABASE IF NOT EXISTS may cause NPE This commit fixes an issue where a CREATE DATABASE IF NOT EXISTS statement will cause an NPE if the database already exists in the Hive MetaStore but not in the Impala catalog. With this fix no error is thrown if the database exists in HMS and the database is added to the catalog. Change-Id: If1d15bb50869ce8084e0443f119a596b365004c7 Reviewed-on: http://gerrit.cloudera.org:8080/1091 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-10-01 13:58:36 -07:00
Tim Armstrong	96d93083fe	Disable nested TPCH tests for old aggs and joins We do not support nested types combined with the old aggs and joins. Change-Id: I81401dd4d482d46e678091989ac9d178ac771d01 Reviewed-on: http://gerrit.cloudera.org:8080/1078 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:48 -07:00
Matthew Jacobs	851056489d	IMPALA-2440: Fix old HJ full outer join with no rows When a full outer join on the old (non-partitioned) HashJoinNode, if any join fragment has 0 build rows and 0 probe rows an extra null row will be produced. Change-Id: I75373edc4f6b3b0c23afba3c1fa363c613f23507 Reviewed-on: http://gerrit.cloudera.org:8080/1068 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:47 -07:00
Juan Yu	7c498627f6	IMPALA-2249: Avoid allocating StringBuffer > 1GB in ScannerContext::Stream::GetBytesInternal() Due to IMPALA-1619, allocating StringBuffer larger than 1GB could cause Impala crash. Check the requested buffer size in advance and fail the request if it is larger than 1GB. Once IMPALA-1619 is fixed, we should revert this change. Change-Id: Iffd1e701614b520ce58922ada2400386661eedb1 (cherry picked from commit 74ba16770eeade36ab77c86ed99d9248c60b0131) Reviewed-on: http://gerrit.cloudera.org:8080/869 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:46 -07:00
Tim Armstrong	fbec3f65a0	Skip nested types tests with old aggs and joins We don't support nested types combined with the old aggs and joins. This patch disables the nested type query tests when the old aggs or joins are enabled with the TEST_START_CLUSTER_ARGS environment variable. Change-Id: I6579a0a245359d4d2ff955c399d1296580c9676e Reviewed-on: http://gerrit.cloudera.org:8080/1046 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:39 -07:00
Matthew Jacobs	2478f05cb3	IMPALA-2375: Unblock old hj/agg test runs Move a very expensive semi-join test case to run only on exhaustive so that it is not run as part of the old hj/agg jenkins runs where it fails. Change-Id: I4a0f915e894ceac91d86b366876e47e9cc87255a Reviewed-on: http://gerrit.cloudera.org:8080/930 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-27 15:13:32 -07:00
Martin Grund	579be1c542	IMPALA-2284: Disallow long (1<<30) strings in group_concat() This is the first step to fix issues with large memory allocations. In this patch, the built-in `group_concat` is no longer allowed to allocate arbitraryly large strings and crash impala, but is limited to the upper bound of possible allocations in Impala. This patch does not perform any functional change, but rather avoids unnecessary crashes. However, it changes the parameter type of FindChunk() in MemPool to be a signed 64bit integer. This change allows the mempool to allocate internally memory of more than one 1GB, but the public interface of Allocate() is not changed, so the general limitation remains. The reason for this change is as follows: 1) In a UDF FunctionContext::Reallocate() would allocate slightly more than 512MB from the FreePool. 2) The free pool tries to double this size to alloocate 1GB from the MemPool. 3) The MemPool doubles the size again and overflows the signed 32bit integer in the FindChunk() method. This will then only allocate 1GB instead of the expected 2GB. What happens is that one of the callers expected a larger allocation than actually happened, which will in turn lead to memory corruption as soon as the memory is accessed. Change-Id: I068835dfa0ac8f7538253d9fa5cfc3fb9d352f6a Reviewed-on: http://gerrit.cloudera.org:8080/858 Tested-by: Internal Jenkins Reviewed-by: Dan Hecht <dhecht@cloudera.com>	2015-09-23 15:15:55 -07:00
Ippokratis Pandis	48699de6e3	IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ PAGG and PHJ were using an all-or-nothing approach wrt spilling. In particular, they were trying to switch to IO-sized buffers for both streams (aggregated and unaggregated in PAGG; build and probe in PHJ) of every partition (currently 16 partitions for a total of 32 streams), even if some of the streams had very few rows, they were empty or simply they would not spill so there was no need to allocate IO-buffers for them. That was increasing the min mem needed by those operators in many queries. This patch decouples the decision to switch to IO-buffers for each stream of each partition. Streams will switch to IO-sized buffers whenever the rows they contain do not fit in the first two small buffers (64KB and 512KB respectively). When we decide to spill a partition, we switch to IO buffers both streams. With these change many streams of PAGG and PHJ nodes do not need to use IO-sized buffers, reducing the min mem requirement. For example, below is the min mem needed (in MBs) for some of the TPC-H queries. Some need half or less mem from the mem they needed before: TPC-H Q3: 645 -> 240 TPC-H Q5: 375 -> 245 TPC-H Q7: 685 -> 265 TPC-H Q8: 740 -> 250 TPC-H Q9: 650 -> 400 TPC-H Q18: 1100 -> 425 TPC-H Q20: 420 -> 250 TPC-H Q21: 975 -> 620 To make this small buffer optimization to work, we had to fix IMPALA-2352. That is, the AllocateRow() call of PAGG::ConstructIntermediateTuple() could return unsuccessfully just because the small buffers of the stream were exhausted. In that case, previously we would treat it as an indication that there is no memory left, start spilling a partition and switching all stream to IO-buffes. Now we make a best effort, trying to first SwitchToIoffers() and if that is successful, we re-attempt the AllocateRow() call. See IMPALA-2352 for more details. Another change is that now SwitchToIoBuffers() will reset the flag using_small_buffers_ back to false, in case we are in a very low memory situation and it fails to get a buffer. That allows us to retry calling SwitchToIoBuffers() once we free up some space. See IMPALA-2330 for more details. With the above fixes we should also have fixed IMPALA-2241 and IMPALA-2271 that are essentially stream::using_small_buffers_-related DCHECKs. This patch adds all 22 TPC-H queries in test_mem_usage_scaling test and updates the per-query min mem limits in it. Additionally, it adds a new aggregation test that uses the TPC-H dataset for larger aggregations (TestTPCHAggregationQueries). It also removes some dead test code. Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38 Reviewed-on: http://gerrit.cloudera.org:8080/818 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins Conflicts: tests/query_test/test_aggregation.py	2015-09-23 11:07:42 -07:00
Ippokratis Pandis	4d5ee2b3a2	IMPALA-2364: Wrong DCHECK in PHJ::ProcessProbeBatch There was a dcheck in PHJ::ProcessProbeBatch() that was expecting that the state of PHJ was PROCESSING_PROBE. It looks like we can hit the same dcheck when we are in REPARTITIONING phase. This patch fixes this dcheck. It also adds tpc-ds q53 in the test_mem_usage_scaling test (along with the needed refactoring in this test) because tpc-ds q53 hit this dcheck in an endurance test. Change-Id: I37f06e1bfe07c45e4a6eac543934b4d83a205d28 Reviewed-on: http://gerrit.cloudera.org:8080/893 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-09-23 10:38:58 -07:00
aacalfa	57dd4d1502	IMPALA-1309: Add support for distinct in group_concat function. Change-Id: I2790f1d2a7bfd0ecc7ef66cc5d91dafe3414e111 Reviewed-on: http://gerrit.cloudera.org:8080/892 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-09-23 09:42:17 +00:00
Szehon Ho	0c574394d6	Data_generator: add a new mode "populate_existing" to populate existing tables. Example use-case, can generate data given any DDL to reproduce issues. The general strategy is to read the tables metadata, create a temp table with that schema and insert text data, and then insert into target table, similar to the code path of parquet. Change-Id: I4d512a80bc0accf4c243587f6246d9f63fda9149 Reviewed-on: http://gerrit.cloudera.org:8080/877 Reviewed-by: Szehon Ho <szehon@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:33 -07:00
Tim Armstrong	db7519df24	IMPALA-2207: memory corruption on build side of NLJ The NLJ node did not follow the expected protocol when need_to_return is set on a row batch, which means that memory referenced by a rowbatch can be freed or reused the next time GetNext() is called on the child. This patch changes the NLJ node to follow the protocol by deep copying all build side row batches when the need_to_return_ flag is set on the row batches. This prevents the row batches from referencing memory that may be freed or reused. Reenable test that was disabled because of IMPALA-2332 since this was the root cause. Change-Id: Idcbb8df12c292b9e2b243e1cef5bdfc1366898d1 Reviewed-on: http://gerrit.cloudera.org:8080/810 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
ishaan	e408560c56	Perf Framework: Move exec functions to a separate file and deprecate Hive execution. This patch does the following: - Removes code that deals with executing queries through Hive. - Gives the user the option to specify only the hostname for the Impalads. - Moves the execution functions to their own .py file. - Removes some duplicate code (exec_shell_cmd -> exec_process) Change-Id: If49951c7bb5423ef9343d4d211f6da13d397325a Reviewed-on: http://gerrit.cloudera.org:8080/862 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
Alex Behm	41ef3a216d	Nested Types: Add functional tests. This patch adds basic end-to-end functional tests for nested types: 1. For exercising the Reset() of exec nodes when inside a subplan. 2. For asserting correct behavior when row batches with collection-typed slots flow through exec nodes. Most cases are covered, but there are a few known issues that prevent full coverage. The remaining tests will be added as part of the fixes for those existing JIRAs. Change-Id: I0140c1a32cb5edd189f283c68a24de8484b3f434 Reviewed-on: http://gerrit.cloudera.org:8080/823 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-14 13:43:01 -07:00
Alex Behm	361da01152	Fail queries that require a SubplanNode when using legacy joins and aggs. We will not provide full nested types support if any of these options are set: --enable_partitioned_aggregation=false --enable_partitioned_hash_join=false Change-Id: I0f8607914faf9691d5f7b1a4327609fefba22e56 Reviewed-on: http://gerrit.cloudera.org:8080/792 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-09-10 04:50:31 +00:00
Tim Armstrong	235a8d08da	IMPALA-2295: deep copy arrays in BufferedTupleStream This was unimplemented and is used on some code paths. Arrays were not properly copied into the BufferedTupleStream, potentially leaving stray pointers to invalid or reused memory. Arrays are now correctly deep copied. Includes a unit test that copys rows containing arrays in and out of a BufferedTupleStream. Also implement matching optimisation for deep copy in RowBatch. Change-Id: I75d91a6b439450c5b47646b73bc528cfb8f7109b Reviewed-on: http://gerrit.cloudera.org:8080/751 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-09 02:39:14 +00:00
Alex Behm	dbb40c7922	Nested Types: Add end-to-end tests running nested TPCH on Parquet. Change-Id: I2a3c46ea50e53479f2f91c175c45e2da3c1c7025 Reviewed-on: http://gerrit.cloudera.org:8080/740 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-06 20:49:42 +00:00
Henry Robinson	270f12b09a	Fix race condition in test_statestore changes KillableThreadedServer.port is required to be set for KillableThreadedServer.wait_until_up() to execute. However, it was being set during serve() which was run concurrently with wait_until_up() (see StatestoreSubscriber.__init_server()), so sometimes it was not set. The fix is to set KillableThreadedServer.port during construction (it is set by the underlying socket as soon as it is constructed itself). Change-Id: Ib9ca9e237bca96635f5ee5b5bbfb7fd678929ce4 Reviewed-on: http://gerrit.cloudera.org:8080/759 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-05 09:44:55 +00:00
Henry Robinson	956c9d74fa	Ensure that test_statestore always uses free ports test_statestore.py needs a lot of ports to run subscriber servers on. before this patch, we'd find a free port by binding to port 0, finding what port the OS actually used, and closing the original socket before passing that port to the actual server to start up. However, this was obviously racy if some other process is also looking for a free port at the same time. This patch moves the bind logic into the Thrift server socket itself, so that there's no close->open race window between binding to the port and actually wanting to use it. I wasn't able to reproduce the issue on my local machine, but this diagnosis fits the problems we've seen. Change-Id: Idfbbe71f596ff5a7c3f4ff33b5edd565648d8e59 Reviewed-on: http://gerrit.cloudera.org:8080/754 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-09-04 22:44:20 +00:00
Alex Behm	d48ec4b8b3	IMPALA-2289: Properly handle AtCapacity() in SubplaNode. After this patch we get correct results for nested TPCH Q13. The bug: Since we were not properly handling AtCapacity() of the output batch in SubplanNode, we sometimes passed a row batch that was already at capacity into GetNext() on the second child of the SubplanNode. In this particular case, that batch was passed into the NestedLoopJoinNode which may return incomplete results if the output batch is already at capacity (e.g., ProcessUnmatchedBuildRows() was not called). The fix is to return from SuplanNode::GetNext() if the output batch is at capacity due to resources being tranferred to it from the input batch used to fetch from the first child. Change-Id: Ib97821e8457867dc0d00fd37149a3f0a75872297 Reviewed-on: http://gerrit.cloudera.org:8080/742 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-04 20:26:52 +00:00
Tim Armstrong	4ac7e5d15d	Disable nested types tests affected by IMPALA-2295 IMPALA-2295 causes tests combining paggs/phjs with collection types to intermittently fail because of memory corruption. This affects non-scanner nested types tests. Change-Id: I63893fbde87189485455cf95a7f63eb7e8aa95f3 Reviewed-on: http://gerrit.cloudera.org:8080/747 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-04 05:38:17 +00:00
Szehon Ho	5056431f29	Standardize all comparison test HS2 connection to use Impyla. With Impyla PR#108, Impyla now supports Hive's default PLAIN auth-mode. This change gets rid of pyHS2 and standardizes all the connection to use Impyla. Change-Id: Ifd3bd212595753ed5e0591105802ec094a41d8af Reviewed-on: http://gerrit.cloudera.org:8080/739 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-09-03 23:35:26 +00:00
Casey Ching	8586ec5280	Python: Switch query generator back to psycopg2 driver The pg8000 driver currently in use doesn't work well with autocommit enabled. When enabled, the drive complains that only 100 rows can be fetched because of a buffer/cache limitation. With autocommit disabled, a bunch of code changes would be needed. For now the driver the previous psycopg2 driver will be used again. (The psycopg2 driver was originally replaced because dev postgres libraries are required to build it, so building the virtualenv would fail. This patch doesn't try to build psycopg2 and instead assumes it was already installed.) Change-Id: I6901cb1fa109d6da907b1415601116d833d66656 Reviewed-on: http://gerrit.cloudera.org:8080/737 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com>	2015-09-03 19:23:39 +00:00
Skye Wanderman-Milne	bcc73a36da	Nested types: read and materialize nested types in Parquet scanner This patch modifies the Parquet scanner to resolve nested schemas, and read and materialize collection types. The high-level modification is to create a CollectionColumnReader that recursively materializes map- and array-type slots. This patch also adds many tests, most of which query a new table called complextypestbl. This table contains hand-generated data that is meant to expose edge cases in the scanner. The tests mostly test the scanner, with a few tests of other functionality (e.g. array serialization). I ran a local benchmark comparing this scanner code to the original scanner code on an expanded version of tpch_parquet.lineitem with 48009720 rows. My benchmark involved selecting different numbers of columns with a single scanner thread, and I looked at the HDFS scan node time in the query profiles. This code introduces a 10%-20% regression in single-threaded scan time. Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a Reviewed-on: http://gerrit.cloudera.org:8080/576 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-02 19:23:54 +00:00
Martin Grund	f927a285c6	IMPALA-1136, IMPALA-2161: Skip \u0000 characters when dealing Avro schemas The limitation of the Avro JSON library not to handle \u0000 characters is to avoid problems with builtin functions like strlen() that would report wrong length when encountering such a character. Now, in the case if Impala, for now, we don't support any Unicode characters. This allows us to actually skip the \u0000 character instead of interpreting it. It is important to say that even the most recent versions of Avro do not support parsing \u0000 characters. Change-Id: I56dfa7f0f12979fe9705c51c751513aebce4beca Reviewed-on: http://gerrit.cloudera.org:8080/712 Tested-by: Internal Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2015-09-02 00:37:28 +00:00
Martin Grund	fa5eca09c8	Disable HDFS file handle caching by default This patch modifies the Impala command line flag: --max_cached_file_handles=VAL to disable caching of HDFS file handles if VAL is 0. In addition, it moves the existing functional tests to a custom cluster test and keeps a sanity check for no caching in the original place. Furthermore, it will check that no file handles are leaked. Change-Id: Ic36168bba52346674f57639e1ac216fd531b0fad Reviewed-on: http://gerrit.cloudera.org:8080/691 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-08-27 23:34:30 +00:00
Martin Grund	60c5140ea7	IMPALA-1983: Warn if table stats are potentially corrupt. When the `numRows` parameter stored in the table properties is errornously set to 0 and a number of non-empty files are present the table statistics are considered to be corrupt. To hint that there might be a problem, the explain statement will emit an additional warning if it detects potentially corrupt table stats like in the following example: Estimated Per-Host Requirements: Memory=42.00MB VCores=1 WARNING: The following tables have potentially corrupt table and/or column statistics. compute_stats_db.corrupted 03:AGGREGATE [FINALIZE] \| output: count:merge() \| 02:EXCHANGE [UNPARTITIONED] \| 01:AGGREGATE \| output: count() \| 00:SCAN HDFS [compute_stats_db.corrupted] partitions=1/2 files=1 size=24B In addition, the small query optimization is disabled for such queries. Change-Id: I0fa911f5132aa62195b854248663a94dcd8b14de Reviewed-on: http://gerrit.cloudera.org:8080/689 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Internal Jenkins	2015-08-26 22:19:33 +00:00
Szehon Ho	dcad11c3af	Support Parquet fileformat for data-generator when running against Hive Also add DDL logging to facilitate debugging. Change-Id: If7600677ed8c491b68468cae4ddf5394499576ca Reviewed-on: http://gerrit.cloudera.org:8080/688 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-08-26 10:10:32 +00:00
Taras Bobrovytsky	b8b7930377	Add nested types support to Create Table Like File Add support for creating a table based on a parquet file which contains arrays, structs and/or maps. Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae Reviewed-on: http://gerrit.cloudera.org:8080/582 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-22 01:46:26 +00:00
Szehon Ho	5787dc2cf3	First commit to run the random query generator on Hive. With this change, random query generator can run continuously on Hive and approximately half of its generated queries are able to run. 1. Connect timeout from Impyla to HS2 was too small, increasing it to match Impala's. 2. Query timeout to wait for Hive queries was too short, making it configurable so we can play with different values. 3. Hive does not support 'with' clause in subquery, but interestingly supports it at the top-level. Added a profile flag "use_nested_with" to disable nested with's. 4. Hive does not support 'having' without 'group by'. Added a profile flag "use_having_without_groupby" to always generate a group by with having. 5. Hive does not support "interval" keyword for timestamp. Added a profile 'restrict' list to restrict certain functions, and added 'dateAdd' to this list for Hive. 6. Hive 'greatest' and 'least' UDF's do not do implicit type casting like other databases. Modified the query-generator to only choose args of the same type for these, and for HiveSqlWriter to add a cast as there were still some lingering issues like udf's on int returning bigint. 7. Hive always orders the Nulls first in ORDER BY ASC, opposite to other databases, and does not have any 'NULLS FIRST' or 'NULLS LAST' option. Thus the only workaround is to add a "nulls_order_asc" flag to the profile, and pass it in to the ref database's SqlWriter to generate the 'NULLS FIRST' or 'NULLS LAST' statement on that end. 8. Hive strangely does not support multiple sort keys in a window without frame specification. The workaround is for HiveSqlWriter to add 'rows unbounded preceding' to specify the default frame if there are no existing frames. Change-Id: I2a5b07e37378f695de1b50af49845283468b4f0f Reviewed-on: http://gerrit.cloudera.org:8080/619 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-08-21 08:19:04 +00:00
Vlad Berindei	e4c42fa8bf	IMPALA-595: Add CASCADE to DROP DATABASE and use it in cleanup_db Change-Id: Idfa5b6943bc797e10d542487c31b8f1b527d8c97 Reviewed-on: http://gerrit.cloudera.org:8080/635 Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com> Tested-by: Internal Jenkins	2015-08-20 03:34:31 +00:00
Skye Wanderman-Milne	7906ed44ac	IMPALA-2015: Add support for nested loop join Implement nested-loop join in Impala with support for multiple join modes, including inner, outer, semi and anti joins. Null-aware left anti-join is not currently supported. Summary of changes: Introduced the NestedLoopJoinNode class in the FE that represents the nested loop join. Common functionality between NestedLoopJoinNode and HashJoinNode (e.g. cardinality estimation) was moved to the JoinNode class. In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop join execution strategy. Change-Id: I238ec7dc0080f661847e5e1b84e30d61c3b0bb5c Reviewed-on: http://gerrit.cloudera.org:8080/652 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-08-19 08:40:14 +00:00
Tim Armstrong	5350d49f8c	IMPALA-1829: UDAs with different intermediate type Previously the frontend rejected UDAs with different intermediate and result type. The backend supports these, so this change enables support in the frontend and adds tests. This patch adds a test UDA function with different intermediate type and a simple end-to-end test that exercises it. It modifies an existing unused test UDA that used a currently unsupported intermediate type - BufferVal. Change-Id: I5675ec7f275ea698c24ea8e92de7f469a950df83 Reviewed-on: http://gerrit.cloudera.org:8080/655 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2015-08-19 04:37:39 +00:00
Henry Robinson	4eb2754924	IMPALA0-2007: Fix race in test_statestore.test_topic_persistence test_topic_persistence simulates a failing subscriber by sending updates for one persistent and one transient topic to the statestore, and then having the subscriber kill itself by closing its connections to the statestore. Another subscriber then registers and checks that the persistent topic entries are there, but the transient ones are not. This patch fixes a race in that test where the first subscriber may forcibly close its connections before it has sent the topic updates, leading to a failure when the second subscriber checks the topic contents. This happens because the 'kill' thread is notified at the point that the RPC thread is leaving the RPC implementation, but before any network response has been sent, so the kill thread can race to close the TCP connection before the response actually makes it to the statestore. The easy fix is to force the subscriber to wait for 2 updates from the statestore rather than 1 before terminating - this ensures that the original response completes before the connections are closed. Before this patch, the test would fail within ten minutes. After, it has yet to fail in an hour of continuous testing. Change-Id: I5d464d5781ed0e27220f3e826609493893a052aa Reviewed-on: http://gerrit.cloudera.org:8080/649 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: Internal Jenkins	2015-08-18 01:29:45 +00:00
Casey Ching	669290c513	Python: Switch postgres driver The old driver (psycopg2) requires some development packages to be installed. The new driver (pg8000) is pure python so it's much easier to setup. Potentially the new driver is slower but we don't need much performance from the postgres driver. Change-Id: Iea743b53b20e9bdf405be595ab1cac35763f120b Reviewed-on: http://gerrit.cloudera.org:8080/653 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com>	2015-08-17 20:45:35 +00:00
Casey Ching	facedb2aa5	Add stress test for TPC queries running against a cluster This will run concurrent TPC-DS/H queries against a CM managed cluster. Stress test outline (and notes): 1) Get a set of queries. TPCH and/or TPCDS queries will be used. TODO: Add randomly generated queries. 2) For each query, run it individually to find: a) Minimum mem limit to avoid spilling b) Minimum mem limit to successfully run the query (spilling allowed) c) Runtime when no mem was spilled d) Runtime when mem was spilled e) A row order independent hash of the result set. This is a slow process so the results will be written to disk for reuse. 3) Find the memory available to Impalad. This will be done by finding the minimum memory available across all impalads (-mem_limit startup option). Ideally, for maximum stress, all impalads will have the same memory configuration but this is not required. 4) Optionally, set an amount of memory that can be overcommitted. 5) Start submitting queries. There are two modes for throttling the number of concurrent queries: a) Submit queries until all available memory (as determined by items 3 and 4) is used. Before running the query a query mem limit is set between 2a and 2b. (There is a runtime option to increase the likelihood that a query will be given the full 2a limit to avoid spilling.) b) TODO: Use admission control. 6) Randomly cancel queries to test cancellation. There is a runtime option to control the likelihood that a query will be randomly canceled. 7) Cancel long running queries. Queries that run longer than some expected time, determined by the number of queries currently running, will be canceled. TODO: Collect stacks of timed out queries and add reporting. 8) If a query errored, verify that memory was overcommitted during execution and the error is a mem limit exceeded error. There is no other reason a query should error and any such error will cause the stress test to stop. TODO: Handle crashes -- collect core dumps and restart Impala TODO: Handle client connectivity timeouts -- retry a few times 9) Verify the result set hash of successful queries. Change-Id: I4bd7f8a7cc65d5ae910a33afba59135040a99061 Reviewed-on: http://gerrit.cloudera.org:8080/474 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com>	2015-08-15 23:10:25 +00:00
Casey Ching	a4fe24c1b2	Python: Add more logging and CM options to common CLI parser Example output of --help: Options: --debug-log-file=DEBUG_LOG_FILE Path to debug log file. [default: /tmp/concurrent_select.py.log] --cm-host=host name The host name of the CM server. --cm-port=port number The port of the CM server. [default: 7180] --cm-user=user name The name of the CM user. [default: admin] --cm-password=password The password for the CM user. [default: admin] --cm-cluster-name=name If CM manages multiple clusters, use this to specify which cluster to use. Change-Id: I614383f4a65e700348572204e3d8fd5670f5bcf7 Reviewed-on: http://gerrit.cloudera.org:8080/472 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com>	2015-08-15 23:10:10 +00:00
Alex Behm	dd88b3b465	IMPALA-2201: Unconditionally update the partition stats and row count. Before this patch, we used to only send alterPartition() requests to the Hive Metastore for partitions whose stats have changed during COMPUTE [INCREMENTAL] STATS. However, there is other state associated with the stats like the STATS_GENERATED_VIA_STATS_TASK that was not properly handled. Not updating this additional partition metadata was the root cause of IMPALA-2201. This patch changes COMPUTE [INCREMENTAL] STATS to unconditionally update the partition stats and row counts in the Hive Metastore, even if the partition already has identical stats. This behavior results in possibly redundant work, but it is predictable and easy to reason about because it does not depend on the existing state of the metadata. Note that in versions starting from CDH 5.4 it is not possible to reproduce IMPALA-2201 because of a behavioral change in the Hive Metastore in the alterPartition() code path. Change-Id: I10105d8d6306d9ad9988b03abc23752d7bc98252 Reviewed-on: http://gerrit.cloudera.org:8080/640 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-14 23:33:20 +00:00
Tim Armstrong	1d2afcfec2	IMPALA-2079: Part 1: report non-writable scratch dirs at startup Previously Impala could erroneously decide to use non-writable scratch directories, e.g. if /tmp/impala-scratch already exists and is not writable by the current user. With this change, if we cannot remove and recreate a fresh scratch directory, it is not used. If we have no valid scratch directories, we log an error and continue startup. Add unit test for CreateDirectory to test behavior for success and failure cases. Add system tests to check logging and query execution in various scenarios where we do not have scratch available. Modify FilesystemUtil to use non-exception-throwing Boost functions to avoid unhandled exceptions escaping into the rest of the Impala codebase, which does not expect the use of exceptions. Change-Id: Icaa8429051942424e1d811c54bde10102ac7f7b3 Reviewed-on: http://gerrit.cloudera.org:8080/565 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-08-14 00:38:22 +00:00
Dimitris Tsirogiannis	47c5ae405a	Revert "IMPALA-2015: Add support for nested loop join" This reverts commit 6837cdec7f6a7e1c7e8157e323f3ab68277689aa. Change-Id: I2fd6424c553a701fcbfd425b4486af7280820b23 Reviewed-on: http://gerrit.cloudera.org:8080/636 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 02:20:07 +00:00
Skye Wanderman-Milne	f8134ff133	IMPALA-2187: Run py.test through impala python env. The symptom of this bug was that we were seeing "ValueError: bad marshal data" when trying to import from tests.hs2.test_hs2 during customer cluster tests. The problem was that we were not running the custom cluster tests through the new Impala Python virtualenv. Some tests (properly running with the virtualenv) that run before the customer cluster tests had caused the generation of pyc files for tests.hs2.test_hs2. Those pyc files then appeared corrupted when executing the custom cluster tests because the default python env is running a different version than the virtualenv those pyc files were generated from in earlier tests. Change-Id: Ie9d8f90c65921247dd885804165f9b7271ea807b Reviewed-on: http://gerrit.cloudera.org:8080/618 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-08-09 06:17:48 +00:00
Skye Wanderman-Milne	f000758ca8	IMPALA-2015: Add support for nested loop join Implement nested-loop join in Impala with support for multiple join modes, including inner, outer, semi and anti joins. Null-aware left anti-join is not currently supported. Summary of changes: Introduced the NestedLoopJoinNode class in the FE that represents the nested loop join. Common functionality between NestedLoopJoinNode and HashJoinNode (e.g. cardinality estimation) was moved to the JoinNode class. In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop join execution strategy. Change-Id: Id65a1aae84335bba53f06339bdfa64a1b0be079e Reviewed-on: http://gerrit.cloudera.org:8080/457 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-08-07 02:47:32 +00:00
Casey Ching	d202d6a967	Use "impala-python" (virtualenv) instead of system python Python tests and infra scripts will now use "python" from the virtualenv via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now that python 2.6 and a dependable set of third-party libraries are available but that is not done as part of this commit. Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f Reviewed-on: http://gerrit.cloudera.org:8080/603 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-06 02:09:09 +00:00
Ippokratis Pandis	adac8b79bc	IMPALA-1933: Fixing an error check in parquet scanner The HdfsParquetScanner would exit with the wrong error that it read fewer rows than what it was stated in the metadata of the file, when the ReadValue() call would fail with memory limit exceeded error. One of the effects of this wrong error reporting it was that tests like test_mem_usage_scaling would some times fail, especially under ASAN. With this patch the parquet scanner checks whether memory limit was exceeded before checking the difference between the number of rows read and the number of expected rows according to metadata. This patch also adds another value in test_mem_usage_scaling test, that value (20MB) would normally trigger this false negative error. Change-Id: Iad008d7af1993b88ac4dc055f595cfdbc62a6b79 Reviewed-on: http://gerrit.cloudera.org:8080/557 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-08-05 12:33:52 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
Matthew Jacobs	891cdf1830	Close pytest beeswax queries with exceptions The beeswax interface in the test infrastructure was not closing queries that encountered exceptions. This was problematic because failed queries would remain open, and due to IMPALA-2060, resources wouldn't be released. If admission control or RM is enabled, the test run may eventually fail if resources continue to be held. Regardless, failed queries should be closed. Change-Id: I5077023b1d95d1ce45a92009666448fdc6e83542 Reviewed-on: http://gerrit.cloudera.org:8080/530 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-08-03 20:45:05 +00:00
Henry Robinson	621205ebbc	IMPALA-2143: Avoid sending auth credentials over insecure connections This patch changes the behaviour of the Impala shell to refuse to attempt an LDAP-authenticated connection to Impala unless SSL/TLS is configured. A new flag --auth_creds_in_clear_ok is added to suppress this behaviour. This is similar to Impala's --ldap_passwords_in_clear_ok flag. The shell will also now print a warning if an insecure configuration is used. Change-Id: Ide25d8dd881a61b9f08900112466c430da64a038 Reviewed-on: http://gerrit.cloudera.org:8080/546 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-07-30 07:15:29 +00:00

1 2 3 4 5 ...

692 Commits