impala

mirror of https://github.com/apache/impala.git synced 2026-01-27 06:10:53 -05:00

Author	SHA1	Message	Date
Juan Yu	6bac14a283	IMPALA-2005: Cleanup the newly created table if CTAS fails. If CTAS query fails during the DML part Impala should drop the newly created table. Change-Id: I39e04a6923a36afa48f3252addd50ddda83d1706 (cherry picked from commit e03ce43585f68590a95038341e74db458f34bf32) Reviewed-on: http://gerrit.cloudera.org:8080/870 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-10-01 13:58:38 -07:00
Skye Wanderman-Milne	0c5e6a804f	IMPALA-2443: add support for more Parquet array encodings This patch adds full support for the various Parquet array encodings, as well as tests that use files from https://github.com/apache/hive/tree/master/data/files. This should allow us to read any existing array data. Change-Id: I3d22ae237b1dc82ee75a83c1d4890d76316fadee Reviewed-on: http://gerrit.cloudera.org:8080/826 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-10-01 13:58:37 -07:00
Tim Armstrong	96d93083fe	Disable nested TPCH tests for old aggs and joins We do not support nested types combined with the old aggs and joins. Change-Id: I81401dd4d482d46e678091989ac9d178ac771d01 Reviewed-on: http://gerrit.cloudera.org:8080/1078 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:48 -07:00
Matthew Jacobs	851056489d	IMPALA-2440: Fix old HJ full outer join with no rows When a full outer join on the old (non-partitioned) HashJoinNode, if any join fragment has 0 build rows and 0 probe rows an extra null row will be produced. Change-Id: I75373edc4f6b3b0c23afba3c1fa363c613f23507 Reviewed-on: http://gerrit.cloudera.org:8080/1068 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:47 -07:00
Juan Yu	7c498627f6	IMPALA-2249: Avoid allocating StringBuffer > 1GB in ScannerContext::Stream::GetBytesInternal() Due to IMPALA-1619, allocating StringBuffer larger than 1GB could cause Impala crash. Check the requested buffer size in advance and fail the request if it is larger than 1GB. Once IMPALA-1619 is fixed, we should revert this change. Change-Id: Iffd1e701614b520ce58922ada2400386661eedb1 (cherry picked from commit 74ba16770eeade36ab77c86ed99d9248c60b0131) Reviewed-on: http://gerrit.cloudera.org:8080/869 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:46 -07:00
Tim Armstrong	fbec3f65a0	Skip nested types tests with old aggs and joins We don't support nested types combined with the old aggs and joins. This patch disables the nested type query tests when the old aggs or joins are enabled with the TEST_START_CLUSTER_ARGS environment variable. Change-Id: I6579a0a245359d4d2ff955c399d1296580c9676e Reviewed-on: http://gerrit.cloudera.org:8080/1046 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:39 -07:00
Matthew Jacobs	2478f05cb3	IMPALA-2375: Unblock old hj/agg test runs Move a very expensive semi-join test case to run only on exhaustive so that it is not run as part of the old hj/agg jenkins runs where it fails. Change-Id: I4a0f915e894ceac91d86b366876e47e9cc87255a Reviewed-on: http://gerrit.cloudera.org:8080/930 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-27 15:13:32 -07:00
Martin Grund	579be1c542	IMPALA-2284: Disallow long (1<<30) strings in group_concat() This is the first step to fix issues with large memory allocations. In this patch, the built-in `group_concat` is no longer allowed to allocate arbitraryly large strings and crash impala, but is limited to the upper bound of possible allocations in Impala. This patch does not perform any functional change, but rather avoids unnecessary crashes. However, it changes the parameter type of FindChunk() in MemPool to be a signed 64bit integer. This change allows the mempool to allocate internally memory of more than one 1GB, but the public interface of Allocate() is not changed, so the general limitation remains. The reason for this change is as follows: 1) In a UDF FunctionContext::Reallocate() would allocate slightly more than 512MB from the FreePool. 2) The free pool tries to double this size to alloocate 1GB from the MemPool. 3) The MemPool doubles the size again and overflows the signed 32bit integer in the FindChunk() method. This will then only allocate 1GB instead of the expected 2GB. What happens is that one of the callers expected a larger allocation than actually happened, which will in turn lead to memory corruption as soon as the memory is accessed. Change-Id: I068835dfa0ac8f7538253d9fa5cfc3fb9d352f6a Reviewed-on: http://gerrit.cloudera.org:8080/858 Tested-by: Internal Jenkins Reviewed-by: Dan Hecht <dhecht@cloudera.com>	2015-09-23 15:15:55 -07:00
Ippokratis Pandis	48699de6e3	IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ PAGG and PHJ were using an all-or-nothing approach wrt spilling. In particular, they were trying to switch to IO-sized buffers for both streams (aggregated and unaggregated in PAGG; build and probe in PHJ) of every partition (currently 16 partitions for a total of 32 streams), even if some of the streams had very few rows, they were empty or simply they would not spill so there was no need to allocate IO-buffers for them. That was increasing the min mem needed by those operators in many queries. This patch decouples the decision to switch to IO-buffers for each stream of each partition. Streams will switch to IO-sized buffers whenever the rows they contain do not fit in the first two small buffers (64KB and 512KB respectively). When we decide to spill a partition, we switch to IO buffers both streams. With these change many streams of PAGG and PHJ nodes do not need to use IO-sized buffers, reducing the min mem requirement. For example, below is the min mem needed (in MBs) for some of the TPC-H queries. Some need half or less mem from the mem they needed before: TPC-H Q3: 645 -> 240 TPC-H Q5: 375 -> 245 TPC-H Q7: 685 -> 265 TPC-H Q8: 740 -> 250 TPC-H Q9: 650 -> 400 TPC-H Q18: 1100 -> 425 TPC-H Q20: 420 -> 250 TPC-H Q21: 975 -> 620 To make this small buffer optimization to work, we had to fix IMPALA-2352. That is, the AllocateRow() call of PAGG::ConstructIntermediateTuple() could return unsuccessfully just because the small buffers of the stream were exhausted. In that case, previously we would treat it as an indication that there is no memory left, start spilling a partition and switching all stream to IO-buffes. Now we make a best effort, trying to first SwitchToIoffers() and if that is successful, we re-attempt the AllocateRow() call. See IMPALA-2352 for more details. Another change is that now SwitchToIoBuffers() will reset the flag using_small_buffers_ back to false, in case we are in a very low memory situation and it fails to get a buffer. That allows us to retry calling SwitchToIoBuffers() once we free up some space. See IMPALA-2330 for more details. With the above fixes we should also have fixed IMPALA-2241 and IMPALA-2271 that are essentially stream::using_small_buffers_-related DCHECKs. This patch adds all 22 TPC-H queries in test_mem_usage_scaling test and updates the per-query min mem limits in it. Additionally, it adds a new aggregation test that uses the TPC-H dataset for larger aggregations (TestTPCHAggregationQueries). It also removes some dead test code. Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38 Reviewed-on: http://gerrit.cloudera.org:8080/818 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins Conflicts: tests/query_test/test_aggregation.py	2015-09-23 11:07:42 -07:00
Ippokratis Pandis	4d5ee2b3a2	IMPALA-2364: Wrong DCHECK in PHJ::ProcessProbeBatch There was a dcheck in PHJ::ProcessProbeBatch() that was expecting that the state of PHJ was PROCESSING_PROBE. It looks like we can hit the same dcheck when we are in REPARTITIONING phase. This patch fixes this dcheck. It also adds tpc-ds q53 in the test_mem_usage_scaling test (along with the needed refactoring in this test) because tpc-ds q53 hit this dcheck in an endurance test. Change-Id: I37f06e1bfe07c45e4a6eac543934b4d83a205d28 Reviewed-on: http://gerrit.cloudera.org:8080/893 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-09-23 10:38:58 -07:00
aacalfa	57dd4d1502	IMPALA-1309: Add support for distinct in group_concat function. Change-Id: I2790f1d2a7bfd0ecc7ef66cc5d91dafe3414e111 Reviewed-on: http://gerrit.cloudera.org:8080/892 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-09-23 09:42:17 +00:00
Tim Armstrong	db7519df24	IMPALA-2207: memory corruption on build side of NLJ The NLJ node did not follow the expected protocol when need_to_return is set on a row batch, which means that memory referenced by a rowbatch can be freed or reused the next time GetNext() is called on the child. This patch changes the NLJ node to follow the protocol by deep copying all build side row batches when the need_to_return_ flag is set on the row batches. This prevents the row batches from referencing memory that may be freed or reused. Reenable test that was disabled because of IMPALA-2332 since this was the root cause. Change-Id: Idcbb8df12c292b9e2b243e1cef5bdfc1366898d1 Reviewed-on: http://gerrit.cloudera.org:8080/810 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-22 10:58:32 -07:00
Alex Behm	41ef3a216d	Nested Types: Add functional tests. This patch adds basic end-to-end functional tests for nested types: 1. For exercising the Reset() of exec nodes when inside a subplan. 2. For asserting correct behavior when row batches with collection-typed slots flow through exec nodes. Most cases are covered, but there are a few known issues that prevent full coverage. The remaining tests will be added as part of the fixes for those existing JIRAs. Change-Id: I0140c1a32cb5edd189f283c68a24de8484b3f434 Reviewed-on: http://gerrit.cloudera.org:8080/823 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-14 13:43:01 -07:00
Tim Armstrong	235a8d08da	IMPALA-2295: deep copy arrays in BufferedTupleStream This was unimplemented and is used on some code paths. Arrays were not properly copied into the BufferedTupleStream, potentially leaving stray pointers to invalid or reused memory. Arrays are now correctly deep copied. Includes a unit test that copys rows containing arrays in and out of a BufferedTupleStream. Also implement matching optimisation for deep copy in RowBatch. Change-Id: I75d91a6b439450c5b47646b73bc528cfb8f7109b Reviewed-on: http://gerrit.cloudera.org:8080/751 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-09 02:39:14 +00:00
Alex Behm	dbb40c7922	Nested Types: Add end-to-end tests running nested TPCH on Parquet. Change-Id: I2a3c46ea50e53479f2f91c175c45e2da3c1c7025 Reviewed-on: http://gerrit.cloudera.org:8080/740 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-06 20:49:42 +00:00
Alex Behm	d48ec4b8b3	IMPALA-2289: Properly handle AtCapacity() in SubplaNode. After this patch we get correct results for nested TPCH Q13. The bug: Since we were not properly handling AtCapacity() of the output batch in SubplanNode, we sometimes passed a row batch that was already at capacity into GetNext() on the second child of the SubplanNode. In this particular case, that batch was passed into the NestedLoopJoinNode which may return incomplete results if the output batch is already at capacity (e.g., ProcessUnmatchedBuildRows() was not called). The fix is to return from SuplanNode::GetNext() if the output batch is at capacity due to resources being tranferred to it from the input batch used to fetch from the first child. Change-Id: Ib97821e8457867dc0d00fd37149a3f0a75872297 Reviewed-on: http://gerrit.cloudera.org:8080/742 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-04 20:26:52 +00:00
Tim Armstrong	4ac7e5d15d	Disable nested types tests affected by IMPALA-2295 IMPALA-2295 causes tests combining paggs/phjs with collection types to intermittently fail because of memory corruption. This affects non-scanner nested types tests. Change-Id: I63893fbde87189485455cf95a7f63eb7e8aa95f3 Reviewed-on: http://gerrit.cloudera.org:8080/747 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-09-04 05:38:17 +00:00
Skye Wanderman-Milne	bcc73a36da	Nested types: read and materialize nested types in Parquet scanner This patch modifies the Parquet scanner to resolve nested schemas, and read and materialize collection types. The high-level modification is to create a CollectionColumnReader that recursively materializes map- and array-type slots. This patch also adds many tests, most of which query a new table called complextypestbl. This table contains hand-generated data that is meant to expose edge cases in the scanner. The tests mostly test the scanner, with a few tests of other functionality (e.g. array serialization). I ran a local benchmark comparing this scanner code to the original scanner code on an expanded version of tpch_parquet.lineitem with 48009720 rows. My benchmark involved selecting different numbers of columns with a single scanner thread, and I looked at the HDFS scan node time in the query profiles. This code introduces a 10%-20% regression in single-threaded scan time. Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a Reviewed-on: http://gerrit.cloudera.org:8080/576 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-09-02 19:23:54 +00:00
Martin Grund	f927a285c6	IMPALA-1136, IMPALA-2161: Skip \u0000 characters when dealing Avro schemas The limitation of the Avro JSON library not to handle \u0000 characters is to avoid problems with builtin functions like strlen() that would report wrong length when encountering such a character. Now, in the case if Impala, for now, we don't support any Unicode characters. This allows us to actually skip the \u0000 character instead of interpreting it. It is important to say that even the most recent versions of Avro do not support parsing \u0000 characters. Change-Id: I56dfa7f0f12979fe9705c51c751513aebce4beca Reviewed-on: http://gerrit.cloudera.org:8080/712 Tested-by: Internal Jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2015-09-02 00:37:28 +00:00
Martin Grund	fa5eca09c8	Disable HDFS file handle caching by default This patch modifies the Impala command line flag: --max_cached_file_handles=VAL to disable caching of HDFS file handles if VAL is 0. In addition, it moves the existing functional tests to a custom cluster test and keeps a sanity check for no caching in the original place. Furthermore, it will check that no file handles are leaked. Change-Id: Ic36168bba52346674f57639e1ac216fd531b0fad Reviewed-on: http://gerrit.cloudera.org:8080/691 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-08-27 23:34:30 +00:00
Skye Wanderman-Milne	7906ed44ac	IMPALA-2015: Add support for nested loop join Implement nested-loop join in Impala with support for multiple join modes, including inner, outer, semi and anti joins. Null-aware left anti-join is not currently supported. Summary of changes: Introduced the NestedLoopJoinNode class in the FE that represents the nested loop join. Common functionality between NestedLoopJoinNode and HashJoinNode (e.g. cardinality estimation) was moved to the JoinNode class. In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop join execution strategy. Change-Id: I238ec7dc0080f661847e5e1b84e30d61c3b0bb5c Reviewed-on: http://gerrit.cloudera.org:8080/652 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-08-19 08:40:14 +00:00
Tim Armstrong	5350d49f8c	IMPALA-1829: UDAs with different intermediate type Previously the frontend rejected UDAs with different intermediate and result type. The backend supports these, so this change enables support in the frontend and adds tests. This patch adds a test UDA function with different intermediate type and a simple end-to-end test that exercises it. It modifies an existing unused test UDA that used a currently unsupported intermediate type - BufferVal. Change-Id: I5675ec7f275ea698c24ea8e92de7f469a950df83 Reviewed-on: http://gerrit.cloudera.org:8080/655 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2015-08-19 04:37:39 +00:00
Dimitris Tsirogiannis	47c5ae405a	Revert "IMPALA-2015: Add support for nested loop join" This reverts commit 6837cdec7f6a7e1c7e8157e323f3ab68277689aa. Change-Id: I2fd6424c553a701fcbfd425b4486af7280820b23 Reviewed-on: http://gerrit.cloudera.org:8080/636 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 02:20:07 +00:00
Skye Wanderman-Milne	f000758ca8	IMPALA-2015: Add support for nested loop join Implement nested-loop join in Impala with support for multiple join modes, including inner, outer, semi and anti joins. Null-aware left anti-join is not currently supported. Summary of changes: Introduced the NestedLoopJoinNode class in the FE that represents the nested loop join. Common functionality between NestedLoopJoinNode and HashJoinNode (e.g. cardinality estimation) was moved to the JoinNode class. In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop join execution strategy. Change-Id: Id65a1aae84335bba53f06339bdfa64a1b0be079e Reviewed-on: http://gerrit.cloudera.org:8080/457 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-08-07 02:47:32 +00:00
Ippokratis Pandis	adac8b79bc	IMPALA-1933: Fixing an error check in parquet scanner The HdfsParquetScanner would exit with the wrong error that it read fewer rows than what it was stated in the metadata of the file, when the ReadValue() call would fail with memory limit exceeded error. One of the effects of this wrong error reporting it was that tests like test_mem_usage_scaling would some times fail, especially under ASAN. With this patch the parquet scanner checks whether memory limit was exceeded before checking the difference between the number of rows read and the number of expected rows according to metadata. This patch also adds another value in test_mem_usage_scaling test, that value (20MB) would normally trigger this false negative error. Change-Id: Iad008d7af1993b88ac4dc055f595cfdbc62a6b79 Reviewed-on: http://gerrit.cloudera.org:8080/557 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-08-05 12:33:52 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
Matthew Jacobs	891cdf1830	Close pytest beeswax queries with exceptions The beeswax interface in the test infrastructure was not closing queries that encountered exceptions. This was problematic because failed queries would remain open, and due to IMPALA-2060, resources wouldn't be released. If admission control or RM is enabled, the test run may eventually fail if resources continue to be held. Regardless, failed queries should be closed. Change-Id: I5077023b1d95d1ce45a92009666448fdc6e83542 Reviewed-on: http://gerrit.cloudera.org:8080/530 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-08-03 20:45:05 +00:00
Alex Behm	569e86a60b	Nested Types: Change ExecNode::Reset() to only clear state and not tuple data. This patch changes the ExecNode::Reset() to: Status ExecNode::Reset(RuntimeState* state); The new Reset() should only clear the internal state of an exec node in preparation for another Open()/GetNext(). Reset() should not clear memory backing rows returned by a node in GetNext() because those rows could still be in flight. Subplan Memory Management: To ensure that the memory backing rows produced by the subplan tree of a SubplanExecNode remains valid for the lifetime a row batch, we intend to use our conventional transfer mechanism. That is, the ownership of memory that is no longer used by an exec node is transferred to an output row batch in GetNext() at a "convenient" point, typically at eos or when the memory usage exceeds some threshold. Note that exec nodes may choose not to transfer memory at eos to amortize the cost of memory allocation over multiple Reset()/Open()/GetNext() cycles. To show the main ideas, this patch fixes transferring of tuple data ownership in several places and implements Reset() for the following nodes: - AnalyticEvalNode - BlockingJoinNode - CrossJoinNode - SelectNode - SortNode - TopNNode - UnionNode To make the transfer of ownership work for SortNode a row batch can now also own a list of BufferdBlockMgr::Block*. Also included are basic query tests that are not meant to be exhaustive. The tests are disabled for now because we cannot run them without several other code changes. I have manually run the test queries on a branch that has all necessary changes. Change-Id: I3ac94b8dd7c7eb48f2e639ea297b447fbf443185 Reviewed-on: http://gerrit.cloudera.org:8080/454 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-06-23 07:43:22 +00:00
ishaan	8454c7712d	Disable tests that call hive from running on S3 and Isilon. A few tests execute hive queries via the command line. These tests should ideally not be run when the underlying filesystem is either S3 or Isilon. This patch disables them. Change-Id: Ieb968f4f109e02ee893a0478b0ffeb16e5b3ff4c Reviewed-on: http://gerrit.cloudera.org:8080/446 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-06-09 21:07:29 +00:00
Martin Grund	0f85e0e700	IMPALA-1588: Enabling caching of HDFS file handles (Part III) This patch enables caching of HDFS file handles to avoid opening the file over and over again. When a file is opened for the first time, a HdfsCachedFileHandle object is created that is a small wrapper around the hdfsFile instance allowing to associate the last modified timestamp with this instance. When the file handle is no longer needed, it is returned to the DsikIoMgr where it is cached under the given path. When the file is opened again, first a lookup is performed to see if an existing handle can be reused. If there is an existing handle, the last modified time of the cached handle is compared with the last modified time of the file to be opened. If they are equal the handle can be reused, otherwise it is closed and the file is opened regularly. The new flag `-max_cached_file_handles` controls the overall size of the cache by defining an upper bound of cached file handles. Furthermore, five new metrics were added to report the number of currently cached file handles in the DiskIoMgr and the hit ratio of the cache (including hit and miss count). impala-server.io.mgr.num-cached-file-handles impala-server.io.mgr.cached-file-handles-hit-ratio impala-server.io.mgr.cached-file-handles-hit-count impala-server.io.mgr.cached-file-handles-miss-count mpala-server.io.mgr.num-file-handles-outstanding Due to the way how Impala performs the scan operations the cache may contain multiple entries for the same file. If the limit of open files in the context of the process is smaller than `max_cached_file_handles`, the lower limit is used as the cache capacity. Performance and Memory Evaluation: The patch was evaluated in three tests 1) Throughput, parallel scans on a small table with 200 small files. TP increased from ~50 QPS to ~150 QPS with FD caching. 2) Latency: single table with 300k files. Running select count(*) on the table was executed in 2792.30s with FD caching and in 2764.81s without FD caching (based HEAD~1 commit). No overhead. 3) Memory consumption. For the above table the delta in RSS memory consumption after running the query is 30MB which equals roughly the expected 2-3kB per FD for 10k cached descriptors. Change-Id: Ifa6560d141188c329d7bc73c2dabcc1352d69cd7 Reviewed-on: http://gerrit.cloudera.org:8080/366 Tested-by: Internal Jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-05-30 17:16:22 +00:00
ishaan	dbc78aaa2c	Enable isilon end to end tests for Impala. This patch introduces changes to run tests against Isilon, combined with minor cleanup of the test and client code. For Isilon, it: - Populates the SkipIfIsilon class with appropriate pytest markers. - Introduces a new default for the hdfs client in order to connect to Isilon. - Cleans up a few test files take the underlying filesystem into account. - Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically: - delete_file_dir does not throw an error if the file does not exist. - get_file_dir_status automatically strips the leading '/' Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f Reviewed-on: http://gerrit.cloudera.org:8080/370 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-05-27 22:25:12 +00:00
Juan Yu	4810e51446	IMPALA-2008: Fix wrong warning when insert overwrite to empty table libhdfs hdfsListDirectory API documentation is wrong. It says it returns NULL when there is an error. But it will return NULL as well when the directory is empty. Impala needs to check errno to make sure if an error happened. The HDFS issue is addressed by HDFS-8407. Change-Id: I9574c321a56fe339d4ccc3bb5bea59bc41f48ac4 (cherry picked from commit 20da688af19ca41576c82fd7b7d49b4346dbae92) Reviewed-on: http://gerrit.cloudera.org:8080/394 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-22 20:23:39 +00:00
Juan Yu	934b28fe5e	IMPALA-1381: Expand set of supported timezones. The hardcoded timezone information is from Java version 1.7.0_76. Change-Id: I32c40d0036473079e5bfd4d0252a648cbb0e7c23 Reviewed-on: http://gerrit.cloudera.org:8080/393 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-22 01:32:54 +00:00
Juan Yu	d1c263402e	IMPALA-1973: Fixing crash when uninitialized, empty row is added in HdfsTextScanner This patch fixes an issue when an uninitialized, empty row is falsely added to the rowbatch. The uninitialized data inside this row leads later on to a crash when the null byte is checked together with the offsets (that contains garbage). The fix is to not only check for the number of materialized columns, but as well for the number of materialized partition key columns. Only if both are empty and the parser has an unfinished tuple, add the empty row. To accommodate for the last row, check in FinishScanRange() if there is an unfinished tuple with materialized slots or materialized partition key. Write the fields if necessary. Change-Id: I2808cc228e62d048d917d3a6352d869d117597ab (cherry picked from commit c1795a8b40d10fbb32d9051a0e7de5ebffc8a6bd) Reviewed-on: http://gerrit.cloudera.org:8080/364 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-05 00:19:12 +00:00
Ippokratis Pandis	4d428440d8	IMPALA-1919: Avoid calling ProcessBatch with out_batch->AtCapacity in right joins PHJ::GetNext() of RIGHT_OUTER, RIGHT_ANTI and FULL_OUTER joins that had repartitioned were not checking whether the output batch reached capacity at the OutputUnmatchedBuild() call. In case of repartitioned joins where the list of build_partitions was exhausted and the output batch has already reached capacity, we would call ProcessProbeBatch() with a full output batch, resulting a DCHECK. This patch adds the missing AtCapacity() check. It also adds a new join test (tpch-out-joins) that uses the TPC-H dataset and moves there some of the join tests that were using it. Running join tests with the larger TPC-H dataset is needed, for example, in order to trigger repartitions. Change-Id: I4434ad0683e1b09f75a25b3eb870a817d4988370 Reviewed-on: http://gerrit.cloudera.org:8080/314 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-05-04 19:49:56 +00:00
ishaan	09e5eaeda2	Introduce classes for pytest's skipif markers. This patch encapsulates pytests's skipif markers in classes. It leads to the following benefits: - Provide context and grouping for tests being skipped. - As we improve test reporting, annotations will give us a better idea of coverage. Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9 Reviewed-on: http://gerrit.cloudera.org:8080/297 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/343	2015-04-19 03:09:59 +00:00
Dan Hecht	47032fa789	IMPALA-1863: Avoid deadlock across fragment instances In the case that the BlockingJoinNode runs the build asynchrously in one fragment instance but not another, a deadlock between the instances is possible (see IMPALA-1863 for the details). To avoid this deadlock potential, close the bulid side child on error, which will deregister any datastream receivers from that side, breaking the cycle that leads to the deadlock. Change-Id: I2de06615897b4bcaa5855449a98984f11c948dc4 Reviewed-on: http://gerrit.cloudera.org:8080/242 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins Conflicts: be/src/exec/blocking-join-node.cc	2015-03-25 14:45:49 -07:00
Skye Wanderman-Milne	16200b5fe9	Add test for IMPALA-1774 This was fixed by b7d41e57 ("Nested types: BE changes for Parquet struct support") Change-Id: I3df887eabed7b50941b31397e633aeef1649693e Reviewed-on: http://gerrit.cloudera.org:8080/260 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/267 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>	2015-03-24 00:31:10 +00:00
Juan Yu	e121bc9b0a	IMPALA-1476: Impala incorrectly handles text data missing a newline on the last line. I did a local benchmark and there's minimal performance impact(<1%) Change-Id: I8d84a145acad886c52587258b27d33cff96ea399 (cherry picked from commit 7e750ad5d90007cc85ebe493af4dce7a537ad7c0) Reviewed-on: http://gerrit.cloudera.org:8080/189 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-03-20 19:58:50 -07:00
ishaan	73d7ab11e1	Compute stats for tpch parquet tables while loading the data. This patch removes the logic from the python test file, it should really live in the code that sets up the test-warehouse. Change-Id: Id04dc90c7ab813af2f347ec79e9e43d76de794a2 Reviewed-on: http://gerrit.cloudera.org:8080/224 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-03-12 17:49:55 -07:00
ishaan	3f0dd5628f	Serialize tests that use a common setup/teandown methods in parallel. pytests parallelizes on the level of a test method + the dimensions. So a test method which operates on n dimensions will be parallelized n times, and scheduled to run separately. A few tests use setup_method/teardown_method in parallel but operate on the same resources (a test db, test table) etc. This introduces a race when they happen to run at the same time. This patch forces those tests to run serially, thereby avoiding the race and removing flakiness. Change-Id: I77f038e15b9a0616c8df4caaebd31733960ad78e Reviewed-on: http://gerrit.cloudera.org:8080/207 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:40 -07:00
Dan Hecht	b318b82f06	Temporarily skip test_mem_usage_scaling.py on S3 until IMPALA-1863 is solved Otherwise, the S3 job always hangs at this test and we loose coverage of everything downstream. I'm pretty sure IMPALA-1863 is not S3 related, but we hit that bug on EC2/S3 for whatever reason. Change-Id: I3f27413fdd53e57d11c08dbef1daac36a032f4a6 Reviewed-on: http://gerrit.cloudera.org:8080/210 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:40 -07:00
Dan Hecht	7fe06d264f	Fix test_udfs.py test_udf_invalid_symbol location We need to use paths that prepend $FILESYSTEM_PREFIX so that paths are qualified when running the tests against the non-default filesystem (e.g. S3). Change-Id: I40b23d550aef67fcc5ebeb4640ed626d4d6361f8 Reviewed-on: http://gerrit.cloudera.org:8080/201 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:39 -07:00
Dan Hecht	2916132283	S3: enable more tests for S3 As needed, fix up file paths and other misc things to get more test cases running against S3. Change-Id: If4eaf9200f2abd17074080a37cd0225d977200ad Reviewed-on: http://gerrit.cloudera.org:8080/167 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:39 -07:00
Matthew Jacobs	73a47167f7	IMPALA-1642: Impala crashes if the symbol for a Hive UDF doesn't exist If the symbol didn't exist, JniContext::output_anyval was never set and wasn't initialized to NULL, so on Close(), we were deleting a random ptr. The fix is to initialize the ptr to NULL. Change-Id: I3c6d6f7f997b37ff9bf066059a3ab4e4e9635253 Reviewed-on: http://gerrit.cloudera.org:8080/185 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:38 -07:00
Dan Hecht	5aa8195534	S3: add end-to-end test for multiple filesystems Verify DDL and queries when a table spans multiple filesystems and across tables that live on different filesystems. Change-Id: I4258bebae4a5a2758666f5c2e283eb2d205c995e Reviewed-on: http://gerrit.cloudera.org:8080/166 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-11 16:39:38 -07:00
Dan Hecht	b46e8001ef	S3: more test triage and location fixups Fix up more locations to allow the tests to run on secondary filsystem. In particular, database locations need to be located on the target filesystem or else any tables created without locations will be in HDFS and not actually give coverage on S3. Change-Id: Ifcc4a47ecaa235b23d305784b844788732d5fa05 Reviewed-on: http://gerrit.cloudera.org:8080/143 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 09:12:46 +00:00
Ippokratis Pandis	a519020908	IMPALA-1836: PAGG::Partition::Close() may need to clean up hash tables w/o buckets In the rare low memory case where during a hash table's initialization we can not consume even 8KBs to create the array of buckets, Init() would set num_buckets==0 and return false. But then, PAGG::Partition::Close() would try to clean up the hash table of the partition iteratating its buckets, even though it didn't have any buckets. That would result in a dcheck at HashTable::Begin(). This patch fixes the problem not dcheck'ing at HashTable::Begin(). That function would call NextBucket() which would correctly return End() if there were no buckets. Change-Id: I9c5984de79fb5ef8b7f31e082ac0d0bfbf242e77 Reviewed-on: http://gerrit.cloudera.org:8080/135 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-03-04 10:15:50 +00:00
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
Ippokratis Pandis	e89cccb7b4	Adding more mem_limit tests for the TPC-H queries Adding Q1-Q9, Q18, Q20 and Q21. Change-Id: If4e545e3f64316665691d53770bdf0ca9d5059ff Reviewed-on: http://gerrit.cloudera.org:8080/85 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-03-01 22:36:27 +00:00

1 2 3 4 5 ...

352 Commits