impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 21:00:54 -05:00

Author	SHA1	Message	Date
Thomas Tauber-Marshall	82290d61ad	IMPALA-4895: Memory limit exceeded in test_outer_joins A recent change (IMPALA-3524) removed a 'CATCH' section for a mem limit exceeded error because the other changes in the patch reduced the memory requirements for that particular query and the error was no longer being hit. This seemed okay because the point of the test wasn't to trigger the mem limit exceeded error, and I manually verified that the situation was the test was addressing was still covered even without the error being hit. It turns out, though, that the test still hits the error in some situations (local-filesystem and non-partitioned-aggs-and-joins builds). The fix is to make the test more permissive by adding '__NO_ERROR_' as one of the options in the 'CATCH: ANY_OF' section, so that it passes whether or not the mem limit is exceeded. Change-Id: I4731a3e83dd2142a1d83be963f83cd1847472295 Reviewed-on: http://gerrit.cloudera.org:8080/5941 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-09 00:50:15 +00:00
Thomas Tauber-Marshall	6a9df54096	IMPALA-3524: Don't process spilled partitions with 0 probe rows In the partitioned hash join node, if a spilled partition has no probe rows, building the hash table is unnecessary. For some build types (right outer, right anti, and full outer), we still need to process the build side to output unmatched rows (in this case, all rows since there were no probe rows to match). Testing: Added some cases to spilling.test. Manually tested these cases for performance, and they all show around a 6% improvement. Change-Id: I175b32dd9031e51218b38c37693ac3e31dfab47b Reviewed-on: http://gerrit.cloudera.org:8080/5389 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-02-06 20:22:33 +00:00
Alex Behm	f0ffbca2c3	IMPALA-3491: Use unique database fixture in test_insert_parquet.py Testing: Ran the test locally in a loop. Did a private debug/core/hdfs build. Change-Id: I790b2ed5236640c7263826d1d2a74b64d43ac6f7 Reviewed-on: http://gerrit.cloudera.org:8080/4317 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-09-08 03:25:29 +00:00
Bikramjeet Vig	36b4ea6f65	IMPALA-1683: Allow REFRESH on a single partition Currently the only way to refresh metadata for a partition was to refresh the whole table. This is a relatively time consuming process especially if there are many partitions and only one is to be refreshed. This patch allows the client to REFRESH on a single partition by using the following syntax: REFRESH [database_name.]table_name PARTITION (partition_spec) Testing: Added parsing and authorization tests in ParserTest.java and AuthorizationTest.java respectively. A new test file "test_refresh_partition.py" was added for testing functionality. Performance: For a table with 10000 partitions and 1 file per partition execResetMetadata() Total Execution Time Refresh Table 3795 ms 4630 ms Refersh Partition 42 ms 680 ms We see that the time to refresh improves by a factor of 90x but due to significant overhead of about 640ms in this case the effective improvement is about 7x. As the size of the table and number of partitions increase, this improvement would be more significant. Change-Id: Ia9aa25d190ada367fbebaca47ae8b2cafbea16fb Reviewed-on: http://gerrit.cloudera.org:8080/3813 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-29 23:57:50 +00:00
Dimitris Tsirogiannis	6fbd35fa87	Enable TPC-H workload for Kudu tables With this commit we enable loading of TPC-H data in Kudu tables and running the 22 TPC-H queries against Kudu. Since Kudu doesn't support the decimal data type, we had to modify the queries by using round() function and update the test results. Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289 Reviewed-on: http://gerrit.cloudera.org:8080/3789 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:35:11 +00:00
Michael Ho	3a4a77521e	IMPALA-3608: Updates Impala E2E test framework to allow multiple exception messages Some of our tests which are expected to fail due to low query memory limits can fail non-deterministically with different error messages. In addition, some tests may throw different error messages when running with the legacy join nodes. This change updates the test infrastructure to allow multiple exception messages to be specified by using adding "ANY_OF" to the "CATCH" subsection. Change-Id: Ie6d81fd3ae601f565b575edfeefff7c5a6c07974 Reviewed-on: http://gerrit.cloudera.org:8080/3205 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-05-31 23:32:10 -07:00
Michael Ho	0243a21da8	IMPALA-3242: Remove most usages of RuntimeState::SetMemLimitExceeded() There are multiple places in the code which call RuntimeState::SetMemLimitExceeded(). Most of them are unnecessary as the error status constructed will eventually be propagated up the tree of exec nodes. There is no obvious reason to treat query memory limit exceeded differently. In some cases such as scan-node, calling SetMemLimitExceeded() is actually confusing as all scanner threads may pick up error status when any thread exceeds query memory limit, causing a lot of noise in the log. This change replaces most calls to RuntimeState::SetMemLimitExceeded() with MemTracker::MemLimitExceeded(). The remaining places are: the old hash table code, the UDF framework and QueryMaintenance() which checks for memory limit periodically. The query maintenance case will be removed eventually once IMPALA-2399 is fixed. Change-Id: Ic0ca128c768d1e73713866e8c513a1b75e6b4b59 Reviewed-on: http://gerrit.cloudera.org:8080/3140 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-05-23 08:40:19 -07:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Tim Armstrong	2c2670e389	IMPALA-1305: streaming pre-aggregations Aggregations are implemented as a distributed pre-aggregation, an exchange, then a final aggregation that produces the results of the aggregation. In many cases the pre-aggregation significantly reduces the amount of data to be exchanged. However, in other cases, the preaggregation does not greatly reduce the amount of data exchanged or can use a lot of memory and starve other operators that would benefit more from the additional memory. In these cases we would be better off "passing through" some input tuples by transforming them into intermediate tuples without aggregating them. This patch adds a streaming pre-aggregation mode to PartitionedAggregationNode that tries to aggregate input rows with a hash table, but can switch to passing through the input tuples (after transforming them into the appropriate tuple format). It does this if it hits a memory limit or if the aggregation is not sufficiently reducing the node's output (specifically, if the number of aggregated rows in the hash table is more than half the number of unaggregated rows consumed by the pre-aggregation). Pre-aggregations never need to spill because they can pass through rows when under memory pressure. This initial implementation is quite conservative: it retains the partitioning of the previous implementation because switching to a single partition proved to regress performance of some queries while improving others. It also always keeps hash tables around and updates them with matching input rows so that reduction statistics are updated and early decisions to pass through data can be reversed. Future work could explore different approaches within the new framework to get larger performance gains. Currently we see significant performance benefits for queries with a very low reduction factor, e.g. group by on a nearly unique column Includes codegen support for the passthrough streaming. Adds a query option, disable_streaming_preaggregations, in case a user wants to revert to the old behaviour. Adds TPC-H tests to exercise the new passthrough code path and updates planner tests to include the new [STREAMING] detail added by the planner. Change-Id: Ia40525340cba89a8c4e70164ae11447e96494664 Reviewed-on: http://gerrit.cloudera.org:8080/1698 Tested-by: Internal Jenkins Reviewed-by: Dan Hecht <dhecht@cloudera.com>	2016-02-11 19:03:51 +00:00
Michael Ho	968c61c940	IMPALA-2824: Restore query options after each test. A failed test case inside a test file will leave the rest of the test cases in the file unexecuted. Some test cases may modify some query options such as memory limit and then restore them in the subsequent test cases in the same file. The failure of those test cases will leave the query options modified, causing cascading failures to other test cases which aren't expected to be run with the modified query options (e.g. lowered memory limit). This problem may lead to broken builds which are recorded in IMPALA-2724 and IMPALA-2824. This change fixes the problem above by checking if a test case modifies any query option and if so, restore those modified query options to their default values. This change makes the assumption that a test should not modify an option specified in its test vector so it's safe to restore the modified query options to their default values. Change-Id: Ib88d1dcb6a65183e1afc8eef0c764179a9f6a8ce Reviewed-on: http://gerrit.cloudera.org:8080/1774 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-01-26 03:13:05 +00:00
Michael Ho	ba0bd1d0da	IMPALA-2612: Free local allocations once for every row batch when building hash tables. When building hash tables for the build side in partitioned hash join or aggreagtion, we will evaluate the build or probe side expressions to compute the hash values for each TupleRow. Evaluation of certain expressions (e.g. CastToChar) requires "local" memory allocation. "Local" memory allocation is supposed to be freed after processing each row batch. However, the calls to free local allocations are missing in PartitionedHashJoinNode::BuildHashTableInternal() and PartitionedAggregationNode::ProcessStream(). This causes all "local" memory allocation to accumulate potentially for the entire duration of the query or until GetNext() is called. This may lead to unnecessary memory allocation failure as memory limit is exceeded. This patch calls ExecNode::FreeLocalAllocations() at least once per row-batch when building hash tables. It also adds the missing checks for the query status in the loop building hash tables. Please note that QueryMaintenance() isn't called due to its overhead in memory limit checks. Change-Id: Idbeab043a45b0aaf6b6a8c560882bd1474a1216d Reviewed-on: http://gerrit.cloudera.org:8080/1448 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2015-11-26 03:21:46 +00:00
Ippokratis Pandis	f1ef5170cb	IMPALA-2168: Do not try to access streams of repartitioned spilled partition in right-joins In case of right joins (right outer, right anti and full outer), if a spilled partition was repartitioned we would try to access its build rows stream, even though that was already set to NULL leading to SEGV. Change-Id: Ia570333c62a4da1152d8d47be9176ac024ba3f5f Reviewed-on: http://gerrit.cloudera.org:8080/1209 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-10-09 16:33:14 -07:00
Ippokratis Pandis	48699de6e3	IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ PAGG and PHJ were using an all-or-nothing approach wrt spilling. In particular, they were trying to switch to IO-sized buffers for both streams (aggregated and unaggregated in PAGG; build and probe in PHJ) of every partition (currently 16 partitions for a total of 32 streams), even if some of the streams had very few rows, they were empty or simply they would not spill so there was no need to allocate IO-buffers for them. That was increasing the min mem needed by those operators in many queries. This patch decouples the decision to switch to IO-buffers for each stream of each partition. Streams will switch to IO-sized buffers whenever the rows they contain do not fit in the first two small buffers (64KB and 512KB respectively). When we decide to spill a partition, we switch to IO buffers both streams. With these change many streams of PAGG and PHJ nodes do not need to use IO-sized buffers, reducing the min mem requirement. For example, below is the min mem needed (in MBs) for some of the TPC-H queries. Some need half or less mem from the mem they needed before: TPC-H Q3: 645 -> 240 TPC-H Q5: 375 -> 245 TPC-H Q7: 685 -> 265 TPC-H Q8: 740 -> 250 TPC-H Q9: 650 -> 400 TPC-H Q18: 1100 -> 425 TPC-H Q20: 420 -> 250 TPC-H Q21: 975 -> 620 To make this small buffer optimization to work, we had to fix IMPALA-2352. That is, the AllocateRow() call of PAGG::ConstructIntermediateTuple() could return unsuccessfully just because the small buffers of the stream were exhausted. In that case, previously we would treat it as an indication that there is no memory left, start spilling a partition and switching all stream to IO-buffes. Now we make a best effort, trying to first SwitchToIoffers() and if that is successful, we re-attempt the AllocateRow() call. See IMPALA-2352 for more details. Another change is that now SwitchToIoBuffers() will reset the flag using_small_buffers_ back to false, in case we are in a very low memory situation and it fails to get a buffer. That allows us to retry calling SwitchToIoBuffers() once we free up some space. See IMPALA-2330 for more details. With the above fixes we should also have fixed IMPALA-2241 and IMPALA-2271 that are essentially stream::using_small_buffers_-related DCHECKs. This patch adds all 22 TPC-H queries in test_mem_usage_scaling test and updates the per-query min mem limits in it. Additionally, it adds a new aggregation test that uses the TPC-H dataset for larger aggregations (TestTPCHAggregationQueries). It also removes some dead test code. Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38 Reviewed-on: http://gerrit.cloudera.org:8080/818 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins Conflicts: tests/query_test/test_aggregation.py	2015-09-23 11:07:42 -07:00
ishaan	dbc78aaa2c	Enable isilon end to end tests for Impala. This patch introduces changes to run tests against Isilon, combined with minor cleanup of the test and client code. For Isilon, it: - Populates the SkipIfIsilon class with appropriate pytest markers. - Introduces a new default for the hdfs client in order to connect to Isilon. - Cleans up a few test files take the underlying filesystem into account. - Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically: - delete_file_dir does not throw an error if the file does not exist. - get_file_dir_status automatically strips the leading '/' Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f Reviewed-on: http://gerrit.cloudera.org:8080/370 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-05-27 22:25:12 +00:00
Ippokratis Pandis	4d428440d8	IMPALA-1919: Avoid calling ProcessBatch with out_batch->AtCapacity in right joins PHJ::GetNext() of RIGHT_OUTER, RIGHT_ANTI and FULL_OUTER joins that had repartitioned were not checking whether the output batch reached capacity at the OutputUnmatchedBuild() call. In case of repartitioned joins where the list of build_partitions was exhausted and the output batch has already reached capacity, we would call ProcessProbeBatch() with a full output batch, resulting a DCHECK. This patch adds the missing AtCapacity() check. It also adds a new join test (tpch-out-joins) that uses the TPC-H dataset and moves there some of the join tests that were using it. Running join tests with the larger TPC-H dataset is needed, for example, in order to trigger repartitions. Change-Id: I4434ad0683e1b09f75a25b3eb870a817d4988370 Reviewed-on: http://gerrit.cloudera.org:8080/314 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins	2015-05-04 19:49:56 +00:00
Matthew Jacobs	99219488d7	IMPALA-1705: Support writing values larger than 64KB to Parquet files Allow values larger than 64KB to be written to Parquet files. This was previously limited by a fixed data page size. This commit removes that limitation by allowing the page size to grow when necessary. This occurs when there are enough unique values to switch from dictionary encoding to plain encoding, and then there are huge values larger than the default 64KB page size. In this case, it may be possible to write files larger than one HDFS block, but this is an edge case and not worth introducing additional complexity to handle. Change-Id: I165ef44ba48ff0c3c3203860157a61c45f77df8b Reviewed-on: http://gerrit.cloudera.org:8080/120 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 05:44:55 +00:00
ishaan	dee6911b20	Enable loading metadata from the hive metastore snapshot and cleanup build scripts. This patch contains the following changes: - Add a metastore_snapshot_file parameter to build.sh - Enable skipping loading the metadata. - create-load-data.sh is refactored into functions. - A lot of scripts source impala-config, which creates a lot of log spew. This has now been muted. - Unecessary log spew from compute-table-stats has been muted. - build_thirdparty.sh determins its parallelism from the system, it was previously hard coded to 4 - Only force load data of the particular dataset if a schema change is detected. Change-Id: I909336451e5c1ca57d21f040eb94c0e831546837 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5540 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-12-19 13:41:00 -08:00
Matthew Jacobs	25428fdb21	Add support for streaming decompression of gzip text Compressed text formats currently require entire compressed files be read into memory to be decompressed in a single call to the decompression codec. This changes the HdfsTextScanner to drive gzip in a streaming mode, i.e. produce partial output as input is consumed. Change-Id: Id5c0805e18cf6b606bcf27a5df4b5f58895809fd Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5233 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: jenkins (cherry picked from commit 05c3cc55e7a601d97adc4eebe03f878c68a33e56) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5385	2014-11-23 01:55:55 -08:00
Taras Bobrovytsky	e5e06c307b	[CDH5] Modified TPCH queries to match the specification Change-Id: Ife2c1fae4d774cd8fe188dfe9c98042ff7e45368 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4997 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-10-29 22:07:33 -07:00
ishaan	c4b4e010ff	Buffered Tuple Stream fixes. This patch fixes two issues: - Add API to buffered block mgr to allow an atomic Unpin and GetNewBlock. This has the semantics of unpinning a block and giving the buffer to the new block. This is necessary for the tuple stream to make sure another thread does not grab the unpinned block in between. - Buffer management reading an unpinned stream. Before moving onto a new block (and unpinning the current), we need to make sure all the tuples returned from the current block are returned up the operator tree. Change-Id: I95ee58d1019dd971f6a7dc19ecafdfa54cdbf942 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4333 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-20 16:05:11 -07:00
Nong Li	a4e2f97845	Fix and add spilling test. More tests coming. Change-Id: I09e98adb6b011575572051eff1cd52e7be689fe8 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4311 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-09-13 00:19:21 -07:00
Ippokratis Pandis	8e193abff3	IMPALA-1157: Crash in HdfsParquetWriter when value size larger than data page size. The HdfsParquetWriter was not making the check whether the size needed to encode a single value is larger than the data page size. In this (rare) case, it was getting into an infinite loop of allocating new pages, eventually crashing when it would run out of memory. Change-Id: I1375c047406ebd863fd8fb826ff1ecc0cb6335bc Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3870 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins	2014-08-17 12:46:17 -07:00
Alex Behm	5798cf7c6f	CDH-18432: Fix assignment of On-clause predicates of semi-/anti-joins. Change-Id: Iad4b772f170ea70ded0747ce55972b0a5194f88a Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3852 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-08-17 12:24:51 -07:00
ishaan	2b5df0c6ff	[CDH5] Convert tpch schemas to decimal and change the queries where possible. I used the following document for reference: http://www.tpc.org/tpch/spec/tpch2.1.0.pdf Change-Id: Ic84db0628323c90e89552707f214bbb9fa2f2ae0 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3132 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-07-08 14:51:43 -07:00
Victor Bittorf	808f9a661a	IMPALA-939: Regex should match anywhere in string. Change-Id: I8dcd337c3b06b632017270670a4f199ec7ada648 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2296 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit c97f82eaaf0efe9bd4c3da3d005464f425696a62) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2371	2014-04-25 16:16:15 -07:00
Nong Li	d401f746d4	IMPALA-692: Fix data corruption with dictionary encoded values. We weren't clearing the state in the dictionary when rolling over to a new page. The memory for the dictionary (built from the first file) was cleared but the dictionary entires were not. This also had a minor side effect that unused dictionary entries from the first page were still being written out for subsequent pages, although in practice, this is unlikely to affect the file size much. Change-Id: I8e11fc4723dc23d21c5de8a42def13d8238c137b Reviewed-on: http://gerrit.ent.cloudera.com:8080/1072 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:24 -08:00
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
ishaan	2f7d24b35b	Fix tpch-q18, to not use qualified table names.	2014-01-08 10:51:49 -08:00
ishaan	ece902d953	Fix tpch-q18 to inser into the database associated with its scale-factor.	2014-01-08 10:51:45 -08:00
Nong Li	58631d9ce0	Fix parquet insert .test files.	2014-01-08 10:49:46 -08:00
Skye Wanderman-Milne	a7e15b1417	Update Parquet scanner to only scan a file if assigned the first split. Also re-enable Parquet tests.	2014-01-08 10:49:25 -08:00
Nong Li	329763e5ab	Disable parquet tests.	2014-01-08 10:49:20 -08:00
Nong Li	20fc700002	Fix precision issue in text table writer.	2014-01-08 10:49:19 -08:00
Lenni Kuff	5f81becd84	Create tables used by insert tests in a supported insert format	2014-01-08 10:49:00 -08:00
Nong Li	0df9476be1	Parquet data loading.	2014-01-08 10:48:48 -08:00
Skye Wanderman-Milne	461a48df2b	Refactor testing framework to generate Avro tables.	2014-01-08 10:48:45 -08:00
Nong Li	6e293090e6	Parquet writer. Change-Id: I7117b545e3d3a7803a219234ad992040a6c7c4ec	2014-01-08 10:48:44 -08:00
Lenni Kuff	328ceed4e7	Add support for generating lzo compressed text files and running tests against lzo	2014-01-08 10:48:38 -08:00
ishaan	5138a720bb	IMP-768: Enable the python test framework to check for insert results.	2014-01-08 10:48:22 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Nong Li	a0229cd12e	Update tpch schema to use bigint for keys.	2014-01-08 10:47:54 -08:00
Nong Li	02c329b97a	Update RC files to use io mgr and remove scanner support for non-io mgr.	2014-01-08 10:47:11 -08:00
Nong Li	f46c654e01	Enable tpch-q21 and tpch-q22 in tests.	2014-01-08 10:47:03 -08:00
Lenni Kuff	837f35eab3	Updated results for more query tests to reflect proper ordering + improved result updating	2014-01-08 10:46:53 -08:00
Lenni Kuff	a035cf4e73	Update results of a few TPC-H queries to reflect proper ordering Change-Id: I41156b506155c846220cfb097f5e8120503f8da8	2014-01-08 10:46:52 -08:00
Marcel Kornacker	f6af9316d9	Fix for IMP-137: incorrect predicate placement for outer joins Fixing predicate assignment for outer joins: - On clause predicates for outer joins are now assigned to the join node - the exception are On clause predicates that can be directly evaluated by the outer-joined tables themselves; those are "pushed down" - Where clause predicates for outer-joined tables are assigned to the join node that materializes the outer join	2014-01-08 10:46:50 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00
Lenni Kuff	1451650055	Bring onlne all TPCH planner tests (updated for new planner) and supported query tests	2014-01-08 10:46:21 -08:00
Lenni Kuff	9f91081183	Modify TPCH tests to always insert into text table so workload can run on all file formats	2014-01-08 10:46:21 -08:00
ishaan	42231b7d86	Annotate queries for better benchmark reporting.	2014-01-08 10:45:05 -08:00

1 2

56 Commits