impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 21:00:54 -05:00

Author	SHA1	Message	Date
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Juan Yu	97af107729	IMPALA-2914: Fix DCHECK Check failed: HasDateOrTime() Some TimestampValue converting functions assume caller ensures TimestampValue instance has a valid date or time but that's not true. Change those functions to return result in output parameter and return boolean to indicate the conversion is good or not. Change-Id: I7a68a1e14d9c4ee5d83da760d4d76c20c36bc359 (cherry picked from commit 47d8977f5976b9be405f44add966820138fbda6f) Reviewed-on: http://gerrit.cloudera.org:8080/2195 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-24 13:31:00 -08:00
Tim Armstrong	2c2670e389	IMPALA-1305: streaming pre-aggregations Aggregations are implemented as a distributed pre-aggregation, an exchange, then a final aggregation that produces the results of the aggregation. In many cases the pre-aggregation significantly reduces the amount of data to be exchanged. However, in other cases, the preaggregation does not greatly reduce the amount of data exchanged or can use a lot of memory and starve other operators that would benefit more from the additional memory. In these cases we would be better off "passing through" some input tuples by transforming them into intermediate tuples without aggregating them. This patch adds a streaming pre-aggregation mode to PartitionedAggregationNode that tries to aggregate input rows with a hash table, but can switch to passing through the input tuples (after transforming them into the appropriate tuple format). It does this if it hits a memory limit or if the aggregation is not sufficiently reducing the node's output (specifically, if the number of aggregated rows in the hash table is more than half the number of unaggregated rows consumed by the pre-aggregation). Pre-aggregations never need to spill because they can pass through rows when under memory pressure. This initial implementation is quite conservative: it retains the partitioning of the previous implementation because switching to a single partition proved to regress performance of some queries while improving others. It also always keeps hash tables around and updates them with matching input rows so that reduction statistics are updated and early decisions to pass through data can be reversed. Future work could explore different approaches within the new framework to get larger performance gains. Currently we see significant performance benefits for queries with a very low reduction factor, e.g. group by on a nearly unique column Includes codegen support for the passthrough streaming. Adds a query option, disable_streaming_preaggregations, in case a user wants to revert to the old behaviour. Adds TPC-H tests to exercise the new passthrough code path and updates planner tests to include the new [STREAMING] detail added by the planner. Change-Id: Ia40525340cba89a8c4e70164ae11447e96494664 Reviewed-on: http://gerrit.cloudera.org:8080/1698 Tested-by: Internal Jenkins Reviewed-by: Dan Hecht <dhecht@cloudera.com>	2016-02-11 19:03:51 +00:00
Ippokratis Pandis	48699de6e3	IMPALA-1621,2241,2271,2330,2352: Lazy switch to IO buffers to reduce min mem needed for PAGG/PHJ PAGG and PHJ were using an all-or-nothing approach wrt spilling. In particular, they were trying to switch to IO-sized buffers for both streams (aggregated and unaggregated in PAGG; build and probe in PHJ) of every partition (currently 16 partitions for a total of 32 streams), even if some of the streams had very few rows, they were empty or simply they would not spill so there was no need to allocate IO-buffers for them. That was increasing the min mem needed by those operators in many queries. This patch decouples the decision to switch to IO-buffers for each stream of each partition. Streams will switch to IO-sized buffers whenever the rows they contain do not fit in the first two small buffers (64KB and 512KB respectively). When we decide to spill a partition, we switch to IO buffers both streams. With these change many streams of PAGG and PHJ nodes do not need to use IO-sized buffers, reducing the min mem requirement. For example, below is the min mem needed (in MBs) for some of the TPC-H queries. Some need half or less mem from the mem they needed before: TPC-H Q3: 645 -> 240 TPC-H Q5: 375 -> 245 TPC-H Q7: 685 -> 265 TPC-H Q8: 740 -> 250 TPC-H Q9: 650 -> 400 TPC-H Q18: 1100 -> 425 TPC-H Q20: 420 -> 250 TPC-H Q21: 975 -> 620 To make this small buffer optimization to work, we had to fix IMPALA-2352. That is, the AllocateRow() call of PAGG::ConstructIntermediateTuple() could return unsuccessfully just because the small buffers of the stream were exhausted. In that case, previously we would treat it as an indication that there is no memory left, start spilling a partition and switching all stream to IO-buffes. Now we make a best effort, trying to first SwitchToIoffers() and if that is successful, we re-attempt the AllocateRow() call. See IMPALA-2352 for more details. Another change is that now SwitchToIoBuffers() will reset the flag using_small_buffers_ back to false, in case we are in a very low memory situation and it fails to get a buffer. That allows us to retry calling SwitchToIoBuffers() once we free up some space. See IMPALA-2330 for more details. With the above fixes we should also have fixed IMPALA-2241 and IMPALA-2271 that are essentially stream::using_small_buffers_-related DCHECKs. This patch adds all 22 TPC-H queries in test_mem_usage_scaling test and updates the per-query min mem limits in it. Additionally, it adds a new aggregation test that uses the TPC-H dataset for larger aggregations (TestTPCHAggregationQueries). It also removes some dead test code. Change-Id: Ia8ccd0b76f6d37562be21fd4539aedbc2a864d38 Reviewed-on: http://gerrit.cloudera.org:8080/818 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: Internal Jenkins Conflicts: tests/query_test/test_aggregation.py	2015-09-23 11:07:42 -07:00
aacalfa	57dd4d1502	IMPALA-1309: Add support for distinct in group_concat function. Change-Id: I2790f1d2a7bfd0ecc7ef66cc5d91dafe3414e111 Reviewed-on: http://gerrit.cloudera.org:8080/892 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-09-23 09:42:17 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
ishaan	09e5eaeda2	Introduce classes for pytest's skipif markers. This patch encapsulates pytests's skipif markers in classes. It leads to the following benefits: - Provide context and grouping for tests being skipped. - As we improve test reporting, annotations will give us a better idea of coverage. Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9 Reviewed-on: http://gerrit.cloudera.org:8080/297 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/343	2015-04-19 03:09:59 +00:00
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
ishaan	11cd7d1d46	Blacklist tests that don't work on s3 This patch introduces a new pytest marker that skip tests that currently don't work when s3 is used as the underlying file system. The set of blacklisted tests is a superset of tests that cannot be run with s3. Follow up patches will remove some of the test files from the blacklist. Change-Id: I39a58223d3435f0bd6496ffd00a2d483b751693d Reviewed-on: http://gerrit.cloudera.org:8080/82 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 01:43:28 +00:00
Ippokratis Pandis	c2d07908c9	Moving the distinct tests to test_aggregation The distinct tests exercise the aggregation node codepaths. This patch moves the distinct tests to test_aggregation, so that running this test will cover most of the aggregation code. Change-Id: Icbe04c51a91e3fda057439a83fca3e61a3890e71 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5868 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2015-01-28 15:31:33 -08:00
Nong Li	fd35cee887	Reorganize/reduce end to end test time. This patch does a few things: 1) Move the metadata tests into their own folder under tests/. I think it's useful to loosely categorize them so it's easier to run a subset of the tests that are most useful for the changes you are making. 2) Reduce the test vectors for query_tests. We should have identical coverage in the daily exhaustive runs but the normal runs should be much better. In particular, deemphasizing scanner tests since that code is more stable now. 3) Misc test cleanup/consolidate python test files/etc. Change-Id: I03c2f34877aed192c2a50665bd5e15fa85e12f1e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3831 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-08-17 12:43:57 -07:00
Dimitris Tsirogiannis	5a6f53db16	Add partition pruning tests The following changes are included in this commit: 1. Modified the alltypesagg table to include an additional partition key that has nulls. 2. Added a number of tests in hdfs.test that exercise the partition pruning logic (see IMPALA-887). 3. Modified all the tests that are affected by the change in alltypesagg. Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236	2014-06-24 02:14:27 -07:00
Lenni Kuff	0ac0527643	Reduce test execution time by limiting long running tests to exhaustive exec strategy I looked at the latest run from master and took the tests suites that had long execution times. This cleans those test suites up to either completely disable them on 'core' or add constraints to limit the number of test vectors. It shouldn't impact nightly coverage since we still run the same tests exhaustively. Change-Id: I10c78c35155b00de0c36d9fc0923b2b1fc6b44de Reviewed-on: http://gerrit.ent.cloudera.com:8080/3119 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3125 Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-06-18 16:18:17 -07:00
Henry Robinson	38befd2126	IMPALA-724: Support infinite / nan values in text files This patch allows the text scanner to read 'inf' or 'Infinity' from a row and correctly translate it into floating-point infinity. It also adds is_inf() and is_nan() builtins. Finally, we change the text table writer to write Infinity and NaN for compatibility with Hive. In the future, we might consider adding nan / inf literals to our grammar (postgres has this, see: http://www.postgresql.org/docs/9.3/static/datatype-numeric.html). Change-Id: I796f2852b3c6c3b72e9aae9dd5ad228d188a6ea3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2393 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins (cherry picked from commit 58091355142cadd2b74874d9aa7c8ab6bf3efe2f) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2483	2014-05-08 12:28:53 -07:00
Lenni Kuff	bb09b5270f	IMPALA-839: Update tests to be more thorough when run exhaustively Some tests have constraints that were there only to help reduce runtime which reduces coverage when running in exhaustive mode. The majority of the constraints are because it adds no value to run the test across additional dimensions (or it is invalid to run with those dimensions). Updates the tests that have legitimate constraints to use two new helper methods for constraining the table format dimension: create_uncompressed_text_dimension() create_parquet_dimension() These will create a dimension that will produce a single test vector, either uncompressed text or parquet respectively. Change-Id: Id85387c1efd5d192f8059ef89934933389bfe247 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2149 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins (cherry picked from commit e02acbd469bc48c684b2089405b4a20552802481) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2290	2014-04-18 20:11:31 -07:00
Lenni Kuff	8d1674f638	Run only subset of tests with small batch_sizes + a few small fixes	2014-01-08 10:48:58 -08:00
Skye Wanderman-Milne	461a48df2b	Refactor testing framework to generate Avro tables.	2014-01-08 10:48:45 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00
Lenni Kuff	12d18631e3	Test enhancements: dynamic table format data loading, per-workload exploration stategies	2014-01-08 10:47:07 -08:00
Lenni Kuff	30dbf59ef2	Final changes to enable Python test infrastructure and tests With this change the Python tests will now be called as part of buildall and the corresponding Java tests have been disabled. The new tests can also be invoked calling ./tests/run-tests.sh directly. This includes a fix from Nong that caused wrong results for limit on non-io manager formats.	2014-01-08 10:46:57 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00

21 Commits