impala

mirror of https://github.com/apache/impala.git synced 2026-02-02 06:00:36 -05:00

Author	SHA1	Message	Date
Adam Tamas	7295edcc26	IMPALA-9680: Fixed compressed inserts failing Modified the insert testfiles to get which database they need to use for 'CREATE TABLE LIKE' dynamically. Tests: Did targeted exhaustive testruns in test_insert.py and test_mt_dop.py and did a full exhaustive testrun. Change-Id: Ib3c7ba02190f57a7ed40311c95a3dd9eca9b474d Reviewed-on: http://gerrit.cloudera.org:8080/15816 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-05-11 19:32:08 +00:00
Adam Tamas	02d84dcf50	IMPALA-9665: Fixed database not found errors in query_test.test_insert Fixed the usage of the unique_database in the test_insert.py to wait with the tests until the database is synced. Testing: -tests/run-tests.py query_test/test_insert.py --exploration_strategy=exhaustive Change-Id: I9b7aa3775dd4375f536d76f2e236ce126f8c78cd Reviewed-on: http://gerrit.cloudera.org:8080/15766 Reviewed-by: Andrew Sherman <asherman@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-21 06:12:56 +00:00
Adam Tamas	c32849a391	IMPALA-8980: Remove functional*.alltypesinsert from EE tests -Modified the ‘test_insert.py’ so the tests can run parallel. -Every test will create its own temporary tables for insert testing. -Swapped out the SETUP tags to Truncate table QUERY statement. -Becouse the SETUP tag is not used anymore, the correspondig code was removed. -A test query in ‘insert.test’. The test was incorrect so modified to test for the right behavior. Testing: -tests/run-tests.py query_test/test_insert.py -impala-py.test tests/query_test/test_insert.py -the same for test_insert_permutation.py and test_load.py Change-Id: I257e936868917a2fcc6c030f6c855b247e8a0eea Reviewed-on: http://gerrit.cloudera.org:8080/15529 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-14 12:18:21 +00:00
Sahil Takiar	8b8a49e617	IMPALA-8557: Add '.txt' to text files, remove '.' at end of filenames Writes to text tables on ABFS are failing because HADOOP-15860 recently changed the ABFS behavior when writing files / folders that end with a '.'. ABFS explicitly does not allow files / folders that end with a dot. From the ABFS docs: "Avoid blob names that end with a dot (.), a forward slash (/), or a sequence or combination of the two." The behavior prior to HADOOP-15860 was to simply drop any trailing dots when writing files or folders, but that can lead to various issues because clients may try to read back a file that should exist on ABFS, but doesn't. HADOOP-15860 changed the behavior so that any attempt to write a file or folder with a trailing dot fails on ABFS. Impala writes all text files with a trailing dot due to some odd behavior in hdfs-table-sink.cc. The table sink writes files with a "file extension" which is dependent on the file type. For example, Parquet files have a file extension of ".parq". For some reason, text files had no file extension, so Impala would try to write text files of the following form: "244c5ee8ece6f759-8b1a1e3b00000000_45513034_data.0.". Several tables created during dataload, such as alltypes, already use the '.txt' extension for their files. These tables are not created via Impala's INSERT code path, they are copied into the table. However, there are several tables created during dataload, such as alltypesinsert, that are created via Impala. This patch will change the files in these tables so that they end in '.txt'. This patch adds the ".txt" extension to all written text files and modifies the hdfs-table-sink.cc so that it doesn't add a trailing dot to a filename if there is no file extension. Testing: * Ran core tests * Re-ran affected ABFS tests * Added test to validate that the correct file extension is used for Parquet and text tables * Manually validated that without the addition of the '.txt' file extension, files are not written with a trailing dot Change-Id: I2a9adacd45855cde86724e10f8a131e17ebf46f8 Reviewed-on: http://gerrit.cloudera.org:8080/14621 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-06 05:49:31 +00:00
Sahil Takiar	e8fda1f224	IMPALA-9117, IMPALA-7726: Fixed a few unit tests for ABFS This test makes the following changes / fixes when running Impala tests on ABFS: * Skips some tests in test_lineage.py that don't work on ABFS / ADLS (they were already skipped for S3) * Skips some tests in test_mt_dop.py; the test creates a directory that ends with a period (and ABFS does not support writing files or directories that end with a period) * Removes the ABFS skip flag SkipIfABFS.trash (IMPALA-7726: Drop with purge tests fail against ABFS due to trash misbehavior"); I removed these flags and looped the tests overnight with no failures, so it is likely whatever bug was causing this has now been fixed * Now that HADOOP-15860 has been resolved, and the agreed upon behavior for ABFS is that it will fail if a client tries to write a file / directory that ends with a period, I added a new entry to the SkipIfABFS class called file_or_folder_name_ends_with_period and applied it where necessary Testing: * Ran core tests on ABFS Change-Id: I18ae5b0f7de6aa7628a1efd780ff30a0cc3c5285 Reviewed-on: http://gerrit.cloudera.org:8080/14636 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-06 05:44:01 +00:00
Tim Armstrong	548106f5e1	IMPALA-8451,IMPALA-8905: enable admission control for dockerised tests This gives us some additional coverage for using admission control in a simple but realistic configuration. What are the implications of this change for test stability and flakiness? On one hand were are adding some more unpredictability to tests, because they may be queued for an arbitrary amount of time. On the other, we can prevent queries from contending over memory. Currently we rely on luck to prevent concurrent queries from forcing each other out-of-memory. I think the unpredictability from the queueing is preferable, because we can generally work around these by fixing tests that are sensitive to being queued, whereas contention over memory requires us to use crude workarounds like forcing tests to execute serially. Added observability for the configured queue wait time for each pool. I noticed that I did not have a direct way to observe the effective value when I set configs. This is IMPALA-8905. I had to tweak tests in a few ways: * Tests with large strings needed higher memory limits. * Hardcoded instances of default-pool had to handle root.default as well. * test_query_mem_limit needed to run without a mem_limit. I created a special pool root.no-limits with no memory limits to allow that. Testing: Ran the dockerised build 5-6 times to flush out flaky tests. Change-Id: I7517673f9e348780fcf7cd6ce1f12c9c5a55373a Reviewed-on: http://gerrit.cloudera.org:8080/13942 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-27 01:54:39 +00:00
Zoltan Borok-Nagy	3e9cac0cac	IMPALA-8854: fix acid insert tests test_acid_nonacid_insert has been failing lately. HMS became more strict about checking the capabilities of its clients. Seems like the Python client doesn't set any capabilities for itself therefore HMS rejects its attempts of creating and dropping tables. Now instead of using the RESET utility from the e2e test framework (to drop and re-create tables), the test is using a unique database and creates the tables through Impala. Different file formats are exercised with the help of the DEFAULT_FILE_FORMAT query option. Change-Id: I3a82338a7820d0ee748c961c8656fa3319c3929c Reviewed-on: http://gerrit.cloudera.org:8080/14064 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-15 13:02:55 +00:00
Tim Armstrong	cef76db392	IMPALA-8854: skip test_acid_insert The test is failing because of a Hive version change in some configurations. Disabling for now until it can be fixed. Change-Id: I3bc5cce8b9c3843b5bb8ac4d29b2219411f671b4 Reviewed-on: http://gerrit.cloudera.org:8080/14056 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2019-08-14 09:27:14 +00:00
Csaba Ringhofer	a0c00e508f	Bump CDP_BUILD_NUMBER to 1318335 The main reason for bumping is to include HIVE-21838. Also skips / fixes some tests. Change-Id: I432e8c02dbd349a3507bfabfef2727914537652c Reviewed-on: http://gerrit.cloudera.org:8080/14005 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-08 15:29:13 +00:00
Zoltan Borok-Nagy	48bb93d474	IMPALA-8636: fix flakiness of ACID INSERT tests I had to add @UniqueDatabase.parametrize(sync_ddl=True) to some e2e tests because they were broken in exhaustive mode. When the tests run with sync_ddl=True then the test files are executed against multiple impalads which means that each statement in the .test file is executed against a random impalad. Change-Id: Ic724e77833ed9ea58268e1857de0d33f9577af8b Reviewed-on: http://gerrit.cloudera.org:8080/13966 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-01 17:44:26 +00:00
Zoltan Borok-Nagy	6360657cb4	IMPALA-8636: Implement INSERT for insert-only ACID tables This commit adds INSERT support for insert-only ACID tables. The Frontend opens a transaction for INSERT statements when the target table is transactional. It also allocates a write ID for the target table. The Frontend aborts the transaction if an error occurs during analysis/planning. The Backend gets the transaction id and the write id in TFinalizeParams. The write id is also set the for the HDFS table sinks. The sinks write the files at their final destination which is an ACID base or delta directory. There is no need for finalization of transactional INSERTS. When the sinks finished with writing the data, the Coordinator invokes updateCatalog() on catalogd which also commits the transaction if everything went well, otherwise the Coordinator aborts the transaction. Testing: * added new tables during dataload * added acid-insert.test file with INSERT statements against the new tables * test insertions between ACID and non-ACID tables * test error scenarios via debug actions * added integration test with Hive to test_hms_integration.py. The test inserts data with Impala and reads with Hive. (These integration tests only run with exhaustive exploration strategy) TODO in following commits: * add locks and heartbeats (without heartbeats long-running transactions might be aborted by HMS) * implement TRUNCATE * CTAS creates files in the 'root' directory of the table/partition. It is handled correctly during SELECT, but would be better to create a base directory from the beginning. Hive creates a delta directory for CTAS. Change-Id: Id6c36fa6902676f06b4e38730f737becfc7c06ad Reviewed-on: http://gerrit.cloudera.org:8080/13559 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-27 13:45:51 +00:00
Abhishek	97a6a3c807	IMPALA-8617: Add support for lz4 in parquet A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents the block compression scheme used by Hadoop. Its similar to SNAPPY_BLOCKED as far as the block format is concerned, with the only difference being the codec used for compression and decompression. Added Lz4BlockCompressor and Lz4BlockDecompressor classes for compressing and decompressing parquet data using Hadoop's lz4 block compression scheme. The Lz4BlockCompressor treats the input as a single block and generates a compressed block with following layout <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> The hdfs parquet table writer should call the Lz4BlockCompressor using the ideal input size (unit of compression in parquet is a page), and so the Lz4BlockCompressor does not further break down the input into smaller blocks. The Lz4BlockDecompressor on the other hand should be compatible with blocks written by Impala and other engines in Hadoop ecosystem. It can decompress compressed data in following format <4 byte big endian uncompressed size> <4 byte big endian compressed size> <lz4 compressed block> ... <4 byte big endian compressed size> <lz4 compressed block> ... <repeated untill uncompressed size from outer block is consumed> Externally users can now set the lz4 codec for parquet using: set COMPRESSION_CODEC=lz4 This gets translated into LZ4_BLOCKED codec for the HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet data, the LZ4_BLOCKED codec is used. Testing: - Added unit tests for LZ4_BLOCKED in decompress-test.cc - Added unit tests for Hadoop compatibility in decompress-test.cc, basically being able to decompress an outer block with multiple inner blocks (the Lz4BlockDecompressor description above) - Added interoperability tests for Hive and Impala for all parquet codecs. New test added to tests/custom_cluster/test_hive_parquet_codec_interop.py Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c Reviewed-on: http://gerrit.cloudera.org:8080/13582 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-19 04:43:43 +00:00
Abhishek	51e8175c62	IMPALA-8450: Add support for zstd in parquet Makefile was updated to include zstd in the ${IMPALA_HOME}/toolchain directory. Other changes were made to make zstd headers and libs accessible. Class ZstandardCompressor/ZstandardDecompressor was added to provide interfaces for calling ZSTD_compress/ZSTD_decompress functions. Zstd supports different compression levels (clevel) from 1 to ZSTD_maxCLevel(). Zstd also supports -ive clevels, but since the -ive values represents uncompressed data they won't be supported. The default clevel is ZSTD_CLEVEL_DEFAULT. HdfsParquetTableWriter was updated to support ZSTD codec. The new codecs can be set using existing query option as follows: set COMPRESSION_CODEC=ZSTD:<clevel>; set COMPRESSION_CODEC=ZSTD; // uses ZSTD_CLEVEL_DEFAULT Testing: - Added unit test in DecompressorTest class with ZSTD_CLEVEL_DEFAULT clevel and a random clevel. The test unit decompresses an input compressed data and validates the result. It also tests for expected behavior when passing an over/under sized buffer for decompressing. - Added unit tests for valid/invalid values for COMPRESSION_CODEC. - Added e2e test in test_insert_parquet.py which tests writing/read- ing (null/non-null) data into/from a table (w different data type columns) using multiple codecs. Other existing e2e tests were updated to also use parquet/zstd table format. - Manual interoperability tests were run between Impala and Hive. Change-Id: Id2c0e26e6f7fb2dc4024309d733983ba5197beb7 Reviewed-on: http://gerrit.cloudera.org:8080/13507 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 11:15:04 +00:00
Michael Ho	ed0a2b6010	IMPALA-7176: Increase wait time in test_insert_mem_limit test_insert_mem_limit in test_insert.py will wait for all fragments to exit before proceeding after the test query hits a memory limit. On some slower EC2 instances with Centos6, the default wait time of 60s may occasionally cause the test to fail due to time out waiting for all fragments to exit. This change increases the timeout to 180 seconds. Testing done: - Looped test_insert_mem_limit for 100 times on Centos6; Change-Id: I2e14bef79c6c6fb0004270319f1c491194260438 Reviewed-on: http://gerrit.cloudera.org:8080/13292 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-10 00:38:54 +00:00
Tim Armstrong	23d7a6dce6	IMPALA-8492: reenable large string tests in docker IMPALA-4865 is fixed so these now pass. I noticed that the IMPALA-4874 test occasionally hit "Memory Limit Exceeded" when looped, so I reduced the data size there slightly. Testing: Looped the tests locally against a dockerised minicluster for a while. Change-Id: I030f4eff2d3fb771fc92b760efb13170e68285dc Reviewed-on: http://gerrit.cloudera.org:8080/13233 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-05 00:00:31 +00:00
Tim Armstrong	2ca7f8e7c0	IMPALA-7995: part 1: fixes for e2e dockerised impala tests This fixes all core e2e tests running on my local dockerised minicluster build. I do not yet have a CI job or script running but I wanted to get feedback on these changes sooner. The second part of the change will include the CI script and any follow-on fixes required for the exhaustive tests. The following fixes were required: * Detect docker_network from TEST_START_CLUSTER_ARGS * get_webserver_port() does not depend on the caller passing in the default webserver port. It failed previously because it relied on start-impala-cluster.py setting -webserver_port for all processes. * Add SkipIf markers for tests that don't make sense or are non-trivial to fix for containerised Impala. * Support loading Impala-lzo plugin from host for tests that depend on it. * Fix some tests that had 'localhost' hardcoded - instead it should be $INTERNAL_LISTEN_HOST, which defaults to localhost. * Fix bug with sorting impala daemons by backend port, which is the same for all dockerised impalads. Testing: I ran tests locally as follows after having set up a docker network and starting other services: ./buildall.sh -noclean -notests -ninja ninja -j $IMPALA_BUILD_THREADS docker_images export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" export FE_TEST=false export BE_TEST=false export JDBC_TEST=false export CLUSTER_TEST=false ./bin/run-all-tests.sh Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755 Reviewed-on: http://gerrit.cloudera.org:8080/12639 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-13 02:42:32 +00:00
Philip Zeyliger	214f61a180	IMPALA-8250: Clean up JNI warnings. Using LIBHDFS_OPTS+="-Xcheck:jni" revealed a handful of warnings related to (a) checking for exceptions and (b) leaking local references. Checking for exceptions required sprinkling RETURN_ERROR_IF_EXC left and right. I chose not to expand the JniCall infrastructure to handle this more generally at the moment. The leaky local references are a bit harder. In the logs, they show up as "WARNING: JNI local refs: 2597, exceeds capacity: 35" or similar. A few of these errors seem to be not in our code. The ones that I've found in our code stemmed from HBaseTableScanner::GetRowKey(): this method uses local references and wasn't returning them. Using a JniLocalFrame seems to have taken care of the warnings. I have added code to skip test_large_strings when JNI checking is enabled. This test takes forever (presumably because JNI is checking bounds on strings very aggressively), and times out. The time out also causes some metric-related checks to fail (since a query is still in flight). Debugging this required customizing my JDK to give stack traces when these warnings occurred. The following diff facilitated this. diff -r 76a9c9cf14f1 src/share/vm/prims/jniCheck.cpp --- a/src/share/vm/prims/jniCheck.cpp Tue Jan 15 10:43:31 2019 +0000 +++ b/src/share/vm/prims/jniCheck.cpp Wed Feb 27 11:57:13 2019 -0800 @@ -143,11 +143,30 @@ static const char * fatal_instance_field_mismatch = "Field type (instance) mismatch in JNI get/set field operations"; static const char * fatal_non_string = "JNI string operation received a non-string"; +// thisone: whether to print every time, or maybe, depending on future +// how many future stacks we want printed (totally racy); helps catch +// missing exception handling if there's a way to tickle that code +// reliably. +static inline void dump_native_stack(JavaThread* thr, bool thisone, int future) { + static int fut_stacks = 0; // racy! + if (fut_stacks > 0) { + thisone = true; + fut_stacks--; + } + if (future > 0) fut_stacks = future; + if (thisone) { + frame fr = os::current_frame(); + char buf[6000]; + tty->print_cr("Thread: %s %d", thr->get_thread_name(), thr->osthread()->thread_id()); + print_native_stack(tty, fr, thr, buf, sizeof(buf)); + } +} // When in VM state: static void ReportJNIWarning(JavaThread* thr, const char msg) { tty->print_cr("WARNING in native method: %s", msg); thr->print_stack(); + dump_native_stack(thr, true, 0); } // When in NATIVE state: @@ -199,11 +218,14 @@ tty->print_cr("WARNING in native method: JNI call made without checking exceptions when required to from %s", thr->get_pending_jni_exception_check()); thr->print_stack(); + dump_native_stack(thr, true, 10); ) thr->clear_pending_jni_exception_check(); // Just complain once } } + + /* * Add to the planned number of handles. I.e. plus current live & warning threshold */ @@ -254,9 +276,12 @@ tty->print_cr("WARNING: JNI local refs: %zu, exceeds capacity: %zu", live_handles, planned_capacity); thr->print_stack(); + dump_native_stack(thr, true, 0); ) // Complain just the once, reset to current + warn threshold add_planned_handle_capacity(handles, 0); + } else { + dump_native_stack(thr, false, 0); } } Change-Id: Idd1709f749a764c1d947704bc64306493863b45f Reviewed-on: http://gerrit.cloudera.org:8080/12660 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-08 03:35:09 +00:00
poojanilangekar	3a3ab7ff8f	IMPALA-6544/IMPALA-7070: Disable tests which fail due to S3's eventual consistency This patch is a temporary fix to disable tests which fail due to S3's eventually consistent behavior. The permanent fix would involve running tests with S3Guard enabled. Change-Id: I676faa191bec8b156e430661c22ee69242eeba9d Reviewed-on: http://gerrit.cloudera.org:8080/12203 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-12 03:42:48 +00:00
Tim Armstrong	11a48234ec	IMPALA-7648: add tests for OOM from scanning big string This extends test_insert_large_string to exercise the OOM code path. Change-Id: I0d1e9b2e8cf6e167da2542950ae90717d4865e9b Reviewed-on: http://gerrit.cloudera.org:8080/12123 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-27 22:23:40 +00:00
Bikramjeet Vig	cf28d8acbd	IMPALA-7994: Prevent test_insert_large_string from causing OOM issues test_insert_large_string uses upto 4GB of untracked memory which results in random OOMs during exhaustive testing on release builds. Queries run faster on release builds which might result in a different set of tests running together when compared to those on debug builds. This can result in queries requiring more memory running together with test_insert_large_string and eventually encounter OOM errors. Testing: Successfully ran exhaustive tests twice on release build. Change-Id: I6c950f6860b2f86865dbc5ce60055175e2c0bebc Reviewed-on: http://gerrit.cloudera.org:8080/12110 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-19 21:28:04 +00:00
Tim Armstrong	58cd69ac48	IMPALA-402: test for random partitioning in insert This adds a basic regression test for the bug reported in IMPALA-402. Testing: Exhaustive build. Looped the modified test overnight. Change-Id: I4bbca5c64977cadf79dabd72f0c8876a40fdf410 Reviewed-on: http://gerrit.cloudera.org:8080/11799 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-06 00:00:12 +00:00
Sean Mackrory	7a022cf36a	IMPALA-7681. Add Azure Blob File System (ADLS Gen2) support. HADOOP-15407 adds a new FileSystem implementation called "ABFS" for the ADLS Gen2 service. It's in the hadoop-azure module as a replacement for WASB. Filesystem semantics should be the same, so skipped tests and other behavior changes have simply mirrored what is done for ADLS Gen1 by default. Tests skipped on ADLS Gen1 due to eventual consistency of the Python client can be run against ADLS Gen2. Change-Id: I5120b071760e7655e78902dce8483f8f54de445d Reviewed-on: http://gerrit.cloudera.org:8080/11630 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 06:43:00 +00:00
Tim Armstrong	d05f73f415	IMPALA-7647: Add HS2/Impyla dimension to TestQueries I used some ideas from Alex Leblang's abandoned patch: https://gerrit.cloudera.org/#/c/137/ in order to run .test files through HS2. The advantage of using Impyla is that much of the code will be reusable for any Python client implementing the standard Python dbapi and does not require us implementing yet another thrift client. This gives us better coverage of non-trivial result sets from HS2, including handling of NULLs, error logs and more interesting result sets than the basic HS2 tests. I added HS2 coverage to TestQueries, which has a reasonable variety of queries and covers the data types in alltypes. I also added TestDecimalQueries, TestStringQuery and TestCharFormats to get coverage of DECIMAL, CHAR and VARCHAR that aren't in alltypes. Coverage of results sets with NULLs was limited so I added a couple of queries. Places where results differ from Beeswax: * Impyla is a Python dbapi client so must convert timestamps into python datetime objects, which only have microsecond precision. Therefore result timestamps within nanosecond precision are truncated. * The HS2 interface reports the NULL type as BOOLEAN as a workaround for IMPALA-914. * The Beeswax interface reported VARCHAR as STRING, but HS2 reports VARCHAR. I dealt with different results by adding additional result sections so that the expected differences between the clients/protocols were explicit. Limitations: * Not all of the same methods are implemented as for beeswax, so some tests that have more complicated interactions with the client will not work with HS2 yet. * We don't have a way to get the affected row count for inserts. I also simplified the ImpalaConnection API by removing some unnecessary methods and moved some generic methods to the base class. Testing: * Confirmed that it detected IMPALA-7588 by re-applying the buggy patch. * Ran exhaustive and CentOS6 tests. Change-Id: I9908ccc4d3df50365be8043b883cacafca52661e Reviewed-on: http://gerrit.cloudera.org:8080/11546 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-09 00:45:10 +00:00
Tianyi Wang	21d92aacbf	IMPALA-7019: Schedule EC as remote & disable failed tests This patch schedules HDFS EC files without considering locality. Failed tests are disabled and a jenkins build should succeed with export ERASURE_COINDG=true. Testing: It passes core tests. Cherry-picks: not for 2.x. Change-Id: I138738d3e28e5daa1718c05c04cd9dd146c4ff84 Reviewed-on: http://gerrit.cloudera.org:8080/10413 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-22 01:10:14 +00:00
Joe McDonnell	1e6544f7da	IMPALA-7023: Wait for fragments to finish for test_insert.py The arrangement of tests in test_insert.py changed with IMPALA-7010, splitting out the memory limit tests into test_insert_mem_limit(). On exhaustive, the combination of test dimensions means test_insert_mem_limit() executes 11 different combinations. Each of these statements can use a large amount of memory and this is not cleaned up immediately. This has been causing test_insert_overwrite(), which immediately follows test_insert_mem_limit(), to hit the process memory limit. This changes test_insert_mem_limit() to make it wait for its fragments to finish. Change-Id: I5642e9cb32dd02afd74dde7e0d3b31bddbee3ccd Reviewed-on: http://gerrit.cloudera.org:8080/10426 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-16 23:33:57 +00:00
Tim Armstrong	25c13bfdd6	IMPALA-7010: don't run memory usage tests on non-HDFS Moved a number of tests with tuned mem_limits. In some cases this required separating the tests from non-tuned functional tests. TestQueryMemLimit used very high and very low limits only, so seemed safe to run in all configurations. Change-Id: I9686195a29dde2d87b19ef8bb0e93e08f8bee662 Reviewed-on: http://gerrit.cloudera.org:8080/10370 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-11 22:41:49 +00:00
Michael Ho	ed72910e96	IMPALA-6262: Always initialize runtime profile for DataSink This change moves the creation of the runtime profile from DataSink::Prepare() to the ctor of DataSink derived classes. This makes sure that DataSink::Close() and other functions can access the profile even if the DataSink fails to initialize. Testing done: Added a test case which triggers failure in the initialization of output expressions in a HdfsTableSink. Impalad crashed consistently without the fix. Change-Id: I2a683000ef180027b929dbebe78bc2a530a4767e Reviewed-on: http://gerrit.cloudera.org:8080/8770 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-12-07 09:47:09 +00:00
Tim Armstrong	5b670f49b6	IMPALA-5640: re-enable gzip for parquet insert tests This addresses a gap in test coverage. There are no known bugs here so we expect this to work. Testing: Ran exhaustive build. Change-Id: I4bea8bac37bb1e72f3ba0b2e162e6fc544aec8a8 Reviewed-on: http://gerrit.cloudera.org:8080/7398 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-12 00:18:44 +00:00
Lars Volker	933f2ce7fd	IMPALA-4876: Remove _test suffix from test names After IMPALA-4735 has been fixed, this suffix is no longer needed to make test names prefix-free. Change-Id: Ie63d145c94c02ec67e81d0c51a33d20685fba73e Reviewed-on: http://gerrit.cloudera.org:8080/5886 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-02-03 05:16:09 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Lars Volker	8ea21d099f	IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Reviewed-on: http://gerrit.cloudera.org:8080/4863 Reviewed-by: Lars Volker <lv@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-22 02:51:20 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Tim Armstrong	3810b7c413	IMPALA-3780: avoid many small reads past end of block The text scanner had some pathological behaviour when reading significantly past the end of it scan range. E.g. reading a 256mb string that's split across blocks. ScannerContext defaulted to issuing 1kb reads, even if the scan node requested significantly more data. E.g. if the Parquet scanner called ReadBytes(16mb), this was chopped up into 1kb reads, which were reassembled in boundary_buffer_. Increase the minimum read size in this case to 64kb. Reading that amount of data should not have any significant overhead even if we only read a few bytes past the end of the scan range. ScannerContext implements a saner default algorithm that will work better if scanners make many small reads: it starts with 64kb reads and doubles the size of each successive read past the end of the scan range. We also correct pass the 'read_past_size' into GetNextBuffer(), so that we always read the right amount of data. Also save some time by pre-sizing the boundary buffer to the correct size instead of reallocating it multiple times. Testing: Add test case that exercises the code paths for very large strings. Performance: The queries in the test case are vastly faster than before. E.g. 0.6s vs ~60s for the count(*) query. Change-Id: Id90c5dea44f07dba5dd465cf325fbff28be34137 Reviewed-on: http://gerrit.cloudera.org:8080/3518 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-07-08 19:42:18 -07:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Michael Brown	2826df93d2	IMPALA-3427: test_insert_wide_table: use unique_database fixture test_insert_wide_table runs in parallel attempts to DROP/CREATE tables with the same name into the same database. A race thus exists in which one tests can stomp on each others' tables. The fix is to use the unique_database fixture, in which each test can run in parallel within its own database, and thus its own table. Change-Id: If3b22f1b089d20c6024d6d680af82f102342394e Reviewed-on: http://gerrit.cloudera.org:8080/2871 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Michael Brown <mikeb@cloudera.com>	2016-05-12 14:17:44 -07:00
Vlad Berindei	b6c20b2a40	Allow Impala to run against local filesystem. Allow Impala to start only with a running HMS (and no additional services like HDFS, HBase, Hive, YARN) and use the local file system. Skip all tests that need these services, use HDFS caching or assume that multiple impalads are running. To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has permissions since this is the location where the test data will be extracted. Test coverage (with core strategy) in comparison with HDFS and S3: HDFS 1348 tests passed S3 1157 tests passed Local Filesystem 1161 tests passed Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03 Reviewed-on: http://gerrit.cloudera.org:8080/1352 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins Readability: Alex Behm <alex.behm@cloudera.com>	2015-12-05 06:48:32 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
ishaan	09e5eaeda2	Introduce classes for pytest's skipif markers. This patch encapsulates pytests's skipif markers in classes. It leads to the following benefits: - Provide context and grouping for tests being skipped. - As we improve test reporting, annotations will give us a better idea of coverage. Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9 Reviewed-on: http://gerrit.cloudera.org:8080/297 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/343	2015-04-19 03:09:59 +00:00
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
ishaan	11cd7d1d46	Blacklist tests that don't work on s3 This patch introduces a new pytest marker that skip tests that currently don't work when s3 is used as the underlying file system. The set of blacklisted tests is a superset of tests that cannot be run with s3. Follow up patches will remove some of the test files from the blacklist. Change-Id: I39a58223d3435f0bd6496ffd00a2d483b751693d Reviewed-on: http://gerrit.cloudera.org:8080/82 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 01:43:28 +00:00
Nong Li	d52a620737	Add support for writing compressed text. Change-Id: I314b925594801ae4b5c47248d998801aa0b37270 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4205 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-07 22:08:30 -07:00
Victor Bittorf	820e1c070b	Support writing to Avro files Introduces support for writing tables stored as Avro files. This supports writing all data types except TIMESTAMP. Supports the following COMPRESSION_CODECs: NONE, DEFLATE, SNAPPY. Change-Id: Ica62063a4f172533c30dd1e8b0a11856da452467 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3863 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit 15c6066d05d5077bee0d5123d26777b0715eb9c6) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4056	2014-08-27 13:41:42 -07:00
Victor Bittorf	f2ef06bef1	SEQUENCEFILE: Add support for writing sequence files. This supports both uncompressed and block compressed formats. Row compressed formats are not supported. The type of compression is specified using a query parameter COMPRESSION_CODEC with values NONE, GZIP, BZIP2, and SNAPPY. Note: this patch only has basic testing. More extensive testing will be done when this avro writer is used in data loading. Change-Id: Id284bd4f3a28e27e49d56b1127cdc83c736feb61 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3541 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-08-17 12:45:10 -07:00
Nong Li	fd35cee887	Reorganize/reduce end to end test time. This patch does a few things: 1) Move the metadata tests into their own folder under tests/. I think it's useful to loosely categorize them so it's easier to run a subset of the tests that are most useful for the changes you are making. 2) Reduce the test vectors for query_tests. We should have identical coverage in the daily exhaustive runs but the normal runs should be much better. In particular, deemphasizing scanner tests since that code is more stable now. 3) Misc test cleanup/consolidate python test files/etc. Change-Id: I03c2f34877aed192c2a50665bd5e15fa85e12f1e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3831 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-08-17 12:43:57 -07:00
Ippokratis Pandis	15f26f3991	Fixing a bug in test_insert that was inserting dummy records in tinytable for text/codec Change-Id: I3e7ac5cd97de6ff4cd0eab411ac4ed4ff6c77508 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3751 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins	2014-08-05 13:19:50 -07:00
Ippokratis Pandis	99b23f1285	Skipping to run TestUnsupportedInsertFormats test for compressed test. Realized that if we run it with text/codec it will successfully insert a record ("hi", "there") in tinytable, which is undesirable. Skipping this test for compressed text. Change-Id: I8076f4fcac17d7f7530ac4993b0d4b3bd89d5fa0 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3668 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins	2014-08-05 13:19:49 -07:00
Ippokratis Pandis	572a4aed95	Updating the expectation of TestUnsupportedInsertFormats test. Previously it was expecting that inserting in any combination of text/codec where codec!=none would fail. With the read compressed text patch, such an insert will succeed by inserting uncompressed text, ignoring the compression type. This is a temporary fix for this (code coverage) test to succeed, until the text writers are in. Change-Id: Ib34bf661ee90e3bcbb76a225df8809ae13e835fa Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3659 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins	2014-08-05 13:19:49 -07:00
Lenni Kuff	b3ebfddadd	Allow tests to access query result column values by col alias or col position For example, you can now do something like: result_set = execute("select * from tbl") result_row = result_set[0] result_row['col_alias'] or result_row[4] to access column values. If the column alias/position does not exist an exception is thrown. Change-Id: Ie4b65619ed17fd90bf39e0966a7fc7e1180dbc5c Reviewed-on: http://gerrit.ent.cloudera.com:8080/2719 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/2922	2014-06-09 23:24:26 -07:00
Skye Wanderman-Milne	60db4d4d82	CDH-18416: Don't inline ReadWriteUtil::ReadZLong() For wide Avro tables, ReadZLong() would get inlined many times into a single function body, causing LLVM to crash. Not inlining doesn't seem to have a performance impact on narrow tables, and helps with wide tables. This change also adds tests over wide (i.e. many-column) tables. The test tables are produced by specifying shell commands to generate test tables in functional_schema_template.sql, which are executed in generate-schema-statements.py. In the SQL templates, sections starting with a ` are treated as shell commands. The output of the shell command is then used as the section text. This is only a starting point; it isn't currently implemented for all sections, and may have to be tweaked if we use this mechanism for all tables. Change-Id: Ife0d857d19b21534167a34c8bc06bc70bef34910 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2206 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Skye Wanderman-Milne <skye@cloudera.com> (cherry picked from commit 1c5951e3cce25a048208ab9bb3a3aed95e41cf67) Reviewed-on: http://gerrit.ent.cloudera.com:8080/2353 Tested-by: jenkins	2014-04-28 15:58:15 -07:00

1 2

68 Commits