impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Csaba Ringhofer	f98b697c7b	IMPALA-13929: Make 'functional-query' the default workload in tests This change adds get_workload() to ImpalaTestSuite and removes it from all test suites that already returned 'functional-query'. get_workload() is also removed from CustomClusterTestSuite which used to return 'tpch'. All other changes besides impala_test_suite.py and custom_cluster_test_suite.py are just mass removals of get_workload() functions. The behavior is only changed in custom cluster tests that didn't override get_workload(). By returning 'functional-query' instead of 'tpch', exploration_strategy() will no longer return 'core' in 'exhaustive' test runs. See IMPALA-3947 on why workload affected exploration_strategy. An example for affected test is TestCatalogHMSFailures which was skipped both in core and exhaustive runs before this change. get_workload() functions that return a different workload than 'functional-query' are not changed - it is possible that some of these also don't handle exploration_strategy() as expected, but individually checking these tests is out of scope in this patch. Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115 Reviewed-on: http://gerrit.cloudera.org:8080/22726 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-08 07:12:55 +00:00
Riza Suminto	de1a925cb7	IMPALA-13349: Fix remaining tests with unexercised exec_option This patch fixes remaining tests that has unexercised exec_option. Some test reorganization are done to clarify their test dimension declaration. The WARNING log added by IMPALA-13323 is turned into pytest.fail() with error message suggestion on how to fix it. Fixed some flake8 warnings and error as well. Testing: - Pass EE and custom cluster tests in exhaustive exploration. Change-Id: I33bb4b6c4ff50b55a082460dd9944d2aa3511e11 Reviewed-on: http://gerrit.cloudera.org:8080/21743 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-05 21:24:46 +00:00
Eyizoha	bb1ec12c1f	IMPALA-12431: Support reading compressed JSON file This patch adds the functionality to read compressed JSON files for the JSON scanner. Because the decompression code can largely be reused from HdfsTextScanner, this patch moves that part of the code from HdfsTextScanner to HdfsScanner so that HdfsJsonScanner can also call it. As it reuses the relevant code from the TEXT scanner, the compression formats supported by the Json scanner are the same as those supported by the TEXT scanner. Tests - Most of the existing end-to-end JSON format tests can run on compressed JSON format too. Change-Id: I2471855d97d4cdd51363b321055e6b06aa6d81e8 Reviewed-on: http://gerrit.cloudera.org:8080/20482 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-01-18 06:35:32 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	2b550634d2	IMPALA-11952 (part 2): Fix print function syntax Python 3 now treats print as a function and requires the parenthesis in invocation. print "Hello World!" is now: print("Hello World!") This fixes all locations to use the function invocation. This is more complicated when the output is being redirected to a file or when avoiding the usual newline. print >> sys.stderr , "Hello World!" is now: print("Hello World!", file=sys.stderr) To support this properly and guarantee equivalent behavior between python 2 and python 3, all files that use print now add this import: from __future__ import print_function This also fixes random flake8 issues that intersect with the changes. Testing: - check-python-syntax.sh shows no errors related to print Change-Id: Ib634958369ad777a41e72d80c8053b74384ac351 Reviewed-on: http://gerrit.cloudera.org:8080/19552 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-02-28 17:11:50 +00:00
Michael Smith	3577030df6	IMPALA-11562: Revert support for o3fs as default filesystem Reverts support for o3fs as a default filesystem added in IMPALA-9442. Updates test setup to use ofs instead. Munges absolute paths in Iceberg metadata to match the new location required for ofs. Ozone has strict requirements on volume and bucket names, so all tables must be created within a bucket (e.g. inside /impala/test-warehouse/). Change-Id: I45e90d30b2e68876dec0db3c43ac15ee510b17bd Reviewed-on: http://gerrit.cloudera.org:8080/19001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-28 22:35:48 +00:00
Zoltan Borok-Nagy	63bf92a2d1	IMPALA-10062: TestCompressedNonText.test_insensitivity_to_extension can fail due to wrong filename Some tests in test_compressed_formats.py hard-coded the filename of the tables. They used "000000_0" for filename. The number after the underscore is the "attempt id" which can be non-zero if there were failed attempts during file writing. I modified the test to do a filesystem listing to retrieve the filename. Testing * I manually renamed one of my files to 000000_1 and re-run the test. Change-Id: I265faf8d2e7f4251b18264052eededbeb2296f57 Reviewed-on: http://gerrit.cloudera.org:8080/16518 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-30 13:57:52 +00:00
Joe McDonnell	dbbd40308a	IMPALA-10005: Fix Snappy decompression for non-block filesystems Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED type compression in the backend. However, for non-block filesystems, the frontend is incorrectly passing THdfsCompression::SNAPPY instead. On debug builds, this leads to a DCHECK when trying to read Snappy-compressed text. On release builds, it fails to decompress the data. This fixes the frontend to always pass THdfsCompression::SNAPPY_BLOCKED for Snappy-compressed text. This reworks query_test/test_compressed_formats.py to provide better coverage: - Changed the RC and Seq test cases to verify that the file extension doesn't matter. Added Avro to this case as well. - Fixed the text case to use appropriate extensions (fixing IMPALA-9004) - Changed the utility function so it doesn't use Hive. This allows it to be enabled on non-HDFS filesystems like S3. - Changed the test to use unique_database and allow parallel execution. - Changed the test to run in the core job, so it now has coverage on the usual S3 test configuration. It is reasonably quick (1-2 minutes) and runs in parallel. Testing: - Exhaustive job - Core s3 job - Changed the frontend to force it to use the code for non-block filesystems (i.e. the TFileSplitGeneratorSpec code) and verified that it is now able to read Snappy-compressed text. Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac Reviewed-on: http://gerrit.cloudera.org:8080/16278 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Sahil Takiar <stakiar@cloudera.com>	2020-08-06 20:12:14 +00:00
Joe McDonnell	f15a311065	IMPALA-9709: Remove Impala-lzo from the development environment This removes Impala-lzo from the Impala development environment. Impala-lzo is not built as part of the Impala build. The LZO plugin is no longer loaded. LZO tables are not loaded during dataload, and LZO is no longer tested. This removes some obsolete scan APIs that were only used by Impala-lzo. With this commit, Impala-lzo would require code changes to build against Impala. The plugin infrastructure is not removed, and this leaves some LZO support code in place. If someone were to decide to revive Impala-lzo, they would still be able to load it as a plugin and get the same functionality as before. This plugin support may be removed later. Testing: - Dryrun of GVO - Modified TestPartitionMetadataUncompressedTextOnly's test_unsupported_text_compression() to add LZO case Change-Id: I3a4f12247d8872b7e14c9feb4b2c58cfd60d4c0e Reviewed-on: http://gerrit.cloudera.org:8080/15814 Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2020-06-15 23:42:12 +00:00
xiaomeng	d45e3a50b0	IMPALA-9673: Add external warehouse dir variable in E2E test Updated CDP build to 7.2.1.0-57 to include new Hive features such as HIVE-22995. In minicluster, we have default values of hive.create.as.acid and hive.create.as.insert.only which are false. So by default hive creates external type table located in external warehouse directory. Due to HIVE-22995, desc db returns external warehouse directory. With above reasons, we need use external warehouse dir in some tests. Also add a new test for "CREATE DATABASE ... LOCATION". Tested: Re-run failed test in minicluster. Run exhaustive tests. Change-Id: I57926babf4caebfd365e6be65a399f12ea68687f Reviewed-on: http://gerrit.cloudera.org:8080/15990 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-05 23:48:53 +00:00
xiaomeng	571131fdc1	IMPALA-9075: Add support for reading zstd text files In this patch, we add support for reading zstd encoded text files. This includes: 1. support reading zstd file written by Hive which uses streaming. 2. support reading zstd file compressed by standard zstd library which uses block. To support decompressing both formats, a function ProcessBlockStreaming is added in zstd decompressor. Testing done: Added two backend tests: 1. streaming decompress test. 2. large data test for both block and streaming decompress. Added two end to end tests: 1. hive and impala integration. For four compression codecs, write in hive and read from impala. 2. zstd library and impala integration. Copy a zstd lib compressed file to HDFS, and read from impala. Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4 Reviewed-on: http://gerrit.cloudera.org:8080/15023 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-25 00:31:34 +00:00
Joe McDonnell	0163a10332	IMPALA-9068: Use different directories for external vs managed warehouse Hive 3 changed the typical storage model for tables to split them between two directories: - hive.metastore.warehouse.dir stores managed tables (which is now defined to be only transactional tables) - hive.metastore.warehouse.external.dir stores external tables (everything that is not a transactional table) In more recent commits of Hive, there is now validation that the external tables cannot be stored in the managed directory. In order to adopt these newer versions of Hive, we need to use separate directories for external vs managed warehouses. Most of our test tables are not transactional, so they would reside in the external directory. To keep the test changes small, this uses /test-warehouse for the external directory and /test-warehouse/managed for the managed directory. Having the managed directory be a subdirectory of /test-warehouse means that the data snapshot code should not need to change. The Hive 2 configuration doesn't change as it does not have this concept. Since this changes the dataload layout, this also sets the CDH_MAJOR_VERSION to 7 for USE_CDP_HIVE=true. This means that dataload will uses a separate location for data as compared to USE_CDP_HIVE=false. That should reduce conflicts between the two configurations. Testing: - Ran exhaustive tests with USE_CDP_HIVE=false - Ran exhaustive tests with USE_CDP_HIVE=true (with current Hive version) - Verified that dataload succeeds and tests are able to run with a newer Hive version. Change-Id: I3db69f1b8ca07ae98670429954f5f7a1a359eaec Reviewed-on: http://gerrit.cloudera.org:8080/15026 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-24 17:29:15 +00:00
Sahil Takiar	ac87278b16	IMPALA-8950: Add -d, -f options to hdfs copyFromLocal, put, cp Add the -d option and -f option to the following commands: `hdfs dfs -copyFromLocal <localsrc> URI` `hdfs dfs -put [ - \| <localsrc1> .. ]. <dst>` `hdfs dfs -cp URI [URI ...] <dest>` The -d option "Skip[s] creation of temporary file with the suffix ._COPYING_." which improves performance of these commands on S3 since S3 does not support metadata only renames. The -f option "Overwrites the destination if it already exists" combined with HADOOP-13884 this improves issues seen with S3 consistency issues by avoiding a HEAD request to check if the destination file exists or not. Added the method 'copy_from_local' to the BaseFilesystem class. Re-factored most usages of the aforementioned HDFS commands to use the filesystem_client. Some usages were not appropriate / worth refactoring, so occasionally this patch just adds the '-d' and '-f' options explicitly. All calls to '-put' were replaced with 'copyFromLocal' because they both copy files from the local fs to a HDFS compatible target fs. Since WebHDFS does not have good support for copying files, this patch removes the copy functionality from the PyWebHdfsClientWithChmod. Re-factored the hdfs_client so that it uses a DelegatingHdfsClient that delegates to either the HadoopFsCommandLineClient or PyWebHdfsClientWithChmod. Testing: * Ran core tests on HDFS and S3 Change-Id: I0d45db1c00554e6fb6bcc0b552596d86d4e30144 Reviewed-on: http://gerrit.cloudera.org:8080/14311 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-05 00:04:08 +00:00
Ethan Xue	6d68c4f6c0	IMPALA-8549: Add support for scanning DEFLATE text files This patch adds support to Impala for scanning .DEFLATE files of tables stored as text. To avoid confusion, it should be noted that although these files have a compression type of DEFLATE in Impala, they should be treated as if their compression type is DEFAULT. Hadoop tools such as Hive and MapReduce support reading and writing text files compressed using the deflate algorithm, which is the default compression type. Hadoop uses the zlib library (an implementation of the DEFLATE algorithm) to compress text files into .DEFLATE files, which are not in the raw deflate format but rather the zlib format (the zlib library supports three flavors of deflate, and Hadoop uses the flavor that compresses data into deflate with zlib wrappings rather than just raw deflate) Testing: There is a pre-existing unit test that validates compressing and decompressing data with compression type DEFLATE. Also, modified existing end-to-end testing that simulates querying files of various formats and compression types. All core and exhaustive tests pass. Change-Id: I45e41ab5a12637d396fef0812a09d71fa839b27a Reviewed-on: http://gerrit.cloudera.org:8080/13857 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2019-08-08 01:05:02 +00:00
Csaba Ringhofer	e57fd17cd9	IMPALA-8524: Avoid calling "hive" via command line in EE tests "hive -e SQL..." without further parameters no longer works when USE_CDP_HIVE=true (it doesn't establish a connection). Some tests used this to load data. These calls can be replaced with ImpalaTestSuite.run_stmt_in_hive() which seems like a good idea regardless of the Hive 3 efforts. Testing: - ran the related tests (some run only in exhustive mode) Change-Id: I874ac344ffd176ffd7b8540d57126a7026f4c4f6 Reviewed-on: http://gerrit.cloudera.org:8080/13282 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2019-05-10 18:51:16 +00:00
Sean Mackrory	7a022cf36a	IMPALA-7681. Add Azure Blob File System (ADLS Gen2) support. HADOOP-15407 adds a new FileSystem implementation called "ABFS" for the ADLS Gen2 service. It's in the hadoop-azure module as a replacement for WASB. Filesystem semantics should be the same, so skipped tests and other behavior changes have simply mirrored what is done for ADLS Gen1 by default. Tests skipped on ADLS Gen1 due to eventual consistency of the Python client can be run against ADLS Gen2. Change-Id: I5120b071760e7655e78902dce8483f8f54de445d Reviewed-on: http://gerrit.cloudera.org:8080/11630 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 06:43:00 +00:00
Bikramjeet Vig	30e82c63ec	IMPALA-7190: Remove unsupported format writer support This patch removes write support for unsupported formats like Sequence, Avro and compressed text. Also, the related query options ALLOW_UNSUPPORTED_FORMATS and SEQ_COMPRESSION_MODE have been migrated to the REMOVED query options type. Testing: Ran exhaustive build. Change-Id: I821dc7495a901f1658daa500daf3791b386c7185 Reviewed-on: http://gerrit.cloudera.org:8080/10823 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-03 20:34:27 +00:00
David Knupp	adbb0b7f81	IMPALA-5413: Add a hive user for test_seq_writer_hive_compatibility. This patch includes a change to the framework to permit the passing of a username to the run_stmt_in_hive() method in the ImpalaTestSuite class, but retains the same default value as before. This is to allow a test to issue a 'select count(*) from foo' query through hive. Hive needs to set up a job to perform this query, and HDFS write access to do so. In typical cases, the HDFS user is 'hdfs'. however it may be necessary to change this depending on the cluster. On a local mini-cluster, the username appears to be irrelevant, so this won't affect locally run tests. Tested by running the core set of tests on a local minicluster to make sure there were no regressions. Also confirmed that the test in question now passes on a remote physical cluster. Change-Id: I1cc8824800e4339874b9c4e3a84969baf848d941 Reviewed-on: http://gerrit.cloudera.org:8080/7046 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-10 02:26:13 +00:00
Sailesh Mukil	50bd015f2d	IMPALA-5333: Add support for Impala to work with ADLS This patch leverages the AdlFileSystem in Hadoop to allow Impala to talk to the Azure Data Lake Store. This patch has functional changes as well as adds test infrastructure for testing Impala over ADLS. We do not support ACLs on ADLS since the Hadoop ADLS connector does not integrate ADLS ACLs with Hadoop users/groups. For testing, we use the azure-data-lake-store-python client from Microsoft. This client seems to have some consistency issues. For example, a drop table through Impala will delete the files in ADLS, however, listing that directory through the python client immediately after the drop, will still show the files. This behavior is unexpected since ADLS claims to be strongly consistent. Some tests have been skipped due to this limitation with the tag SkipIfADLS.slow_client. Tracked by IMPALA-5335. The azure-data-lake-store-python client also only works on CentOS 6.6 and over, so the python dependencies for Azure will not be downloaded when the TARGET_FILESYSTEM is not "adls". While running ADLS tests, the expectation will be that it runs on a machine that is at least running CentOS 6.6. Note: This is only a test limitation, not a functional one. Clusters with older OSes like CentOS 6.4 will still work with ADLS. Added another dependency to bootstrap_build.sh for the ADLS Python client. Testing: Ran core tests with and without TARGET_FILESYSTEM as 'adls' to make sure that all tests pass and that nothing breaks. Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542 Reviewed-on: http://gerrit.cloudera.org:8080/6910 Tested-by: Impala Public Jenkins Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>	2017-05-25 19:35:24 +00:00
Attila Jeges	2e63752858	IMPALA-5257: test_seq_writer_hive_compatibility fails on local file system build TestTableWriters.test_seq_writer_hive_compatibility test introduced in IMPALA-3079 had to be skipped for non-HDFS filesystems. Change-Id: Ic7dbe2529818865f871b66d78642ed956d1ee039 Reviewed-on: http://gerrit.cloudera.org:8080/6746 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-28 06:20:29 +00:00
Attila Jeges	59b2db6ba7	IMPALA-3079: Fix sequence file writer This change fixes the following issues in the Sequence File Writer: 1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong() were broken. As a result, Impala created corrupt uncompressed sequence files. 2. KEY_CLASS_NAME was missing from the sequence file header. As a result, Hive could not read back uncompressed sequence files created by Impala. 3. Impala created record-compressed sequence files with empty keys block. As a result, Hive could not read back record-compressed sequence files created by Impala. 4. Impala created block-compressed files with: - empty key-lengths block - empty keys block - empty value-lengths block This resulted in invalid block-compressed sequence files that Hive could not read back. 5. In some cases the wrong Record-compression flag was written to the sequence file header. As a result, Hive could not read back record- compressed sequence files created by Impala. 6. Impala added 'sync_marker' instead of 'neg1_sync_marker' to the beginning of blocks in block-compressed sequence files. Hive could not read these files back. 7. The calculation of block sizes in SnappyBlockCompressor class was incorrect for odd-length buffers. Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487 Reviewed-on: http://gerrit.cloudera.org:8080/6107 Reviewed-by: Michael Ho <kwho@cloudera.com> Reviewed-by: Attila Jeges <attilaj@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-25 21:07:53 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Michael Ho	ed5ec6772f	IMPALA-1619: Support 64-bit allocations. This change extends MemPool, FreePool and StringBuffer to support 64-bit allocations, fixes a bug in decompressor and extends various places in the code to support 64-bit allocation sizes. With this change, the text scanner can now decompress compressed files larger than 1GB. Note that the UDF interfaces FunctionContext::Allocate() and FunctionContext::Reallocate() still use 32-bit for the input argument to avoid breaking compatibility. In addition, the byte size of a tuple is still assumed to be within 32-bit. If it needs to be upgraded to 64-bit, it will be done in a separate change. A new test has been added to test the decompression of a 2GB snappy block compressed text file. Change-Id: Ic1af1564953ac02aca2728646973199381c86e5f Reviewed-on: http://gerrit.cloudera.org:8080/3575 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-07-08 15:42:09 -07:00
Michael Ho	a07fc367ee	Revert "IMPALA-1619: Support 64-bit allocations." This reverts commit 1ffb2bd5a2a2faaa759ebdbaf49bf00aa8f86b5e. Unbreak the packaging builds for now. Change-Id: Id079acb83d35b51ba4dfe1c8042e1c5ec891d807 Reviewed-on: http://gerrit.cloudera.org:8080/3543 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Michael Ho <kwho@cloudera.com>	2016-07-05 13:37:26 -07:00
Michael Ho	5f3dfdf6c7	IMPALA-1619: Support 64-bit allocations. This change extends MemPool, FreePool and StringBuffer to support 64-bit allocations, fixes a bug in decompressor and extends various places in the code to support 64-bit allocation sizes. With this change, the text scanner can now decompress compressed files larger than 1GB. Note that the UDF interfaces FunctionContext::Allocate() and FunctionContext::Reallocate() still use 32-bit for the input argument to avoid breaking compatibility. In addition, the byte size of a tuple is still assumed to be within 32-bit. If it needs to be upgraded to 64-bit, it will be done in a separate change. Change-Id: I7ed28083d809a86d801a9c063a0aa32c50d32b20 Reviewed-on: http://gerrit.cloudera.org:8080/2781 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-07-05 13:37:25 -07:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Juan Yu	c9b33ddf63	IMPALA-1886/IMPALA-2154: Add support for multi-stream bz2/gzip compressed files. Fix a bug in which Impala only reads the first stream of a multi-stream bz2/gzip file. Changes the bz2 decoder to read the file in a streaming fashion rather than reading the entire file into memory before it can be decompressed. Change-Id: Icbe617d03a69953f0bf3aa0f7c30d34bc612f9f8 (cherry picked from commit b6d0b4e059329633dc50f1f73ebe35b7ac317a8e) Reviewed-on: http://gerrit.cloudera.org:8080/2219 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2016-02-28 21:31:37 -08:00
Vlad Berindei	b6c20b2a40	Allow Impala to run against local filesystem. Allow Impala to start only with a running HMS (and no additional services like HDFS, HBase, Hive, YARN) and use the local file system. Skip all tests that need these services, use HDFS caching or assume that multiple impalads are running. To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has permissions since this is the location where the test data will be extracted. Test coverage (with core strategy) in comparison with HDFS and S3: HDFS 1348 tests passed S3 1157 tests passed Local Filesystem 1161 tests passed Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03 Reviewed-on: http://gerrit.cloudera.org:8080/1352 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins Readability: Alex Behm <alex.behm@cloudera.com>	2015-12-05 06:48:32 +00:00
Juan Yu	7c498627f6	IMPALA-2249: Avoid allocating StringBuffer > 1GB in ScannerContext::Stream::GetBytesInternal() Due to IMPALA-1619, allocating StringBuffer larger than 1GB could cause Impala crash. Check the requested buffer size in advance and fail the request if it is larger than 1GB. Once IMPALA-1619 is fixed, we should revert this change. Change-Id: Iffd1e701614b520ce58922ada2400386661eedb1 (cherry picked from commit 74ba16770eeade36ab77c86ed99d9248c60b0131) Reviewed-on: http://gerrit.cloudera.org:8080/869 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-09-30 17:17:46 -07:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
ishaan	8454c7712d	Disable tests that call hive from running on S3 and Isilon. A few tests execute hive queries via the command line. These tests should ideally not be run when the underlying filesystem is either S3 or Isilon. This patch disables them. Change-Id: Ieb968f4f109e02ee893a0478b0ffeb16e5b3ff4c Reviewed-on: http://gerrit.cloudera.org:8080/446 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-06-09 21:07:29 +00:00
ishaan	09e5eaeda2	Introduce classes for pytest's skipif markers. This patch encapsulates pytests's skipif markers in classes. It leads to the following benefits: - Provide context and grouping for tests being skipped. - As we improve test reporting, annotations will give us a better idea of coverage. Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9 Reviewed-on: http://gerrit.cloudera.org:8080/297 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/343	2015-04-19 03:09:59 +00:00
Dan Hecht	b46e8001ef	S3: more test triage and location fixups Fix up more locations to allow the tests to run on secondary filsystem. In particular, database locations need to be located on the target filesystem or else any tables created without locations will be in HDFS and not actually give coverage on S3. Change-Id: Ifcc4a47ecaa235b23d305784b844788732d5fa05 Reviewed-on: http://gerrit.cloudera.org:8080/143 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 09:12:46 +00:00
ishaan	11cd7d1d46	Blacklist tests that don't work on s3 This patch introduces a new pytest marker that skip tests that currently don't work when s3 is used as the underlying file system. The set of blacklisted tests is a superset of tests that cannot be run with s3. Follow up patches will remove some of the test files from the blacklist. Change-Id: I39a58223d3435f0bd6496ffd00a2d483b751693d Reviewed-on: http://gerrit.cloudera.org:8080/82 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 01:43:28 +00:00
Victor Bittorf	4339133887	Adding SEQUENCEFILE compressed record format Currently we do not support per record compression for SEQUENCEFILE; we do support no compression and block compression. Per record compression is typically very slow (since the compressor is invoked per record in the table) and not widely used. We chose to add support for per record compression as part of our effort to use Impala for all of our testdata loading infrastructure. We have per record compressed tables in testdata, so even though there is no customer demand for per record compression, we need it to migrate our data loading off of Hive. Change-Id: I6ea98ae0d31cceff8236b4b006c3a9fc00f64131 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5302 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit f62a76f8d00b8dbc2846deb36ee5f65031ad846e) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5322	2014-11-19 17:21:36 -08:00
Victor Bittorf	3b7862c633	Disable compressed test to unblock code coverage. Change-Id: Ida584920dde75a27882e217933eba97344219192 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5241 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-11-17 15:02:48 -08:00
Victor Bittorf	3f75bd6735	Reintroduce SEQUENCEFILE writer tests The sequence writer test had an issue with zlib on certain cluster machines, making this a flaky test. This has passed several times locally and in private builds. This re-enables the test because the failures could not be produced in private builds. Change-Id: I0aeea3a2d000e711e5a84427a7b40592e1eef75b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5077 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-11-17 11:19:16 -08:00
Nong Li	9ce5a7fd13	Disable text writer test. This fails for the same reason as the sequence writer. It passes locally but fails in zlib on the jenkins boxes. I suspect something is wrong with our gzip codec or the version of zlib installed on those machines (we've disabled this for parquet as well). Change-Id: I706186fbb6207fa694b4e61c7114e17c1ffe3482 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4221 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Nong Li <nong@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4260 Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-11 08:47:23 -07:00
Nong Li	d52a620737	Add support for writing compressed text. Change-Id: I314b925594801ae4b5c47248d998801aa0b37270 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4205 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-07 22:08:30 -07:00
Victor Bittorf	820e1c070b	Support writing to Avro files Introduces support for writing tables stored as Avro files. This supports writing all data types except TIMESTAMP. Supports the following COMPRESSION_CODECs: NONE, DEFLATE, SNAPPY. Change-Id: Ica62063a4f172533c30dd1e8b0a11856da452467 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3863 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins (cherry picked from commit 15c6066d05d5077bee0d5123d26777b0715eb9c6) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4056	2014-08-27 13:41:42 -07:00
Victor Bittorf	f2ef06bef1	SEQUENCEFILE: Add support for writing sequence files. This supports both uncompressed and block compressed formats. Row compressed formats are not supported. The type of compression is specified using a query parameter COMPRESSION_CODEC with values NONE, GZIP, BZIP2, and SNAPPY. Note: this patch only has basic testing. More extensive testing will be done when this avro writer is used in data loading. Change-Id: Id284bd4f3a28e27e49d56b1127cdc83c736feb61 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3541 Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com> Tested-by: jenkins	2014-08-17 12:45:10 -07:00
Ippokratis Pandis	7701c651a4	Reading compressed text. Adding the ability to read compressed text. Reading the compression type from the file descriptors. Trying to homogenize a bit more the interface of the scanners. Removing the LZO_TEXT file format, since it was not actually a file format. Modifying the tests to load and test also text/{snap,gzip,bzip} databases. Note that this patch requires some changes to Impala-lzo as well. Change-Id: Ic0742ba11f106ba545050bdb71795efbff70ef74 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3549 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ippokratis Pandis <ipandis@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3651 Tested-by: jenkins	2014-07-29 20:28:36 -07:00
Lenni Kuff	9c3b318112	Fix test_compressed_formats to properly pull in tbl created in Hive Change-Id: I4e143826e5900ebfa6f77023ae4cf0d2c71db190 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1960 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1967 Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-03-18 13:24:10 -07:00
Lenni Kuff	dd20958e5d	Minor test cleanup * Prefer 'refresh <table name>' over 'invalidate metadata' * Remove the 'RELOAD' test setup option that was used by only 1 test. * Delete a .py test file that seems to be a duplicate Change-Id: I890546635840bb8f4d55789a89f8c8f33e40d001 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1933 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1946 Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-03-17 17:30:15 -07:00
Lenni Kuff	d698881f71	Improve test run throughput by executing more tests in parallel This updates the tests to run more test cases in parallel and also removes some unneeded "invalidate metadata" calls. This cut down the 'serial' execution time for me by 10+ minutes. Change-Id: I04b4d6db508a26a1a2e4b972bcf74f4d8b9dde5a Reviewed-on: http://gerrit.ent.cloudera.com:8080/757 Tested-by: jenkins Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:46 -08:00
Nong Li	741599dc2a	Move compressed table test out of core.	2014-01-08 10:49:40 -08:00
ishaan	09d6d931f4	Change the way data is loaded	2014-01-08 10:48:09 -08:00

1 2

52 Commits