impala

mirror of https://github.com/apache/impala.git synced 2025-12-21 02:48:14 -05:00

Author	SHA1	Message	Date
Csaba Ringhofer	95a073aa08	IMPALA-14223: Cleanup subdirectories in INSERT OVERWRITE If an external table contains data files in subdirectories, and recursive listing is enabled, Impala considers the files in the subdirectories as part of the table. However, currently INSERT OVERWRITE and TRUNCATE do not always delete these files, leading to data corruption. This change takes care of INSERT OVERWRITE. Before this change, for unpartitioned external tables, only top-level data files were deleted and data files in subdirectories (whether hidden, ignored or normal) were kept. After this change, directories are also deleted in addition to (non-hidden) data files, with the exception of hidden and ignored directories. (Note: for ignored directories, see --ignored_dir_prefix_list). Note that for partitioned tables, INSERT OVERWRITE completely removes the partition directories that are affected, and this change does not alter that. Testing: - extended the tests in test_recursive_listing.py::TestRecursiveListing Change-Id: I1a40a22e18e6a384da982d300422ac8995ed0273 Reviewed-on: http://gerrit.cloudera.org:8080/23165 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>	2025-07-15 12:30:40 +00:00
Csaba Ringhofer	f98b697c7b	IMPALA-13929: Make 'functional-query' the default workload in tests This change adds get_workload() to ImpalaTestSuite and removes it from all test suites that already returned 'functional-query'. get_workload() is also removed from CustomClusterTestSuite which used to return 'tpch'. All other changes besides impala_test_suite.py and custom_cluster_test_suite.py are just mass removals of get_workload() functions. The behavior is only changed in custom cluster tests that didn't override get_workload(). By returning 'functional-query' instead of 'tpch', exploration_strategy() will no longer return 'core' in 'exhaustive' test runs. See IMPALA-3947 on why workload affected exploration_strategy. An example for affected test is TestCatalogHMSFailures which was skipped both in core and exhaustive runs before this change. get_workload() functions that return a different workload than 'functional-query' are not changed - it is possible that some of these also don't handle exploration_strategy() as expected, but individually checking these tests is out of scope in this patch. Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115 Reviewed-on: http://gerrit.cloudera.org:8080/22726 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-04-08 07:12:55 +00:00
Riza Suminto	6fbde72969	IMPALA-13694: Add ImpalaTestSuite.__reset_impala_clients method This patch adds __reset_impala_clients() method in ImpalaConnection. __reset_impala_clients() then simply clear configuration. It is called on each setup_method() to ensure that each EE test uses clean test client. All subclasses of ImpalaTestSuite that declare setup() method are refactored to declare setup_method() instead, to match newer py.test convention. Also implement teardown_method() to complement setup_method(). See "Method and function level setup/teardown" at https://docs.pytest.org/en/stable/how-to/xunit_setup.html. CustomClusterTestSuite fully overrides setup_method() and teardown_method() because it subclasses can be destructive. The custom cluster test method often restart the whole Impala cluster, rendering default impala clients initialized at setup_class() unusable. Each subclass of CustomClusterTestSuite is responsible to ensure that impala client they are using is in a good state. This patch improve BeeswaxConnection and ImpylaHS2Connection to only consider non-REMOVED options as its default options. They lookup for valid (not REMOVED) query options with their own appropriate way, memorized the option names as lowercase string and the values as string. List values are wrapped with double quote. Log in ImpalaConnection.set_configuration_option() is differentiated from how SET query looks. Note that ImpalaTestSuite.run_test_case() modify and restore query option written at .test file by issuing SET query, not by calling ImpalaConnection.set_configuration_option(). It is remain unchanged. Consistently lower case query option everywhere in Impala test code infrastructure. Fixed several tests that has been unknowingly override 'exec_option' vector dimension due to case sensitive mismatch. Also fixed some flake8 issues. Added convenience method execute_query_using_vector() and create_impala_client_from_vector() in ImpalaTestSuite. Testing: - Pass core tests. Change-Id: Ieb47fec9f384cb58b19fdbd10ff7aa0850ad6277 Reviewed-on: http://gerrit.cloudera.org:8080/22404 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Reviewed-by: Jason Fehr <jfehr@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-02-06 04:03:33 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Michael Smith	3577030df6	IMPALA-11562: Revert support for o3fs as default filesystem Reverts support for o3fs as a default filesystem added in IMPALA-9442. Updates test setup to use ofs instead. Munges absolute paths in Iceberg metadata to match the new location required for ofs. Ozone has strict requirements on volume and bucket names, so all tables must be created within a bucket (e.g. inside /impala/test-warehouse/). Change-Id: I45e90d30b2e68876dec0db3c43ac15ee510b17bd Reviewed-on: http://gerrit.cloudera.org:8080/19001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-28 22:35:48 +00:00
Michael Smith	1eb0510eaa	IMPALA-11456: Collapse filesystem Skip logic Combines all SkipIf* classes for different filesystems into a single SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the rest as filesystem-specific special cases. The 'jira' option is removed in favor of specific flags for each issue. Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1 Reviewed-on: http://gerrit.cloudera.org:8080/18781 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-10 22:37:08 +00:00
Michael Smith	830625b104	IMPALA-9442: Add Ozone to minicluster Adds Ozone as an alternative to hdfs in the minicluster. Select by setting `export TARGET_FILESYSTEM=ozone`. With that flag, run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot because Ozone does not support HBase (HDDS-3589); snapshot loading doesn't work yet primarily due to HDDS-5502. Uses the o3fs interface because Ozone puts specific restrictions on bucket names (no underscores, for instance), and it was a lot easier to use an interface where everything is written to a single bucket than to update all Impala's use of HDFS-style paths to make `test-warehouse` a bucket inside a volume. Specifies reduced Ozone client retries during shutdown where Ozone may not be available. Passes tests with FE_TEST=false BE_TEST=false. Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be Reviewed-on: http://gerrit.cloudera.org:8080/18738 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-08-03 16:58:20 +00:00
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Tim Armstrong	a2c5d953b0	IMPALA-8121: part 2: use local catalog in containers This enables "modern" catalog features including the local catalog and HMS notification support in the dockerised minicluster by default. The flags can be overridden if needed. Skip tests affected by these bugs: * IMPALA-8486 (LibCache invalidations) * IMPALA-8458 (alter column stats) * IMPALA-7131 (data sources not supported) * IMPALA-7538 (HDFS caching DDL not supported) * IMPALA-8489 TestRecoverPartitions.test_post_invalidate fails with IllegalStateException * IMPALA-8459 (cannot drop Kudu table) * IMPALA-7539 (insert permission checks) Fix handling of table properties in _get_properties() to avoid including properties from unrelated sections. This caused problems becase of additional properties added by metastore event processing. Rewrite test_partition_ddl_predicates() to change file formats rather than use HDFS caching DDL. Update the various test_kudu_col* tests to not expect staleness of Kudu metadata for catalog V2. Fix IMPALA-8464 so that testMetaDataGetColumnComments() allows the table comment to be present, which is the new behaviour. Add a new end-to-end test test_get_tables() that tests the precise behaviour for different catalog versions so as to not lose coverage. Change-Id: I900d4b718cca98bcf86d36a2e64c0b6a424a5b7c Reviewed-on: http://gerrit.cloudera.org:8080/13226 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-10 12:06:01 +00:00
Tim Armstrong	2ca7f8e7c0	IMPALA-7995: part 1: fixes for e2e dockerised impala tests This fixes all core e2e tests running on my local dockerised minicluster build. I do not yet have a CI job or script running but I wanted to get feedback on these changes sooner. The second part of the change will include the CI script and any follow-on fixes required for the exhaustive tests. The following fixes were required: * Detect docker_network from TEST_START_CLUSTER_ARGS * get_webserver_port() does not depend on the caller passing in the default webserver port. It failed previously because it relied on start-impala-cluster.py setting -webserver_port for all processes. * Add SkipIf markers for tests that don't make sense or are non-trivial to fix for containerised Impala. * Support loading Impala-lzo plugin from host for tests that depend on it. * Fix some tests that had 'localhost' hardcoded - instead it should be $INTERNAL_LISTEN_HOST, which defaults to localhost. * Fix bug with sorting impala daemons by backend port, which is the same for all dockerised impalads. Testing: I ran tests locally as follows after having set up a docker network and starting other services: ./buildall.sh -noclean -notests -ninja ninja -j $IMPALA_BUILD_THREADS docker_images export TEST_START_CLUSTER_ARGS="--docker_network=impala-cluster" export FE_TEST=false export BE_TEST=false export JDBC_TEST=false export CLUSTER_TEST=false ./bin/run-all-tests.sh Change-Id: Iee86cbd2c4631a014af1e8cef8e1cd523a812755 Reviewed-on: http://gerrit.cloudera.org:8080/12639 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-13 02:42:32 +00:00
Sean Mackrory	7a022cf36a	IMPALA-7681. Add Azure Blob File System (ADLS Gen2) support. HADOOP-15407 adds a new FileSystem implementation called "ABFS" for the ADLS Gen2 service. It's in the hadoop-azure module as a replacement for WASB. Filesystem semantics should be the same, so skipped tests and other behavior changes have simply mirrored what is done for ADLS Gen1 by default. Tests skipped on ADLS Gen1 due to eventual consistency of the Python client can be run against ADLS Gen2. Change-Id: I5120b071760e7655e78902dce8483f8f54de445d Reviewed-on: http://gerrit.cloudera.org:8080/11630 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 06:43:00 +00:00
Todd Lipcon	35bce6b12b	IMPALA-7311. Allow INSERT on writable partitions even if some other partition is READ_ONLY This changes the permissions-checking of INSERT so that, if a partition is specified, we only verify writability of the specific explicit partition. This allows insertion into a table even if it contains one or more read-only partitions. This matches the existing behavior of LOAD DATA. New regression tests are added which failed prior to the fix. Change-Id: I1dd81100ae73fcabdbfaf679c20cea7dc102cd13 Reviewed-on: http://gerrit.cloudera.org:8080/10974 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tianyi Wang <twang@cloudera.com>	2018-08-07 23:40:11 +00:00
Sailesh Mukil	50bd015f2d	IMPALA-5333: Add support for Impala to work with ADLS This patch leverages the AdlFileSystem in Hadoop to allow Impala to talk to the Azure Data Lake Store. This patch has functional changes as well as adds test infrastructure for testing Impala over ADLS. We do not support ACLs on ADLS since the Hadoop ADLS connector does not integrate ADLS ACLs with Hadoop users/groups. For testing, we use the azure-data-lake-store-python client from Microsoft. This client seems to have some consistency issues. For example, a drop table through Impala will delete the files in ADLS, however, listing that directory through the python client immediately after the drop, will still show the files. This behavior is unexpected since ADLS claims to be strongly consistent. Some tests have been skipped due to this limitation with the tag SkipIfADLS.slow_client. Tracked by IMPALA-5335. The azure-data-lake-store-python client also only works on CentOS 6.6 and over, so the python dependencies for Azure will not be downloaded when the TARGET_FILESYSTEM is not "adls". While running ADLS tests, the expectation will be that it runs on a machine that is at least running CentOS 6.6. Note: This is only a test limitation, not a functional one. Clusters with older OSes like CentOS 6.4 will still work with ADLS. Added another dependency to bootstrap_build.sh for the ADLS Python client. Testing: Ran core tests with and without TARGET_FILESYSTEM as 'adls' to make sure that all tests pass and that nothing breaks. Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542 Reviewed-on: http://gerrit.cloudera.org:8080/6910 Tested-by: Impala Public Jenkins Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>	2017-05-25 19:35:24 +00:00
Lars Volker	8ea21d099f	IMPALA-2523: Make HdfsTableSink aware of clustered input IMPALA-2521 introduced clustering for insert statements. This change makes the HdfsTableSink aware of clustered inputs, so that partitions are opened, written, and closed one by one. This change also adds/modifies tests in several ways: - clustered insert tests switch from selecting all rows from alltypessmall to alltypes. Together with varying settings for batch_size, this results in a larger number of row batches being written. - clustered insert tests select from alltypes instead of functional.alltypes to make sure we also select from various input formats. - clustered insert tests have been added to select from alltypestiny to create inserts with 1 and 2 rows per partition respectively. - exhaustive insert tests now use different values for batch_size: 1, 16, 0 (meaning default, 1024). This is limited to uncompressed parquet files, to maintain a reasonable runtime. On my machine execution of test.insert took 1778 seconds, compared to 1002 seconds with the just default row batch size. - There is additional testing in test_insert_behaviour.py to make sure that insertion over several row batches only creates one file per partition. - It renames the test_insert method to make it unique in the file and allow for effective filtering with -k. - It adds tests to the Analyzer test suite. Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0 Reviewed-on: http://gerrit.cloudera.org:8080/4863 Reviewed-by: Lars Volker <lv@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-22 02:51:20 +00:00
Anuj Phadke	b66829f15f	IMPALA-2700: ASCII NUL characters are doubled on insert into text tables Currently the scanner processes the '\0' character as a no special character whereas the writer treats it as a special character. The writer appends a special character before writting which is causing the ASCII NULL characters to double since they are the default escape characters. This adds a check to treat '\0' as a no special character in the writter. Change-Id: Ia30fa314d1ee1e99f9e7598466eb1570ca7940fc Reviewed-on: http://gerrit.cloudera.org:8080/3876 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-08-10 04:09:38 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Michael Brown	58219eac2c	IMPALA-2537: EE tests: create and use unique database fixture To speed up tests and reduce flakiness, introduce a pytest fixture whereby a test maintainer may request a database unique to his test. Such databases are suitable for tests that need to create tables within Python test code. Because the database name is unique to the test, the test can create any tables within that database it wants without fear that the same tables will be picked up by another test. Unique databases effectively guarantee a unique namespace for tables. To generate the database name, we use the CRC32 checksum of the test's so-called pytest test ID. This ID is a long string containing the test's module path, class (if applicable), function name, and parameter set (e.g., vector). We then concatenate the CRC32 checksum with the test function name, so that it's easier to identify the test to which the database belongs. The test author may also override the prefix by parametrizing the fixture. We then use a pytest fixture to create the database, hand the name to the test using the fixture, and clean up the database automatically after the test completes. The command `impala-py.test --fixtures` executed from the tests/ directory explains the full usage. Finally, we modify a few tests to show how test maintainers can use this fixture. Not supported here are databases used by .test files, creation of hive databases, databases with special CREATE parameters such as LOCATION and COMMENT, or asking the fixture to create multiple databases. Also not supported would be attempted parallel runs of the same test with the same test parameters. Testing: 1. Manual testing of the fixture usage, both in vanilla and parametrized context. 2. Manual runs of the tests modified. 3. An exhaustive exploration strategy test run. Change-Id: I74d200da8a59379388e1edfbb849828f92a1b3b7 Reviewed-on: http://gerrit.cloudera.org:8080/1821 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-03-16 18:29:57 +00:00
Vlad Berindei	b6c20b2a40	Allow Impala to run against local filesystem. Allow Impala to start only with a running HMS (and no additional services like HDFS, HBase, Hive, YARN) and use the local file system. Skip all tests that need these services, use HDFS caching or assume that multiple impalads are running. To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has permissions since this is the location where the test data will be extracted. Test coverage (with core strategy) in comparison with HDFS and S3: HDFS 1348 tests passed S3 1157 tests passed Local Filesystem 1161 tests passed Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03 Reviewed-on: http://gerrit.cloudera.org:8080/1352 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins Readability: Alex Behm <alex.behm@cloudera.com>	2015-12-05 06:48:32 +00:00
Juan Yu	1055d86793	IMPALA-2496: Fix flaky test in test_insert_behaviour.py Change-Id: I1c7f0206ba703e7c5766c79f0331724eb551893f Reviewed-on: http://gerrit.cloudera.org:8080/1203 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-10-08 15:16:34 -07:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
ishaan	dbc78aaa2c	Enable isilon end to end tests for Impala. This patch introduces changes to run tests against Isilon, combined with minor cleanup of the test and client code. For Isilon, it: - Populates the SkipIfIsilon class with appropriate pytest markers. - Introduces a new default for the hdfs client in order to connect to Isilon. - Cleans up a few test files take the underlying filesystem into account. - Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically: - delete_file_dir does not throw an error if the file does not exist. - get_file_dir_status automatically strips the leading '/' Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f Reviewed-on: http://gerrit.cloudera.org:8080/370 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-05-27 22:25:12 +00:00
Juan Yu	4810e51446	IMPALA-2008: Fix wrong warning when insert overwrite to empty table libhdfs hdfsListDirectory API documentation is wrong. It says it returns NULL when there is an error. But it will return NULL as well when the directory is empty. Impala needs to check errno to make sure if an error happened. The HDFS issue is addressed by HDFS-8407. Change-Id: I9574c321a56fe339d4ccc3bb5bea59bc41f48ac4 (cherry picked from commit 20da688af19ca41576c82fd7b7d49b4346dbae92) Reviewed-on: http://gerrit.cloudera.org:8080/394 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-05-22 20:23:39 +00:00
ishaan	09e5eaeda2	Introduce classes for pytest's skipif markers. This patch encapsulates pytests's skipif markers in classes. It leads to the following benefits: - Provide context and grouping for tests being skipped. - As we improve test reporting, annotations will give us a better idea of coverage. Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9 Reviewed-on: http://gerrit.cloudera.org:8080/297 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/343	2015-04-19 03:09:59 +00:00
Dan Hecht	c8fb10f50a	S3: Some more work toward enabling additional S3 test coverage Add skip markers for S3 that can be used to categorize the tests that are skipped against S3 to help see what coverage is missing. Soon we'll be reworking some tests and/or adding new tests to get back the important gaps. Also, add a mechanism to parameterize paths in the .test files, and start using these new variables. This is a step toward enabling some more tests against S3. Finally, a fix for buildall.sh to stop the minicluster before applying the metastore snapshot. Otherwise, this fails since the ms db is in use. Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0 Reviewed-on: http://gerrit.cloudera.org:8080/127 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-03-03 08:29:13 +00:00
Juan Yu	04e8b32df5	IMPALA-1805: Impala's ACLs check do not consider all group ACLs, only checked first one. Impala only check the first applicable acl entry. but a user can be a member of more than one group. If any of these matching group entries contain the requested permissions, access should be granted. Change-Id: I16164ee906cf147e2f1f2fd389762593e85a1e84 Reviewed-on: http://gerrit.cloudera.org:8080/104 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Internal Jenkins	2015-02-25 20:40:16 +00:00
ishaan	11cd7d1d46	Blacklist tests that don't work on s3 This patch introduces a new pytest marker that skip tests that currently don't work when s3 is used as the underlying file system. The set of blacklisted tests is a superset of tests that cannot be run with s3. Follow up patches will remove some of the test files from the blacklist. Change-Id: I39a58223d3435f0bd6496ffd00a2d483b751693d Reviewed-on: http://gerrit.cloudera.org:8080/82 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-02-24 01:43:28 +00:00
Juan Yu	9b7fa09ab6	IMPALA-1438: Zero-row insert creates empty data file DataSink should check if there is any row to output before creating data file or sending data so it doesn't generate zero size data file or create empty partition directory. It also fixes the issue reported in IMPALA-1432. Change-Id: I58c995f7d5cda203c23bdd9d09776e4cf35c2246 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5545 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: jenkins	2014-12-11 20:06:57 -08:00
Henry Robinson	08feed674b	IMPALA-837: Use '_' as prefix for insert staging directory Although Hive ignores directories with '_' or '.' as prefixes, some tools only ignore those beginning with '_'. Change-Id: I499491b0cb1919c4b3a46efcc45b57ad56bfdf86 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4985 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins (cherry picked from commit 469d4bd85b33fd0282594197055bb5dce47ecc9e) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4999	2014-10-29 21:45:18 -07:00
Juan Yu	3973608d34	IMPALA-1338: HDFS does not return all ACLs in getAclStatus() Impala needs to combine file permission and getAclStatus output to get full Acl list and use that to check permissions. Change-Id: I6d5884932423573e545680a2747d85bdf5793909 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4683 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Juan Yu <jyu@cloudera.com>	2014-10-08 16:49:07 -07:00
Juan Yu	2d46029dab	Fix bug that FsPermissionChecker doesn't apply Acl mask properly. Change-Id: Ia609002e7f63260c9eb5d2080d64017e1d1cc1c9 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4682 Reviewed-by: Juan Yu <jyu@cloudera.com> Tested-by: Juan Yu <jyu@cloudera.com>	2014-10-08 16:48:44 -07:00
Henry Robinson	267b81142d	[CDH5] IMPALA-1279: Check ACLs for INSERT and LOAD statements This patch forces LOAD and INSERT to check ACLs during analysis. We mimic the behaviour of HDFS's ACL checking by adding code to FsPermissionChecker. Change-Id: I42660db1da13ceaef63f582cff2c2078e08f90a1 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4428 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-09-19 20:34:48 -07:00
Henry Robinson	9d0173c647	[CDH5] Disable ACL tests The tests pass every time locally (in a 60 minute run), but fail intermittently on our build machines. Change-Id: I62d5ea0df8c42728a538b29bd16006be3179bfd3 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3489 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-07-14 15:38:11 -07:00
Henry Robinson	ff32821c6b	[CDH5] Test to confirm that ACLs are inherited correctly on INSERT Change-Id: I781a6b7203c2e12b484162954abae51a6443bead Reviewed-on: http://gerrit.ent.cloudera.com:8080/3076 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-07-09 19:04:55 -07:00
Henry Robinson	60cbe1b0e1	IMPALA-741: Support partitions with non-existant HDFS locations If a partition had a location that did not exist in HDFS, Impala would refuse to load its metadata. This meant a typo could render a table unloadable. We fix this problem by removing the existence check from the frontend, and by inheriting access from the first extant parent of the partition directory. Fixing this exposed a second issue, where Impala wouldn't create directories for partitions in the right place after an INSERT if the partition location had been changed. To get this right we have to plumb the partition ID through to Coordinator::FinalizeSuccessfulInsert(), so that the coordinator can look up the partition's location from the query-wide descriptor table. As a by-product, this patch rationalises the per-partition, per-fragment statistics gathering a little bit by putting almost all the per-partition stats into TInsertPartitionStatus. Change-Id: I9ee0a1a1ef62cf28f55be3249e8142c362083163 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2851 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-06-08 18:44:45 -07:00
Henry Robinson	99c37aac37	IMPALA-827: Add an option for directories created by INSERT to inherit their parent's permissions This patch adds --insert_inherit_permissions. If true, all new partition directories created by INSERT will inherit their permissions from their parent. When false, the directories are created with the default permissions. Change-Id: Ib2b4c251e51ea5048387169678e8dde34ecfe5f6 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1917 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-04-04 10:25:20 -07:00
Henry Robinson	635dd7d289	IMPALA-875: Respect isAnalyzed_ in IntLiteral expressions Partition column expressions are analysed twice for INSERT statements - once to infer the type and so to add a possible cast, and once to compute stats on the resulting expr. However, this process resulted in an partition column expr that was a IntLiteral getting the smallest type that would contains its value, rather than retaining the column-compatible type that had been assigned to it. This patch does the minimum thing, which is make IntLiteral.analyze() idempotent. Doing the same thing to Expr and LiteralExpr unearths some other bugs, which we will have to fix in a follow-on patch (see IMPALA-884). Change-Id: Ie22fc5d3f4832c735a1ebc0ef78f50d736f597fd Reviewed-on: http://gerrit.ent.cloudera.com:8080/1931 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins (cherry picked from commit 1912d65ea21a5025d385948642f0d4aadad91abf) Reviewed-on: http://gerrit.ent.cloudera.com:8080/1947	2014-03-17 17:35:12 -07:00
Henry Robinson	05c8e4da93	IMPALA-624: Inserts should respect changes in partition location Impala would ignore changes in a partition's location (by ALTER TABLE ... SET LOCATION ...). Change-Id: I9fdc1f09f9d848aa1a4ade3d4f35f8de9cbd18a5 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1647 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/1824	2014-03-08 13:21:06 -08:00
Henry Robinson	89a0beb56a	IMPALA-449: Better cleanup after an INSERT fails This patch goes some way to improving recovery after an INSERT fails. Inserts now write intermediate results to <table_dir>/.impala_insert_staging. After execution completes, either successfully or not, the query-specific directory under that directory is deleted. This doesn't complete the job for better cleanup (although this goes as far as IMPALA-449 suggests). Two things to do in the future: * Have each backend delete its own staging files on error. The difficulty getting there now is that backends don't know if they are cancelled in error or because a LIMIT was reached. * If the operation to move files to their final destinations should fail during FinalizeQuery(), the coordinator should perform compensation actions and delete the files that made it. Note: We also considered a query-wide and impalad-wide option to change the staging dir. There are advantages to this (all intermediate results go to a known location which is easy to clean up on failure), but also security and other operational concerns. Worth revisiting in the future. Change-Id: Ia54cf36db6a382e359877f87d7d40aad7fdb77be Reviewed-on: http://gerrit.ent.cloudera.com:8080/670 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:53:37 -08:00
Lenni Kuff	a2cbd2820e	Add Catalog Service and support for automatic metadata refresh The Impala CatalogService manages the caching and dissemination of cluster-wide metadata. The CatalogService combines the metadata from the Hive Metastore, the NameNode, and potentially additional sources in the future. The CatalogService uses the StateStore to broadcast metadata updates across the cluster. The CatalogService also directly handles executing metadata updates request from impalad servers (DDL requests). It exposes a Thrift interface to allow impalads to directly connect execute their DDL operations. The CatalogService has two main components - a C++ server that implements StateStore integration, Thrift service implementiation, and exporting of the debug webpage/metrics. The other main component is the Java Catalog that manages caching and updating of of all the metadata. For each StateStore heartbeat, a delta of all metadata updates is broadcast to the rest of the cluster. Some Notes On the Changes --- * The metadata is all sent as thrift structs. To do this all catalog objects (Tables/Views, Databases, UDFs) have thrift struct to represent them. These are sent with each statestore delta update. * The existing Catalog class has been seperated into two seperate sub-classes. An ImpladCatalog and a CatalogServiceCatalog. See the comments on those classes for more details. What is working: * New CatalogService created * Working with statestore delta updates and latest UDF changes * DDL performed on Node 1 is now visible on all other nodes without a "refresh". * Each DDL operation against the Catalog Service will return the catalog version that contains the change. An impalad will wait for the statestore heartbeat that contains this version before returning from the DDL comment. * All table types (Hbase, Hdfs, Views) getting their metadata propagated properly * Block location information included in CS updates and used by Impalads * Column and table stats included in CS updates and used by Impalads * Query tests are all passing Still TODO: * Directly return catalog object metadata from DDL requests * Poll the Hive Metastore to detect new/dropped/modified tables * Reorganize the FE code for the Catalog Service. I don't think we want everything in the same JAR. Change-Id: I8c61296dac28fb98bcfdc17361f4f141d3977eda Reviewed-on: http://gerrit.ent.cloudera.com:8080/601 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:53:11 -08:00
Henry Robinson	a46276325c	IMPALA-415: Don't delete hidden files in the root directory for INSERT OVERWRITE INSERT OVERWRITE into an unpartitioned table is supposed to remove all data files from the root. This should not include hidden files or directories. This patch excludes hidden files from deletion, and adds a test case. Partition directories are still removed in their entirety: the cost of statting a large number of files and directories rather than issuing a single "rm -rf" outweighs the benefits of preserving hidden files for now. Hive does not preserve hidden files in either configuration. Change-Id: Ia73e55e011c26c88f14745075210cf359764e3c1 Reviewed-on: http://gerrit.ent.cloudera.com:8080/418 Tested-by: jenkins Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:52:50 -08:00

43 Commits