impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
stiga-huang	0b619962e6	IMPALA-14011: Skip test_no_hms_event_incremental_refresh_transactional_table on new Hive versions The feature of hms_event_incremental_refresh_transactional_table is already mature that it has been enabled for years. We'd like to deprecate the feature of turning it off. However, for older Hive versions like Apache Hive 3 that don't provide sufficient APIs for Impala to process COMMIT_TXN events, users can still turn this off. This patch skips test_no_hms_event_incremental_refresh_transactional_table when running on CDP Hive. To run the test on Apache Hive 3, adjust the test to create ACID table using tblproperties instead of "create transactional table" statement. Tests: - Ran the test on CDP Hive and Apache Hive 3. Change-Id: I93379e5331072bec1d3a4769f7d7ab59431478ee Reviewed-on: http://gerrit.cloudera.org:8080/23435 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-17 20:11:33 +00:00
Riza Suminto	28cff4022d	IMPALA-14333: Run impala-py.test using Python3 Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true reveals some tests that require adjustment. This patch made such adjustment, which mostly revolves around encoding differences and string vs bytes type in Python3. This patch also switch the default to run pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The following are the details: Change hash() function in conftest.py to crc32() to produce deterministic hash. Hash randomization is enabled by default since Python 3.3 (see https://docs.python.org/3/reference/datamodel.html#object.__hash__). This cause test sharding (like --shard_tests=1/2) produce inconsistent set of tests per shard. Always restart minicluster during custom cluster tests if --shard_tests argument is set, because test order may change and affect test correctness, depending on whether running on fresh minicluster or not. Moved one test case from delimited-latin-text.test to test_delimited_text.py for easier binary comparison. Add bytes_to_str() as a utility function to decode bytes in Python3. This is often needed when inspecting the return value of subprocess.check_output() as a string. Implement DataTypeMetaclass.__lt__ to substitute DataTypeMetaclass.__cmp__ that is ignored in Python3 (see https://peps.python.org/pep-0207/). Fix WEB_CERT_ERR difference in test_ipv6.py. Fix trivial integer parsing in test_restart_services.py. Fix various encoding issues in test_saml2_sso.py, test_shell_commandline.py, and test_shell_interactive.py. Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1. Switch to binary comparison in test_iceberg.py where needed. Specify text mode when calling tempfile.NamedTemporaryFile(). Simplify create_impala_shell_executable_dimension to skip testing dev and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason is that several UTF-8 related tests in test_shell_commandline.py break in Python3 pytest + Python2 impala-shell combo. This skipping already happen automatically in build OS without system Python2 available like RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty). Removed unused vector argument and fixed some trivial flake8 issues. Several test logic require modification due to intermittent issue in Python3 pytest. These include: Add _run_query_with_client() in test_ranger.py to allow reusing a single Impala client for running several queries. Ensure clients are closed when the test is done. Mark several tests in test_ranger.py with SkipIfFS.hive because they run queries through beeline + HiveServer2, but Ozone and S3 build environment does not start HiveServer2 by default. Increase the sleep period from 0.1 to 0.5 seconds per iteration in test_statestore.py and mark TestStatestore to execute serially. This is because TServer appears to shut down more slowly when run concurrently with other tests. Handle the deprecation of Thread.setDaemon() as well. Always force_restart=True each test method in TestLoggingCore, TestShellInteractiveReconnect, and TestQueryRetries to prevent them from reusing minicluster from previous test method. Some of these tests destruct minicluster (kill impalad) and will produce minidump if metrics verifier for next tests fail to detect healthy minicluster state. Testing: Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true. Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4 Reviewed-on: http://gerrit.cloudera.org:8080/23319 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-09-03 10:01:29 +00:00
Csaba Ringhofer	c45e3e7968	IMPALA-14109: Remove SkipIfCatalogV2.hms_event_polling_disabled This skipIf used the coordinator webui to check whether the flag is set and skipped the test if the cluster was not running. The skipIf was only used in custom cluster tests where the cluster is restarted with new flags anyway, so the flags of the previous cluster are not relevant. Change-Id: I455b39eff95e45d02c7b9e0b35d8e7fe03145bb1 Reviewed-on: http://gerrit.cloudera.org:8080/22960 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-06-02 16:45:53 +00:00
jfehr	742d8d05f5	IMPALA-14090: Move Some Stable Custom Cluster Tests to Exhaustive Moves several custom cluster tests out of core and into exhaustive only. The tests were chosen based on their stability, lack of recent modifications, and coverage of rare/corner cases. Testing was accomplished by running both core and exhaustive tests and manually verifying the tests were or were not skipped as expected. Change-Id: If99c015a0cb5d95b1607ca2be48d2dea04194f81 Reviewed-on: http://gerrit.cloudera.org:8080/22963 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-06-02 07:53:37 +00:00
Joe McDonnell	5b4afb4f8f	IMPALA-13368: Fixup Redhat detection for Python >= 3.8 Python 3.8 removed the platform.linux_distribution() function which is currently used to detect Redhat. This switches to using the 'distro' package, which implements the same functionality across different Python versions. Since Redhat 6 is no longer supported, this removes the detection of Redhat 6 and associated skip logic. Testing: - Ran a core job Change-Id: I0dfaf798c0239f6068f29adbd2eafafdbbfd66c3 Reviewed-on: http://gerrit.cloudera.org:8080/22073 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-12-17 07:28:51 +00:00
Yida Wu	f2a09b6dda	IMPALA-12907: Add testcases for TPC-H/TPC-DS queries with tuple caching Added testcases to run TPC-H and TPC-DS queries twice with tuple caching to verify that Impala won't crash and ensure the correctness of the results. Testcases allows mt_dop to be 0 or 4. Also, added the environment varibles of tuple cache to run-all-tests.sh and added skipif to test_tuple_cache_tpc_queries.py to skip if not tuple cache enabled. Tests: Ran the tests in the build with tuple cache enabled. Change-Id: I967372744d8dda25cbe372aefec04faec5a76847 Reviewed-on: http://gerrit.cloudera.org:8080/21628 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-08-16 07:30:26 +00:00
wzhou-code	08f8a30025	IMPALA-12910: Support running TPCH/TPCDS queries for JDBC tables This patch adds script to create external JDBC tables for the dataset of TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for external JDBC tables with Impala-Impala federation. Note that JDBC tables are mapping tables, they don't take additional disk spaces. It fixes the race condition when caching of SQL DataSource objects by using a new DataSourceObjectCache class, which checks reference count before closing SQL DataSource. Adds a new query-option 'clean_dbcp_ds_cache' with default value as true. When it's set as false, SQL DataSource object will not be closed when its reference count equals 0 and will be kept in cache until the SQL DataSource is idle for more than 5 minutes. Flag variable 'dbcp_data_source_idle_timeout_s' is added to make the duration configurable. java.sql.Connection.close() fails to remove a closed connection from connection pool sometimes, which causes JDBC working threads to wait for available connections from the connection pool for a long time. The work around is to call BasicDataSource.invalidateConnection() API to close a connection. Two flag variables are added for DBCP configuration properties 'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait' properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively in apache.commons.dbcp v2. Fixes a bug for database type comparison since the type strings specified by user could be lower case or mix of upper/lower cases, but the code compares the types with upper case string. Fixes issue to close SQL DataSource object in JdbcDataSource.open() and JdbcDataSource.getNext() when some errors returned from DBCP APIs or JDBC drivers. testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables for Impala-Impala, Postgres and MySQL. Following sample commands creates TPCDS JDBC tables for Impala-Impala federation with remote coordinator running at 10.19.10.86, and Postgres server running at 10.19.10.86: ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=IMPALA --database_host=10.19.10.86 --clean ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=POSTGRES --database_host=10.19.10.86 \ --database_name=tpcds --clean TPCDS tests for JDBC tables run only for release/exhaustive builds. TPCH tests for JDBC tables run for core and exhaustive builds, except Dockerized builds. Remaining Issues: - tpcds-decimal_v2-q80a failed with returned rows not matching expected results for some decimal values. This will be fixed in IMPALA-13018. Testing: - Passed core tests. - Passed query_test/test_tpcds_queries.py in release/exhaustive build. - Manually verified that only one SQL DataSource object was created for test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option 'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object was closed by cleanup thread. Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a Reviewed-on: http://gerrit.cloudera.org:8080/21304 Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-02 02:14:20 +00:00
stiga-huang	9ecf0cbfc7	IMPALA-12054: Lazily check Kudu flags in tests I usually shutdown Kudu in my dev env to save some resources. However, tests that import skip.py will fail if Kudu cluster is not running locally, even if the tests are unrelated to Kudu. The cause is that Kudu web pages are accessed when the module is imported, and it fails if Kudu cluster is not running. This patch exposes the decorators of SkipIfKudu as methods just like what we did in SkipIfCatalogV2, so Kudu web pages can be checked lazily when needed. Tests: - Ran Kudu tests. - Ran some Kudu unrelated tests without lauching the Kudu cluster. Change-Id: Ic7a8282b59d72322085c21c70a5019c51b586a52 Reviewed-on: http://gerrit.cloudera.org:8080/20904 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-01-16 23:12:30 +00:00
wzhou-code	a2c2f118d2	IMPALA-12375: Make DataSource Object persistent DataSource objects are saved in-memory cache in Catalog server. They are not persisted to the HMS. The objects are lost after Catalog server is restarted and user needs to recreate DataSource objects before creating new external DataSource tables. This patch makes DataSource Object persistent by saving DataSource objects as DataConnector objects with type "impalaDataSource" in HMS. Since HMS events for DataConnector are not handled, Catalog server has to refresh DataSource objects when the catalogd becomes active. Note that this feature is not supported for Apache Hive 3.1 and older version. Testing: - Added two end-to-end unit tests with restarting of Catalog server, and catalogd HA failover. These two tests are skipped when USE_APACHE_HIVE is set as true and Apache Hive version is 3.x or older version. - Passed all-build-options-ub2004. - Passed core test. Change-Id: I500a99142bb62ce873e693d573064ad4ffa153ab Reviewed-on: http://gerrit.cloudera.org:8080/20768 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Wenzhe Zhou <wzhou@cloudera.com>	2024-01-03 03:25:18 +00:00
Riza Suminto	967ed18407	IMPALA-12528: Deflake test_hdfs_scanner_thread_non_reserved_bytes Prior deflake attempt at IMPALA-12499 does not seem sufficient. There are still sporadic failures happening in test_hdfs_scanner_thread_non_reserved_bytes. This patch further attempt to deflake it by: - Injecting 100ms sleep every time scanner thread obtain new scan range. - Running it serially. - Skip it in dockerized environment. This patch also fix small comment mistakes in hdfs-scan-node.cc. Testing: - Loop and pass the test 100 times in local minicluster environment. Change-Id: I5715cf16c87ff0de51afd2fa778c5b591409d376 Reviewed-on: http://gerrit.cloudera.org:8080/20640 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-11-02 12:12:20 +00:00
wzhou-code	c77a457520	IMPALA-7131: Support external data sources in LocalCatalog mode This patch makes external data source working in LocalCatalog mode: - Add APIs in CatalogdMetaProvider to fetch DataSource from Catalog server through RPC. - Add getDataSources() and getDataSource() in LocalCatalog. - Add LocalDataSourceTable class for loading DataSource table in LocalCatalog. - Handle request for loading DataSource in CatalogServiceCatalog on Catalog server. - Enable tests which are skipped by SkipIfCatalogV2.data_sources_unsupported(). Remove SkipIfCatalogV2.data_sources_unsupported(). - Add end-to-end tests for LocalCatalog mode. Testing: - Passed core tests Change-Id: I40841c9be9064ac67771c4d3f5acbb3b552a2e55 Reviewed-on: http://gerrit.cloudera.org:8080/20574 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>	2023-10-30 16:04:47 +00:00
Riza Suminto	05890c1c84	IMPALA-12499: Deflake test_hdfs_scanner_thread_mem_scaling IMPALA-11068 added three new tests into hdfs-scanner-thread-mem-scaling.test. The first one is failing intermittently, most likely due to fragment right above the scan does not pull row batches fast enough. This patch attempt to deflake the tests by replacing it with simple count start query. The three test cases is now contained in its own test_hdfs_scanner_thread_non_reserved_bytes and will be skipped for sanitized build. Testing: - Loop and pass test_hdfs_scanner_thread_non_reserved_bytes a hundred times. Change-Id: I7c99b2ef70b71e148cedb19037e2d99702966d6e Reviewed-on: http://gerrit.cloudera.org:8080/20593 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-10-20 00:41:59 +00:00
Michael Smith	e27e4eb54a	IMPALA-11941: (Addendum) ease testing other JDKs Makes it simpler to build with one JDK and run tests with another. TEST_JDK_VERSION sets IMPALA_JDK_VERSION before running tests, so the Impala cluster is started with that JDK. TEST_JAVA_HOME_OVERRIDE sets IMPALA_JAVA_HOME_OVERRIDE if a non-OS version of Java is required. Restart Kudu with original JAVA_HOME in frontend tests. Also skips restarting Hive, Kudu, and Ranger in tests as they'll restart with a different JDK than originally started with. Testing: 1. built normally 2. ran "TEST_JDK_VERSION=17 run-all-tests.sh" 3. verified various logs contain "java.specification.version:17" Change-Id: I46b5515efd9537d63b843dbc42aa93b376efce00 Reviewed-on: http://gerrit.cloudera.org:8080/20143 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-07-27 02:10:17 +00:00
Michael Smith	8f35c7f4aa	IMPALA-12052: Update EC policy string for Ozone HDDS-7122 changed how Ozone prints chunk size from bytes to KB, as in 1024k rather than 1048576. That makes it consistent with HDFS reporting. Our tests verify the chunk size reported in SHOW output. Updates the expected erasure code policy string to match the new format. Updates CDP_OZONE_VERSION to a build that includes HDDS-7122. However this build includes two regressions that we work around for the moment: - HDDS-8543: FSO layout reports incorrect replication config for directories in EC buckets - HDDS-8289: FSO layout listStatus operations get slower with lots of files and filesystem operations Testing: - ran test suite with Ozone Erasure Coding Change-Id: I5354de61bbc507931a1d5bc86f6466c0dd50fc30 Reviewed-on: http://gerrit.cloudera.org:8080/19870 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-05-18 17:34:33 +00:00
Riza Suminto	baddaf2241	IMPALA-12144: Skip TestTpcdsQueryWithProcessingCost if dockerised There is a sign of flakiness in TestTpcdsQueryWithProcessingCost within dockerised environment. The flakiness seems to happen due to tighter per-process memory limit in dockerised environment. This patch skip TestTpcdsQueryWithProcessingCost in dockerised environment. Testing: - Hack SkipIfDockerizedCluster.insufficient_mem_limit to return True if IS_HDFS and confirm that the whole TestTpcdsQueryWithProcessingCost is skipped. Change-Id: Ibb6b2d4258a2c6613d1954552f21641b42cb3c38 Reviewed-on: http://gerrit.cloudera.org:8080/19892 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-16 20:40:11 +00:00
Michael Smith	5018a30d6d	IMPALA-11476: (Addendum) enable test_erasure_coding for Ozone Enables test_erasure_coding for Ozone. HDDS-7603 is planned for Ozone 1.4.0, but we test with a CDP build - 1.3.0.7.2.17.0-127 - that already includes this fix. Since this is a testing-only change, seems safe to rely on that. Testing: - Ran test_erasure_coding on Ozone with EC Change-Id: Iee57c008102db7fac89abcea9a140c867178bb08 Reviewed-on: http://gerrit.cloudera.org:8080/19578 Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-03-13 22:06:46 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
yx91490	f4d306cbca	IMPALA-11629: Support for huawei OBS FileSystem This patch adds support for huawei OBS (Object Storage Service) FileSystem. The implementation is similar to other remote FileSystems. New flags for OBS: - num_obs_io_threads: Number of OBS I/O threads. Defaults to be 16. Testing: - Upload hdfs test data to an OBS bucket. Modify all locations in HMS DB to point to the OBS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: I84a54dbebcc5b71e9bcdd141dae9e95104d98cb1 Reviewed-on: http://gerrit.cloudera.org:8080/19110 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-02-09 08:10:19 +00:00
Peter Rozsa	1d05381b7b	IMPALA-11745: Add Hive's ESRI geospatial functions as builtins This change adds geospatial functions from Hive's ESRI library as builtin UDFs. Plain Hive UDFs are imported without changes, but the generic and varargs functions are handled differently; generic functions are added with all of the combinations of their parameters (cartesian product of the parameters), and varargs functions are unfolded as an nth parameter simple function. The varargs function wrappers are generated at build time and they can be configured in gen_geospatial_udf_wrappers.py. These additional steps are required because of the limitations in Impala's UDF Executor (lack of varargs support and only partial generics support) which could be further improved; in this case, the additional wrapping/mapping steps could be removed. Changes regarding function handling/creating are sourced from https://gerrit.cloudera.org/c/19177 A new backend flag was added to turn this feature on/off as "geospatial_library". The default value is "NONE" which means no geospatial function gets registered as builtin, "HIVE_ESRI" value enables this implementation. The ESRI geospatial implementation for Hive currently only available in Hive 4, but CDP Hive backported it to Hive 3, therefore for Apache Hive this feature is disabled regardless of the "geospatial_library" flag. Known limitations: - ST_MultiLineString, ST_MultiPolygon only works with the WKT overload - ST_Polygon supports a maximum of 6 pairs of coordinates - ST_MultiPoint, ST_LineString supports a maximum of 7 pairs of coordinates - ST_ConvexHull, ST_Union supports a maximum of 6 geoms These limits can be increased in gen_geospatial_udf_wrappers.py Tests: - test_geospatial_udfs.py added based on https://github.com/Esri/spatial-framework-for-hadoop Co-Authored-by: Csaba Ringhofer <csringhofer@cloudera.com> Change-Id: If0ca02a70b4ba244778c9db6d14df4423072b225 Reviewed-on: http://gerrit.cloudera.org:8080/19425 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-02-07 20:18:47 +00:00
Michael Smith	bbb0b4939d	IMPALA-11476: Support Ozone erasure coding Adds support for identifying erasure coding policy with Ozone. Enables testing Ozone with erasure coding. Omits support for identifying erasure coding policy with the o3fs protocol as that protocol is effectively deprecated and its classes don't provide access to the ObjectStore. Refactors volumeBucketPair to use StringTokenizer. Test updates: - test_exclusive_coordinator_plan: Ozone+EC blocks are 768MB, which is larger than all tables in our test environment. Use tpch_parquet which we rely on having 3 files (by loading from snapshot in this case). - test_new_file_shorter: receives an EOFException when seeking with EC - test_local_read: erasure-coded-bytes-read is also tied to IMPALA-11697 - test_erasure_coding: Ozone doesn't report files as erasure-coded (HDDS-7603) Testing: - Passes core E2E and custom cluster tests with TARGET_FILESYSTEM=ozone and ERASURE_CODING=true. Change-Id: I201e2e33ce94bbc1e81631a0a315884bcc8047d1 Reviewed-on: http://gerrit.cloudera.org:8080/19324 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-01-25 18:18:28 +00:00
Gergely Fürnstáhl	4595371ea2	IMPALA-11821: Adjusting manifest_length and absolute paths in case of metadata rewrite testdata/bin/rewrite-iceberg-metadata.py rewrites manifest and snapshot files using the provided prefix for file paths. Snapshot files store the length of manifest files as well, this needs to be adjusted too. Additionally, improved path rewrite to be able to rewrite absolute paths correctly and pretty dumping metadata jsons. Testing: - Tested locally, manually verified the rewrites - Tested on Ozone, automatically rewriting the test data and running test_iceberg.py Change-Id: I89b9208f25552012cc1ab16fa60a819dd5a683d9 Reviewed-on: http://gerrit.cloudera.org:8080/19412 Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com> Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-01-17 22:33:58 +00:00
noemi	4a05eaf988	IMPALA-11807: Fix TestIcebergTable.test_avro_file_format and test_mixed_file_format Iceberg hardcodes URIs in metadata files. If the table was written in a certain storage location and then moved to another file system, the hardcoded URIs will still point to the old location instead of the current one. Therefore Impala will be unable to read the table. TestIcebergTable.test_avro_file_format and test_mixed_file_format use Hive from Impala to write tables. If the tables are created in a different file system than the one they will be read from, the tests fail due to the invalid URIs. Skipping these 2 tests if testing is not done on HDFS. Updated the data load schema of the 2 test tables created by Hive and set LOCATION to the same as in the previous test tables. If this makes it possible to rewrite the URIs in the metadata and makes the tables accessible from another file system as well later, then the tests can be enabled again. Testing: - Testing locally on HDFS minicluster - Triggered an Ozone build to verify that it is skipped on a different file system Change-Id: Ie2f126de80c6e7f825d02f6814fcf69ae320a781 Reviewed-on: http://gerrit.cloudera.org:8080/19387 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-12-22 19:45:21 +00:00
Michael Smith	a469a9cf19	IMPALA-11730: Add support for spilling to Ozone Adds support for spilling to Ozone (ofs and o3fs schemes) for parity with HDFS. Note that ofs paths start with <volume>/<bucket>, which have naming restrictions; tmp/impala-scratch is a valid name, so something like ofs://localhost:9862/tmp would work as a scratch directory (volume tmp, implicit bucket impala-scratch). Updates tests to determine the correct path from the environment. Fixes backend tests to work with Ozone as well. Guards test_scratch_disk.py behind a new flag for filesystems that support spilling. Updates metric verification to wait for scratch-space-bytes-used to be non-zero, as it seems to update slower with Ozone. Refactors TmpDir to remove extraneous variables and functions. Each implementation is expected to handle its own token parsing. Initializes default_fs in ExecEnv when using TestEnv. Previously it was uninitialized, and uses of default_fs would return an empty string. Testing: - Ran backend, end-to-end, and custom cluster tests with Ozone. - Ran test_scratch_disk.py exhaustive runs with Ozone and HDFS. Change-Id: I5837c30357363f727ca832fb94169f2474fb4f6f Reviewed-on: http://gerrit.cloudera.org:8080/19251 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-12-08 17:20:33 +00:00
Michael Smith	8cd4a1e4e5	IMPALA-11584: Enable minicluster tests for Ozone Enables tests guarded by SkipIfNotHdfsMinicluster to run on Ozone as well as HDFS. Plans are still skipped for Ozone because there's Ozone-specific text in the plan output. Updates explain output to allow for Ozone, which has a block size of 256MB instead of 128MB. One of the partitions read in test_explain is ~180MB, straddling the difference between Ozone and HDFS. Testing: ran affected tests with Ozone. Change-Id: I6b06ceacf951dbc966aa409cf24a310c9676fe7f Reviewed-on: http://gerrit.cloudera.org:8080/19250 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-12-06 21:18:33 +00:00
Michael Smith	f8443d9828	IMPALA-11697: Enable SkipIf.not_hdfs tests for Ozone Convert SkipIf.not_hdfs to SkipIf.not_dfs for tests that require filesystem semantics, adding more feature test coverage with Ozone. Creates a separate not_scratch_fs flag for scratch dir tests as they're not supported with Ozone yet. Filed IMPALA-11730 to address this. Preserves not_hdfs for a specific test that uses the dfsadmin CLI to put it in safemode. Adds sfs_ofs_unsupported for SmallFileSystem tests. This should work for many of our filesystems based on `ebb1e2fa99/ql/src/java/org/apache/hadoop/hive/ql/io/SingleFileSystem.java (L62-L87)`. Makes sfs tests work on S3. Adds hardcoded_uris for IcebergV2 tests where deletes are implemented as hardcoded URIs in parquet files. Adding a parquet read/write library for Python is beyond the scope if this patch. Change-Id: Iafc1dac52d013e74a459fdc4336c26891a256ef1 Reviewed-on: http://gerrit.cloudera.org:8080/19254 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-11-21 18:51:30 +00:00
yacai	c953426692	IMPALA-11683: Support Aliyun OSS File System This patch adds support for OSS (Aliyun Object Storage Service). Using the hadoop-aliyun, the implementation is similar to other remote FileSystems. Tests: - Prepare: Initialize OSS-related environment variables: OSS_ACCESS_KEY_ID, OSS_SECRET_ACCESS_KEY, OSS_ACCESS_ENDPOINT. Compile and create hdfs test data on a ECS instance. Upload test data to an OSS bucket. - Modify all locations in HMS DB to point to the OSS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: I267e6531da58e3ac97029fea4c5e075724587910 Reviewed-on: http://gerrit.cloudera.org:8080/19165 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-16 10:14:49 +00:00
Michael Smith	d1d4f183da	IMPALA-11704: Delay hdfsOpenFile with data cache Delays hdfsOpenFile until after data cache lookup if using a data cache. IMPALA-10147 implemented this, but only when using the file handle cache. This patch adds an additional check in case file handle caching is disabled. In networked environments, hdfsOpenFile can take significant time, as observed in a TPC-DS run of q90 where TotalRawHdfsOpenFileTime represented a majority of time spent for HDFS_SCAN_NODE. This patch brings that time to 0 with a primed data cache. Change-Id: I9429a41fb16de27ccb57730203f95559df0dbfb6 Reviewed-on: http://gerrit.cloudera.org:8080/19204 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-11-06 00:23:08 +00:00
Michael Smith	eed92b223f	IMPALA-7092: Restore tests after HDFS fixes Restore EC tests that were disabled until HDFS-13539 and HDFS-13540 were fixed, as the fixes are available in the current version of Hadoop we test. Testing: ran these tests with EC enabled. Change-Id: I8b0bbc604601e6fab742f145c1adfb3c47b3fb6e Reviewed-on: http://gerrit.cloudera.org:8080/19159 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-11-04 22:19:21 +00:00
Michael Smith	a870a11e64	IMPALA-7098: Re-enable tests under EC Re-enables tests under erasure coding, or provides more specific exceptions. Erasure coding uses multiple data blocks to construct a block group. Our tests use RS-3-2-1024k, which includes 3 data blocks in a block group. Each of these blocks is sized according to `dfs.block.size`, so block groups by default hold up to 384MB of data. Impala schedules work to executors based on blocks reported by HDFS, which for EC actually represent block groups. So with default block size, a file in EC has 1/3rd the number of schedulable blocks. In the case of tpch.lineitem, this produces 2 parquet files instead of 3 and reduces the number of executors scheduled to read parquet lineitem as 1. lineitem.tbl is loaded via Hive. With EC it uses 2 block groups, without EC it uses 6 blocks. 2. parquet lineitem is created by select/insert from lineitem.tbl. Impala schedules reads to executors based on available blocks, so with EC this gets scheduled across 2 executors instead of 3 and each executor writes a separate parquet file. Change-Id: Ib452024993e35d5a8d2854c6b2085115b26e40df Reviewed-on: http://gerrit.cloudera.org:8080/19172 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-11-04 22:13:50 +00:00
Michael Smith	19114c7205	IMPALA-11578: Exclude locality test for remote FS Exclude test_scheduler_locality when the filesystem can only be remote. Change-Id: Ie6198421f21bc2520773ecbb34ffaf65969ebc43 Reviewed-on: http://gerrit.cloudera.org:8080/18980 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-19 21:11:09 +00:00
Michael Smith	79e474d310	IMPALA-10213: Add test for local vs remote scheduling Impala already supports locality-aware scheduling with Ozone because it returns location data on partitions. That data doesn't include specific storage ids in getStorageIds, so we skip a warning that will always trigger on Ozone. Updates Ozone to add implicit rules mapping localhost -> 127.0.0.1 for local development. HDFS translates localhost to 127.0.0.1 for host names in its location data, which Impala will identify as colocated with executors in the dev environment. Ozone doesn't, and the default Impala hostname is the machine hostname - not localhost - so without this change all HDFS access in the minicluster is local but all Ozone access is remote. Adds a test to verify local vs remote assignment by using custom clusters with hostnames that either do or don't match storage hostnames. Change-Id: I4e5606528404c3d4fd164c03dec8315345be5f6d Reviewed-on: http://gerrit.cloudera.org:8080/18841 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-09-07 17:13:04 +00:00
Michael Smith	cf7490ccbd	IMPALA-11464: (Addendum) Skip tests in Ozone Updates the skip for new recursive listing tests to match the comment so that they're only run on HDFS. The previous skip only roughly matched the set of all non-HDFS filesystems, and didn't automatically include new filesystems. Change-Id: I80de83d506138b57a969258b2f6dcf112dd2e44d Reviewed-on: http://gerrit.cloudera.org:8080/18934 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-09-01 09:53:16 +00:00
Michael Smith	1eb0510eaa	IMPALA-11456: Collapse filesystem Skip logic Combines all SkipIf* classes for different filesystems into a single SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the rest as filesystem-specific special cases. The 'jira' option is removed in favor of specific flags for each issue. Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1 Reviewed-on: http://gerrit.cloudera.org:8080/18781 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-10 22:37:08 +00:00
Michael Smith	830625b104	IMPALA-9442: Add Ozone to minicluster Adds Ozone as an alternative to hdfs in the minicluster. Select by setting `export TARGET_FILESYSTEM=ozone`. With that flag, run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot because Ozone does not support HBase (HDDS-3589); snapshot loading doesn't work yet primarily due to HDDS-5502. Uses the o3fs interface because Ozone puts specific restrictions on bucket names (no underscores, for instance), and it was a lot easier to use an interface where everything is written to a single bucket than to update all Impala's use of HDFS-style paths to make `test-warehouse` a bucket inside a volume. Specifies reduced Ozone client retries during shutdown where Ozone may not be available. Passes tests with FE_TEST=false BE_TEST=false. Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be Reviewed-on: http://gerrit.cloudera.org:8080/18738 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>	2022-08-03 16:58:20 +00:00
Gergely Fürnstáhl	8965059c2c	IMPALA-11358: Fixed Kudu table's missing comment If Kudu-HMS integration is enabled, Kudu creates the table in HMS too, which was missing the comment field. Added the code to forward the comment field to Kudu during creation. Testing: Added a test to verify the comment is present when the intergration is enabled. Reenabled several kudu tests as IMPALA-8751 (and follow ups) fixed the hive3 notification incompatibility. Change-Id: Idf66f8b4679b00da6693a27fed79b04e8f6afb55 Reviewed-on: http://gerrit.cloudera.org:8080/18627 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-06-20 19:38:33 +00:00
Qifan Chen	07a3e6e0df	IMPALA-10992 Planner changes for estimate peak memory This patch provides replan support for multiple executor group sets. Each executor group set is associated with a distinct number of nodes and a threshold for estimated memory per host in bytes that can be denoted as [<group_name_prefix>:<#nodes>, <threshold>]. In the patch, a query of type EXPLAIN, QUERY or DML can be compiled more than once. In each attempt, per host memory is estimated and compared with the threshold of an executor group set. If the estimated memory is no more than the threshold, the iteration process terminates and the final plan is determined. The executor group set with the threshold is selected to run the query. A new query option 'enable_replan', default to 1 (enabled), is added. It can be set to 0 to disable this patch and to generate the distributed plan for the default executor group. To avoid long compilation time, the following enhancement is enabled. Note 1) can be disabled when relevant meta-data change is detected. 1. Authorization is performed only for the 1st compilation; 2. openTransaction() is called for transactional queries in 1st compilation and the saved transactional info is used in subsequent compilations. Similar logic is applied to Kudu transactional queries. To facilitate testing, the patch imposes an artificial two executor group setup in FE as follows. 1. [regular:<#nodes>, 64MB] 2. [large:<#nodes>, 8PB] This setup is enabled when a new query option 'test_replan' is set to 1 in backend tests, or RuntimeEnv.INSTANCE.isTestEnv() is true as in most frontend tests. This query option is set to 0 by default. Compilation time increases when a query is compiled in several iterations, as shown below for several TPCDs queries. The increase is mostly due to redundant work in either single node plan creation or recomputing value transfer graph phase. For small queries, the increase can be avoided if they can be compiled in single iteration by properly setting the smallest threshold among all executor group sets. For example, for the set of queries listed below, the smallest threshold can be set to 320MB to catch both q15 and q21 in one compilation. Compilation time (ms) Queries Estimated Memory 2-iterations 1-iteration Percentage of increase q1 408MB 60.14 25.75 133.56% q11 1.37GB 261.00 109.61 138.11% q10a 519MB 139.24 54.52 155.39% q13 339MB 143.82 60.08 139.38% q14a 3.56GB 762.68 312.92 143.73% q14b 2.20GB 522.01 245.13 112.95% q15 314MB 9.73 4.28 127.33% q21 275MB 16.00 8.18 95.59% q23a 1.50GB 461.69 231.78 99.19% q23b 1.34GB 461.31 219.61 110.05% q4 2.60GB 218.05 105.07 107.52% q67 5.16GB 694.59 334.24 101.82% Testing: 1. Almost all FE and BE tests are now run in the artificial two executor setup except a few where a specific cluster configuration is desirable; 2. Ran core tests successfully; 3. Added a new observability test and a new query assignment test; 4. Disabled concurrent insert test (test_concurrent_inserts) and failing inserts (test_failing_inserts) test in local catalog mode due to flakiness. Reported both in IMPALA-11189 and IMPALA-11191. Change-Id: I75cf17290be2c64fd4b732a5505bdac31869712a Reviewed-on: http://gerrit.cloudera.org:8080/18178 Reviewed-by: Qifan Chen <qchen@cloudera.com> Tested-by: Qifan Chen <qchen@cloudera.com>	2022-03-21 20:17:28 +00:00
Fucun Chu	157086cb80	IMPALA-10771: Add Tencent COS support This patch adds support for COS(Cloud Object Storage). Using the hadoop-cos, the implementation is similar to other remote FileSystems. New flags for COS: - num_cos_io_threads: Number of COS I/O threads. Defaults to be 16. Follow-up: - Support for caching COS file handles will be addressed in IMPALA-10772. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on COS (IMPALA-10773). Tests: - Upload hdfs test data to a COS bucket. Modify all locations in HMS DB to point to the COS bucket. Remove some hdfs caching params. Run CORE tests. Change-Id: Idce135a7591d1b4c74425e365525be3086a39821 Reviewed-on: http://gerrit.cloudera.org:8080/17503 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-12-08 16:32:02 +00:00
Vihang Karajgaonkar	5a9dcd108d	IMPALA-8795: Turn on events processing by default This commit turns on events processing by default. The default polling interval is set as 1 second which can be overrriden by setting hms_event_polling_interval_s to non-default value. When the event polling turned on by default this patch also moves the test_event_processing.py to tests/metadata instead of custom cluster test. Some tests within test_event_processing.py which needed non-default configurations were moved to tests/custom_cluster/test_events_custom_configs.py. Additionally, some other tests were modified to take into account the automatic ability of Impala to detect newly added tables from hive. Testing done: 1. Ran exhaustive tests by turning on the events processing multiple times. 2. Ran exhaustive tests by disabling events processing. 3. Ran dockerized tests. Change-Id: I9a8b1871a98b913d0ad8bb26a104a296b6a06122 Reviewed-on: http://gerrit.cloudera.org:8080/17612 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>	2021-08-09 17:22:31 +00:00
Qifan Chen	fdb6a4e264	IMPALA-10532: TestOverlapMinMaxFilters.test_overlap_min_max_filters seems flaky This change disables the overlap min/max filter test for hdfs in erasure coding, due to the query plan change (from 3-node scan to 2-node scan) which splits the row groups among scan nodes differently. The SkipIfEC class in test harness skip.py is enhanced with a new skip reason 'different_scan_split' to facilitate this action. Testing: 1. Ran unit tests; 2. Ran core tests. Change-Id: I527de530f7db1ce959e7ef2ae3ced18677221c9f Reviewed-on: http://gerrit.cloudera.org:8080/17289 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-04-15 09:44:10 +00:00
stiga-huang	2dfc68d852	IMPALA-7712: Support Google Cloud Storage This patch adds support for GCS(Google Cloud Storage). Using the gcs-connector, the implementation is similar to other remote FileSystems. New flags for GCS: - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16. Follow-up: - Support for spilling to GCS will be addressed in IMPALA-10561. - Support for caching GCS file handles will be addressed in IMPALA-10568. - test_concurrent_inserts and test_failing_inserts in test_acid_stress.py are skipped due to slow file listing on GCS (IMPALA-10562). - Some tests are skipped due to issues introduced by /etc/hosts setting on GCE instances (IMPALA-10563). Tests: - Compile and create hdfs test data on a GCE instance. Upload test data to a GCS bucket. Modify all locations in HMS DB to point to the GCS bucket. Remove some hdfs caching params. Run CORE tests. - Compile and load snapshot data to a GCS bucket. Run CORE tests. Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b Reviewed-on: http://gerrit.cloudera.org:8080/17121 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-13 11:20:08 +00:00
Joe McDonnell	35bae939ab	IMPALA-10427: Remove SkipIfS3.eventually_consistent pytest marker These tests were disabled due to S3's eventually consistent behavior. Now that S3 is strongly consistent, these tests do not need to be disabled. Testing: - Ran s3 core job Change-Id: Ie9041f530bf3a818f8954b31a3d01d9f6753d7d4 Reviewed-on: http://gerrit.cloudera.org:8080/16931 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-07 23:53:56 +00:00
Tim Armstrong	62c19e6339	IMPALA-10366: skip test_runtime_profile_aggregated for EC The schedule for erasure coded data results in 3 instead of 4 instances of the fragment with the scan. Skip the test - we don't need special coverage for erasure coding. Change-Id: I2bb47d89f6d6c59242f2632c481f26d93e28e33e Reviewed-on: http://gerrit.cloudera.org:8080/16799 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-01 16:44:18 +00:00
Qifan Chen	6493f87357	IMPALA-10334: test_stats_extrapolation output doesn't match on erasure coding build This patch skips test_stats_extrapolation for erasure code builds. The reason is that an extra erasure code information line can be included in the scan explain section when a hdfs table is erasure coded. This makes the explain output different between a normal build and an erasure code build. A new reason 'contain_full_explain' is added to SkipIfEC to facilitate this. Testing: Ran erasure coding version of the EE and CLUSTER tests. Ran core tests Change-Id: I16c11aa0a1ec2d4569c272d2454915041039f950 Reviewed-on: http://gerrit.cloudera.org:8080/16756 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-23 20:36:58 +00:00
Vihang Karajgaonkar	c9ccb61acb	IMPALA-10286: Disable metadata.test_catalogd_debug_actions on S3 This patch disables metadata/test_catalogd_debug_actions test on S3 builds due to its flakiness. The root cause of this seems to be that listing time on S3 is variable and the test becomes flaky because it measures the time taken by refresh command after a certain debug action is set. Testing: 1. Ran the test on my local environment to make sure it compiles fine. Change-Id: I30bd10de468ad449c4a143a65cdcba97d9f0cd78 Reviewed-on: http://gerrit.cloudera.org:8080/16745 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-19 01:54:35 +00:00
Zoltan Borok-Nagy	981ef10465	IMPALA-10215: Implement INSERT INTO for non-partitioned Iceberg tables (Parquet) This commit adds support for INSERT INTO statements against Iceberg tables when the table is non-partitioned and the underlying file format is Parquet. We still use Impala's HdfsParquetTableWriter to write the data files, though they needed some modifications to conform to the Iceberg spec, namely: * write Iceberg/Parquet 'field_id' for the columns * TIMESTAMPs are encoded as INT64 micros (without time zone) We use DmlExecState to transfer information from the table sink operators to the coordinator, then updateCatalog() invokes the AppendFiles API to add files atomically. DmlExecState is encoded in protobuf, communication with the Frontend uses Thrift. Therefore to avoid defining Iceberg DataFile multiple times they are stored in FlatBuffers. The commit also does some corrections on Impala type <-> Iceberg type mapping: * Impala TIMESTAMP is Iceberg TIMESTAMP (without time zone) * Impala CHAR is Iceberg FIXED Testing: * Added INSERT tests to iceberg-insert.test * Added negative tests to iceberg-negative.test * I also did some manual testing with Spark. Spark is able to read Iceberg tables written by Impala until we use TIMESTAMPs. In that case Spark rejects the data files because it only accepts TIMESTAMPS with time zone. * Added concurrent INSERT tests to test_insert_stress.py Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e Reviewed-on: http://gerrit.cloudera.org:8080/16545 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-26 20:01:09 +00:00
stiga-huang	b02fad2db4	IMPALA-7538: Support HDFS caching with LocalCatalog This patch adds support for HDFS caching in LocalCatalog coordinators. We use the same way catalogd propagates HdfsCachePools in catalog-v1. They are cached in LocalCatalog coordinators as v1 and are not “fetch-on-demand” since only cache pool names are cached. The isMarkedCached markers of HdfsTable and HdfsPartition are also propagated to the LocalCatalog coordinators for correctly handling ShowTableStats and ShowPartitions statements with caching information. Tests: - Revive hdfs caching related tests in metadata/test_ddl.py and query_test/test_hdfs_caching.py for LocalCatalog. Change-Id: I661f7b76a9575f6f5b3fa2c6feebda1a5d7c3712 Reviewed-on: http://gerrit.cloudera.org:8080/16058 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 22:03:24 +00:00
Tim Armstrong	6ec6aaae8e	IMPALA-3695: Remove KUDU_IS_SUPPORTED Testing: Ran exhaustive tests. Change-Id: I059d7a42798c38b570f25283663c284f2fcee517 Reviewed-on: http://gerrit.cloudera.org:8080/16085 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-18 01:11:18 +00:00
Joe McDonnell	3e76da9f51	IMPALA-9708: Remove Sentry support Impala 4 decided to drop Sentry support in favor of Ranger. This removes Sentry support and related tests. It retires startup flags related to Sentry and does the first round of removing obsolete code. This does not adjust documentation to remove references to Sentry, and other dead code will be removed separately. Some issues came up when implementing this. Here is a summary of how this patch resolves them: 1. authorization_provider currently defaults to "sentry", but "ranger" requires extra parameters to be set. This changes the default value of authorization_provider to "", which translates internally to the noop policy that does no authorization. 2. These flags are Sentry specific and are now retired: - authorization_policy_provider_class - sentry_catalog_polling_frequency_s - sentry_config 3. The authorization_factory_class may be obsolete now that there is only one authorization policy, but this leaves it in place. 4. Sentry is the last component using CDH_COMPONENTS_HOME, so that is removed. There are still Maven dependencies coming from the CDH_BUILD_NUMBER repository, so that is not removed. 5. To make the transition easier, testdata/bin/kill-sentry-service.sh is not removed and it is still called from testdata/bin/kill-all.sh. Testing: - Core job passes Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e Reviewed-on: http://gerrit.cloudera.org:8080/15833 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-05-20 17:43:40 +00:00
Zoltan Borok-Nagy	8aa0652871	IMPALA-9484: Full ACID Milestone 1: properly scan files that has full ACID schema Full ACID row format looks like this: { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1, "row": {"i": 1} } User columns are nested under "row". In the frontend we need to create slot descriptors that correspond to the file schema. In the catalog we could mimic the file schema but that would introduce several complexities and corner cases in column resolution. Also in query results the heading of the above user column would be "row.i". Star expansion should also be modified, etc. Because of that in the Catalog I create the exact opposite of the above schema: { "row__id": { "operation": 0, "originalTransaction": 1, "bucket": 536870912, "rowId": 0, "currentTransaction": 1 } "i": 1 } This way very little modification is needed in the frontend. And the hidden columns can be easily retrieved via 'SELECT row__id.' when we need those for debugging/testing. We only need to change Path.getAbsolutePath() to return a schema path that corresponds to the file schema. Also in the backend we need some extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the table schema path from the file schema path. Testing: I changed data loading to load ORC files in full ACID format by default. With this change we should be able to scan full ACID tables that are not minor-compacted, don't have deleted rows, and don't have original files. Newly added Tests: specific queries about hidden columns (full-acid-rowid.test) * SHOW CREATE TABLE (show-create-table-full-acid.test) * DESCRIBE [FORMATTED] TABLE (describe-path.test) * INSERT should be forbidden (acid-negative.test) * added tests for column masking ( ranger_column_masking_complex_types.test) Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb Reviewed-on: http://gerrit.cloudera.org:8080/15395 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-04-02 12:01:41 +00:00
Fang-Yu Rao	01684ab3aa	IMPALA-9191 (part 1): Allow Impala to run tests without Sentry This patch adds an environment variable DISABLE_SENTRY to allow Impala to run tests without Sentry. Specifically, we start up Sentry only when $DISABLE_SENTRY does not evaluate to true. The corresponding Sentry FE and E2E tests will also be skipped if $DISABLE_SENTRY is true. Moreover, in this patch we will set DISABLE_SENTRY to true if $USE_CDP_HIVE evaluates to true, allowing one to only test Impala's authorization with Ranger when support for Sentry is dropped after we switch to the CDP Hive. Note that in this patch we also change the way we generate hive-site.xml when $DISABLE_SENTRY is true. To be more precise, when generating hive-site.xml, we do not add the Sentry server as a metastore event listener if $DISABLE_SENTRY is true. Recall that both CDH Hive and CDP Hive would make an RPC to the registered listeners every time after the method of create_database_core() in HiveMetaStore.java is called, which happens when Hive instead of Impala is used to create a database, e.g., when some databases in the TPC-DS data set are created during the execution of create-load-data.sh. Thus the removal of Sentry as an event listener is necessary when $DISABLE_SENTRY is true in that it prevents the HiveMetaStore from keeping connecting to the Sentry server that is not online, which could make create-load-data.sh time out. Testing: Except for two currently known issues of IMPALA-9513 AND IMPALA-9451, verified this patch passes the exhaustive tests in the DEBUG build - when $USE_CDP_HIVE is false, and - when $USE_CDP_HIVE is true. Change-Id: Ifa3f1840a77a7b32310a5c8b78a2c26300ccb41e Reviewed-on: http://gerrit.cloudera.org:8080/15505 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-21 20:14:33 +00:00

1 2 3

123 Commits