123 Commits

Author SHA1 Message Date
stiga-huang
0b619962e6 IMPALA-14011: Skip test_no_hms_event_incremental_refresh_transactional_table on new Hive versions
The feature of hms_event_incremental_refresh_transactional_table is
already mature that it has been enabled for years. We'd like to
deprecate the feature of turning it off. However, for older Hive
versions like Apache Hive 3 that don't provide sufficient APIs for
Impala to process COMMIT_TXN events, users can still turn this off.

This patch skips
test_no_hms_event_incremental_refresh_transactional_table when running
on CDP Hive.

To run the test on Apache Hive 3, adjust the test to create ACID table
using tblproperties instead of "create transactional table" statement.

Tests:
 - Ran the test on CDP Hive and Apache Hive 3.

Change-Id: I93379e5331072bec1d3a4769f7d7ab59431478ee
Reviewed-on: http://gerrit.cloudera.org:8080/23435
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-17 20:11:33 +00:00
Riza Suminto
28cff4022d IMPALA-14333: Run impala-py.test using Python3
Running exhaustive tests with env var IMPALA_USE_PYTHON3_TESTS=true
reveals some tests that require adjustment. This patch made such
adjustment, which mostly revolves around encoding differences and string
vs bytes type in Python3. This patch also switch the default to run
pytest with Python3 by setting IMPALA_USE_PYTHON3_TESTS=true. The
following are the details:

Change hash() function in conftest.py to crc32() to produce
deterministic hash. Hash randomization is enabled by default since
Python 3.3 (see
https://docs.python.org/3/reference/datamodel.html#object.__hash__).
This cause test sharding (like --shard_tests=1/2) produce inconsistent
set of tests per shard. Always restart minicluster during custom cluster
tests if --shard_tests argument is set, because test order may change
and affect test correctness, depending on whether running on fresh
minicluster or not.

Moved one test case from delimited-latin-text.test to
test_delimited_text.py for easier binary comparison.

Add bytes_to_str() as a utility function to decode bytes in Python3.
This is often needed when inspecting the return value of
subprocess.check_output() as a string.

Implement DataTypeMetaclass.__lt__ to substitute
DataTypeMetaclass.__cmp__ that is ignored in Python3 (see
https://peps.python.org/pep-0207/).

Fix WEB_CERT_ERR difference in test_ipv6.py.

Fix trivial integer parsing in test_restart_services.py.

Fix various encoding issues in test_saml2_sso.py,
test_shell_commandline.py, and test_shell_interactive.py.

Change timeout in Impala.for_each_impalad() from sys.maxsize to 2^31-1.

Switch to binary comparison in test_iceberg.py where needed.

Specify text mode when calling tempfile.NamedTemporaryFile().

Simplify create_impala_shell_executable_dimension to skip testing dev
and python2 impala-shell when IMPALA_USE_PYTHON3_TESTS=true. The reason
is that several UTF-8 related tests in test_shell_commandline.py break
in Python3 pytest + Python2 impala-shell combo. This skipping already
happen automatically in build OS without system Python2 available like
RHEL9 (IMPALA_SYSTEM_PYTHON2 env var is empty).

Removed unused vector argument and fixed some trivial flake8 issues.

Several test logic require modification due to intermittent issue in
Python3 pytest. These include:

Add _run_query_with_client() in test_ranger.py to allow reusing a single
Impala client for running several queries. Ensure clients are closed
when the test is done. Mark several tests in test_ranger.py with

SkipIfFS.hive because they run queries through beeline + HiveServer2,
but Ozone and S3 build environment does not start HiveServer2 by
default.

Increase the sleep period from 0.1 to 0.5 seconds per iteration in
test_statestore.py and mark TestStatestore to execute serially. This is
because TServer appears to shut down more slowly when run concurrently
with other tests. Handle the deprecation of Thread.setDaemon() as well.

Always force_restart=True each test method in TestLoggingCore,
TestShellInteractiveReconnect, and TestQueryRetries to prevent them from
reusing minicluster from previous test method. Some of these tests
destruct minicluster (kill impalad) and will produce minidump if metrics
verifier for next tests fail to detect healthy minicluster state.

Testing:
Pass exhaustive tests with IMPALA_USE_PYTHON3_TESTS=true.

Change-Id: I401a93b6cc7bcd17f41d24e7a310e0c882a550d4
Reviewed-on: http://gerrit.cloudera.org:8080/23319
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-03 10:01:29 +00:00
Csaba Ringhofer
c45e3e7968 IMPALA-14109: Remove SkipIfCatalogV2.hms_event_polling_disabled
This skipIf used the coordinator webui to check whether the flag
is set and skipped the test if the cluster was not running.
The skipIf was only used in custom cluster tests where the cluster
is restarted with new flags anyway, so the flags of the previous
cluster are not relevant.

Change-Id: I455b39eff95e45d02c7b9e0b35d8e7fe03145bb1
Reviewed-on: http://gerrit.cloudera.org:8080/22960
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-02 16:45:53 +00:00
jfehr
742d8d05f5 IMPALA-14090: Move Some Stable Custom Cluster Tests to Exhaustive
Moves several custom cluster tests out of core and into exhaustive
only. The tests were chosen based on their stability, lack of recent
modifications, and coverage of rare/corner cases.

Testing was accomplished by running both core and exhaustive tests
and manually verifying the tests were or were not skipped as
expected.

Change-Id: If99c015a0cb5d95b1607ca2be48d2dea04194f81
Reviewed-on: http://gerrit.cloudera.org:8080/22963
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-02 07:53:37 +00:00
Joe McDonnell
5b4afb4f8f IMPALA-13368: Fixup Redhat detection for Python >= 3.8
Python 3.8 removed the platform.linux_distribution() function which is
currently used to detect Redhat. This switches to using the 'distro'
package, which implements the same functionality across different
Python versions. Since Redhat 6 is no longer supported, this removes
the detection of Redhat 6 and associated skip logic.

Testing:
 - Ran a core job

Change-Id: I0dfaf798c0239f6068f29adbd2eafafdbbfd66c3
Reviewed-on: http://gerrit.cloudera.org:8080/22073
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-12-17 07:28:51 +00:00
Yida Wu
f2a09b6dda IMPALA-12907: Add testcases for TPC-H/TPC-DS queries with tuple caching
Added testcases to run TPC-H and TPC-DS queries twice with tuple
caching to verify that Impala won't crash and ensure the
correctness of the results.

Testcases allows mt_dop to be 0 or 4.

Also, added the environment varibles of tuple cache to
run-all-tests.sh and added skipif to test_tuple_cache_tpc_queries.py
to skip if not tuple cache enabled.

Tests:
Ran the tests in the build with tuple cache enabled.

Change-Id: I967372744d8dda25cbe372aefec04faec5a76847
Reviewed-on: http://gerrit.cloudera.org:8080/21628
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-08-16 07:30:26 +00:00
wzhou-code
08f8a30025 IMPALA-12910: Support running TPCH/TPCDS queries for JDBC tables
This patch adds script to create external JDBC tables for the dataset of
TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for
external JDBC tables with Impala-Impala federation. Note that JDBC
tables are mapping tables, they don't take additional disk spaces.
It fixes the race condition when caching of SQL DataSource objects by
using a new DataSourceObjectCache class, which checks reference count
before closing SQL DataSource.
Adds a new query-option 'clean_dbcp_ds_cache' with default value as
true. When it's set as false, SQL DataSource object will not be closed
when its reference count equals 0 and will be kept in cache until
the SQL DataSource is idle for more than 5 minutes. Flag variable
'dbcp_data_source_idle_timeout_s' is added to make the duration
configurable.
java.sql.Connection.close() fails to remove a closed connection from
connection pool sometimes, which causes JDBC working threads to wait
for available connections from the connection pool for a long time.
The work around is to call BasicDataSource.invalidateConnection() API
to close a connection.
Two flag variables are added for DBCP configuration properties
'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait'
properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively
in apache.commons.dbcp v2.
Fixes a bug for database type comparison since the type strings
specified by user could be lower case or mix of upper/lower cases, but
the code compares the types with upper case string.
Fixes issue to close SQL DataSource object in JdbcDataSource.open()
and JdbcDataSource.getNext() when some errors returned from DBCP APIs
or JDBC drivers.

testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables
for Impala-Impala, Postgres and MySQL.
Following sample commands creates TPCDS JDBC tables for Impala-Impala
federation with remote coordinator running at 10.19.10.86, and Postgres
server running at 10.19.10.86:
  ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
    --jdbc_db_name=tpcds_jdbc --workload=tpcds \
    --database_type=IMPALA --database_host=10.19.10.86 --clean

  ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
    --jdbc_db_name=tpcds_jdbc --workload=tpcds \
    --database_type=POSTGRES --database_host=10.19.10.86 \
    --database_name=tpcds --clean

TPCDS tests for JDBC tables run only for release/exhaustive builds.
TPCH tests for JDBC tables run for core and exhaustive builds, except
Dockerized builds.

Remaining Issues:
 - tpcds-decimal_v2-q80a failed with returned rows not matching expected
   results for some decimal values. This will be fixed in IMPALA-13018.

Testing:
 - Passed core tests.
 - Passed query_test/test_tpcds_queries.py in release/exhaustive build.
 - Manually verified that only one SQL DataSource object was created for
   test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option
   'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object
   was closed by cleanup thread.

Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a
Reviewed-on: http://gerrit.cloudera.org:8080/21304
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-05-02 02:14:20 +00:00
stiga-huang
9ecf0cbfc7 IMPALA-12054: Lazily check Kudu flags in tests
I usually shutdown Kudu in my dev env to save some resources. However,
tests that import skip.py will fail if Kudu cluster is not running
locally, even if the tests are unrelated to Kudu. The cause is that Kudu
web pages are accessed when the module is imported, and it fails if Kudu
cluster is not running.

This patch exposes the decorators of SkipIfKudu as methods just like
what we did in SkipIfCatalogV2, so Kudu web pages can be checked lazily
when needed.

Tests:
 - Ran Kudu tests.
 - Ran some Kudu unrelated tests without lauching the Kudu cluster.

Change-Id: Ic7a8282b59d72322085c21c70a5019c51b586a52
Reviewed-on: http://gerrit.cloudera.org:8080/20904
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-01-16 23:12:30 +00:00
wzhou-code
a2c2f118d2 IMPALA-12375: Make DataSource Object persistent
DataSource objects are saved in-memory cache in Catalog server. They are
not persisted to the HMS. The objects are lost after Catalog server is
restarted and user needs to recreate DataSource objects before creating
new external DataSource tables.
This patch makes DataSource Object persistent by saving DataSource
objects as DataConnector objects with type "impalaDataSource" in HMS.
Since HMS events for DataConnector are not handled, Catalog server
has to refresh DataSource objects when the catalogd becomes active.
Note that this feature is not supported for Apache Hive 3.1 and older
version.

Testing:
 - Added two end-to-end unit tests with restarting of Catalog server,
   and catalogd HA failover.
   These two tests are skipped when USE_APACHE_HIVE is set as true
   and Apache Hive version is 3.x or older version.
 - Passed all-build-options-ub2004.
 - Passed core test.

Change-Id: I500a99142bb62ce873e693d573064ad4ffa153ab
Reviewed-on: http://gerrit.cloudera.org:8080/20768
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
2024-01-03 03:25:18 +00:00
Riza Suminto
967ed18407 IMPALA-12528: Deflake test_hdfs_scanner_thread_non_reserved_bytes
Prior deflake attempt at IMPALA-12499 does not seem sufficient. There
are still sporadic failures happening in
test_hdfs_scanner_thread_non_reserved_bytes. This patch further attempt
to deflake it by:
- Injecting 100ms sleep every time scanner thread obtain new scan range.
- Running it serially.
- Skip it in dockerized environment.

This patch also fix small comment mistakes in hdfs-scan-node.cc.

Testing:
- Loop and pass the test 100 times in local minicluster environment.

Change-Id: I5715cf16c87ff0de51afd2fa778c5b591409d376
Reviewed-on: http://gerrit.cloudera.org:8080/20640
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-11-02 12:12:20 +00:00
wzhou-code
c77a457520 IMPALA-7131: Support external data sources in LocalCatalog mode
This patch makes external data source working in LocalCatalog mode:
 - Add APIs in CatalogdMetaProvider to fetch DataSource from Catalog
   server through RPC.
 - Add getDataSources() and getDataSource() in LocalCatalog.
 - Add LocalDataSourceTable class for loading DataSource table in
   LocalCatalog.
 - Handle request for loading DataSource in CatalogServiceCatalog on
   Catalog server.
 - Enable tests which are skipped by
   SkipIfCatalogV2.data_sources_unsupported().
   Remove SkipIfCatalogV2.data_sources_unsupported().
 - Add end-to-end tests for LocalCatalog mode.

Testing:
 - Passed core tests

Change-Id: I40841c9be9064ac67771c4d3f5acbb3b552a2e55
Reviewed-on: http://gerrit.cloudera.org:8080/20574
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
2023-10-30 16:04:47 +00:00
Riza Suminto
05890c1c84 IMPALA-12499: Deflake test_hdfs_scanner_thread_mem_scaling
IMPALA-11068 added three new tests into
hdfs-scanner-thread-mem-scaling.test. The first one is failing
intermittently, most likely due to fragment right above the scan does
not pull row batches fast enough. This patch attempt to deflake the
tests by replacing it with simple count start query. The three test
cases is now contained in its own
test_hdfs_scanner_thread_non_reserved_bytes and will be skipped for
sanitized build.

Testing:
- Loop and pass test_hdfs_scanner_thread_non_reserved_bytes a hundred
  times.

Change-Id: I7c99b2ef70b71e148cedb19037e2d99702966d6e
Reviewed-on: http://gerrit.cloudera.org:8080/20593
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-10-20 00:41:59 +00:00
Michael Smith
e27e4eb54a IMPALA-11941: (Addendum) ease testing other JDKs
Makes it simpler to build with one JDK and run tests with another.
TEST_JDK_VERSION sets IMPALA_JDK_VERSION before running tests, so the
Impala cluster is started with that JDK. TEST_JAVA_HOME_OVERRIDE sets
IMPALA_JAVA_HOME_OVERRIDE if a non-OS version of Java is required.

Restart Kudu with original JAVA_HOME in frontend tests.

Also skips restarting Hive, Kudu, and Ranger in tests as they'll restart
with a different JDK than originally started with.

Testing:
1. built normally
2. ran "TEST_JDK_VERSION=17 run-all-tests.sh"
3. verified various logs contain "java.specification.version:17"

Change-Id: I46b5515efd9537d63b843dbc42aa93b376efce00
Reviewed-on: http://gerrit.cloudera.org:8080/20143
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-07-27 02:10:17 +00:00
Michael Smith
8f35c7f4aa IMPALA-12052: Update EC policy string for Ozone
HDDS-7122 changed how Ozone prints chunk size from bytes to KB, as in
1024k rather than 1048576. That makes it consistent with HDFS reporting.

Our tests verify the chunk size reported in SHOW output. Updates the
expected erasure code policy string to match the new format.

Updates CDP_OZONE_VERSION to a build that includes HDDS-7122. However
this build includes two regressions that we work around for the moment:
- HDDS-8543: FSO layout reports incorrect replication config for
             directories in EC buckets
- HDDS-8289: FSO layout listStatus operations get slower with lots of
             files and filesystem operations

Testing:
- ran test suite with Ozone Erasure Coding

Change-Id: I5354de61bbc507931a1d5bc86f6466c0dd50fc30
Reviewed-on: http://gerrit.cloudera.org:8080/19870
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-05-18 17:34:33 +00:00
Riza Suminto
baddaf2241 IMPALA-12144: Skip TestTpcdsQueryWithProcessingCost if dockerised
There is a sign of flakiness in TestTpcdsQueryWithProcessingCost within
dockerised environment. The flakiness seems to happen due to tighter
per-process memory limit in dockerised environment. This patch skip
TestTpcdsQueryWithProcessingCost in dockerised environment.

Testing:
- Hack SkipIfDockerizedCluster.insufficient_mem_limit to return True if
  IS_HDFS and confirm that the whole TestTpcdsQueryWithProcessingCost is
  skipped.

Change-Id: Ibb6b2d4258a2c6613d1954552f21641b42cb3c38
Reviewed-on: http://gerrit.cloudera.org:8080/19892
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-16 20:40:11 +00:00
Michael Smith
5018a30d6d IMPALA-11476: (Addendum) enable test_erasure_coding for Ozone
Enables test_erasure_coding for Ozone. HDDS-7603 is planned for Ozone
1.4.0, but we test with a CDP build - 1.3.0.7.2.17.0-127 - that already
includes this fix. Since this is a testing-only change, seems safe to
rely on that.

Testing:
- Ran test_erasure_coding on Ozone with EC

Change-Id: Iee57c008102db7fac89abcea9a140c867178bb08
Reviewed-on: http://gerrit.cloudera.org:8080/19578
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-03-13 22:06:46 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
yx91490
f4d306cbca IMPALA-11629: Support for huawei OBS FileSystem
This patch adds support for huawei OBS (Object Storage Service)
FileSystem. The implementation is similar to other remote FileSystems.

New flags for OBS:
- num_obs_io_threads: Number of OBS I/O threads. Defaults to be 16.

Testing:
 - Upload hdfs test data to an OBS bucket. Modify all locations in HMS
   DB to point to the OBS bucket. Remove some hdfs caching params.
   Run CORE tests.

Change-Id: I84a54dbebcc5b71e9bcdd141dae9e95104d98cb1
Reviewed-on: http://gerrit.cloudera.org:8080/19110
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-09 08:10:19 +00:00
Peter Rozsa
1d05381b7b IMPALA-11745: Add Hive's ESRI geospatial functions as builtins
This change adds geospatial functions from Hive's ESRI library
as builtin UDFs. Plain Hive UDFs are imported without changes,
but the generic and varargs functions are handled differently;
generic functions are added with all of the combinations of
their parameters (cartesian product of the parameters), and
varargs functions are unfolded as an nth parameter simple
function. The varargs function wrappers are generated at build
time and they can be configured in
gen_geospatial_udf_wrappers.py. These additional steps are
required because of the limitations in Impala's UDF Executor
(lack of varargs support and only partial generics support)
which could be further improved; in this case, the additional
wrapping/mapping steps could be removed.

Changes regarding function handling/creating are sourced from
https://gerrit.cloudera.org/c/19177

A new backend flag was added to turn this feature on/off
as "geospatial_library". The default value is "NONE" which
means no geospatial function gets registered
as builtin, "HIVE_ESRI" value enables this implementation.

The ESRI geospatial implementation for Hive currently only
available in Hive 4, but CDP Hive backported it to Hive 3,
therefore for Apache Hive this feature is disabled
regardless of the "geospatial_library" flag.

Known limitations:
 - ST_MultiLineString, ST_MultiPolygon only works
   with the WKT overload
 - ST_Polygon supports a maximum of 6 pairs of coordinates
 - ST_MultiPoint, ST_LineString supports a maximum of 7
   pairs of coordinates
 - ST_ConvexHull, ST_Union supports a maximum of 6 geoms

These limits can be increased in gen_geospatial_udf_wrappers.py

Tests:
 - test_geospatial_udfs.py added based on
   https://github.com/Esri/spatial-framework-for-hadoop

Co-Authored-by: Csaba Ringhofer <csringhofer@cloudera.com>

Change-Id: If0ca02a70b4ba244778c9db6d14df4423072b225
Reviewed-on: http://gerrit.cloudera.org:8080/19425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-07 20:18:47 +00:00
Michael Smith
bbb0b4939d IMPALA-11476: Support Ozone erasure coding
Adds support for identifying erasure coding policy with Ozone. Enables
testing Ozone with erasure coding.

Omits support for identifying erasure coding policy with the o3fs
protocol as that protocol is effectively deprecated and its classes
don't provide access to the ObjectStore.

Refactors volumeBucketPair to use StringTokenizer.

Test updates:
- test_exclusive_coordinator_plan: Ozone+EC blocks are 768MB, which is
  larger than all tables in our test environment. Use tpch_parquet which
  we rely on having 3 files (by loading from snapshot in this case).
- test_new_file_shorter: receives an EOFException when seeking with EC
- test_local_read: erasure-coded-bytes-read is also tied to IMPALA-11697
- test_erasure_coding: Ozone doesn't report files as erasure-coded
  (HDDS-7603)

Testing:
- Passes core E2E and custom cluster tests with TARGET_FILESYSTEM=ozone
  and ERASURE_CODING=true.

Change-Id: I201e2e33ce94bbc1e81631a0a315884bcc8047d1
Reviewed-on: http://gerrit.cloudera.org:8080/19324
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-25 18:18:28 +00:00
Gergely Fürnstáhl
4595371ea2 IMPALA-11821: Adjusting manifest_length and absolute paths in case of metadata rewrite
testdata/bin/rewrite-iceberg-metadata.py rewrites manifest and snapshot
files using the provided prefix for file paths. Snapshot files store
the length of manifest files as well, this needs to be adjusted too.

Additionally, improved path rewrite to be able to rewrite absolute
paths correctly and pretty dumping metadata jsons.

Testing:
 - Tested locally, manually verified the rewrites
 - Tested on Ozone, automatically rewriting the test data and running
test_iceberg.py

Change-Id: I89b9208f25552012cc1ab16fa60a819dd5a683d9
Reviewed-on: http://gerrit.cloudera.org:8080/19412
Reviewed-by: Noemi Pap-Takacs <npaptakacs@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-01-17 22:33:58 +00:00
noemi
4a05eaf988 IMPALA-11807: Fix TestIcebergTable.test_avro_file_format and test_mixed_file_format
Iceberg hardcodes URIs in metadata files. If the table was written
in a certain storage location and then moved to another file system,
the hardcoded URIs will still point to the old location instead of
the current one. Therefore Impala will be unable to read the table.

TestIcebergTable.test_avro_file_format and test_mixed_file_format
use Hive from Impala to write tables. If the tables are created in
a different file system than the one they will be read from, the tests
fail due to the invalid URIs.
Skipping these 2 tests if testing is not done on HDFS.

Updated the data load schema of the 2 test tables created by Hive and
set LOCATION to the same as in the previous test tables. If this
makes it possible to rewrite the URIs in the metadata and makes the
tables accessible from another file system as well later, then the
tests can be enabled again.

Testing:
 - Testing locally on HDFS minicluster
 - Triggered an Ozone build to verify that it is skipped on a different
   file system

Change-Id: Ie2f126de80c6e7f825d02f6814fcf69ae320a781
Reviewed-on: http://gerrit.cloudera.org:8080/19387
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-22 19:45:21 +00:00
Michael Smith
a469a9cf19 IMPALA-11730: Add support for spilling to Ozone
Adds support for spilling to Ozone (ofs and o3fs schemes) for parity
with HDFS. Note that ofs paths start with <volume>/<bucket>, which have
naming restrictions; tmp/impala-scratch is a valid name, so something
like ofs://localhost:9862/tmp would work as a scratch directory (volume
tmp, implicit bucket impala-scratch).

Updates tests to determine the correct path from the environment. Fixes
backend tests to work with Ozone as well. Guards test_scratch_disk.py
behind a new flag for filesystems that support spilling. Updates metric
verification to wait for scratch-space-bytes-used to be non-zero, as it
seems to update slower with Ozone.

Refactors TmpDir to remove extraneous variables and functions. Each
implementation is expected to handle its own token parsing.

Initializes default_fs in ExecEnv when using TestEnv. Previously it was
uninitialized, and uses of default_fs would return an empty string.

Testing:
- Ran backend, end-to-end, and custom cluster tests with Ozone.
- Ran test_scratch_disk.py exhaustive runs with Ozone and HDFS.

Change-Id: I5837c30357363f727ca832fb94169f2474fb4f6f
Reviewed-on: http://gerrit.cloudera.org:8080/19251
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-12-08 17:20:33 +00:00
Michael Smith
8cd4a1e4e5 IMPALA-11584: Enable minicluster tests for Ozone
Enables tests guarded by SkipIfNotHdfsMinicluster to run on Ozone as
well as HDFS. Plans are still skipped for Ozone because there's
Ozone-specific text in the plan output.

Updates explain output to allow for Ozone, which has a block size of
256MB instead of 128MB. One of the partitions read in test_explain is
~180MB, straddling the difference between Ozone and HDFS.

Testing: ran affected tests with Ozone.

Change-Id: I6b06ceacf951dbc966aa409cf24a310c9676fe7f
Reviewed-on: http://gerrit.cloudera.org:8080/19250
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-12-06 21:18:33 +00:00
Michael Smith
f8443d9828 IMPALA-11697: Enable SkipIf.not_hdfs tests for Ozone
Convert SkipIf.not_hdfs to SkipIf.not_dfs for tests that require
filesystem semantics, adding more feature test coverage with Ozone.

Creates a separate not_scratch_fs flag for scratch dir tests as they're
not supported with Ozone yet. Filed IMPALA-11730 to address this.

Preserves not_hdfs for a specific test that uses the dfsadmin CLI to put
it in safemode.

Adds sfs_ofs_unsupported for SmallFileSystem tests. This should work for
many of our filesystems based on
ebb1e2fa99/ql/src/java/org/apache/hadoop/hive/ql/io/SingleFileSystem.java (L62-L87). Makes sfs tests work on S3.

Adds hardcoded_uris for IcebergV2 tests where deletes are implemented as
hardcoded URIs in parquet files. Adding a parquet read/write library for
Python is beyond the scope if this patch.

Change-Id: Iafc1dac52d013e74a459fdc4336c26891a256ef1
Reviewed-on: http://gerrit.cloudera.org:8080/19254
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-11-21 18:51:30 +00:00
yacai
c953426692 IMPALA-11683: Support Aliyun OSS File System
This patch adds support for OSS (Aliyun Object Storage Service).
Using the hadoop-aliyun, the implementation is similar to other
remote FileSystems.

Tests:
- Prepare:
  Initialize OSS-related environment variables:
  OSS_ACCESS_KEY_ID, OSS_SECRET_ACCESS_KEY, OSS_ACCESS_ENDPOINT.
  Compile and create hdfs test data on a ECS instance. Upload test data
  to an OSS bucket.
- Modify all locations in HMS DB to point to the OSS bucket.
  Remove some hdfs caching params. Run CORE tests.

Change-Id: I267e6531da58e3ac97029fea4c5e075724587910
Reviewed-on: http://gerrit.cloudera.org:8080/19165
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-16 10:14:49 +00:00
Michael Smith
d1d4f183da IMPALA-11704: Delay hdfsOpenFile with data cache
Delays hdfsOpenFile until after data cache lookup if using a data cache.
IMPALA-10147 implemented this, but only when using the file handle
cache. This patch adds an additional check in case file handle caching
is disabled.

In networked environments, hdfsOpenFile can take significant time, as
observed in a TPC-DS run of q90 where TotalRawHdfsOpenFileTime
represented a majority of time spent for HDFS_SCAN_NODE. This patch
brings that time to 0 with a primed data cache.

Change-Id: I9429a41fb16de27ccb57730203f95559df0dbfb6
Reviewed-on: http://gerrit.cloudera.org:8080/19204
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-11-06 00:23:08 +00:00
Michael Smith
eed92b223f IMPALA-7092: Restore tests after HDFS fixes
Restore EC tests that were disabled until HDFS-13539 and HDFS-13540 were
fixed, as the fixes are available in the current version of Hadoop we
test.

Testing: ran these tests with EC enabled.

Change-Id: I8b0bbc604601e6fab742f145c1adfb3c47b3fb6e
Reviewed-on: http://gerrit.cloudera.org:8080/19159
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-11-04 22:19:21 +00:00
Michael Smith
a870a11e64 IMPALA-7098: Re-enable tests under EC
Re-enables tests under erasure coding, or provides more specific
exceptions.

Erasure coding uses multiple data blocks to construct a block group. Our
tests use RS-3-2-1024k, which includes 3 data blocks in a block group.
Each of these blocks is sized according to `dfs.block.size`, so block
groups by default hold up to 384MB of data.

Impala schedules work to executors based on blocks reported by HDFS,
which for EC actually represent block groups. So with default block
size, a file in EC has 1/3rd the number of schedulable blocks. In the
case of tpch.lineitem, this produces 2 parquet files instead of 3 and
reduces the number of executors scheduled to read parquet lineitem as

1. lineitem.tbl is loaded via Hive. With EC it uses 2 block groups,
   without EC it uses 6 blocks.
2. parquet lineitem is created by select/insert from lineitem.tbl.
   Impala schedules reads to executors based on available blocks, so
   with EC this gets scheduled across 2 executors instead of 3 and each
   executor writes a separate parquet file.

Change-Id: Ib452024993e35d5a8d2854c6b2085115b26e40df
Reviewed-on: http://gerrit.cloudera.org:8080/19172
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-11-04 22:13:50 +00:00
Michael Smith
19114c7205 IMPALA-11578: Exclude locality test for remote FS
Exclude test_scheduler_locality when the filesystem can only be remote.

Change-Id: Ie6198421f21bc2520773ecbb34ffaf65969ebc43
Reviewed-on: http://gerrit.cloudera.org:8080/18980
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-19 21:11:09 +00:00
Michael Smith
79e474d310 IMPALA-10213: Add test for local vs remote scheduling
Impala already supports locality-aware scheduling with Ozone because it
returns location data on partitions. That data doesn't include specific
storage ids in getStorageIds, so we skip a warning that will always
trigger on Ozone.

Updates Ozone to add implicit rules mapping localhost -> 127.0.0.1 for
local development. HDFS translates localhost to 127.0.0.1 for host names
in its location data, which Impala will identify as colocated with
executors in the dev environment. Ozone doesn't, and the default Impala
hostname is the machine hostname - not localhost - so without this
change all HDFS access in the minicluster is local but all Ozone access
is remote.

Adds a test to verify local vs remote assignment by using custom
clusters with hostnames that either do or don't match storage hostnames.

Change-Id: I4e5606528404c3d4fd164c03dec8315345be5f6d
Reviewed-on: http://gerrit.cloudera.org:8080/18841
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-09-07 17:13:04 +00:00
Michael Smith
cf7490ccbd IMPALA-11464: (Addendum) Skip tests in Ozone
Updates the skip for new recursive listing tests to match the comment so
that they're only run on HDFS. The previous skip only roughly matched
the set of all non-HDFS filesystems, and didn't automatically include
new filesystems.

Change-Id: I80de83d506138b57a969258b2f6dcf112dd2e44d
Reviewed-on: http://gerrit.cloudera.org:8080/18934
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-09-01 09:53:16 +00:00
Michael Smith
1eb0510eaa IMPALA-11456: Collapse filesystem Skip logic
Combines all SkipIf* classes for different filesystems into a single
SkipIfFS class. Many cases are simplified to 'not IS_HDFS', with the
rest as filesystem-specific special cases. The 'jira' option is removed
in favor of specific flags for each issue.

Change-Id: Ib928a6274baaaec45614887b9e762346a25812a1
Reviewed-on: http://gerrit.cloudera.org:8080/18781
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-10 22:37:08 +00:00
Michael Smith
830625b104 IMPALA-9442: Add Ozone to minicluster
Adds Ozone as an alternative to hdfs in the minicluster. Select by
setting `export TARGET_FILESYSTEM=ozone`. With that flag,
run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot
because Ozone does not support HBase (HDDS-3589); snapshot loading
doesn't work yet primarily due to HDDS-5502.

Uses the o3fs interface because Ozone puts specific restrictions on
bucket names (no underscores, for instance), and it was a lot easier to
use an interface where everything is written to a single bucket than to
update all Impala's use of HDFS-style paths to make `test-warehouse` a
bucket inside a volume.

Specifies reduced Ozone client retries during shutdown where Ozone may
not be available.

Passes tests with FE_TEST=false BE_TEST=false.

Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be
Reviewed-on: http://gerrit.cloudera.org:8080/18738
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2022-08-03 16:58:20 +00:00
Gergely Fürnstáhl
8965059c2c IMPALA-11358: Fixed Kudu table's missing comment
If Kudu-HMS integration is enabled, Kudu creates the table in HMS too,
which was missing the comment field. Added the code to forward the
comment field to Kudu during creation.

Testing:

Added a test to verify the comment is present when the intergration is
enabled.
Reenabled several kudu tests as IMPALA-8751 (and follow ups) fixed the
hive3 notification incompatibility.

Change-Id: Idf66f8b4679b00da6693a27fed79b04e8f6afb55
Reviewed-on: http://gerrit.cloudera.org:8080/18627
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-06-20 19:38:33 +00:00
Qifan Chen
07a3e6e0df IMPALA-10992 Planner changes for estimate peak memory
This patch provides replan support for multiple executor group sets.
Each executor group set is associated with a distinct number of nodes
and a threshold for estimated memory per host in bytes that can be
denoted as [<group_name_prefix>:<#nodes>, <threshold>].

In the patch, a query of type EXPLAIN, QUERY or DML can be compiled
more than once. In each attempt, per host memory is estimated and
compared with the threshold of an executor group set. If the estimated
memory is no more than the threshold, the iteration process terminates
and the final plan is determined. The executor group set with the
threshold is selected to run the query.

A new query option 'enable_replan', default to 1 (enabled), is added.
It can be set to 0 to disable this patch and to generate the distributed
plan for the default executor group.

To avoid long compilation time, the following enhancement is enabled.
Note 1) can be disabled when relevant meta-data change is
detected.

 1. Authorization is performed only for the 1st compilation;
 2. openTransaction() is called for transactional queries in 1st
    compilation and the saved transactional info is used in
    subsequent compilations. Similar logic is applied to Kudu
    transactional queries.

To facilitate testing, the patch imposes an artificial two executor
group setup in FE as follows.

 1. [regular:<#nodes>, 64MB]
 2. [large:<#nodes>, 8PB]

This setup is enabled when a new query option 'test_replan' is set
to 1 in backend tests, or RuntimeEnv.INSTANCE.isTestEnv() is true as
in most frontend tests. This query option is set to 0 by default.

Compilation time increases when a query is compiled in several
iterations, as shown below for several TPCDs queries. The increase
is mostly due to redundant work in either single node plan creation
or recomputing value transfer graph phase. For small queries, the
increase can be avoided if they can be compiled in single iteration
by properly setting the smallest threshold among all executor group
sets. For example, for the set of queries listed below, the smallest
threshold can be set to 320MB to catch both q15 and q21 in one
compilation.

                              Compilation time (ms)
Queries	 Estimated Memory   2-iterations  1-iteration  Percentage of
                                                         increase
 q1         408MB              60.14         25.75       133.56%
 q11	   1.37GB             261.00        109.61       138.11%
 q10a	    519MB             139.24         54.52       155.39%
 q13	    339MB             143.82         60.08       139.38%
 q14a	   3.56GB             762.68        312.92       143.73%
 q14b	   2.20GB             522.01        245.13       112.95%
 q15	    314MB               9.73          4.28       127.33%
 q21	    275MB              16.00          8.18        95.59%
 q23a	   1.50GB             461.69        231.78        99.19%
 q23b	   1.34GB             461.31        219.61       110.05%
 q4	   2.60GB             218.05        105.07       107.52%
 q67	   5.16GB             694.59        334.24       101.82%

Testing:
 1. Almost all FE and BE tests are now run in the artificial two
    executor setup except a few where a specific cluster configuration
    is desirable;
 2. Ran core tests successfully;
 3. Added a new observability test and a new query assignment test;
 4. Disabled concurrent insert test (test_concurrent_inserts) and
    failing inserts (test_failing_inserts) test in local catalog mode
    due to flakiness. Reported both in IMPALA-11189 and IMPALA-11191.

Change-Id: I75cf17290be2c64fd4b732a5505bdac31869712a
Reviewed-on: http://gerrit.cloudera.org:8080/18178
Reviewed-by: Qifan Chen <qchen@cloudera.com>
Tested-by: Qifan Chen <qchen@cloudera.com>
2022-03-21 20:17:28 +00:00
Fucun Chu
157086cb80 IMPALA-10771: Add Tencent COS support
This patch adds support for COS(Cloud Object Storage). Using the
hadoop-cos, the implementation is similar to other remote FileSystems.

New flags for COS:
- num_cos_io_threads: Number of COS I/O threads. Defaults to be 16.

Follow-up:
- Support for caching COS file handles will be addressed in
   IMPALA-10772.
- test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   COS (IMPALA-10773).

Tests:
 - Upload hdfs test data to a COS bucket. Modify all locations in HMS
   DB to point to the COS bucket. Remove some hdfs caching params.
   Run CORE tests.

Change-Id: Idce135a7591d1b4c74425e365525be3086a39821
Reviewed-on: http://gerrit.cloudera.org:8080/17503
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-12-08 16:32:02 +00:00
Vihang Karajgaonkar
5a9dcd108d IMPALA-8795: Turn on events processing by default
This commit turns on events processing by default. The default
polling interval is set as 1 second which can be overrriden by
setting hms_event_polling_interval_s to non-default value.

When the event polling turned on by default this patch also
moves the test_event_processing.py to tests/metadata instead
of custom cluster test. Some tests within test_event_processing.py
which needed non-default configurations were moved to
tests/custom_cluster/test_events_custom_configs.py.

Additionally, some other tests were modified to take into account
the automatic ability of Impala to detect newly added tables
from hive.

Testing done:
1. Ran exhaustive tests by turning on the events processing multiple
times.
2. Ran exhaustive tests by disabling events processing.
3. Ran dockerized tests.

Change-Id: I9a8b1871a98b913d0ad8bb26a104a296b6a06122
Reviewed-on: http://gerrit.cloudera.org:8080/17612
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
2021-08-09 17:22:31 +00:00
Qifan Chen
fdb6a4e264 IMPALA-10532: TestOverlapMinMaxFilters.test_overlap_min_max_filters seems flaky
This change disables the overlap min/max filter test for hdfs in
erasure coding, due to the query plan change (from 3-node scan to
2-node scan) which splits the row groups among scan nodes differently.

The SkipIfEC class in test harness skip.py is enhanced with a new
skip reason 'different_scan_split' to facilitate this action.

Testing:
  1. Ran unit tests;
  2. Ran core tests.

Change-Id: I527de530f7db1ce959e7ef2ae3ced18677221c9f
Reviewed-on: http://gerrit.cloudera.org:8080/17289
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-04-15 09:44:10 +00:00
stiga-huang
2dfc68d852 IMPALA-7712: Support Google Cloud Storage
This patch adds support for GCS(Google Cloud Storage). Using the
gcs-connector, the implementation is similar to other remote
FileSystems.

New flags for GCS:
 - num_gcs_io_threads: Number of GCS I/O threads. Defaults to be 16.

Follow-up:
 - Support for spilling to GCS will be addressed in IMPALA-10561.
 - Support for caching GCS file handles will be addressed in
   IMPALA-10568.
 - test_concurrent_inserts and test_failing_inserts in
   test_acid_stress.py are skipped due to slow file listing on
   GCS (IMPALA-10562).
 - Some tests are skipped due to issues introduced by /etc/hosts setting
   on GCE instances (IMPALA-10563).

Tests:
 - Compile and create hdfs test data on a GCE instance. Upload test data
   to a GCS bucket. Modify all locations in HMS DB to point to the GCS
   bucket. Remove some hdfs caching params. Run CORE tests.
 - Compile and load snapshot data to a GCS bucket. Run CORE tests.

Change-Id: Ia91ec956de3b620cccf6a1244b56b7da7a45b32b
Reviewed-on: http://gerrit.cloudera.org:8080/17121
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-13 11:20:08 +00:00
Joe McDonnell
35bae939ab IMPALA-10427: Remove SkipIfS3.eventually_consistent pytest marker
These tests were disabled due to S3's eventually consistent
behavior. Now that S3 is strongly consistent, these tests do
not need to be disabled.

Testing:
 - Ran s3 core job

Change-Id: Ie9041f530bf3a818f8954b31a3d01d9f6753d7d4
Reviewed-on: http://gerrit.cloudera.org:8080/16931
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-07 23:53:56 +00:00
Tim Armstrong
62c19e6339 IMPALA-10366: skip test_runtime_profile_aggregated for EC
The schedule for erasure coded data results in 3 instead
of 4 instances of the fragment with the scan. Skip the
test - we don't need special coverage for erasure coding.

Change-Id: I2bb47d89f6d6c59242f2632c481f26d93e28e33e
Reviewed-on: http://gerrit.cloudera.org:8080/16799
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-01 16:44:18 +00:00
Qifan Chen
6493f87357 IMPALA-10334: test_stats_extrapolation output doesn't match on erasure coding build
This patch skips test_stats_extrapolation for erasure code builds. The
reason is that an extra erasure code information line can be included
in the scan explain section when a hdfs table is erasure coded. This
makes the explain output different between a normal build and an
erasure code build. A new reason 'contain_full_explain' is added to
SkipIfEC to facilitate this.

Testing:
  Ran erasure coding version of the EE and CLUSTER tests.
  Ran core tests

Change-Id: I16c11aa0a1ec2d4569c272d2454915041039f950
Reviewed-on: http://gerrit.cloudera.org:8080/16756
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-11-23 20:36:58 +00:00
Vihang Karajgaonkar
c9ccb61acb IMPALA-10286: Disable metadata.test_catalogd_debug_actions on S3
This patch disables metadata/test_catalogd_debug_actions
test on S3 builds due to its flakiness. The root cause of this
seems to be that listing time on S3 is variable and the test
becomes flaky because it measures the time taken by refresh
command after a certain debug action is set.

Testing:
1. Ran the test on my local environment to make sure it
compiles fine.

Change-Id: I30bd10de468ad449c4a143a65cdcba97d9f0cd78
Reviewed-on: http://gerrit.cloudera.org:8080/16745
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-11-19 01:54:35 +00:00
Zoltan Borok-Nagy
981ef10465 IMPALA-10215: Implement INSERT INTO for non-partitioned Iceberg tables (Parquet)
This commit adds support for INSERT INTO statements against Iceberg
tables when the table is non-partitioned and the underlying file format
is Parquet.

We still use Impala's HdfsParquetTableWriter to write the data files,
though they needed some modifications to conform to the Iceberg spec,
namely:
 * write Iceberg/Parquet 'field_id' for the columns
 * TIMESTAMPs are encoded as INT64 micros (without time zone)

We use DmlExecState to transfer information from the table sink
operators to the coordinator, then updateCatalog() invokes the
AppendFiles API to add files atomically. DmlExecState is encoded in
protobuf, communication with the Frontend uses Thrift. Therefore to
avoid defining Iceberg DataFile multiple times they are stored in
FlatBuffers.

The commit also does some corrections on Impala type <-> Iceberg type
mapping:
 * Impala TIMESTAMP is Iceberg TIMESTAMP (without time zone)
 * Impala CHAR is Iceberg FIXED

Testing:
 * Added INSERT tests to iceberg-insert.test
 * Added negative tests to iceberg-negative.test
 * I also did some manual testing with Spark. Spark is able to read
   Iceberg tables written by Impala until we use TIMESTAMPs. In that
   case Spark rejects the data files because it only accepts TIMESTAMPS
   with time zone.
 * Added concurrent INSERT tests to test_insert_stress.py

Change-Id: I5690fb6c2cc51f0033fa26caf8597c80a11bcd8e
Reviewed-on: http://gerrit.cloudera.org:8080/16545
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-10-26 20:01:09 +00:00
stiga-huang
b02fad2db4 IMPALA-7538: Support HDFS caching with LocalCatalog
This patch adds support for HDFS caching in LocalCatalog coordinators.
We use the same way catalogd propagates HdfsCachePools in catalog-v1.
They are cached in LocalCatalog coordinators as v1 and are not
“fetch-on-demand” since only cache pool names are cached.

The isMarkedCached markers of HdfsTable and HdfsPartition are also
propagated to the LocalCatalog coordinators for correctly handling
ShowTableStats and ShowPartitions statements with caching information.

Tests:
 - Revive hdfs caching related tests in metadata/test_ddl.py and
   query_test/test_hdfs_caching.py for LocalCatalog.

Change-Id: I661f7b76a9575f6f5b3fa2c6feebda1a5d7c3712
Reviewed-on: http://gerrit.cloudera.org:8080/16058
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-06-18 22:03:24 +00:00
Tim Armstrong
6ec6aaae8e IMPALA-3695: Remove KUDU_IS_SUPPORTED
Testing:
Ran exhaustive tests.

Change-Id: I059d7a42798c38b570f25283663c284f2fcee517
Reviewed-on: http://gerrit.cloudera.org:8080/16085
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-06-18 01:11:18 +00:00
Joe McDonnell
3e76da9f51 IMPALA-9708: Remove Sentry support
Impala 4 decided to drop Sentry support in favor of Ranger. This
removes Sentry support and related tests. It retires startup
flags related to Sentry and does the first round of removing
obsolete code. This does not adjust documentation to remove
references to Sentry, and other dead code will be removed
separately.

Some issues came up when implementing this. Here is a summary
of how this patch resolves them:
1. authorization_provider currently defaults to "sentry", but
   "ranger" requires extra parameters to be set. This changes the
   default value of authorization_provider to "", which translates
   internally to the noop policy that does no authorization.
2. These flags are Sentry specific and are now retired:
 - authorization_policy_provider_class
 - sentry_catalog_polling_frequency_s
 - sentry_config
3. The authorization_factory_class may be obsolete now that
   there is only one authorization policy, but this leaves it
   in place.
4. Sentry is the last component using CDH_COMPONENTS_HOME, so
   that is removed. There are still Maven dependencies coming
   from the CDH_BUILD_NUMBER repository, so that is not removed.
5. To make the transition easier, testdata/bin/kill-sentry-service.sh
   is not removed and it is still called from testdata/bin/kill-all.sh.

Testing:
 - Core job passes

Change-Id: I8e99c15936d6d250cf258e3a1dcba11d3eb4661e
Reviewed-on: http://gerrit.cloudera.org:8080/15833
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-05-20 17:43:40 +00:00
Zoltan Borok-Nagy
8aa0652871 IMPALA-9484: Full ACID Milestone 1: properly scan files that has full ACID schema
Full ACID row format looks like this:

{
  "operation": 0,
  "originalTransaction": 1,
  "bucket": 536870912,
  "rowId": 0,
  "currentTransaction": 1,
  "row": {"i": 1}
}

User columns are nested under "row". In the frontend we need to create
slot descriptors that correspond to the file schema. In the catalog we
could mimic the file schema but that would introduce several
complexities and corner cases in column resolution. Also in query
results the heading of the above user column would be "row.i". Star
expansion should also be modified, etc.

Because of that in the Catalog I create the exact opposite of the above
schema:

{
  "row__id":
  {
    "operation": 0,
    "originalTransaction": 1,
    "bucket": 536870912,
    "rowId": 0,
    "currentTransaction": 1
  }
  "i": 1
}

This way very little modification is needed in the frontend. And the
hidden columns can be easily retrieved via 'SELECT row__id.*' when we
need those for debugging/testing.

We only need to change Path.getAbsolutePath() to return a schema path
that corresponds to the file schema. Also in the backend we need some
extra juggling in OrcSchemaResolver::ResolveColumn() to retrieve the
table schema path from the file schema path.

Testing:
I changed data loading to load ORC files in full ACID format by default.
With this change we should be able to scan full ACID tables that are
not minor-compacted, don't have deleted rows, and don't have original
files.

Newly added Tests:
 * specific queries about hidden columns (full-acid-rowid.test)
 * SHOW CREATE TABLE (show-create-table-full-acid.test)
 * DESCRIBE [FORMATTED] TABLE (describe-path.test)
 * INSERT should be forbidden (acid-negative.test)
 * added tests for column masking (
   ranger_column_masking_complex_types.test)

Change-Id: Ic2e2afec00c9a5cf87f1d61b5fe52b0085844bcb
Reviewed-on: http://gerrit.cloudera.org:8080/15395
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-04-02 12:01:41 +00:00
Fang-Yu Rao
01684ab3aa IMPALA-9191 (part 1): Allow Impala to run tests without Sentry
This patch adds an environment variable DISABLE_SENTRY to allow Impala
to run tests without Sentry. Specifically, we start up Sentry only when
$DISABLE_SENTRY does not evaluate to true. The corresponding Sentry FE
and E2E tests will also be skipped if $DISABLE_SENTRY is true.

Moreover, in this patch we will set DISABLE_SENTRY to true if
$USE_CDP_HIVE evaluates to true, allowing one to only test Impala's
authorization with Ranger when support for Sentry is dropped after we
switch to the CDP Hive.

Note that in this patch we also change the way we generate
hive-site.xml when $DISABLE_SENTRY is true. To be more precise, when
generating hive-site.xml, we do not add the Sentry server as a metastore
event listener if $DISABLE_SENTRY is true. Recall that both CDH Hive and
CDP Hive would make an RPC to the registered listeners every time after
the method of create_database_core() in HiveMetaStore.java is called,
which happens when Hive instead of Impala is used to create a database,
e.g., when some databases in the TPC-DS data set are created during the
execution of create-load-data.sh. Thus the removal of Sentry as an event
listener is necessary when $DISABLE_SENTRY is true in that it prevents
the HiveMetaStore from keeping connecting to the Sentry server that is
not online, which could make create-load-data.sh time out.

Testing:
Except for two currently known issues of IMPALA-9513 AND IMPALA-9451,
verified this patch passes the exhaustive tests in the DEBUG build
- when $USE_CDP_HIVE is false, and
- when $USE_CDP_HIVE is true.

Change-Id: Ifa3f1840a77a7b32310a5c8b78a2c26300ccb41e
Reviewed-on: http://gerrit.cloudera.org:8080/15505
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-21 20:14:33 +00:00