This change updates the way column names are
projected in the SQL query generated for JDBC
external tables. Instead of relying on optional
mapping or default behavior, all column names are now
explicitly quoted using appropriate quote characters.
Column names are now wrapped with quote characters
based on the JDBC driver being used:
1. Backticks (`) for Hive, Impala and MySQL
2. Double quotes (") for all other databases
This helps in the support for case-sensitive or
reserved column names.
Change-Id: I5da5bc7ea5df8f094b7e2877a0ebf35662f93805
Reviewed-on: http://gerrit.cloudera.org:8080/23066
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
This adds more tests in test_catalogd_ha.py for warm failover.
Refactored _test_metadata_after_failover to run in the following way:
- Run DDL/DML in the active catalogd.
- Kill the active catalogd and wait until the failover finishes.
- Verify the DDL/DML results in the new active catalogd.
- Restart the killed catalogd
It accepts two methods in parameters to perform the DDL/DML and the
verifier. In the last step, the killed catalogd is started so we keep
having 2 catalogd and can merge these into a single test by invoking
_test_metadata_after_failover for different method pairs. This saves
some test time.
The following DDL/DML statements are tested:
- CreateTable
- AddPartition
- REFRESH
- DropPartition
- INSERT
- DropTable
After each failover, the table is verified to be warmed up (i.e. loaded).
Also validate flags in startup to make sure enable_insert_events and
enable_reload_events are both set to true when warm failover is enabled,
i.e. --catalogd_ha_reset_metadata_on_failover=false.
Change-Id: I6b20adeb0bd175592b425e521138c41196347600
Reviewed-on: http://gerrit.cloudera.org:8080/23206
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
This patch improve the availability of CatalogD under huge INVALIDATE
METADATA operation. Previously, CatalogServiceCatalog.reset() hold
versionLock_.writeLock() for the whole reset duration. When the number
of database, tables, or functions are big, this write lock can be held
for a long time, preventing any other catalog operation from proceeding.
This patch improve the situation by:
1. Making CatalogServiceCatalog.reset() rebuild dbCache_ in place and
occasionally release the write lock between rebuild stages.
2. Fetch databases, tables, and functions metadata from MetaStore in
background using ExecutorService. Added catalog_reset_max_threads
flag to control number of threads to do parallel fetch.
In order to do so, lexicographic order must be enforced during reset()
and ensure all Db invalidation within a single stage is complete before
releasing the write lock. Stages should run in approximately the same
amount of time. A catalog operation over a database must ensure that no
reset operation is currently running, or the database name is
lexicographically less than the current database-under-invalidation.
This patch adds CatalogResetManager to do background metadata fetching
and provide helper methods to help facilitate waiting for reset
progress. CatalogServiceCatalog must hold the versionLock_.writeLock()
before calling most of CatalogResetManager methods.
These are methods in CatalogServiceCatalog class that must wait for
CatalogResetManager.waitOngoingMetadataFetch():
addDb()
addFunction()
addIncompleteTable()
addTable()
invalidateTableIfExists()
removeDb()
removeFunction()
removeTable()
renameTable()
replaceTableIfUnchanged()
tryLock()
updateDb()
InvalidateAwareDbSnapshotIterator.hasNext()
Concurrent global IM must wait until currently running global IM
complete. The waiting happens by calling waitFullMetadataFetch().
CatalogServiceCatalog.getAllDbs() get a snapshot of dbCache_ values at a
time. With this patch, it is now possible that some Db in this snapshot
maybe removed from dbCache() by concurrent reset(). Caller that cares
about snapshot integrity like CatalogServiceCatalog.getCatalogDelta()
should be careful when iterating the snapshot. It must iterate in
lexicographic order, similar like reset(), and make sure that it does
not go beyond the current database-under-invalidation. It also must skip
the Db that it is currently being inspected if Db.isRemoved() is True.
Added helper class InvalidateAwareDbSnapshot for this kind of iteration
Override CatalogServiceCatalog.getDb() and
CatalogServiceCatalog.getDbs() to wait until first reset metadata
complete or looked up Db found in cache.
Expand test_restart_catalogd_twice to test_restart_legacy_catalogd_twice
and test_restart_local_catalogd_twice. Update
CustomClusterTestSuite.wait_for_wm_init_complete() to correctly pass
timeout values to helper methods that it calls. Reduce cluster_size from
10 to 3 in few tests of test_workload_mgmt_init.py to avoid flakiness.
Fixed HMS connection leak between tests in AuthorizationStmtTest (see
IMPALA-8073).
Testing:
- Pass exhaustive tests.
Change-Id: Ib4ae2154612746b34484391c5950e74b61f85c9d
Reviewed-on: http://gerrit.cloudera.org:8080/22640
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.
All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.
The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.
get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.
Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is an enhancement request to support JDBC tables
created by Hive JDBC Storage handler. This is essentially
done by making JDBC table properties compatible with
Impala. It is done by translating when loading the table,
and maintaining that only in the Impala cluster, i.e. it's
not written back to HMS.
Impala includes JDBC drivers for PostgreSQL and MySQL
making 'driver.url' not mandatory in such cases. The
Impala JDBC driver is still required for Impala-to-Impala
JDBC connections. Additionally, Hive allows adding database
driver JARs at runtime via Beeline, enabling users to
dynamically include JDBC driver JARs. However, Impala does
not support adding database driver JARs at runtime,
making the driver.url field still useful
in cases where additional drivers are needed.
'hive.sql.query' property is not handled in this patch.
It'll be covered in a separate jira.
Testing: End-to-end tests are included in
test_ext_data_sources.py.
Change-Id: I1674b93a02f43df8c1a449cdc54053cc80d9c458
Reviewed-on: http://gerrit.cloudera.org:8080/22134
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The test failed with a "Data source does not exist" due
to name conflicts with pre-existing Data source objects.
To resolve this, each datasource name is made unique for
each concurrently running test dimension. The fix ensures
that the test runs smoothly without encountering errors
related to conflicting Data source names.
Testing: To test this change, it needs to be built with
-ubsan flag, post which a bash script is triggered to set
some environment variables, followed by './bin/run-all-tests.sh'
command to make sure all tests are run. Some important
environment variables of the bash script includes:
1. EXPLORATION_STRATEGY set to exhaustive to ensure all
possible scenarios are covered.
2. The specific test file to run is query_test/
test_ext_data_sources.py::TestExtDataSources
::test_data_source_tables and custom_cluster/
test_ext_data_sources.py, while frontend (FE), backend (BE),
and cluster tests are disabled. End-to-end tests are enabled
(EE_TEST=true), with iteration and failure limits also
specified.
Change-Id: I29822855da8136e013c8a62bb0489a181bf131ae
Reviewed-on: http://gerrit.cloudera.org:8080/21815
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
In some deployment environment, JDBC tables are set as transactional
tables by default. This causes catalogd failed to load the metadata for
JDBC tables. This patch explicitly add table properties with
"transactional=false" for JDBC table to avoid the JDBC to be set as
transactional table.
The operations on JDBC table are processed only on coordinator. The
processed rows should be estimated as 0 for DataSourceScanNode by
planner so that coordinator-only query plans are generated for simple
queries on JDBC tables and queries could be executed without invoking
executor nodes. Also adds Preconditions.check to make sure numNodes
equals 1 for DataSourceScanNode.
Updates FileSystemUtil.copyFileFromUriToLocal() function to write log
message for all types of exceptions.
Testing:
- Fixed planer tests for data source tables.
- Ran end-to-end tests of JDBC tables with query option
'exec_single_node_rows_threshold' as default value 100.
- Passed core-tests.
Change-Id: I556faeda923a4a11d4bef8c1250c9616f77e6fa6
Reviewed-on: http://gerrit.cloudera.org:8080/21141
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
TestExtDataSources::test_catalogd_ha_failover failed to delete data
source object after catalog service failed over to standby catalogd.
Log messages showed that coordinator tried to submit the DDL request
to original active catalogd since it did not receive failover
notification from statestored yet.
To fix the flaky test, wait until coordinator receive failover
notification from statestored before executing DDL request to drop
data source.
Testing:
- Looped to run the test for more than hundred times without failure.
Change-Id: Ia6225271357740c055c25fdd349f1dc9162c2f53
Reviewed-on: http://gerrit.cloudera.org:8080/21078
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
for external data source table
This patch adds support for datatype date as predicates
for external data sources.
Testing:
- Added tests for date predicates with operators:
'=', '>', '<', '>=', '<=', '!=', 'BETWEEN'.
Change-Id: Ibf13cbefaad812a0f78755c5791d82b24a3395e4
Reviewed-on: http://gerrit.cloudera.org:8080/20915
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Previous patch of IMPALA-12642 added supporting query options for Impala
external JDBC table. It added one unit-test to verify query option
ENABLED_RUNTIME_FILTER_TYPES by checking the queries in Queries Web
page. The test failed on Ubuntu 18.04 since the value of query option is
shown as single quoted string, instead of double quoted string.
This patch fixed the error.
Testing:
- Ran tests/custom_cluster/test_ext_data_sources.py on Ubuntu 18.04,
and Ubuntu 20.04.
Change-Id: I996c8fac038132f2b132d5e6ac36aca1dff59d72
Reviewed-on: http://gerrit.cloudera.org:8080/20978
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: gaurav singh <gsingh@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
setup environment fails.
This patch modifies the mysql tests to be marked as xfailed
if the mysql environment fails to setup successfully.
Change-Id: Ib7829aed09d25ff3e636004f3d1f32ecc6f37299
Reviewed-on: http://gerrit.cloudera.org:8080/20975
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
This patch uses JDBC connection string to apply query options to the
Impala server by setting the properties in "jdbc.properties" when
creating JDBC external DataSource table.
jdbc.properties are specified as comma-delimited key=value string, like
"MEM_LIMIT=1000000000, ENABLED_RUNTIME_FILTER_TYPES=\"BLOOM,MIN_MAX\"".
Fixed Impala to allow value of ENABLED_RUNTIME_FILTER_TYPES to have
double quotes in the beginning and ending of string.
jdbc.properties can be used for other databases like Postgres and MySQL
to set additional properties. The test cases will be added in separate
patch.
Testing:
- Added end-to-end tests for setting query options on Impala JDBC
tables.
- Passed core tests.
Change-Id: I47687b7a93e90cea8ebd5f3fc280c9135bd97992
Reviewed-on: http://gerrit.cloudera.org:8080/20837
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
DataSource objects are saved in-memory cache in Catalog server. They are
not persisted to the HMS. The objects are lost after Catalog server is
restarted and user needs to recreate DataSource objects before creating
new external DataSource tables.
This patch makes DataSource Object persistent by saving DataSource
objects as DataConnector objects with type "impalaDataSource" in HMS.
Since HMS events for DataConnector are not handled, Catalog server
has to refresh DataSource objects when the catalogd becomes active.
Note that this feature is not supported for Apache Hive 3.1 and older
version.
Testing:
- Added two end-to-end unit tests with restarting of Catalog server,
and catalogd HA failover.
These two tests are skipped when USE_APACHE_HIVE is set as true
and Apache Hive version is 3.x or older version.
- Passed all-build-options-ub2004.
- Passed core test.
Change-Id: I500a99142bb62ce873e693d573064ad4ffa153ab
Reviewed-on: http://gerrit.cloudera.org:8080/20768
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
This patch adds support to read Impala tables in the Impala cluster
through JDBC external data source. It also adds a new counter
NumExternalDataSourceGetNext in profile for the total number of calls
to ExternalDataSource::GetNext().
Setting query options for Impala will be supported in a following patch.
Testing:
- Added an end-to-end unit test to read Impala tables from Impala
cluster through JDBC external data source.
Manually ran the unit-test with Impala tables in Impala cluster on a
remote host by setting $INTERNAL_LISTEN_HOST in jdbc.url as the ip
address of the remote host on which an Impala cluster is running.
- Added LDAP test for reading table through JDBC external data source
with LDAP authentication.
Manually ran the unit-test with Impala tables in a remote Impala
cluster.
- Passed core tests.
Change-Id: I79ad3273932b658cb85c9c17cc834fa1b5fbd64f
Reviewed-on: http://gerrit.cloudera.org:8080/20731
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
This patch adds a check for the existance of mysqld.sock
file in directory: /var/run/mysqld/ inside the mysqld
docker container. If the file is not present then the
test is skipped.
Testing: tested manually with and without the mysqld.sock
file.
Change-Id: I393fd03fa6efd4c11781d219f66978a4f556c668
Reviewed-on: http://gerrit.cloudera.org:8080/20780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
tables for MySQL
This patch adds MySql tests for the "external data source"
mechanism in Impala to implement data source for querying JDBC.
This patch also fixes the handling of case-sensitive table and
column names for MySQL query.
Testing:
- Added unit test for mysql and ran unit-test with JDBC
driver mysql-connector-j-8.1.0.jar. This test requires
to add the docker to sudoer's group. Also, the test is
only run in 'exhaustive' mode.
Change-Id: I446ec3d4ebaf53c8edac0b2d181514bde587dfae
Reviewed-on: http://gerrit.cloudera.org:8080/20710
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
DataSourceScanNode does not handle eos properly in function
DataSourceScanNode::GetNext(). Rows, which are returned from
external data source, could be dropped if data_source_batch_size
is set with value which is greater than default value 1024.
Testing:
- Added end-to-end test with data_source_batch_size as 2048.
The test failed without fixing, passed with fixing.
Also added test with data_source_batch_size as 512.
- Passed core tests.
Change-Id: I978d0a65faa63a47ec86a0127c0bee8dfb79530b
Reviewed-on: http://gerrit.cloudera.org:8080/20636
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
creating external jdbc table
This patch builds on top of IMPALA-5741 to copy the jdbc jar from
remote filesystems: Ozone and S3. Currenty we only support hdfs.
Testing:
Commented out "@skipif.not_hdfs" qualifier in files:
- tests/query_test/test_ext_data_sources.py
- tests/custom_cluster/test_ext_data_sources.py
1) tested locally by running tests:
- impala-py.test tests/query_test/test_ext_data_sources.py
- impala-py.test tests/custom_cluster/test_ext_data_sources.py
2) tested using jenkins job for ozone and S3
Change-Id: I804fa3d239a4bedcd31569f2b46edb7316d7f004
Reviewed-on: http://gerrit.cloudera.org:8080/20639
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
This patch makes external data source working in LocalCatalog mode:
- Add APIs in CatalogdMetaProvider to fetch DataSource from Catalog
server through RPC.
- Add getDataSources() and getDataSource() in LocalCatalog.
- Add LocalDataSourceTable class for loading DataSource table in
LocalCatalog.
- Handle request for loading DataSource in CatalogServiceCatalog on
Catalog server.
- Enable tests which are skipped by
SkipIfCatalogV2.data_sources_unsupported().
Remove SkipIfCatalogV2.data_sources_unsupported().
- Add end-to-end tests for LocalCatalog mode.
Testing:
- Passed core tests
Change-Id: I40841c9be9064ac67771c4d3f5acbb3b552a2e55
Reviewed-on: http://gerrit.cloudera.org:8080/20574
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>