This patch moves the source files of jdbc package to fe.
Data source location is optional. Data source could be created without
specifying HDFS location. Assume data source class is in the classpath
and instance of data source class could be created with current class
loader. Impala still try to load the jar file of the data source in
runtime if it's set in data source location.
Testing:
- Passed core test
- Passed dockerised-tests
Change-Id: I0daff8db6231f161ec27b45b51d78e21733d9b1f
Reviewed-on: http://gerrit.cloudera.org:8080/20971
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
for external data source table
This patch adds support for datatype date as predicates
for external data sources.
Testing:
- Added tests for date predicates with operators:
'=', '>', '<', '>=', '<=', '!=', 'BETWEEN'.
Change-Id: Ibf13cbefaad812a0f78755c5791d82b24a3395e4
Reviewed-on: http://gerrit.cloudera.org:8080/20915
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Also sets dependencyManagement to force using the same version
for jackson-databind, jackson-core and jackon-annotations. This is
needed because datagenerator depends on kitesdk, which would pull in a
very old jackson-core version (2.3.1) and lead to build failures
with the newer jackson.databind.
Change-Id: I8440426da1395045cf149aca0044286015861e5f
Reviewed-on: http://gerrit.cloudera.org:8080/20914
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch uses JDBC connection string to apply query options to the
Impala server by setting the properties in "jdbc.properties" when
creating JDBC external DataSource table.
jdbc.properties are specified as comma-delimited key=value string, like
"MEM_LIMIT=1000000000, ENABLED_RUNTIME_FILTER_TYPES=\"BLOOM,MIN_MAX\"".
Fixed Impala to allow value of ENABLED_RUNTIME_FILTER_TYPES to have
double quotes in the beginning and ending of string.
jdbc.properties can be used for other databases like Postgres and MySQL
to set additional properties. The test cases will be added in separate
patch.
Testing:
- Added end-to-end tests for setting query options on Impala JDBC
tables.
- Passed core tests.
Change-Id: I47687b7a93e90cea8ebd5f3fc280c9135bd97992
Reviewed-on: http://gerrit.cloudera.org:8080/20837
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
external data source
In the current implementation of external JDBC data source,
the user has to provide both the username and password in
plain text which is not a good practice.
This patch extends the functionality of existing implementation
to either provide:
a) username and password
b) username or key and keystore
If the user provides the password, then that password is used.
However, if no password is provided and the user provides only the
key/keystore, then it fetches the password from the secure jceks
keystore.
Testing:
- Added unit test TestExtDataSourcesWithKeyStore
Change-Id: Iec83a9b6e00456f0a1bbee747bd752b2cf9bf238
Reviewed-on: http://gerrit.cloudera.org:8080/20809
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support to read Impala tables in the Impala cluster
through JDBC external data source. It also adds a new counter
NumExternalDataSourceGetNext in profile for the total number of calls
to ExternalDataSource::GetNext().
Setting query options for Impala will be supported in a following patch.
Testing:
- Added an end-to-end unit test to read Impala tables from Impala
cluster through JDBC external data source.
Manually ran the unit-test with Impala tables in Impala cluster on a
remote host by setting $INTERNAL_LISTEN_HOST in jdbc.url as the ip
address of the remote host on which an Impala cluster is running.
- Added LDAP test for reading table through JDBC external data source
with LDAP authentication.
Manually ran the unit-test with Impala tables in a remote Impala
cluster.
- Passed core tests.
Change-Id: I79ad3273932b658cb85c9c17cc834fa1b5fbd64f
Reviewed-on: http://gerrit.cloudera.org:8080/20731
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
tables for MySQL
This patch adds MySql tests for the "external data source"
mechanism in Impala to implement data source for querying JDBC.
This patch also fixes the handling of case-sensitive table and
column names for MySQL query.
Testing:
- Added unit test for mysql and ran unit-test with JDBC
driver mysql-connector-j-8.1.0.jar. This test requires
to add the docker to sudoer's group. Also, the test is
only run in 'exhaustive' mode.
Change-Id: I446ec3d4ebaf53c8edac0b2d181514bde587dfae
Reviewed-on: http://gerrit.cloudera.org:8080/20710
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
IMPALA-12373 introduces small string optimisation, after which not all
strings will have a var-len part.
IMPALA-12159 adds support for ORDER BY for collections of variable
length types in the select list, but the test tables it uses only/mostly
contain short strings.
This patch has two modifications:
1. It introduces longer strings in 'collection_tbl' and
'collection_struct_mix'. It also adds two more rows to the existing one
in 'collection_tbl' so that it can be used in sorting tests. These
tables are only used by complex types tests, so the impact is limited.
2. It modifies RandomNestedDataGenerator.java, so that now it takes a
parameter for string length. Some variable names are changed to clearer
names. The references to and uses of RandomNestedDataGenerator are
updated.
Change-Id: Ief770d6bc9258fce159a733d5afa34fe594b96f8
Reviewed-on: http://gerrit.cloudera.org:8080/20718
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
in GenericJdbcDatabaseAccessor close() function
The earlier change had a bug where we are deleting
the temporary jdbc jar file too early from the
/tmp directory before it can be loaded. The
GenericJdbcDatabaseAccessor class loader works by
OnDemand loading. Hence move the delete file logic
to the GenericJdbcDatabaseAccessor close()
function instead.
Testing:
1. Make sure the Impala cluster has been started.
2. Copy the jar files of JDBC drivers and the data source library into
HDFS.
${IMPALA_HOME}/testdata/bin/copy-ext-data-sources.sh
Verify that the mysql-jdbc.jar is present in the hdfs path:
hadoop fs -ls /test-warehouse/data-sources/jdbc-drivers
3. Create an `alltypes` table in the mysql database.
${IMPALA_HOME}/testdata/bin/load-ext-data-sources.sh
4. Create mysql data source tables (alltypes_jdbc_datasource and
alltypes_jdbc_datasource_2).
${IMPALA_HOME}/bin/impala-shell.sh -f\
${IMPALA_HOME}/testdata/bin/create-ext-data-source-table.sql
5. Make sure that the mysql jar file is not present in the classpath
grep 'mysql' /home/gsingh/Impala/fe/target/build-classpath.txt \
/home/gsingh/Impala/fe/target/test-classpath.txt \
/home/gsingh/Impala/java/executor-deps/target/build-executor-\
deps-classpath.txt | wc -l
returns 0
6. Run the impala-shell query:
use functional;
select count(*) from alltypes_jdbc_mysql_datasource;
executes successfully and returns the row count.
Change-Id: I1becc01a9d93a99be8f47dfe99258dea3a8abeb3
Reviewed-on: http://gerrit.cloudera.org:8080/20706
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Backend function DataSourceScanNode::GetNext() handles count query
inefficiently. Even when there are no column data returned from
external data source, it still tries to materialize rows and add
rows to RowBatch one by one up to the number of row count. It also
call GetNextInputBatch() multiple times (count / batch_size), while
GetNextInputBatch() invokes JNI function in external data source.
This patch improves the DataSourceScanNode::GetNext() and
JdbcDataSource.getNext() to avoid unnecessary function calls.
Testing:
- Ran query_test/test_ext_data_sources.py which consists count
queries for jdbc external table.
- Passed core-tests.
Change-Id: I9953dca949eb773022f1d6dcf48d8877857635d6
Reviewed-on: http://gerrit.cloudera.org:8080/20653
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
/tmp after class loaded
This patch fixes the bug added in the previous patch for IMPALA-12470.
It adds the prefix "file://" to the unix standard path string to
create the corresponding valid hadoop.fs.Path object. For example:
"/tmp" is converted to "file:///tmp".
Testing:
1. Deleted all the jar files in the /tmp directory.
2. Ran the local jdbc ext data sources tests:
- impala-py.test tests/query_test/test_ext_data_sources.py
- impala-py.test tests/custom_cluster/test_ext_data_sources.py
3. Upon completion of the tests successfully, Verified that there were
no .jar files in the /tmp directory.
Change-Id: Iab7cc66383bc62f209987dd3fb42fc3fc6604726
Reviewed-on: http://gerrit.cloudera.org:8080/20654
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
creating external jdbc table
This patch builds on top of IMPALA-5741 to copy the jdbc jar from
remote filesystems: Ozone and S3. Currenty we only support hdfs.
Testing:
Commented out "@skipif.not_hdfs" qualifier in files:
- tests/query_test/test_ext_data_sources.py
- tests/custom_cluster/test_ext_data_sources.py
1) tested locally by running tests:
- impala-py.test tests/query_test/test_ext_data_sources.py
- impala-py.test tests/custom_cluster/test_ext_data_sources.py
2) tested using jenkins job for ozone and S3
Change-Id: I804fa3d239a4bedcd31569f2b46edb7316d7f004
Reviewed-on: http://gerrit.cloudera.org:8080/20639
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Wenzhe Zhou <wzhou@cloudera.com>
Uses the imported Hadoop version for the hadoop-aliyun module, which is
a tool in the hadoop project. This allows us to exclude vulnerable
versions of jdom that were previously included via hadoop-aliyun.
Change-Id: I270f3895ec668d9fb907f35b04cad2f149e3d0de
Reviewed-on: http://gerrit.cloudera.org:8080/20532
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch uses the "external data source" mechanism in Impala to
implement data source for querying JDBC.
It has some limitations due to the restrictions of "external data
source":
- It is not distributed, e.g, fragment is unpartitioned. The queries
are executed on coordinator.
- Queries which read following data types from external JDBC tables
are not supported:
BINARY, CHAR, DATETIME, and COMPLEX.
- Only support binary predicates with operators =, !=, <=, >=,
<, > to be pushed to RDBMS.
- Following data types are not supported for predicates:
DECIMAL, TIMESTAMP, DATE, and BINARY.
- External tables with complex types of columns are not supported.
- Support is limited to the following databases:
MySQL, Postgres, Oracle, MSSQL, H2, DB2, and JETHRO_DATA.
- Catalog V2 is not supported (IMPALA-7131).
- DataSource objects are not persistent (IMPALA-12375).
Additional fixes are planned on top of this patch.
Source files under jdbc/conf, jdbc/dao and jdbc/exception are
replicated from Hive JDBC Storage Handler.
In order to query the RDBMS tables, the following steps should be
followed (note that existing data source table will be rebuilt):
1. Make sure the Impala cluster has been started.
2. Copy the jar files of JDBC drivers and the data source library into
HDFS.
${IMPALA_HOME}/testdata/bin/copy-ext-data-sources.sh
3. Create an `alltypes` table in the Postgres database.
${IMPALA_HOME}/testdata/bin/load-ext-data-sources.sh
4. Create data source tables (alltypes_jdbc_datasource and
alltypes_jdbc_datasource_2).
${IMPALA_HOME}/bin/impala-shell.sh -f\
${IMPALA_HOME}/testdata/bin/create-ext-data-source-table.sql
5. It's ready to run query to access data source tables created
in last step. Don't need to restart Impala cluster.
Testing:
- Added unit-test for Postgres and ran unit-test with JDBC driver
postgresql-42.5.1.jar.
- Ran manual unit-test for MySql with JDBC driver
mysql-connector-j-8.1.0.jar.
- Ran core tests successfully.
Change-Id: I8244e978c7717c6f1452f66f1630b6441392e7d2
Reviewed-on: http://gerrit.cloudera.org:8080/17842
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
We need a couple of hive changes HIVE-27319 and HIVE-27337 for catalogD
to work with latest HMS server to fix IMPALA-11768 and IMPALA-11939
respectively.
Bump CDP_BUILD_NUMBER (GBN) to 44206393
Bump various CDP versiona numbers to be based on 7.2.18.0-273
TESTING: Exhaustive tests ran clean
Added a couple of tests for IMPALA-11939 and IMPALA-11768
Change-Id: I117873b628aed3e24280f9fcd79643f918c8d5f3
Reviewed-on: http://gerrit.cloudera.org:8080/20420
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch allows a third party to inject their own frontend
class instead of using the default JniFrontend included in the
project.
The test case includes an interface that runs queries as normal
except for the "select 1" query which gets changed to "select 42".
Change-Id: I89e677da557b39232847644b6ff17510e2b3c3d5
Reviewed-on: http://gerrit.cloudera.org:8080/20459
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Maven 3.9.x offers a new dependency resolver, HttpClient, which allows
downloading project dependencies in parallel.
This patch bumps the Maven version installed by bootstrap_system.sh to
v3.9.2, and adds the flags enabling the new resolver to download
dependencies (including POM files) in parallel. Parallelism is set to
10 threads.
The flags are added to a project-specific Maven setting file in the
newly created java/.mvn directory. The settings file is added to the
RAT exclusion list in bin/rat_exclude_files.txt.
The --show-version flag is added for debugging purposes.
The same flags are added to the JAMM subproject as well.
The new resolver in Maven 3.9 has also changed the warning message
emitted for missing component checksums, so the new warning string
is added to the filter in bin/mvn-quiet.sh
Unfortunately Maven 3.9 has also changed the way it responds to missing
checksum files: the resolver now emits a stack trace when checksums
cannot be determined, and missing checksums are not explicitly ignored.
Detailed documentation for the new Maven resolver in Maven 3.9.0+ is
located at:
https://maven.apache.org/guides/mini/guide-resolver-transport.html
resolver configuration reference:
https://maven.apache.org/resolver/configuration.html
Tests:
- verified in a core-mode test run with Maven 3.9.2 installed
- verified in a local build using an earlier version of Maven
to verify that the new default setting does not cause regressions
with the old dependency resolver.
Change-Id: I75d05215effc724f5bd471646fb352f37443e185
Reviewed-on: http://gerrit.cloudera.org:8080/20142
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
This adds the --batch-mode flag to the maven invocation
the builds jamm. That disables some of the download progress
output, reducing the total size of the output.
Testing:
- Ran a build locally
Change-Id: I1634240b191168b13cf3be7c9266e21a746844b1
Reviewed-on: http://gerrit.cloudera.org:8080/20196
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Produces a shaded copy of a pre-release jamm jar that supports Java 17.
Building a copy of jamm and directly depending on it meant any consumer
of Impala would have to provide their own build of it.
Testing: ran custom_cluster/test_local_catalog.py with Java 8 and 17
Change-Id: Ida42d720a2639b65391c07a9237556311e04fac6
Reviewed-on: http://gerrit.cloudera.org:8080/20147
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
RandomNestedDataGenerator can be used to produce parquet files with
random data from Avro schemas. This change makes it possible to provide
a seed value for the random generator so the generated files are
reproducible. The seed can be given as the last (optional) command line
argument. It is parsed as a Java 'long'.
Testing:
- manually verified that when run with the same arguments (including
the seed), the data generator produces the same results
Change-Id: Iee33604bbfe12895100afbd0f98ac302dee9a238
Reviewed-on: http://gerrit.cloudera.org:8080/20136
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Daniel Becker <daniel.becker@cloudera.com>
Enables building for Java 17 - and particularly using Java 17 in
containers - but won't run a minicluster fully with Java 17 as some
projects (Hadoop) don't yet support it.
Starting with Java 15, ehcache.sizeof encounters
UnsupportedOperationException: can't get field offset on a hidden class
in class members pointing to capturing lambda functions. Java 17 also
introduces new modules that need to be added to add-opens. Both of these
pose problems for continued use of ehcache.
Adds https://github.com/jbellis/jamm as a new cache weigher for Java
15+. We build from HEAD as an external project until Java 17 support is
released (https://github.com/jbellis/jamm/issues/44). Adds the
'java_weigher' option to select 'sizeof' or 'jamm'; defaults to 'auto',
which uses jamm for Java 15+ and sizeof for everything else. Also adds
metrics for viewing cache weight results.
Adds JAVA_HOME/lib/server to LD_LIBRARY_PATH in run-jvm-binary to
simplify switching between JDK versions for testing. You can now
- export IMPALA_JDK_VERSION=11
- source bin/impala-config.sh
- start-impala-cluster.py
and have Impala running a different JDK (11) version.
Retains add-opens calls that are still necessary due to dependencies'
use of lambdas for jamm, and all others for ehcache. Add-opens are still
required as a fallback, as noted in
https://github.com/jbellis/jamm#object-graph-crawling. We catch the
exceptions jamm and ehcache throw - CannotAccessFieldException,
UnsupportedOperationException - to avoid crashing Impala, and add it to
the list of banned log messages (as we should add-opens when we find
them).
Testing:
- container test run with Java 11 and 17 (excludes custom cluster)
- manual custom_cluster/test_local_catalog.py +
test_banned_log_messages.py run with Java 11 and 17 (Java 8 build)
- full Java 11 build (passed except IMPALA-12184)
- add test catalog cache entry size metrics fit reasonable bounds
- add unit test for utility to find jamm jar file in classpath
Change-Id: Ic378896f572e030a3a019646a96a32a07866a737
Reviewed-on: http://gerrit.cloudera.org:8080/19863
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change removes the Text-typed overload for BufferAlteringUDF to
avoid ambiguous function matchings. It also changes the 2-parameter
function in BufferAlteringUDF to cover Text typed arguments.
Tests:
- test_udfs.py manually executed
Change-Id: I3a17240ce39fef41b0453f162ab5752f1c940f41
Reviewed-on: http://gerrit.cloudera.org:8080/20038
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds new environment variable IMPALA_JDK_VERSION which can be 'system',
'8', or '11'. The default is 'system', which uses the same logic as
before. If set to 8 or 11, it will ignore the system java and search for
java of that specific version (based on specific directories for Ubuntu
and Redhat). This is used by bin/bootstrap_system.sh to determine
whether to install java 8 or java 11 (other versions can come later). If
IMPALA_JDK_VERSION=11, then bin/start-impala-cluster.py adds the opens
needed to deal with the ehcache issue.
This no longer puts JAVA_HOME in bin/impala-config-local.sh as part of
bootstrap_system.sh. Instead, it provides a new environment variable
IMPALA_JAVA_HOME_OVERRIDE, which will be preferred over
IMPALA_JDK_VERSION.
This also updates the versions of Maven plugins related to the build.
Source and target releases are still set to Java 8 compatibility.
Adds a verifier to the end of run-all-tests that
InaccessibleObjectException is not present in impalad logs. Tested with
JDBC_TEST=false EE_TEST=false FE_TEST=false BE_TEST=false \
CLUSTER_TEST_FILES=custom_cluster/test_local_catalog.py \
run-all-tests.sh
Testing: ran test suite with Java 11
This reverts the revert commit 1b6011c, restoring these changes minus
code to update IMPALA_JDK_VERSION based on $JAVA -version as that could
break subsequent sourcing of impala-config.sh.
Change-Id: Ie16504ad5738b1f228f97044afd3d9017ccc6c53
Reviewed-on: http://gerrit.cloudera.org:8080/19928
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds new environment variable IMPALA_JDK_VERSION which can be 'system',
'8', or '11'. The default is 'system', which uses the same logic as
before. If set to 8 or 11, it will ignore the system java and search for
java of that specific version (based on specific directories for Ubuntu
and Redhat). This is used by bin/bootstrap_system.sh to determine
whether to install java 8 or java 11 (other versions can come later). If
IMPALA_JDK_VERSION=11, then bin/start-impala-cluster.py adds the opens
needed to deal with the ehcache issue.
This no longer puts JAVA_HOME in bin/impala-config-local.sh as part of
bootstrap_system.sh. Instead, it provides a new environment variable
IMPALA_JAVA_HOME_OVERRIDE, which will be preferred over
IMPALA_JDK_VERSION.
This also updates the versions of Maven plugins related to the build.
Source and target releases are still set to Java 8 compatibility.
Adds a verifier to the end of run-all-tests that
InaccessibleObjectException is not present in impalad logs. Tested with
JDBC_TEST=false EE_TEST=false FE_TEST=false BE_TEST=false \
CLUSTER_TEST_FILES=custom_cluster/test_local_catalog.py \
run-all-tests.sh
Testing: ran test suite with Java 11
Change-Id: I15d309e2092c12d7fdd2c99b727f3a8eed8bc07a
Reviewed-on: http://gerrit.cloudera.org:8080/19539
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Replaces constructor calls for object versions of primitives - Integer,
Long, Float, Double, Boolean - with optimized valueOf calls as using
constructors for these is deprecated according to jdeprscan.
Removes override of finalize. Use of finalize is deprecated, and
hive-udf-call.cc ensures we always call close when unloading the UDF.
Adds try-with-resources to UdfExecutorTest to handle test cleanup.
Updates BigDecimal.setScale to use RoundingMode.
Change-Id: Idfb053223b6e098e6032502f873361696dd2da84
Reviewed-on: http://gerrit.cloudera.org:8080/19721
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change fixes the behavior of BytesWritable and TextWritable's
getBytes() method. Now the returned byte array could be handled as
the underlying buffer as it gets loaded before the UDF's evaluation,
and tracks the changes as a regular Java byte array; the resizing
operation still resets the reference. The operations that wrote back
to the native heap were also removed as these operations are now
handled in the byte array. ImpalaStringWritable class is also removed,
writables that used it before now store the data directly.
Tests:
- Test UDFs added as BufferAlteringUdf and GenericBufferAlteringUdf
- E2E test ran for UDFs
Change-Id: Ifb28bd0dce7b0482c7abe1f61f245691fcbfe212
Reviewed-on: http://gerrit.cloudera.org:8080/19507
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before this patch if an argument of a GenericUDF was NULL, then Impala
passed it as null instead of a DeferredObject. This was incorrect, as
a DeferredObject is expected with a get() function that returns null.
See the Jira for more details and GenericUDF examples in Hive.
TestGenericUdf's NULL handling was further broken in IMPALA-11549,
leading to throwing null pointer exceptions when the UDF's result is
NULL. This test bug was not detected, because Hive udf tests were
running with default abort_java_udf_on_exception=false, which means
that exceptions from Hive UDFs only led to warnings and returning NULL,
which was the expected result in all affected test queries.
This patch fixes the behavior in HiveUdfExecutorGeneric and improves
FE/EE tests to catch null handling related issues. Most Hive UDF tests
are run with abort_java_udf_on_exception=true after this patch to treat
exceptions in UDFs as errors. The ones where the test checks that NULL
is returned if an exception is thrown while abort_java_udf_on_exception
is false are moved to new .test files.
TestGenericUdf is also fixed (and simplified) to handle NULL return
values correctly.
Change-Id: I53238612f4037572abb6d2cc913dd74ee830a9c9
Reviewed-on: http://gerrit.cloudera.org:8080/19499
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds support for huawei OBS (Object Storage Service)
FileSystem. The implementation is similar to other remote FileSystems.
New flags for OBS:
- num_obs_io_threads: Number of OBS I/O threads. Defaults to be 16.
Testing:
- Upload hdfs test data to an OBS bucket. Modify all locations in HMS
DB to point to the OBS bucket. Remove some hdfs caching params.
Run CORE tests.
Change-Id: I84a54dbebcc5b71e9bcdd141dae9e95104d98cb1
Reviewed-on: http://gerrit.cloudera.org:8080/19110
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This change adds geospatial functions from Hive's ESRI library
as builtin UDFs. Plain Hive UDFs are imported without changes,
but the generic and varargs functions are handled differently;
generic functions are added with all of the combinations of
their parameters (cartesian product of the parameters), and
varargs functions are unfolded as an nth parameter simple
function. The varargs function wrappers are generated at build
time and they can be configured in
gen_geospatial_udf_wrappers.py. These additional steps are
required because of the limitations in Impala's UDF Executor
(lack of varargs support and only partial generics support)
which could be further improved; in this case, the additional
wrapping/mapping steps could be removed.
Changes regarding function handling/creating are sourced from
https://gerrit.cloudera.org/c/19177
A new backend flag was added to turn this feature on/off
as "geospatial_library". The default value is "NONE" which
means no geospatial function gets registered
as builtin, "HIVE_ESRI" value enables this implementation.
The ESRI geospatial implementation for Hive currently only
available in Hive 4, but CDP Hive backported it to Hive 3,
therefore for Apache Hive this feature is disabled
regardless of the "geospatial_library" flag.
Known limitations:
- ST_MultiLineString, ST_MultiPolygon only works
with the WKT overload
- ST_Polygon supports a maximum of 6 pairs of coordinates
- ST_MultiPoint, ST_LineString supports a maximum of 7
pairs of coordinates
- ST_ConvexHull, ST_Union supports a maximum of 6 geoms
These limits can be increased in gen_geospatial_udf_wrappers.py
Tests:
- test_geospatial_udfs.py added based on
https://github.com/Esri/spatial-framework-for-hadoop
Co-Authored-by: Csaba Ringhofer <csringhofer@cloudera.com>
Change-Id: If0ca02a70b4ba244778c9db6d14df4423072b225
Reviewed-on: http://gerrit.cloudera.org:8080/19425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Without HIVE-24498 we get java.lang.NoClassDefFoundError exceptions
when we write Iceberg tables via Hive. This makes it hard to write
interop tests between Hive and Impala which use Iceberg tables.
I also exclude some private Java components to get things built.
Change-Id: I486c2b1b224f72e082e331a57cf25a37ebb9fa54
Reviewed-on: http://gerrit.cloudera.org:8080/19331
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Laszlo Gaal <laszlo.gaal@cloudera.com>
Reviewed-by: Tamas Mate <tmater@apache.org>
Before this patch only the Writable* types were accepted in GenericUdfs
as return types, while some GenericUdfs in the wild return primitive java
types (e.g. Integer instead of IntWritable). For legacy Hive UDFs these
return types were already handled, so the only change needed was to
map the ObjectInspector subclasses (e.g. JavaIntObjectInspector) to the
correct JavaUdfDataType in Impala.
Testing:
- Added a subclass for TestGenericUdf (TestGenericUdfWithJavaReturnTypes)
that returns primitive java types (probably inheriting in the opposite
direction would be more logical, but the diff is smaller this way).
- Changed EE tests to also use TestGenericUdfWithJavaReturnTypes.
- Changed FE tests (UdfExecutorTest) to check both
TestGenericUdfWithJavaReturnTypes and TestGenericUdf.
- Also added a test with BINARY type to UdfExecutorTest as this was
forgotten during the original BINARY patch.
Change-Id: I30679045d6693ebd35718b6f1a22aaa4963c1e63
Reviewed-on: http://gerrit.cloudera.org:8080/19304
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Updates the test environment to default to the CDP build of Ozone, as
the latest build of CDP Hive depends on pre-release features unavailable
in Ozone 1.2.1. Apache Ozone 1.2 can still be used by setting
USE_APACHE_OZONE=true.
The latest CDP build also includes a version of Ozone based on
ozone#master with a candidate version of 1.3.0. Both Apache and CDP
therefore have builds of Ozone we can test with that use the new
artifact names introduced in Ozone 1.2, so this patch cleans up setup
that was only needed for Ozone versions prior to 1.2.
Change-Id: I1177a1b820fe21adca9f8c1cc51ff73ee001d3f2
Reviewed-on: http://gerrit.cloudera.org:8080/19247
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
This patch adds support for OSS (Aliyun Object Storage Service).
Using the hadoop-aliyun, the implementation is similar to other
remote FileSystems.
Tests:
- Prepare:
Initialize OSS-related environment variables:
OSS_ACCESS_KEY_ID, OSS_SECRET_ACCESS_KEY, OSS_ACCESS_ENDPOINT.
Compile and create hdfs test data on a ECS instance. Upload test data
to an OSS bucket.
- Modify all locations in HMS DB to point to the OSS bucket.
Remove some hdfs caching params. Run CORE tests.
Change-Id: I267e6531da58e3ac97029fea4c5e075724587910
Reviewed-on: http://gerrit.cloudera.org:8080/19165
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Upgrades guava to 31.1-jre and jackson-databind to 2.13.4.2 to address
CVEs. Adds environment variables for commonly-updated components so they
can be customized via the branch-specific impala-config-branch.sh in a
way that allows both to be updated regularly without merge conflicts.
Also updates httpcomponents.httpcore to 4.4.14 to be consistent with
other httpcomponents libraries included transitively.
Change-Id: I1c2c4481ca3f498abf302aa05361d950b1ed1216
Reviewed-on: http://gerrit.cloudera.org:8080/19147
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Use dependencyManagement to simplify Java dependencies by directly
controlling versions of transitive dependencies instead of using
exclusions and direct inclusion.
Dependency management specifies versions authoritatively, so redundant
version declarations are also removed.
Change-Id: I424a175135855dcbd38ae432ea111cca5f562633
Reviewed-on: http://gerrit.cloudera.org:8080/19146
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
Switches from log4j 1.x to reload4j, a maintained fork. Updates slf4j to
the latest version so we can include all CVE fixes.
slf4j 2.0.x requires Java 8 and adds a backward-compatible fluent
logging api. Neither seems like a problem for Impala.
Bans all use of log4j 1.x so we only use reload4j.
Change-Id: I5238b9c8247af3e0f4cb05c0b76a75bfee37f5c8
Reviewed-on: http://gerrit.cloudera.org:8080/19102
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch upgrade the Spring framework to 5.3.20 to
address multiple CVEs:
- CVE-2022-22971
- CVE-2022-22968
- CVE-2022-22970
Testing:
- Ran core job
- Ran custom cluster tests in exhaustive mode
Change-Id: I33f4f1d22fc27227e31d744658a17c16b61b9677
Reviewed-on: http://gerrit.cloudera.org:8080/19091
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This commit handles the case for a specific kind of corrupt function
within the Hive Metastore in the following situation:
A valid Hive SQL function gets created in HMS. This UDF is written in
Java and must derive from the "UDF" class. After creating this function
in Impala, we then replace the underlying jar file with a class that
does NOT derive from the "UDF" class.
In this scenario, catalogd should reject the function and still start
up gracefully. Before this commit, catalogd wasn't coming up. The
reason for this was because the Hive function
FunctionUtils.getUDFClassType() has a dependency on UDAF and was
throwing a LinkageError exception, so we need to include the UDAF
class in the shaded jar.
Change-Id: I54e7a1df6d018ba6cf5ecf32dc9946edf86e2112
Reviewed-on: http://gerrit.cloudera.org:8080/18927
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tamas Mate <tmater@apache.org>
This patch adds support for BINARY columns for all table formats with
the exception of Kudu.
In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.
As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.
Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).
All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.
The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.
Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
to test scanning.
- Removed functional.unsupported_types and related tests, as now
Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type
Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Hadoop provides hadoop-cloud-storage, which includes most of
the dependencies that Impala currently uses like hadoop-aws,
hadoop-azure, Knox's gateway-cloud-bindings, etc. Hadoop has
put in a lot of work to make sure that this package includes
the right version of dependencies (including shading some
dependencies for GCS). It seems like this is a more reliable
way to consume these dependencies.
This switches the Java build to use hadoop-cloud-storage
and removes the dependencies that it replaces. This eliminates
the need to control the version of oauth and GCS, as those
are determined by hadoop-cloud-storage.
Change-Id: I3a1631289f990513823c2b17eb9241cc1b5a7ffd
Reviewed-on: http://gerrit.cloudera.org:8080/18817
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Adds Ozone as an alternative to hdfs in the minicluster. Select by
setting `export TARGET_FILESYSTEM=ozone`. With that flag,
run-mini-dfs.sh will start Ozone instead of HDFS. Requires a snapshot
because Ozone does not support HBase (HDDS-3589); snapshot loading
doesn't work yet primarily due to HDDS-5502.
Uses the o3fs interface because Ozone puts specific restrictions on
bucket names (no underscores, for instance), and it was a lot easier to
use an interface where everything is written to a single bucket than to
update all Impala's use of HDFS-style paths to make `test-warehouse` a
bucket inside a volume.
Specifies reduced Ozone client retries during shutdown where Ozone may
not be available.
Passes tests with FE_TEST=false BE_TEST=false.
Change-Id: Ibf8b0f7b2d685d8b011df1926e12bf5434b5a2be
Reviewed-on: http://gerrit.cloudera.org:8080/18738
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Loading new classes from the same jar in the constructor of UDFs
did not work in the catalog because the URLClassLoader was closed
too early. Extended the lifecycle of the class loader a bit to
let the catalog finish all initialisation.
Note that the instantiation of legacy Hive UDFs doesn't seem
necessary in the catalog, we can get all relevant info from
the class. Generic UDFs do need to be instantiated to be able
to call initialize().
Testing:
- added new classes to load in test UDFs and loaded these
in constructor / initialize()
- ran the Hive UDF ee tests
Change-Id: If16e38b8fc3b2577a5d32104ea9e6948b9562e24
Reviewed-on: http://gerrit.cloudera.org:8080/18611
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Upgrades google-oauth-client and google-oauth-client-java6 to 1.33.3 to
address CVE-2021-22573. These are included as dependencies of
com.google.cloud.bigdataoss/gcs-connector, which does not yet have a
release that includes versions 1.33.3 or later.
Change-Id: I8d95913f26e6073373374e169ee045881f40f065
Reviewed-on: http://gerrit.cloudera.org:8080/18683
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>