impala

mirror of https://github.com/apache/impala.git synced 2026-01-17 12:00:29 -05:00

Author	SHA1	Message	Date
Peter Rozsa	6b571eb7e4	IMPALA-12184: Java UDF increment on an empty string is inconsistent This change removes the Text-typed overload for BufferAlteringUDF to avoid ambiguous function matchings. It also changes the 2-parameter function in BufferAlteringUDF to cover Text typed arguments. Tests: - test_udfs.py manually executed Change-Id: I3a17240ce39fef41b0453f162ab5752f1c940f41 Reviewed-on: http://gerrit.cloudera.org:8080/20038 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-06-20 17:00:35 +00:00
Peter Rozsa	afe59f7f0d	IMPALA-11854: ImpalaStringWritable's underlying array can't be changed in UDFs This change fixes the behavior of BytesWritable and TextWritable's getBytes() method. Now the returned byte array could be handled as the underlying buffer as it gets loaded before the UDF's evaluation, and tracks the changes as a regular Java byte array; the resizing operation still resets the reference. The operations that wrote back to the native heap were also removed as these operations are now handled in the byte array. ImpalaStringWritable class is also removed, writables that used it before now store the data directly. Tests: - Test UDFs added as BufferAlteringUdf and GenericBufferAlteringUdf - E2E test ran for UDFs Change-Id: Ifb28bd0dce7b0482c7abe1f61f245691fcbfe212 Reviewed-on: http://gerrit.cloudera.org:8080/19507 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-08 19:54:38 +00:00
Csaba Ringhofer	67bb870aa3	IMPALA-11911: Fix NULL argument handling in Hive GenericUDFs Before this patch if an argument of a GenericUDF was NULL, then Impala passed it as null instead of a DeferredObject. This was incorrect, as a DeferredObject is expected with a get() function that returns null. See the Jira for more details and GenericUDF examples in Hive. TestGenericUdf's NULL handling was further broken in IMPALA-11549, leading to throwing null pointer exceptions when the UDF's result is NULL. This test bug was not detected, because Hive udf tests were running with default abort_java_udf_on_exception=false, which means that exceptions from Hive UDFs only led to warnings and returning NULL, which was the expected result in all affected test queries. This patch fixes the behavior in HiveUdfExecutorGeneric and improves FE/EE tests to catch null handling related issues. Most Hive UDF tests are run with abort_java_udf_on_exception=true after this patch to treat exceptions in UDFs as errors. The ones where the test checks that NULL is returned if an exception is thrown while abort_java_udf_on_exception is false are moved to new .test files. TestGenericUdf is also fixed (and simplified) to handle NULL return values correctly. Change-Id: I53238612f4037572abb6d2cc913dd74ee830a9c9 Reviewed-on: http://gerrit.cloudera.org:8080/19499 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-03-06 13:45:56 +00:00
Csaba Ringhofer	7ca11dfc7f	IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 13:55:42 +00:00
Steve Carlin	ca5ea4aeab	IMPALA-11162: Support GenericUDFs for Hive Hive has 2 types of UDFs. This commit contains limited support for the second generation UDFs called GenericUDFs. The main limitations are as follows: Decimal types are not supported. The Impala framework determines the precision and scale of the decimal return type. However, the Hive GenericUDFs allow the capability to choose its own return type based on the parameters. Until this can be resolved, it is safer to forbid decimals from being used. Note that this limitation currently exists in the first generation of Hive Java UDFs. Complex types are not supported. Functions are not extracted from the jar file. The first generation of Hive UDFs allowed this because the method prototypes are explicitly defined and can be determined at function creation time. For GenericUDFs, the return types are determined based on the parameters passed in when running a query. For the same reason as above, GenericUDFs cannot be made permanent. They will need to be recreated everytime the server is restarted. This is a severe limitation and will be resolved in the near future. Change-Id: Ie6fd09120db413fade94410c83ebe8ff104013cd Reviewed-on: http://gerrit.cloudera.org:8080/18295 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2022-05-11 15:10:28 +00:00
Steve Carlin	504e0d0012	IMPALA-11056: Create option to fail query on Java UDF exceptions This commit will create a new query option, "abort_java_udf_on_exception". The current and default behavior is that when the Java UDF throws an exception, a warning is logged and the function returns NULL. If the query option is set to true, the query will fail. Change-Id: Ifece20cf16a6575f1c498238f754440e870e2ce9 Reviewed-on: http://gerrit.cloudera.org:8080/18080 Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com>	2022-01-27 23:06:36 +00:00
Daniel Becker	dc08b657e8	IMPALA-7658: Proper codegen for HiveUdfCall Implementing codegen for HiveUdfCall. Testing: Verified that java udf tests pass locally. Benchmarks: Used a UDF from TestUdf.java that adds three integers: create function tpch15_parquet.sum3(int, int, int) returns int location '/test-warehouse/impala-hive-udfs.jar' symbol='org.apache.impala.TestUdf'; Used the following query on the master branch and the change's branch: set num_nodes=1; set mt_dop=1; select min(tpch15_parquet.sum3(cast(l_orderkey as int), cast(l_partkey as int), cast(l_suppkey as int))) from tpch15_parquet.lineitem; Results averaged over 100 runs after warmup: Master: 20.6346s, stddev: 0.3132411856765332 This change: 19.0256s, stddev: 0.42039019873436 This is a ~7.8% improvement. Change-Id: I2f994dac550f297ed3c88491816403f237d4d747 Reviewed-on: http://gerrit.cloudera.org:8080/16314 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-04 00:55:02 +00:00
Philip Zeyliger	635149b85f	IMPALA-8016: Fix lifecycle of classloader for UDFs. The ClassLoader whence a UDF was loaded needs to be kept open for executions of the UDF, so that the UDF can load other classes from the same jar. (A typical scenario might be a utility class.) This was broken by the fix to IMPALA-7668. This commit moves closing the ClassLoader to the close() function. A test for a UDF that imports a static method from another file has been added. Doing so failed without this change. Change-Id: Ic02e42fb25a2754ede21fe00312a60f07e0ba8a2 Reviewed-on: http://gerrit.cloudera.org:8080/12125 Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-27 21:02:57 +00:00
Dan Hecht	924066a4fa	IMPALA-5580: fix Java UDFs that return NULL strings In commit `741421de`, we accidently made it so that is_null=true StringVals became is_null=false with len=0. Fix that and add a regression test. Change-Id: I34d288aad66a2609484058c9a177c02200cb6a6e Reviewed-on: http://gerrit.cloudera.org:8080/7364 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-07 01:30:59 +00:00
Tim Armstrong	d7246d64c7	IMPALA-1430,IMPALA-4108: codegen all builtin aggregate functions This change enables codegen for all builtin aggregate functions, e.g. timestamp functions and group_concat. There are several parts to the change: * Adding support for generic UDAs. Previous the codegen code did not handle multiple input arguments or NULL return values. * Defaulting to using the UDA interface when there is not a special codegen path (we have implementations of all builtin aggregate functions for the interpreted path). * Remove all the logic to disable codegen for the special cases that now are supported. Also fix the generation of code to get/set NULL bits since I needed to add functionality there anyway. Testing: Add tests that check that codegen was enabled for builtin aggregate functions. Also fix some gaps in the preexisting tests. Also add tests for UDAs that check input/output nulls are handled correctly, in anticipation of enabling codegen for arbitrary UDAs. The tests are run with both codegen enabled and disabled. To avoid flaky tests, we switch the UDF tests to use "unique_database". Perf: Ran local TPC-H and targeted perf. Spent a lot of time on TPC-H Q1, since my original approach regressed it ~5%. In the end the problem was to do with the ordering of loads/stores to the slot and null bit in the generated code: the previous version of the code exploited some properties of the particular aggregate function. I ended up replicating this behaviour to avoid regressing perf. Change-Id: Id9dc21d1d676505d3617e1e4f37557397c4fb260 Reviewed-on: http://gerrit.cloudera.org:8080/4655 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-09 03:27:12 +00:00
Tim Armstrong	381e719065	IMPALA-4266: Java udf returning string can give incorrect results The memory management of string results was wrong: strings returned from Exprs must live until the next time FreeLocalAllocations() is called. Otherwise the buffer holding the string is freed or reused by the next UDF call. The fix is to copy string values into a buffer with the right lifetime. Testing: Added a regression test based on Bharath's example that reproduced the bug reliably. Change-Id: I705d271814cb1143f67d8a12f4fd87bab7a8e161 Reviewed-on: http://gerrit.cloudera.org:8080/4941 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-11-08 02:47:11 +00:00
Skye Wanderman-Milne	9f366645ab	IMPALA-3378/IMPALA-3379: fix various JNI issues This patch: 1) Removes JniUtil::Cleanup() and JniUtil::global_refs_. We never called Cleanup(), and all the jobjects in global_refs_ are meant to have the lifetime of the impalad process. This makes JniUtil::GetGlobalClassRef() and JniUtil::LocalToGlobalRef() thread-safe (which fixes IMPALA-3379). 2) Introduces a new JniUtil::FreeGlobalRef() method, which is a wrapper around the JNI DeleteGlobalRef() method. 3) Change JNI users to use the JniUtil methods instead of the JNI methods directly where appropriate. This makes error checking more consistent, and makes it easier to find all JNI uses. This is possible since GetGlobalClassRef() and LocalToGlobalRef() are now thread-safe and don't leak jobjects. 4) Removes HiveUdfCall::JniContext::cl, as well as other JNI constants, and replaces them with process-wide static singletons. It then moves the initialization to a new HiveUdfCall::Init() method which is once called in the main thread at the beginning of the process. This fixes IMPALA-3378. 5) Deletes the JniContext created for each HiveUdfCall Unfortunately I am not able to repro IMPALA-3378 so there is no test case (I didn't attempt IMPALA-3379 but it's similar). Change-Id: I8cd089e355d2ee2d5ace81f05b214272c05cf941 Reviewed-on: http://gerrit.cloudera.org:8080/2820 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:41 -07:00
Bharath Vissapragada	ef0dac661c	IMPALA-2843: Persist hive udfs across catalog restarts This commit adds a new feature to persist hive/java udfs across catalog restarts. IMPALA-1748 already added this for non-java udfs by storing them in parameters map of the Db object and reading them back at catalog startup. However we follow a different approach for hive udfs by converting them to Hive's function format and adding them as hive functions to the metastore. This makes it possible to share udfs between hive and Impala as the udfs added from one service are accessible to other. This commit takes care of format conversions between hive and impala and user can just add function once in either of the services. Background: Hive and impala treat udfs differently. Hive resolves the evaluate function in the udf class at runtime depending on the data types of the input arguments. So user can add one function by name and can pass any arguments to it as long as there is a compatible evaluate function in the udf class. However Impala takes the input types of the udf as a part of function definition (that maps to only one evaluate function) and loads the function only for those set of input argument types. If we have multiple 'evaluate' methods, we need to add multiple functions one for each of them. This commit adds new variants of CREATE \| DROP FUNCTIONS to Impala which lets the user to create and drop hive/java udfs without input argument types or return types. Catalog takes care of loading/dropping the udf signatures corresponding to each "evaluate" method in the udf symbol class. The syntax is as follows, CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts> DROP FUNCTION [IF EXISTS] <function name> Examples: CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf'; CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2'; DROP FUNCTION foo; DROP FUNCTION IF EXISTS bar; The older way of creating hive/java udfs with specific signature is still supported, however they are not persisted across restarts. So a restart of catalog can wipe them out. Additionally this commit also loads all the compatible java udfs added outside of Impala and they needn't be separately loaded. One thing to note here is that the functions added using the new CREATE FUNCTION can only be dropped using the new DROP FUNCTION syntax (without signature). The same rule applies for the java udfs added using the old CREATE FUNCTION syntax (with signature). Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d Reviewed-on: http://gerrit.cloudera.org:8080/2250 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 23:04:03 -08:00

13 Commits