CREATE FUNCTION Statement CREATE FUNCTION

Creates a user-defined function (UDF), which you can use to implement custom logic during SELECT or INSERT operations.

The syntax is different depending on whether you create a scalar UDF, which is called once for each row and implemented by a single function, or a user-defined aggregate function (UDA), which is implemented by multiple functions that compute intermediate results across sets of rows.

In and higher, the syntax is also different for creating or dropping scalar Java-based UDFs. The statements for Java UDFs use a new syntax, without any argument types or return type specified. Java-based UDFs created using the new syntax persist across restarts of the Impala catalog server, and can be shared transparently between Impala and Hive.

To create a persistent scalar C++ UDF with CREATE FUNCTION:

CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...]) RETURNS return_type LOCATION 'hdfs_path_to_dot_so' SYMBOL='symbol_name'

To create a persistent Java UDF with CREATE FUNCTION: CREATE FUNCTION [IF NOT EXISTS] [db_name.]function_name LOCATION 'hdfs_path_to_jar' SYMBOL='class_name'

To create a persistent UDA, which must be written in C++, issue a CREATE AGGREGATE FUNCTION statement:

CREATE [AGGREGATE] FUNCTION [IF NOT EXISTS] [db_name.]function_name([arg_type[, arg_type...]) RETURNS return_type [INTERMEDIATE type_spec] LOCATION 'hdfs_path' [INIT_FN='function] UPDATE_FN='function MERGE_FN='function [PREPARE_FN='function] [CLOSEFN='function] [SERIALIZE_FN='function] [FINALIZE_FN='function]

Varargs notation:

Variable-length argument lists are supported for C++ UDFs, but currently not for Java UDFs.

If the underlying implementation of your function accepts a variable number of arguments:

See for how to code a C++ UDF to accept variable-length argument lists.

Scalar and aggregate functions:

The simplest kind of user-defined function returns a single scalar value each time it is called, typically once for each row in the result set. This general kind of function is what is usually meant by UDF. User-defined aggregate functions (UDAs) are a specialized kind of UDF that produce a single value based on the contents of multiple rows. You usually use UDAs in combination with a GROUP BY clause to condense a large result set into a smaller one, or even a single row summarizing column values across an entire table.

You create UDAs by using the CREATE AGGREGATE FUNCTION syntax. The clauses INIT_FN, UPDATE_FN, MERGE_FN, SERIALIZE_FN, FINALIZE_FN, and INTERMEDIATE only apply when you create a UDA rather than a scalar UDF.

The *_FN clauses specify functions to call at different phases of function processing.

If you use a consistent naming convention for each of the underlying functions, Impala can automatically determine the names based on the first such clause, so the others are optional.

For end-to-end examples of UDAs, see .

Impala can run UDFs that were created through Hive, as long as they refer to Impala-compatible data types (not composite or nested column types). Hive can run Java-based UDFs that were created through Impala, but not Impala UDFs written in C++.

Persistence:

For additional examples of all kinds of user-defined functions, see .

The following example shows how to take a Java jar file and make all the functions inside one of its classes into UDFs under a single (overloaded) function name in Impala. Each CREATE FUNCTION or DROP FUNCTION statement applies to all the overloaded Java functions with the same name. This example uses the signatureless syntax for CREATE FUNCTION and DROP FUNCTION, which is available in and higher.

At the start, the jar file is in the local filesystem. Then it is copied into HDFS, so that it is available for Impala to reference through the CREATE FUNCTION statement and queries that refer to the Impala function name.

$ jar -tvf udf-examples.jar 0 Mon Feb 22 04:06:50 PST 2016 META-INF/ 122 Mon Feb 22 04:06:48 PST 2016 META-INF/MANIFEST.MF 0 Mon Feb 22 04:06:46 PST 2016 org/ 0 Mon Feb 22 04:06:46 PST 2016 org/apache/ 0 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/ 2460 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/IncompatibleUdfTest.class 541 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdfException.class 3438 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/JavaUdfTest.class 5872 Mon Feb 22 04:06:46 PST 2016 org/apache/impala/TestUdf.class ... $ hdfs dfs -put udf-examples.jar /user/impala/udfs $ hdfs dfs -ls /user/impala/udfs Found 2 items -rw-r--r-- 3 jrussell supergroup 853 2015-10-09 14:05 /user/impala/udfs/hello_world.jar -rw-r--r-- 3 jrussell supergroup 7366 2016-06-08 14:25 /user/impala/udfs/udf-examples.jar

In impala-shell, the CREATE FUNCTION refers to the HDFS path of the jar file and the fully qualified class name inside the jar. Each of the functions inside the class becomes an Impala function, each one overloaded under the specified Impala function name.

[localhost:21000] > create function testudf location '/user/impala/udfs/udf-examples.jar' symbol='org.apache.impala.TestUdf'; [localhost:21000] > show functions; +-------------+---------------------------------------+-------------+---------------+ | return type | signature | binary type | is persistent | +-------------+---------------------------------------+-------------+---------------+ | BIGINT | testudf(BIGINT) | JAVA | true | | BOOLEAN | testudf(BOOLEAN) | JAVA | true | | BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | | BOOLEAN | testudf(BOOLEAN, BOOLEAN, BOOLEAN) | JAVA | true | | DOUBLE | testudf(DOUBLE) | JAVA | true | | DOUBLE | testudf(DOUBLE, DOUBLE) | JAVA | true | | DOUBLE | testudf(DOUBLE, DOUBLE, DOUBLE) | JAVA | true | | FLOAT | testudf(FLOAT) | JAVA | true | | FLOAT | testudf(FLOAT, FLOAT) | JAVA | true | | FLOAT | testudf(FLOAT, FLOAT, FLOAT) | JAVA | true | | INT | testudf(INT) | JAVA | true | | DOUBLE | testudf(INT, DOUBLE) | JAVA | true | | INT | testudf(INT, INT) | JAVA | true | | INT | testudf(INT, INT, INT) | JAVA | true | | SMALLINT | testudf(SMALLINT) | JAVA | true | | SMALLINT | testudf(SMALLINT, SMALLINT) | JAVA | true | | SMALLINT | testudf(SMALLINT, SMALLINT, SMALLINT) | JAVA | true | | STRING | testudf(STRING) | JAVA | true | | STRING | testudf(STRING, STRING) | JAVA | true | | STRING | testudf(STRING, STRING, STRING) | JAVA | true | | TINYINT | testudf(TINYINT) | JAVA | true | +-------------+---------------------------------------+-------------+---------------+

These are all simple functions that return their single arguments, or sum, concatenate, and so on their multiple arguments. Impala determines which overloaded function to use based on the number and types of the arguments.

insert into bigint_x values (1), (2), (4), (3); select testudf(x) from bigint_x; +-----------------+ | udfs.testudf(x) | +-----------------+ | 1 | | 2 | | 4 | | 3 | +-----------------+ insert into int_x values (1), (2), (4), (3); select testudf(x, x+1, x*x) from int_x; +-------------------------------+ | udfs.testudf(x, x + 1, x * x) | +-------------------------------+ | 4 | | 9 | | 25 | | 16 | +-------------------------------+ select testudf(x) from string_x; +-----------------+ | udfs.testudf(x) | +-----------------+ | one | | two | | four | | three | +-----------------+ select testudf(x,x) from string_x; +--------------------+ | udfs.testudf(x, x) | +--------------------+ | oneone | | twotwo | | fourfour | | threethree | +--------------------+

The previous example used the same Impala function name as the name of the class. This example shows how the Impala function name is independent of the underlying Java class or function names. A second CREATE FUNCTION statement results in a set of overloaded functions all named my_func, to go along with the overloaded functions all named testudf.

create function my_func location '/user/impala/udfs/udf-examples.jar' symbol='org.apache.impala.TestUdf'; show functions; +-------------+---------------------------------------+-------------+---------------+ | return type | signature | binary type | is persistent | +-------------+---------------------------------------+-------------+---------------+ | BIGINT | my_func(BIGINT) | JAVA | true | | BOOLEAN | my_func(BOOLEAN) | JAVA | true | | BOOLEAN | my_func(BOOLEAN, BOOLEAN) | JAVA | true | ... | BIGINT | testudf(BIGINT) | JAVA | true | | BOOLEAN | testudf(BOOLEAN) | JAVA | true | | BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | ...

The corresponding DROP FUNCTION statement with no signature drops all the overloaded functions with that name.

drop function my_func; show functions; +-------------+---------------------------------------+-------------+---------------+ | return type | signature | binary type | is persistent | +-------------+---------------------------------------+-------------+---------------+ | BIGINT | testudf(BIGINT) | JAVA | true | | BOOLEAN | testudf(BOOLEAN) | JAVA | true | | BOOLEAN | testudf(BOOLEAN, BOOLEAN) | JAVA | true | ...

The signatureless CREATE FUNCTION syntax for Java UDFs ensures that the functions shown in this example remain available after the Impala service (specifically, the Catalog Server) are restarted.

for more background information, usage instructions, and examples for Impala UDFs;