mirror of
https://github.com/apache/impala.git
synced 2025-12-19 09:58:28 -05:00
add memory management and fix broken links. Incorporated review changes. Change-Id: I6e8b6d0c3fe2e1746831665b3d3ae98a0beaa1e7 Reviewed-on: http://gerrit.cloudera.org:8080/15836 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
1920 lines
77 KiB
XML
1920 lines
77 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!--
|
|
Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
-->
|
|
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
|
|
<concept rev="1.2" id="udfs">
|
|
|
|
<title>User-Defined Functions (UDFs)</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Impala"/>
|
|
<data name="Category" value="Impala Functions"/>
|
|
<data name="Category" value="UDFs"/>
|
|
<data name="Category" value="Developers"/>
|
|
<data name="Category" value="Data Analysts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
User-defined functions (frequently abbreviated as UDFs) let you code your own application logic for
|
|
processing column values during an Impala query. For example, a UDF could perform calculations using an
|
|
external math library, combine several column values into one, do geospatial calculations, or other kinds of
|
|
tests and transformations that are outside the scope of the built-in SQL operators and functions.
|
|
</p>
|
|
|
|
<p>
|
|
You can use UDFs to simplify query logic when producing reports, or to transform data in flexible ways when
|
|
copying from one table to another with the <codeph>INSERT ... SELECT</codeph> syntax.
|
|
</p>
|
|
|
|
<p> You might be familiar with this feature from other database products,
|
|
under names such as stored functions or stored routines. </p>
|
|
|
|
<p>
|
|
Impala support for UDFs is available in Impala 1.2 and higher:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
In Impala 1.1, using UDFs in a query required using the Hive shell. (Because Impala and Hive share the same
|
|
metastore database, you could switch to Hive to run just those queries requiring UDFs, then switch back to
|
|
Impala.)
|
|
</li>
|
|
|
|
<li>
|
|
Starting in Impala 1.2, Impala can run both high-performance native code UDFs written in C++, and
|
|
Java-based Hive UDFs that you might already have written.
|
|
</li>
|
|
|
|
<li>
|
|
Impala can run scalar UDFs that return a single value for each row of the result set, and user-defined
|
|
aggregate functions (UDAFs) that return a value based on a set of rows. Currently, Impala does not support
|
|
user-defined table functions (UDTFs) or window functions.
|
|
</li>
|
|
</ul>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="udf_concepts">
|
|
|
|
<title>UDF Concepts</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Concepts"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Depending on your use case, you might write all-new functions, reuse Java UDFs that you have already
|
|
written for Hive, or port Hive Java UDF code to higher-performance native Impala UDFs in C++. You can code
|
|
either scalar functions for producing results one row at a time, or more complex aggregate functions for
|
|
doing analysis across. The following sections discuss these different aspects of working with UDFs.
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="udfs_udafs">
|
|
|
|
<title>UDFs and UDAFs</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Depending on your use case, the user-defined functions (UDFs) you write might accept or produce different
|
|
numbers of input and output values:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The most general kind of user-defined function (the one typically referred to by the abbreviation UDF)
|
|
takes a single input value and produces a single output value. When used in a query, it is called once
|
|
for each row in the result set. For example:
|
|
<codeblock>select customer_name, is_frequent_customer(customer_id) from customers;
|
|
select obfuscate(sensitive_column) from sensitive_data;</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
A user-defined aggregate function (UDAF) accepts a group of values and returns a single value. You use
|
|
UDAFs to summarize and condense sets of rows, in the same style as the built-in <codeph>COUNT</codeph>,
|
|
<codeph>MAX()</codeph>, <codeph>SUM()</codeph>, and <codeph>AVG()</codeph> functions. When called in a
|
|
query that uses the <codeph>GROUP BY</codeph> clause, the function is called once for each combination
|
|
of <codeph>GROUP BY</codeph> values. For example:
|
|
<codeblock>-- Evaluates multiple rows but returns a single value.
|
|
select closest_restaurant(latitude, longitude) from places;
|
|
|
|
-- Evaluates batches of rows and returns a separate value for each batch.
|
|
select most_profitable_location(store_id, sales, expenses, tax_rate, depreciation) from franchise_data group by year;</codeblock>
|
|
</li>
|
|
|
|
<li>
|
|
Currently, Impala does not support other categories of user-defined functions, such as user-defined
|
|
table functions (UDTFs) or window functions.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="native_udfs">
|
|
|
|
<title>Native Impala UDFs</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Impala supports UDFs written in C++, in addition to supporting existing Hive UDFs written in Java.
|
|
Where practical, use C++ UDFs because the compiled native code can yield higher performance, with
|
|
UDF execution time often 10x faster for a C++ UDF than the equivalent Java UDF.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udfs_hive">
|
|
|
|
<title>Using Hive UDFs with Impala</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Impala can run Java-based user-defined functions (UDFs), originally written for Hive, with no changes,
|
|
subject to the following conditions:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
The parameters and return value must all use scalar data types supported by Impala. For example, complex or nested
|
|
types are not supported.
|
|
</li>
|
|
|
|
<li>
|
|
Hive/Java UDFs must extend
|
|
<codeph>org.apache.hadoop.hive.ql.exec.UDF</codeph> class.
|
|
</li>
|
|
|
|
<li>
|
|
Currently, Hive UDFs that accept or return the <codeph>TIMESTAMP</codeph> type are not supported.
|
|
</li>
|
|
|
|
<li>
|
|
Prior to <keyword keyref="impala25_full"/> the return type must be a <q>Writable</q> type such as <codeph>Text</codeph> or
|
|
<codeph>IntWritable</codeph>, rather than a Java primitive type such as <codeph>String</codeph> or
|
|
<codeph>int</codeph>. Otherwise, the UDF returns <codeph>NULL</codeph>.
|
|
<ph rev="2.5.0">In <keyword keyref="impala25_full"/> and higher, this restriction is lifted, and both
|
|
UDF arguments and return values can be Java primitive types.</ph>
|
|
</li>
|
|
|
|
<li>
|
|
Hive UDAFs and UDTFs are not supported.
|
|
</li>
|
|
|
|
<li>
|
|
Typically, a Java UDF will execute several times slower in Impala than the equivalent native UDF
|
|
written in C++.
|
|
</li>
|
|
<li rev="2.5.0 IMPALA-2843">
|
|
In <keyword keyref="impala25_full"/> and higher, you can transparently call Hive Java UDFs through Impala,
|
|
or call Impala Java UDFs through Hive. This feature does not apply to built-in Hive functions.
|
|
Any Impala Java UDFs created with older versions must be re-created using new <codeph>CREATE FUNCTION</codeph>
|
|
syntax, without any signature for arguments or the return value.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
To take full advantage of the Impala architecture and performance features, you can also write
|
|
Impala-specific UDFs in C++.
|
|
</p>
|
|
|
|
<p>
|
|
For background about Java-based Hive UDFs, see the
|
|
<xref href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF" scope="external" format="html">Hive
|
|
documentation for UDFs</xref>. For examples or tutorials for writing such UDFs, search the web for
|
|
related blog posts.
|
|
</p>
|
|
|
|
<p>
|
|
The ideal way to understand how to reuse Java-based UDFs (originally written for Hive) with Impala is to
|
|
take some of the Hive built-in functions (implemented as Java UDFs) and take the applicable JAR files
|
|
through the UDF deployment process for Impala, creating new UDFs with different names:
|
|
</p>
|
|
|
|
<ol>
|
|
<li>
|
|
Take a copy of the Hive JAR file containing the Hive built-in functions. For example, the path might be
|
|
like <filepath>/usr/lib/hive/lib/hive-exec-0.10.0.jar</filepath>, with different version
|
|
numbers corresponding to your specific level of <keyword keyref="distro"/>.
|
|
</li>
|
|
|
|
<li>
|
|
Use <codeph>jar tf <varname>jar_file</varname></codeph> to see a list of the classes inside the JAR.
|
|
You will see names like <codeph>org/apache/hadoop/hive/ql/udf/UDFLower.class</codeph> and
|
|
<codeph>org/apache/hadoop/hive/ql/udf/UDFOPNegative.class</codeph>. Make a note of the names of the
|
|
functions you want to experiment with. When you specify the entry points for the Impala <codeph>CREATE
|
|
FUNCTION</codeph> statement, change the slash characters to dots and strip off the
|
|
<codeph>.class</codeph> suffix, for example <codeph>org.apache.hadoop.hive.ql.udf.UDFLower</codeph> and
|
|
<codeph>org.apache.hadoop.hive.ql.udf.UDFOPNegative</codeph>.
|
|
</li>
|
|
|
|
<li>
|
|
Copy that file to an HDFS location that Impala can read. (In the examples here, we renamed the file to
|
|
<filepath>hive-builtins.jar</filepath> in HDFS for simplicity.)
|
|
</li>
|
|
|
|
<li>
|
|
For each Java-based UDF that you want to call through Impala, issue a <codeph>CREATE FUNCTION</codeph>
|
|
statement, with a <codeph>LOCATION</codeph> clause containing the full HDFS path of the JAR file, and a
|
|
<codeph>SYMBOL</codeph> clause with the fully qualified name of the class, using dots as separators and
|
|
without the <codeph>.class</codeph> extension. Remember that user-defined functions are associated with
|
|
a particular database, so issue a <codeph>USE</codeph> statement for the appropriate database first, or
|
|
specify the SQL function name as
|
|
<codeph><varname>db_name</varname>.<varname>function_name</varname></codeph>. Use completely new names
|
|
for the SQL functions, because Impala UDFs cannot have the same name as Impala built-in functions.
|
|
</li>
|
|
|
|
<li>
|
|
Call the function from your queries, passing arguments of the correct type to match the function
|
|
signature. These arguments could be references to columns, arithmetic or other kinds of expressions,
|
|
the results of <codeph>CAST</codeph> functions to ensure correct data types, and so on.
|
|
</li>
|
|
</ol>
|
|
|
|
<note>
|
|
<p conref="../shared/impala_common.xml#common/refresh_functions_tip"/>
|
|
</note>
|
|
|
|
<example>
|
|
|
|
<title>Java UDF Example: Reusing lower() Function</title>
|
|
|
|
<p>
|
|
For example, the following <cmdname>impala-shell</cmdname> session creates an Impala UDF
|
|
<codeph>my_lower()</codeph> that reuses the Java code for the Hive <codeph>lower()</codeph>: built-in
|
|
function. We cannot call it <codeph>lower()</codeph> because Impala does not allow UDFs to have the
|
|
same name as built-in functions. From SQL, we call the function in a basic way (in a query with no
|
|
<codeph>WHERE</codeph> clause), directly on a column, and on the results of a string expression:
|
|
</p>
|
|
|
|
<!-- To do: adapt for signatureless syntax per IMPALA-2843. -->
|
|
<codeblock>[localhost:21000] > create database udfs;
|
|
[localhost:21000] > use udfs;
|
|
localhost:21000] > create function lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';
|
|
ERROR: AnalysisException: Function cannot have the same name as a builtin: lower
|
|
[localhost:21000] > create function my_lower(string) returns string location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFLower';
|
|
[localhost:21000] > select my_lower('Some String NOT ALREADY LOWERCASE');
|
|
+----------------------------------------------------+
|
|
| udfs.my_lower('some string not already lowercase') |
|
|
+----------------------------------------------------+
|
|
| some string not already lowercase |
|
|
+----------------------------------------------------+
|
|
Returned 1 row(s) in 0.11s
|
|
[localhost:21000] > create table t2 (s string);
|
|
[localhost:21000] > insert into t2 values ('lower'),('UPPER'),('Init cap'),('CamelCase');
|
|
Inserted 4 rows in 2.28s
|
|
[localhost:21000] > select * from t2;
|
|
+-----------+
|
|
| s |
|
|
+-----------+
|
|
| lower |
|
|
| UPPER |
|
|
| Init cap |
|
|
| CamelCase |
|
|
+-----------+
|
|
Returned 4 row(s) in 0.47s
|
|
[localhost:21000] > select my_lower(s) from t2;
|
|
+------------------+
|
|
| udfs.my_lower(s) |
|
|
+------------------+
|
|
| lower |
|
|
| upper |
|
|
| init cap |
|
|
| camelcase |
|
|
+------------------+
|
|
Returned 4 row(s) in 0.54s
|
|
[localhost:21000] > select my_lower(concat('ABC ',s,' XYZ')) from t2;
|
|
+------------------------------------------+
|
|
| udfs.my_lower(concat('abc ', s, ' xyz')) |
|
|
+------------------------------------------+
|
|
| abc lower xyz |
|
|
| abc upper xyz |
|
|
| abc init cap xyz |
|
|
| abc camelcase xyz |
|
|
+------------------------------------------+
|
|
Returned 4 row(s) in 0.22s</codeblock>
|
|
|
|
</example>
|
|
|
|
<example>
|
|
|
|
<title>Java UDF Example: Reusing negative() Function</title>
|
|
|
|
<p>
|
|
Here is an example that reuses the Hive Java code for the <codeph>negative()</codeph> built-in
|
|
function. This example demonstrates how the data types of the arguments must match precisely with the
|
|
function signature. At first, we create an Impala SQL function that can only accept an integer
|
|
argument. Impala cannot find a matching function when the query passes a floating-point argument,
|
|
although we can call the integer version of the function by casting the argument. Then we overload the
|
|
same function name to also accept a floating-point argument.
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create table t (x int);
|
|
[localhost:21000] > insert into t values (1), (2), (4), (100);
|
|
Inserted 4 rows in 1.43s
|
|
[localhost:21000] > create function my_neg(bigint) returns bigint location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFOPNegative';
|
|
[localhost:21000] > select my_neg(4);
|
|
+----------------+
|
|
| udfs.my_neg(4) |
|
|
+----------------+
|
|
| -4 |
|
|
+----------------+
|
|
[localhost:21000] > select my_neg(x) from t;
|
|
+----------------+
|
|
| udfs.my_neg(x) |
|
|
+----------------+
|
|
| -2 |
|
|
| -4 |
|
|
| -100 |
|
|
+----------------+
|
|
Returned 3 row(s) in 0.60s
|
|
[localhost:21000] > select my_neg(4.0);
|
|
ERROR: AnalysisException: No matching function with signature: udfs.my_neg(FLOAT).
|
|
[localhost:21000] > select my_neg(cast(4.0 as int));
|
|
+-------------------------------+
|
|
| udfs.my_neg(cast(4.0 as int)) |
|
|
+-------------------------------+
|
|
| -4 |
|
|
+-------------------------------+
|
|
Returned 1 row(s) in 0.11s
|
|
[localhost:21000] > create function my_neg(double) returns double location '/user/hive/udfs/hive.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFOPNegative';
|
|
[localhost:21000] > select my_neg(4.0);
|
|
+------------------+
|
|
| udfs.my_neg(4.0) |
|
|
+------------------+
|
|
| -4 |
|
|
+------------------+
|
|
Returned 1 row(s) in 0.11s</codeblock>
|
|
|
|
<p audience="hidden">
|
|
You can find the sample files mentioned here in <xref keyref="udf_samples"/>.
|
|
</p>
|
|
|
|
</example>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|
|
|
|
<concept id="udf_runtime">
|
|
<title>Runtime Environment for UDFs</title>
|
|
<conbody>
|
|
<p>
|
|
By default, Impala copies UDFs into <filepath>/tmp</filepath>,
|
|
and you can configure this location through the <codeph>--local_library_dir</codeph>
|
|
startup flag for the <cmdname>impalad</cmdname> daemon.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_demo_env">
|
|
|
|
<title>Installing the UDF Development Package</title>
|
|
|
|
<conbody>
|
|
|
|
<p rev="">
|
|
To develop UDFs for Impala, download and install the <codeph>impala-udf-devel</codeph> package (RHEL-based
|
|
distributions) or <codeph>impala-udf-dev</codeph> (Ubuntu and Debian). This package contains
|
|
header files, sample source, and build configuration files.
|
|
</p>
|
|
|
|
<ol>
|
|
<li audience="hidden">
|
|
Start at <xref keyref="archive_root"/>.
|
|
</li>
|
|
|
|
<li>
|
|
Locate the appropriate <codeph>.repo</codeph> or list file for your operating system version.
|
|
</li>
|
|
|
|
<li>
|
|
Use the familiar <codeph>yum</codeph>, <codeph>zypper</codeph>, or <codeph>apt-get</codeph> commands
|
|
depending on your operating system. For the package name, specify <codeph>impala-udf-devel</codeph>
|
|
(RHEL-based distributions) or <codeph>impala-udf-dev</codeph> (Ubuntu and Debian).
|
|
</li>
|
|
</ol>
|
|
|
|
|
|
<note>
|
|
The UDF development code does not rely on Impala being installed on the same machine. You can write and
|
|
compile UDFs on a minimal development system, then deploy them on a different one for use with Impala.
|
|
</note>
|
|
|
|
<p>
|
|
When you are ready to start writing your own UDFs, download the sample code and build scripts from
|
|
<xref keyref="udf-samples">the Impala sample UDF github</xref>.
|
|
Then see <xref href="impala_udf.xml#udf_coding"/> for how to code UDFs, and
|
|
<xref href="impala_udf.xml#udf_tutorial"/> for how to build and run UDFs.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_coding">
|
|
|
|
<title>Writing User-Defined Functions (UDFs)</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Before starting UDF development, make sure to install the development package and download the UDF code
|
|
samples, as described in <xref href="#udf_demo_env"/>.
|
|
</p>
|
|
|
|
<p>
|
|
When writing UDFs:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Keep in mind the data type differences as you transfer values from the high-level SQL to your lower-level
|
|
UDF code. For example, in the UDF code you might be much more aware of how many bytes different kinds of
|
|
integers require.
|
|
</li>
|
|
|
|
<li>
|
|
Use best practices for function-oriented programming: choose arguments carefully, avoid side effects,
|
|
make each function do a single thing, and so on.
|
|
</li>
|
|
</ul>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="udf_exploring">
|
|
|
|
<title>Getting Started with UDF Coding</title>
|
|
<prolog>
|
|
<metadata>
|
|
<!-- OK, this is not something a Hadoop newbie would tackle, but being lenient and inclusive in this initial pass, so including the GS tag. -->
|
|
<data name="Category" value="Getting Started"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To understand the layout and member variables and functions of the predefined UDF data types, examine the
|
|
header file <filepath>/usr/include/impala_udf/udf.h</filepath>:
|
|
</p>
|
|
|
|
<codeblock>// This is the only Impala header required to develop UDFs and UDAs. This header
|
|
// contains the types that need to be used and the FunctionContext object. The context
|
|
// object serves as the interface object between the UDF/UDA and the impala process. </codeblock>
|
|
|
|
<p>
|
|
For the basic declarations needed to write a scalar UDF, see the header file
|
|
<xref keyref="udf-sample.h"><filepath>udf-sample.h</filepath></xref>
|
|
within the sample build environment, which defines a simple function
|
|
named <codeph>AddUdf()</codeph>:
|
|
</p>
|
|
|
|
<codeblock>#ifndef IMPALA_UDF_SAMPLE_UDF_H
|
|
#define IMPALA_UDF_SAMPLE_UDF_H
|
|
|
|
#include <impala_udf/udf.h>
|
|
|
|
using namespace impala_udf;
|
|
|
|
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2);
|
|
|
|
#endif
|
|
</codeblock>
|
|
|
|
<p>
|
|
For sample C++ code for a simple function named <codeph>AddUdf()</codeph>, see the source file
|
|
<filepath>udf-sample.cc</filepath> within the sample build environment:
|
|
</p>
|
|
|
|
<codeblock>#include "udf-sample.h"
|
|
|
|
// In this sample we are declaring a UDF that adds two ints and returns an int.
|
|
IntVal AddUdf(FunctionContext* context, const IntVal& arg1, const IntVal& arg2) {
|
|
if (arg1.is_null || arg2.is_null) return IntVal::null();
|
|
return IntVal(arg1.val + arg2.val);
|
|
}
|
|
|
|
// Multiple UDFs can be defined in the same file</codeblock>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udfs_args">
|
|
|
|
<title>Data Types for Function Arguments and Return Values</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Each value that a user-defined function can accept as an argument or return as a result value must map to
|
|
a SQL data type that you could specify for a table column.
|
|
</p>
|
|
|
|
<p conref="../shared/impala_common.xml#common/udfs_no_complex_types"/>
|
|
|
|
<p>
|
|
Each data type has a corresponding structure defined in the C++ and Java header files, with two member
|
|
fields and some predefined comparison operators and constructors:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
<codeph>is_null</codeph> indicates whether the value is <codeph>NULL</codeph> or not.
|
|
<codeph>val</codeph> holds the actual argument or return value when it is non-<codeph>NULL</codeph>.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Each struct also defines a <codeph>null()</codeph> member function that constructs an instance of the
|
|
struct with the <codeph>is_null</codeph> flag set.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
The built-in SQL comparison operators and clauses such as <codeph><</codeph>,
|
|
<codeph>>=</codeph>, <codeph>BETWEEN</codeph>, and <codeph>ORDER BY</codeph> all work
|
|
automatically based on the SQL return type of each UDF. For example, Impala knows how to evaluate
|
|
<codeph>BETWEEN 1 AND udf_returning_int(col1)</codeph> or <codeph>ORDER BY
|
|
udf_returning_string(col2)</codeph> without you declaring any comparison operators within the UDF
|
|
itself.
|
|
</p>
|
|
<p>
|
|
For convenience within your UDF code, each struct defines <codeph>==</codeph> and <codeph>!=</codeph>
|
|
operators for comparisons with other structs of the same type. These are for typical C++ comparisons
|
|
within your own code, not necessarily reproducing SQL semantics. For example, if the
|
|
<codeph>is_null</codeph> flag is set in both structs, they compare as equal. That behavior of
|
|
<codeph>null</codeph> comparisons is different from SQL (where <codeph>NULL == NULL</codeph> is
|
|
<codeph>NULL</codeph> rather than <codeph>true</codeph>), but more in line with typical C++ behavior.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Each kind of struct has one or more constructors that define a filled-in instance of the struct,
|
|
optionally with default values.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
Impala cannot process UDFs that accept composite or nested types
|
|
as arguments or return them as result values. This limitation
|
|
applies both to Impala UDFs written in C++ and Java-based Hive
|
|
UDFs.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
You can overload functions by creating multiple functions with the same SQL name but different
|
|
argument types. For overloaded functions, you must use different C++ or Java entry point names in the
|
|
underlying functions.
|
|
</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
The data types defined on the C++ side (in <filepath>/usr/include/impala_udf/udf.h</filepath>) are:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>
|
|
<codeph>IntVal</codeph> represents an <codeph>INT</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>BigIntVal</codeph> represents a <codeph>BIGINT</codeph> column. Even if you do not need the
|
|
full range of a <codeph>BIGINT</codeph> value, it can be useful to code your function arguments as
|
|
<codeph>BigIntVal</codeph> to make it convenient to call the function with different kinds of integer
|
|
columns and expressions as arguments. Impala automatically casts smaller integer types to larger ones
|
|
when appropriate, but does not implicitly cast large integer types to smaller ones.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>SmallIntVal</codeph> represents a <codeph>SMALLINT</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>TinyIntVal</codeph> represents a <codeph>TINYINT</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>StringVal</codeph> represents a <codeph>STRING</codeph> column. It has a <codeph>len</codeph>
|
|
field representing the length of the string, and a <codeph>ptr</codeph> field pointing to the string
|
|
data. It has constructors that create a new <codeph>StringVal</codeph> struct based on a
|
|
null-terminated C-style string, or a pointer plus a length; these new structs still refer to the
|
|
original string data rather than allocating a new buffer for the data. It also has a constructor that
|
|
takes a pointer to a <codeph>FunctionContext</codeph> struct and a length, that does allocate space
|
|
for a new copy of the string data, for use in UDFs that return string values.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>BooleanVal</codeph> represents a <codeph>BOOLEAN</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>FloatVal</codeph> represents a <codeph>FLOAT</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>DoubleVal</codeph> represents a <codeph>DOUBLE</codeph> column.
|
|
</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>
|
|
<codeph>TimestampVal</codeph> represents a <codeph>TIMESTAMP</codeph> column. It has a
|
|
<codeph>date</codeph> field, a 32-bit integer representing the Gregorian date, that is, the days past
|
|
the epoch date. It also has a <codeph>time_of_day</codeph> field, a 64-bit integer representing the
|
|
current time of day in nanoseconds.
|
|
</p>
|
|
</li>
|
|
|
|
<!--
|
|
<li>
|
|
<p>
|
|
<codeph>AnyVal</codeph> is the parent type of all the other
|
|
structs. They inherit the <codeph>is_null</codeph> field from it.
|
|
You do not use this type directly in your code.
|
|
</p>
|
|
</li>
|
|
-->
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_varargs">
|
|
|
|
<title>Variable-Length Argument Lists</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
UDFs typically take a fixed number of arguments, with each one named explicitly in the signature of your
|
|
C++ function. Your function can also accept additional optional arguments, all of the same type. For
|
|
example, you can concatenate two strings, three strings, four strings, and so on. Or you can compare two
|
|
numbers, three numbers, four numbers, and so on.
|
|
</p>
|
|
|
|
<p>
|
|
To accept a variable-length argument list, code the signature of your function like this:
|
|
</p>
|
|
|
|
<codeblock>StringVal Concat(FunctionContext* context, const StringVal& separator,
|
|
int num_var_args, const StringVal* args);</codeblock>
|
|
|
|
<p>
|
|
In the <codeph>CREATE FUNCTION</codeph> statement, after the type of the first optional argument, include
|
|
<codeph>...</codeph> to indicate it could be followed by more arguments of the same type. For example,
|
|
the following function accepts a <codeph>STRING</codeph> argument, followed by one or more additional
|
|
<codeph>STRING</codeph> arguments:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create function my_concat(string, string ...) returns string location '/user/test_user/udfs/sample.so' symbol='Concat';
|
|
</codeblock>
|
|
|
|
<p>
|
|
The call from the SQL query must pass at least one argument to the variable-length portion of the
|
|
argument list.
|
|
</p>
|
|
|
|
<p>
|
|
When Impala calls the function, it fills in the initial set of required arguments, then passes the number
|
|
of extra arguments and a pointer to the first of those optional arguments.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_null">
|
|
|
|
<title>Handling NULL Values</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
For correctness, performance, and reliability, it is important for each UDF to handle all situations
|
|
where any <codeph>NULL</codeph> values are passed to your function. For example, when passed a
|
|
<codeph>NULL</codeph>, UDFs typically also return <codeph>NULL</codeph>. In an aggregate function, which
|
|
could be passed a combination of real and <codeph>NULL</codeph> values, you might make the final value
|
|
into a <codeph>NULL</codeph> (as in <codeph>CONCAT()</codeph>), ignore the <codeph>NULL</codeph> value
|
|
(as in <codeph>AVG()</codeph>), or treat it the same as a numeric zero or empty string.
|
|
</p>
|
|
|
|
<p>
|
|
Each parameter type, such as <codeph>IntVal</codeph> or <codeph>StringVal</codeph>, has an
|
|
<codeph>is_null</codeph> Boolean member.
|
|
<!--
|
|
If your function has no effect when passed <codeph>NULL</codeph>
|
|
values,
|
|
-->
|
|
Test this flag immediately for each argument to your function, and if it is set, do not refer to the
|
|
<codeph>val</codeph> field of the argument structure. The <codeph>val</codeph> field is undefined when
|
|
the argument is <codeph>NULL</codeph>, so your function could go into an infinite loop or produce
|
|
incorrect results if you skip the special handling for <codeph>NULL</codeph>.
|
|
<!-- and return if so.
|
|
For <codeph>void</codeph> intermediate functions
|
|
within UDAs, you can return without specifying a value.
|
|
-->
|
|
</p>
|
|
|
|
<p>
|
|
If your function returns <codeph>NULL</codeph> when passed a <codeph>NULL</codeph> value, or in other
|
|
cases such as when a search string is not found, you can construct a null instance of the return type by
|
|
using its <codeph>null()</codeph> member function.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_malloc">
|
|
|
|
<title>Memory Allocation for UDFs</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Memory"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
By default, memory allocated within a UDF is deallocated when the function exits, which could be before
|
|
the query is finished. The input arguments remain allocated for the lifetime of the function, so you can
|
|
refer to them in the expressions for your return values. If you use temporary variables to construct
|
|
all-new string values, use the <codeph>StringVal()</codeph> constructor that takes an initial
|
|
<codeph>FunctionContext*</codeph> argument followed by a length, and copy the data into the newly
|
|
allocated memory buffer.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="1.3.0" id="udf_threads">
|
|
|
|
<title>Thread-Safe Work Area for UDFs</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
One way to improve performance of UDFs is to specify the optional <codeph>PREPARE_FN</codeph> and
|
|
<codeph>CLOSE_FN</codeph> clauses on the <codeph>CREATE FUNCTION</codeph> statement. The <q>prepare</q>
|
|
function sets up a thread-safe data structure in memory that you can use as a work area. The <q>close</q>
|
|
function deallocates that memory. Each subsequent call to the UDF within the same thread can access that
|
|
same memory area. There might be several such memory areas allocated on the same host, as UDFs are
|
|
parallelized using multiple threads.
|
|
</p>
|
|
|
|
<p>
|
|
Within this work area, you can set up predefined lookup tables, or record the results of complex
|
|
operations on data types such as <codeph>STRING</codeph> or <codeph>TIMESTAMP</codeph>. Saving the
|
|
results of previous computations rather than repeating the computation each time is an optimization known
|
|
as <xref href="http://en.wikipedia.org/wiki/Memoization" scope="external" format="html"/>. For example,
|
|
if your UDF performs a regular expression match or date manipulation on a column that repeats the same
|
|
value over and over, you could store the last-computed value or a hash table of already-computed values,
|
|
and do a fast lookup to find the result for subsequent iterations of the UDF.
|
|
</p>
|
|
|
|
<p>
|
|
Each such function must have the signature:
|
|
</p>
|
|
|
|
<codeblock>void <varname>function_name</varname>(impala_udf::FunctionContext*, impala_udf::FunctionContext::FunctionScope)
|
|
</codeblock>
|
|
|
|
<p>
|
|
Currently, only <codeph>THREAD_SCOPE</codeph> is implemented, not <codeph>FRAGMENT_SCOPE</codeph>. See
|
|
<filepath>udf.h</filepath> for details about the scope values.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_error_handling">
|
|
|
|
<title>Error Handling for UDFs</title>
|
|
<prolog>
|
|
<metadata>
|
|
<!-- A little bit of a stretch, but if you're doing UDFs and you need to debug you might look up Troubleshooting. -->
|
|
<data name="Category" value="Troubleshooting"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
To handle errors in UDFs, you call functions that are members of the initial
|
|
<codeph>FunctionContext*</codeph> argument passed to your function.
|
|
</p>
|
|
|
|
<p>
|
|
A UDF can record one or more warnings, for conditions that indicate minor, recoverable problems that do
|
|
not cause the query to stop. The signature for this function is:
|
|
</p>
|
|
|
|
<codeblock>bool AddWarning(const char* warning_msg);</codeblock>
|
|
|
|
<p>
|
|
For a serious problem that requires cancelling the query, a UDF can set an error flag that prevents the
|
|
query from returning any results. The signature for this function is:
|
|
</p>
|
|
|
|
<codeblock>void SetError(const char* error_msg);</codeblock>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|
|
|
|
<concept id="udafs">
|
|
|
|
<title>Writing User-Defined Aggregate Functions (UDAFs)</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
User-defined aggregate functions (UDAFs or UDAs) are a powerful and flexible category of user-defined
|
|
functions. If a query processes N rows, calling a UDAF during the query condenses the result set, anywhere
|
|
from a single value (such as with the <codeph>SUM</codeph> or <codeph>MAX</codeph> functions), or some
|
|
number less than or equal to N (as in queries using the <codeph>GROUP BY</codeph> or
|
|
<codeph>HAVING</codeph> clause).
|
|
</p>
|
|
|
|
<p outputclass="toc inpage"/>
|
|
</conbody>
|
|
|
|
<concept id="uda_functions">
|
|
|
|
<title>The Underlying Functions for a UDA</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
A UDAF must maintain a state value across subsequent calls, so that it can accumulate a result across a
|
|
set of calls, rather than derive it purely from one set of arguments. For that reason, a UDAF is
|
|
represented by multiple underlying functions:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
An initialization function that sets any counters to zero, creates empty buffers, and does any other
|
|
one-time setup for a query.
|
|
</li>
|
|
|
|
<li>
|
|
An update function that processes the arguments for each row in the query result set and accumulates an
|
|
intermediate result for each node. For example, this function might increment a counter, append to a
|
|
string buffer, or set flags.
|
|
</li>
|
|
|
|
<li>
|
|
A merge function that combines the intermediate results from two different nodes.
|
|
</li>
|
|
|
|
<li rev="2.0.0">
|
|
A serialize function that flattens any intermediate values containing pointers, and frees any memory
|
|
allocated during the init, update, and merge phases.
|
|
</li>
|
|
|
|
<li>
|
|
A finalize function that either passes through the combined result unchanged, or does one final
|
|
transformation.
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
Intermediate values returned by the init, update and merge functions that referred to allocations
|
|
must be allocated using <codeph>FunctionContext::Allocate()</codeph> and freed using <codeph>FunctionContext::Free()</codeph>.
|
|
Both serialize and finalize functions are responsible for cleaning up the intermediate value and freeing such allocations.
|
|
StringVals returned to Impala directly by Serialize(), Finalize() or GetValue() functions should be backed by
|
|
temporary results memory allocated using the <codeph>StringVal(FunctionContext*, int) constructor</codeph>,
|
|
<codeph>StringVal::CopyFrom(FunctionContext*, const uint8_t*, size_t)</codeph>, or <codeph>StringVal::Resize()</codeph>.
|
|
</p>
|
|
<p>
|
|
In the SQL syntax, you create a UDAF by using the statement <codeph>CREATE AGGREGATE FUNCTION</codeph>.
|
|
You specify the entry points of the underlying C++ functions using the clauses <codeph>INIT_FN</codeph>,
|
|
<codeph>UPDATE_FN</codeph>, <codeph>MERGE_FN</codeph>, <codeph rev="2.0.0">SERIALIZE_FN</codeph>, and
|
|
<codeph>FINALIZE_FN</codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
<!-- To do:
|
|
Need an example to demonstrate exactly what tokens are used for init, merge, finalize in
|
|
this substitution.
|
|
-->
|
|
For convenience, you can use a naming convention for the underlying functions and Impala automatically
|
|
recognizes those entry points. Specify the <codeph>UPDATE_FN</codeph> clause, using an entry point name
|
|
containing the string <codeph>update</codeph> or <codeph>Update</codeph>. When you omit the other
|
|
<codeph>_FN</codeph> clauses from the SQL statement, Impala looks for entry points with names formed by
|
|
substituting the <codeph>update</codeph> or <codeph>Update</codeph> portion of the specified name.
|
|
</p>
|
|
|
|
<!--
|
|
[INIT_FN '<varname>function</varname>]
|
|
[UPDATE_FN '<varname>function</varname>]
|
|
[MERGE_FN '<varname>function</varname>]
|
|
[FINALIZE_FN '<varname>function</varname>]
|
|
-->
|
|
|
|
<p>
|
|
<xref keyref="uda-sample.h"><filepath>uda-sample.h</filepath></xref>:
|
|
</p>
|
|
|
|
<codeblock audience="hidden">#ifndef SAMPLES_UDA_H
|
|
#define SAMPLES_UDA_H
|
|
|
|
#include <impala_udf/udf.h>
|
|
|
|
using namespace impala_udf;
|
|
|
|
// This is an example of the COUNT aggregate function.
|
|
//
|
|
// Usage: > create aggregate function my_count(int) returns bigint
|
|
// location '/user/doc_demo/libudasample.so' update_fn='CountUpdate';
|
|
// > select my_count(col) from tbl;
|
|
|
|
void CountInit(FunctionContext* context, BigIntVal* val);
|
|
void CountUpdate(FunctionContext* context, const IntVal& input, BigIntVal* val);
|
|
void CountMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst);
|
|
BigIntVal CountFinalize(FunctionContext* context, const BigIntVal& val);
|
|
|
|
|
|
// This is an example of the AVG(double) aggregate function. This function needs to
|
|
// maintain two pieces of state, the current sum and the count. We do this using
|
|
// the StringVal intermediate type. When this UDA is registered, it would specify
|
|
// 16 bytes (8 byte sum + 8 byte count) as the size for this buffer.
|
|
//
|
|
// Usage: > create aggregate function my_avg(double) returns string
|
|
// location '/user/doc_demo/libudasample.so' update_fn='AvgUpdate';
|
|
// > select cast(my_avg(col) as double) from tbl;
|
|
|
|
void AvgInit(FunctionContext* context, StringVal* val);
|
|
void AvgUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val);
|
|
void AvgMerge(FunctionContext* context, const StringVal& src, StringVal* dst);
|
|
const StringVal AvgSerialize(FunctionContext* context, const StringVal& val);
|
|
StringVal AvgFinalize(FunctionContext* context, const StringVal& val);
|
|
|
|
|
|
// This is a sample of implementing the STRING_CONCAT aggregate function.
|
|
//
|
|
// Usage: > create aggregate function string_concat(string, string) returns string
|
|
// location '/user/doc_demo/libudasample.so' update_fn='StringConcatUpdate';
|
|
// > select string_concat(string_col, ",") from table;
|
|
|
|
void StringConcatInit(FunctionContext* context, StringVal* val);
|
|
void StringConcatUpdate(FunctionContext* context, const StringVal& arg1,
|
|
const StringVal& arg2, StringVal* val);
|
|
void StringConcatMerge(FunctionContext* context, const StringVal& src, StringVal* dst);
|
|
const StringVal StringConcatSerialize(FunctionContext* context, const StringVal& val);
|
|
StringVal StringConcatFinalize(FunctionContext* context, const StringVal& val);
|
|
|
|
|
|
// This is a example of the variance aggregate function.
|
|
//
|
|
// Usage: > create aggregate function var(double) returns string
|
|
// location '/user/doc_demo/libudasample.so' update_fn='VarianceUpdate';
|
|
// > select cast(var(col) as double) from tbl;
|
|
|
|
void VarianceInit(FunctionContext* context, StringVal* val);
|
|
void VarianceUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val);
|
|
void VarianceMerge(FunctionContext* context, const StringVal& src, StringVal* dst);
|
|
const StringVal VarianceSerialize(FunctionContext* context, const StringVal& val);
|
|
StringVal VarianceFinalize(FunctionContext* context, const StringVal& val);
|
|
|
|
|
|
// An implementation of the Knuth online variance algorithm, which is also single pass and
|
|
// more numerically stable.
|
|
//
|
|
// Usage: > create aggregate function knuth_var(double) returns string
|
|
// location '/user/doc_demo/libudasample.so' update_fn='KnuthVarianceUpdate';
|
|
// > select cast(knuth_var(col) as double) from tbl;
|
|
|
|
void KnuthVarianceInit(FunctionContext* context, StringVal* val);
|
|
void KnuthVarianceUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val);
|
|
void KnuthVarianceMerge(FunctionContext* context, const StringVal& src, StringVal* dst);
|
|
const StringVal KnuthVarianceSerialize(FunctionContext* context, const StringVal& val);
|
|
StringVal KnuthVarianceFinalize(FunctionContext* context, const StringVal& val);
|
|
|
|
|
|
// The different steps of the UDA are composable. In this case, we'the UDA will use the
|
|
// other steps from the Knuth variance computation.
|
|
//
|
|
// Usage: > create aggregate function stddev(double) returns string
|
|
// location '/user/doc_demo/libudasample.so' update_fn='KnuthVarianceUpdate'
|
|
// finalize_fn="StdDevFinalize";
|
|
// > select cast(stddev(col) as double) from tbl;
|
|
|
|
StringVal StdDevFinalize(FunctionContext* context, const StringVal& val);
|
|
|
|
|
|
// Utility function for serialization to StringVal
|
|
template <typename T>
|
|
StringVal ToStringVal(FunctionContext* context, const T& val);
|
|
|
|
#endif</codeblock>
|
|
|
|
<p>
|
|
<xref keyref="uda-sample.cc"><filepath>uda-sample.cc</filepath></xref>:
|
|
</p>
|
|
|
|
<codeblock audience="hidden">#include "uda-sample.h"
|
|
#include <assert.h>
|
|
#include <sstream>
|
|
|
|
using namespace impala_udf;
|
|
using namespace std;
|
|
|
|
template <typename T>
|
|
StringVal ToStringVal(FunctionContext* context, const T& val) {
|
|
stringstream ss;
|
|
ss << val;
|
|
string str = ss.str();
|
|
StringVal string_val(context, str.size());
|
|
memcpy(string_val.ptr, str.c_str(), str.size());
|
|
return string_val;
|
|
}
|
|
|
|
template <>
|
|
StringVal ToStringVal<DoubleVal>(FunctionContext* context, const DoubleVal& val) {
|
|
if (val.is_null) return StringVal::null();
|
|
return ToStringVal(context, val.val);
|
|
}
|
|
|
|
// ---------------------------------------------------------------------------
|
|
// This is a sample of implementing a COUNT aggregate function.
|
|
// ---------------------------------------------------------------------------
|
|
void CountInit(FunctionContext* context, BigIntVal* val) {
|
|
val->is_null = false;
|
|
val->val = 0;
|
|
}
|
|
|
|
void CountUpdate(FunctionContext* context, const IntVal& input, BigIntVal* val) {
|
|
if (input.is_null) return;
|
|
++val->val;
|
|
}
|
|
|
|
void CountMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst) {
|
|
dst->val += src.val;
|
|
}
|
|
|
|
BigIntVal CountFinalize(FunctionContext* context, const BigIntVal& val) {
|
|
return val;
|
|
}
|
|
|
|
// ---------------------------------------------------------------------------
|
|
// This is a sample of implementing a AVG aggregate function.
|
|
// ---------------------------------------------------------------------------
|
|
struct AvgStruct {
|
|
double sum;
|
|
int64_t count;
|
|
};
|
|
|
|
// Initialize the StringVal intermediate to a zero'd AvgStruct
|
|
void AvgInit(FunctionContext* context, StringVal* val) {
|
|
val->is_null = false;
|
|
val->len = sizeof(AvgStruct);
|
|
val->ptr = context->Allocate(val->len);
|
|
memset(val->ptr, 0, val->len);
|
|
}
|
|
|
|
void AvgUpdate(FunctionContext* context, const DoubleVal& input, StringVal* val) {
|
|
if (input.is_null) return;
|
|
assert(!val->is_null);
|
|
assert(val->len == sizeof(AvgStruct));
|
|
AvgStruct* avg = reinterpret_cast<AvgStruct*>(val->ptr);
|
|
avg->sum += input.val;
|
|
++avg->count;
|
|
}
|
|
|
|
void AvgMerge(FunctionContext* context, const StringVal& src, StringVal* dst) {
|
|
if (src.is_null) return;
|
|
const AvgStruct* src_avg = reinterpret_cast<const AvgStruct*>(src.ptr);
|
|
AvgStruct* dst_avg = reinterpret_cast<AvgStruct*>(dst->ptr);
|
|
dst_avg->sum += src_avg->sum;
|
|
dst_avg->count += src_avg->count;
|
|
}
|
|
|
|
// A serialize function is necesary to free the intermediate state allocation. We use the
|
|
// StringVal constructor to allocate memory owned by Impala, copy the intermediate state,
|
|
// and free the original allocation. Note that memory allocated by the StringVal ctor is
|
|
// not necessarily persisted across UDA function calls, which is why we don't use it in
|
|
// AvgInit().
|
|
const StringVal AvgSerialize(FunctionContext* context, const StringVal& val) {
|
|
assert(!val.is_null);
|
|
StringVal result(context, val.len);
|
|
memcpy(result.ptr, val.ptr, val.len);
|
|
context->Free(val.ptr);
|
|
return result;
|
|
}
|
|
|
|
StringVal AvgFinalize(FunctionContext* context, const StringVal& val) {
|
|
assert(!val.is_null);
|
|
assert(val.len == sizeof(AvgStruct));
|
|
AvgStruct* avg = reinterpret_cast<AvgStruct*>(val.ptr);
|
|
StringVal result;
|
|
if (avg->count == 0) {
|
|
result = StringVal::null();
|
|
} else {
|
|
// Copies the result to memory owned by Impala
|
|
result = ToStringVal(context, avg->sum / avg->count);
|
|
}
|
|
context->Free(val.ptr);
|
|
return result;
|
|
}
|
|
|
|
// ---------------------------------------------------------------------------
|
|
// This is a sample of implementing the STRING_CONCAT aggregate function.
|
|
// Example: select string_concat(string_col, ",") from table
|
|
// ---------------------------------------------------------------------------
|
|
// Delimiter to use if the separator is NULL.
|
|
static const StringVal DEFAULT_STRING_CONCAT_DELIM((uint8_t*)", ", 2);
|
|
|
|
void StringConcatInit(FunctionContext* context, StringVal* val) {
|
|
val->is_null = true;
|
|
}
|
|
|
|
void StringConcatUpdate(FunctionContext* context, const StringVal& str,
|
|
const StringVal& separator, StringVal* result) {
|
|
if (str.is_null) return;
|
|
if (result->is_null) {
|
|
// This is the first string, simply set the result to be the value.
|
|
uint8_t* copy = context->Allocate(str.len);
|
|
memcpy(copy, str.ptr, str.len);
|
|
*result = StringVal(copy, str.len);
|
|
return;
|
|
}
|
|
|
|
const StringVal* sep_ptr = separator.is_null ? &DEFAULT_STRING_CONCAT_DELIM :
|
|
&separator;
|
|
|
|
// We need to grow the result buffer and then append the new string and
|
|
// separator.
|
|
int new_size = result->len + sep_ptr->len + str.len;
|
|
result->ptr = context->Reallocate(result->ptr, new_size);
|
|
memcpy(result->ptr + result->len, sep_ptr->ptr, sep_ptr->len);
|
|
result->len += sep_ptr->len;
|
|
memcpy(result->ptr + result->len, str.ptr, str.len);
|
|
result->len += str.len;
|
|
}
|
|
|
|
void StringConcatMerge(FunctionContext* context, const StringVal& src, StringVal* dst) {
|
|
if (src.is_null) return;
|
|
StringConcatUpdate(context, src, ",", dst);
|
|
}
|
|
|
|
// A serialize function is necesary to free the intermediate state allocation. We use the
|
|
// StringVal constructor to allocate memory owned by Impala, copy the intermediate
|
|
// StringVal, and free the intermediate's memory. Note that memory allocated by the
|
|
// StringVal ctor is not necessarily persisted across UDA function calls, which is why we
|
|
// don't use it in StringConcatUpdate().
|
|
const StringVal StringConcatSerialize(FunctionContext* context, const StringVal& val) {
|
|
if (val.is_null) return val;
|
|
StringVal result(context, val.len);
|
|
memcpy(result.ptr, val.ptr, val.len);
|
|
context->Free(val.ptr);
|
|
return result;
|
|
}
|
|
|
|
// Same as StringConcatSerialize().
|
|
StringVal StringConcatFinalize(FunctionContext* context, const StringVal& val) {
|
|
if (val.is_null) return val;
|
|
StringVal result(context, val.len);
|
|
memcpy(result.ptr, val.ptr, val.len);
|
|
context->Free(val.ptr);
|
|
return result;
|
|
}</codeblock>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept rev="2.3.0 IMPALA-1829" id="udf_intermediate">
|
|
|
|
<title>Intermediate Results for UDAs</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
A user-defined aggregate function might produce and combine intermediate results during some phases of
|
|
processing, using a different data type than the final return value. For example, if you implement a
|
|
function similar to the built-in <codeph>AVG()</codeph> function, it must keep track of two values, the
|
|
number of values counted and the sum of those values. Or, you might accumulate a string value over the
|
|
course of a UDA, then in the end return a numeric or Boolean result.
|
|
</p>
|
|
|
|
<p>
|
|
In such a case, specify the data type of the intermediate results using the optional <codeph>INTERMEDIATE
|
|
<varname>type_name</varname></codeph> clause of the <codeph>CREATE AGGREGATE FUNCTION</codeph> statement.
|
|
If the intermediate data is a typeless byte array (for example, to represent a C++ struct or array),
|
|
specify the type name as <codeph>CHAR(<varname>n</varname>)</codeph>, with <varname>n</varname>
|
|
representing the number of bytes in the intermediate result buffer.
|
|
</p>
|
|
|
|
<p>
|
|
For an example of this technique, see the <codeph>trunc_sum()</codeph> aggregate function, which accumulates
|
|
intermediate results of type <codeph>DOUBLE</codeph> and returns <codeph>BIGINT</codeph> at the end.
|
|
View <xref keyref="test_udfs.py">the <codeph>CREATE FUNCTION</codeph> statement</xref>
|
|
and <xref keyref="test-udas.cc">the implementation of the underlying TruncSum*() functions</xref>
|
|
on Github.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|
|
|
|
<concept id="udf_building">
|
|
|
|
<title>Building and Deploying UDFs</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Deploying"/>
|
|
<data name="Category" value="Building"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This section explains the steps to compile Impala UDFs from C++ source code, and deploy the resulting
|
|
libraries for use in Impala queries.
|
|
</p>
|
|
|
|
<p>
|
|
Impala ships with a sample build environment for UDFs, that you can study, experiment with, and adapt for
|
|
your own use. This sample build environment starts with the <cmdname>cmake</cmdname> configuration command,
|
|
which reads the file <filepath>CMakeLists.txt</filepath> and generates a <filepath>Makefile</filepath>
|
|
customized for your particular directory paths. Then the <cmdname>make</cmdname> command runs the actual
|
|
build steps based on the rules in the <filepath>Makefile</filepath>.
|
|
</p>
|
|
|
|
<p>
|
|
Impala loads the shared library from an HDFS location. After building a shared library containing one or
|
|
more UDFs, use <codeph>hdfs dfs</codeph> or <codeph>hadoop fs</codeph> commands to copy the binary file to
|
|
an HDFS location readable by Impala.
|
|
</p>
|
|
|
|
<p>
|
|
The final step in deployment is to issue a <codeph>CREATE FUNCTION</codeph> statement in the
|
|
<cmdname>impala-shell</cmdname> interpreter to make Impala aware of the new function. See
|
|
<xref href="impala_create_function.xml#create_function"/> for syntax details. Because each function is
|
|
associated with a particular database, always issue a <codeph>USE</codeph> statement to the appropriate
|
|
database before creating a function, or specify a fully qualified name, that is, <codeph>CREATE FUNCTION
|
|
<varname>db_name</varname>.<varname>function_name</varname></codeph>.
|
|
</p>
|
|
|
|
<p>
|
|
As you update the UDF code and redeploy updated versions of a shared library, use <codeph>DROP
|
|
FUNCTION</codeph> and <codeph>CREATE FUNCTION</codeph> to let Impala pick up the latest version of the
|
|
code.
|
|
</p>
|
|
|
|
<note>
|
|
<p conref="../shared/impala_common.xml#common/udf_persistence_restriction"/>
|
|
<p>
|
|
See <xref href="impala_create_function.xml#create_function"/> and <xref href="impala_drop_function.xml#drop_function"/>
|
|
for the new syntax for the persistent Java UDFs.
|
|
</p>
|
|
</note>
|
|
|
|
<p>
|
|
Prerequisites for the build environment are:
|
|
</p>
|
|
|
|
<codeblock rev=""># Use the appropriate package installation command for your Linux distribution.
|
|
sudo yum install gcc-c++ cmake boost-devel
|
|
sudo yum install impala-udf-devel
|
|
# The package name on Ubuntu and Debian is impala-udf-dev.
|
|
</codeblock>
|
|
|
|
<p>
|
|
Then, unpack the sample code in <filepath>udf_samples.tar.gz</filepath> and use that as a template to set
|
|
up your build environment.
|
|
</p>
|
|
|
|
<p>
|
|
To build the original samples:
|
|
</p>
|
|
|
|
<codeblock># Process CMakeLists.txt and set up appropriate Makefiles.
|
|
cmake .
|
|
# Generate shared libraries from UDF and UDAF sample code,
|
|
# udf_samples/libudfsample.so and udf_samples/libudasample.so
|
|
make</codeblock>
|
|
|
|
<p>
|
|
The sample code to examine, experiment with, and adapt is in these files:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<filepath>udf-sample.h</filepath>: Header file that declares the signature for a scalar UDF
|
|
(<codeph>AddUDF</codeph>).
|
|
</li>
|
|
|
|
<li>
|
|
<filepath>udf-sample.cc</filepath>: Sample source for a simple UDF that adds two integers. Because
|
|
Impala can reference multiple function entry points from the same shared library, you could add other UDF
|
|
functions in this file and add their signatures to the corresponding header file.
|
|
</li>
|
|
|
|
<li>
|
|
<filepath>udf-sample-test.cc</filepath>: Basic unit tests for the sample UDF.
|
|
</li>
|
|
|
|
<li>
|
|
<filepath>uda-sample.h</filepath>: Header file that declares the signature for sample aggregate
|
|
functions. The SQL functions will be called <codeph>COUNT</codeph>, <codeph>AVG</codeph>, and
|
|
<codeph>STRINGCONCAT</codeph>. Because aggregate functions require more elaborate coding to handle the
|
|
processing for multiple phases, there are several underlying C++ functions such as
|
|
<codeph>CountInit</codeph>, <codeph>AvgUpdate</codeph>, and <codeph>StringConcatFinalize</codeph>.
|
|
</li>
|
|
|
|
<li>
|
|
<filepath>uda-sample.cc</filepath>: Sample source for simple UDAFs that demonstrate how to manage the
|
|
state transitions as the underlying functions are called during the different phases of query processing.
|
|
<ul>
|
|
<li>
|
|
The UDAF that imitates the <codeph>COUNT</codeph> function keeps track of a single incrementing
|
|
number; the merge functions combine the intermediate count values from each Impala node, and the
|
|
combined number is returned verbatim by the finalize function.
|
|
</li>
|
|
|
|
<li>
|
|
The UDAF that imitates the <codeph>AVG</codeph> function keeps track of two numbers, a count of rows
|
|
processed and the sum of values for a column. These numbers are updated and merged as with
|
|
<codeph>COUNT</codeph>, then the finalize function divides them to produce and return the final
|
|
average value.
|
|
</li>
|
|
|
|
<li>
|
|
The UDAF that concatenates string values into a comma-separated list demonstrates how to manage
|
|
storage for a string that increases in length as the function is called for multiple rows.
|
|
</li>
|
|
</ul>
|
|
</li>
|
|
|
|
<li>
|
|
<filepath>uda-sample-test.cc</filepath>: basic unit tests for the sample UDAFs.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_performance">
|
|
|
|
<title>Performance Considerations for UDFs</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Performance"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
Because a UDF typically processes each row of a table, potentially being called billions of times, the
|
|
performance of each UDF is a critical factor in the speed of the overall ETL or ELT pipeline. Tiny
|
|
optimizations you can make within the function body can pay off in a big way when the function is called
|
|
over and over when processing a huge result set.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_tutorial">
|
|
|
|
<title>Examples of Creating and Using UDFs</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
This section demonstrates how to create and use all kinds of user-defined functions (UDFs).
|
|
</p>
|
|
|
|
<p audience="hidden">
|
|
For downloadable examples that you can experiment with, adapt, and use as templates for your own functions,
|
|
see <xref keyref="udf-samples" scope="external" format="html">the Impala sample UDF github</xref>.
|
|
You must have already installed the appropriate header files, as explained in
|
|
<xref href="impala_udf.xml#udf_demo_env"/>.
|
|
</p>
|
|
|
|
<!-- Limitation: mini-TOC currently doesn't include the <example> tags. -->
|
|
|
|
<!-- <p outputclass="toc inpage"/> -->
|
|
|
|
<example id="udf_sample_udf">
|
|
|
|
<title>Sample C++ UDFs: HasVowels, CountVowels, StripVowels</title>
|
|
|
|
<p>
|
|
This example shows 3 separate UDFs that operate on strings and return different data types. In the C++
|
|
code, the functions are <codeph>HasVowels()</codeph> (checks if a string contains any vowels),
|
|
<codeph>CountVowels()</codeph> (returns the number of vowels in a string), and
|
|
<codeph>StripVowels()</codeph> (returns a new string with vowels removed).
|
|
</p>
|
|
|
|
<p>
|
|
First, we add the signatures for these functions to <filepath>udf-sample.h</filepath> in the demo build
|
|
environment:
|
|
</p>
|
|
|
|
<codeblock>BooleanVal HasVowels(FunctionContext* context, const StringVal& input);
|
|
IntVal CountVowels(FunctionContext* context, const StringVal& arg1);
|
|
StringVal StripVowels(FunctionContext* context, const StringVal& arg1);</codeblock>
|
|
|
|
<p>
|
|
Then, we add the bodies of these functions to <filepath>udf-sample.cc</filepath>:
|
|
</p>
|
|
|
|
<codeblock>BooleanVal HasVowels(FunctionContext* context, const StringVal& input)
|
|
{
|
|
if (input.is_null) return BooleanVal::null();
|
|
|
|
int index;
|
|
uint8_t *ptr;
|
|
|
|
for (ptr = input.ptr, index = 0; index <= input.len; index++, ptr++)
|
|
{
|
|
uint8_t c = tolower(*ptr);
|
|
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u')
|
|
{
|
|
return BooleanVal(true);
|
|
}
|
|
}
|
|
return BooleanVal(false);
|
|
}
|
|
|
|
IntVal CountVowels(FunctionContext* context, const StringVal& arg1)
|
|
{
|
|
if (arg1.is_null) return IntVal::null();
|
|
|
|
int count;
|
|
int index;
|
|
uint8_t *ptr;
|
|
|
|
for (ptr = arg1.ptr, count = 0, index = 0; index <= arg1.len; index++, ptr++)
|
|
{
|
|
uint8_t c = tolower(*ptr);
|
|
if (c == 'a' || c == 'e' || c == 'i' || c == 'o' || c == 'u')
|
|
{
|
|
count++;
|
|
}
|
|
}
|
|
return IntVal(count);
|
|
}
|
|
|
|
StringVal StripVowels(FunctionContext* context, const StringVal& arg1)
|
|
{
|
|
if (arg1.is_null) return StringVal::null();
|
|
|
|
int index;
|
|
std::string original((const char *)arg1.ptr,arg1.len);
|
|
std::string shorter("");
|
|
|
|
for (index = 0; index < original.length(); index++)
|
|
{
|
|
uint8_t c = original[index];
|
|
uint8_t l = tolower(c);
|
|
|
|
if (l == 'a' || l == 'e' || l == 'i' || l == 'o' || l == 'u')
|
|
{
|
|
;
|
|
}
|
|
else
|
|
{
|
|
shorter.append(1, (char)c);
|
|
}
|
|
}
|
|
// The modified string is stored in 'shorter', which is destroyed when this function ends. We need to make a string val
|
|
// and copy the contents.
|
|
StringVal result(context, shorter.size()); // Only the version of the ctor that takes a context object allocates new memory
|
|
memcpy(result.ptr, shorter.c_str(), shorter.size());
|
|
return result;
|
|
}</codeblock>
|
|
|
|
<p>
|
|
We build a shared library, <filepath>libudfsample.so</filepath>, and put the library file into HDFS
|
|
where Impala can read it:
|
|
</p>
|
|
|
|
<codeblock>$ make
|
|
[ 0%] Generating udf_samples/uda-sample.ll
|
|
[ 16%] Built target uda-sample-ir
|
|
[ 33%] Built target udasample
|
|
[ 50%] Built target uda-sample-test
|
|
[ 50%] Generating udf_samples/udf-sample.ll
|
|
[ 66%] Built target udf-sample-ir
|
|
Scanning dependencies of target udfsample
|
|
[ 83%] Building CXX object CMakeFiles/udfsample.dir/udf-sample.o
|
|
Linking CXX shared library udf_samples/libudfsample.so
|
|
[ 83%] Built target udfsample
|
|
Linking CXX executable udf_samples/udf-sample-test
|
|
[100%] Built target udf-sample-test
|
|
$ hdfs dfs -put ./udf_samples/libudfsample.so /user/hive/udfs/libudfsample.so</codeblock>
|
|
|
|
<p>
|
|
Finally, we go into the <cmdname>impala-shell</cmdname> interpreter where we set up some sample data,
|
|
issue <codeph>CREATE FUNCTION</codeph> statements to set up the SQL function names, and call the
|
|
functions in some queries:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > create database udf_testing;
|
|
[localhost:21000] > use udf_testing;
|
|
|
|
[localhost:21000] > create function has_vowels (string) returns boolean location '/user/hive/udfs/libudfsample.so' symbol='HasVowels';
|
|
[localhost:21000] > select has_vowels('abc');
|
|
+------------------------+
|
|
| udfs.has_vowels('abc') |
|
|
+------------------------+
|
|
| true |
|
|
+------------------------+
|
|
Returned 1 row(s) in 0.13s
|
|
[localhost:21000] > select has_vowels('zxcvbnm');
|
|
+----------------------------+
|
|
| udfs.has_vowels('zxcvbnm') |
|
|
+----------------------------+
|
|
| false |
|
|
+----------------------------+
|
|
Returned 1 row(s) in 0.12s
|
|
[localhost:21000] > select has_vowels(null);
|
|
+-----------------------+
|
|
| udfs.has_vowels(null) |
|
|
+-----------------------+
|
|
| NULL |
|
|
+-----------------------+
|
|
Returned 1 row(s) in 0.11s
|
|
[localhost:21000] > select s, has_vowels(s) from t2;
|
|
+-----------+--------------------+
|
|
| s | udfs.has_vowels(s) |
|
|
+-----------+--------------------+
|
|
| lower | true |
|
|
| UPPER | true |
|
|
| Init cap | true |
|
|
| CamelCase | true |
|
|
+-----------+--------------------+
|
|
Returned 4 row(s) in 0.24s
|
|
|
|
[localhost:21000] > create function count_vowels (string) returns int location '/user/hive/udfs/libudfsample.so' symbol='CountVowels';
|
|
[localhost:21000] > select count_vowels('cat in the hat');
|
|
+-------------------------------------+
|
|
| udfs.count_vowels('cat in the hat') |
|
|
+-------------------------------------+
|
|
| 4 |
|
|
+-------------------------------------+
|
|
Returned 1 row(s) in 0.12s
|
|
[localhost:21000] > select s, count_vowels(s) from t2;
|
|
+-----------+----------------------+
|
|
| s | udfs.count_vowels(s) |
|
|
+-----------+----------------------+
|
|
| lower | 2 |
|
|
| UPPER | 2 |
|
|
| Init cap | 3 |
|
|
| CamelCase | 4 |
|
|
+-----------+----------------------+
|
|
Returned 4 row(s) in 0.23s
|
|
[localhost:21000] > select count_vowels(null);
|
|
+-------------------------+
|
|
| udfs.count_vowels(null) |
|
|
+-------------------------+
|
|
| NULL |
|
|
+-------------------------+
|
|
Returned 1 row(s) in 0.12s
|
|
|
|
[localhost:21000] > create function strip_vowels (string) returns string location '/user/hive/udfs/libudfsample.so' symbol='StripVowels';
|
|
[localhost:21000] > select strip_vowels('abcdefg');
|
|
+------------------------------+
|
|
| udfs.strip_vowels('abcdefg') |
|
|
+------------------------------+
|
|
| bcdfg |
|
|
+------------------------------+
|
|
Returned 1 row(s) in 0.11s
|
|
[localhost:21000] > select strip_vowels('ABCDEFG');
|
|
+------------------------------+
|
|
| udfs.strip_vowels('abcdefg') |
|
|
+------------------------------+
|
|
| BCDFG |
|
|
+------------------------------+
|
|
Returned 1 row(s) in 0.12s
|
|
[localhost:21000] > select strip_vowels(null);
|
|
+-------------------------+
|
|
| udfs.strip_vowels(null) |
|
|
+-------------------------+
|
|
| NULL |
|
|
+-------------------------+
|
|
Returned 1 row(s) in 0.16s
|
|
[localhost:21000] > select s, strip_vowels(s) from t2;
|
|
+-----------+----------------------+
|
|
| s | udfs.strip_vowels(s) |
|
|
+-----------+----------------------+
|
|
| lower | lwr |
|
|
| UPPER | PPR |
|
|
| Init cap | nt cp |
|
|
| CamelCase | CmlCs |
|
|
+-----------+----------------------+
|
|
Returned 4 row(s) in 0.24s</codeblock>
|
|
|
|
</example>
|
|
|
|
<example id="udf_sample_uda">
|
|
|
|
<title>Sample C++ UDA: SumOfSquares</title>
|
|
|
|
<p>
|
|
This example demonstrates a user-defined aggregate function (UDA) that produces the sum of the squares of
|
|
its input values.
|
|
</p>
|
|
|
|
<p>
|
|
The coding for a UDA is a little more involved than a scalar UDF, because the processing is split into
|
|
several phases, each implemented by a different function. Each phase is relatively straightforward: the
|
|
<q>update</q> and <q>merge</q> phases, where most of the work is done, read an input value and combine it
|
|
with some accumulated intermediate value.
|
|
</p>
|
|
|
|
<p>
|
|
As in our sample UDF from the previous example, we add function signatures to a header file (in this
|
|
case, <filepath>uda-sample.h</filepath>). Because this is a math-oriented UDA, we make two versions of
|
|
each function, one accepting an integer value and the other accepting a floating-point value.
|
|
</p>
|
|
|
|
<codeblock>void SumOfSquaresInit(FunctionContext* context, BigIntVal* val);
|
|
void SumOfSquaresInit(FunctionContext* context, DoubleVal* val);
|
|
|
|
void SumOfSquaresUpdate(FunctionContext* context, const BigIntVal& input, BigIntVal* val);
|
|
void SumOfSquaresUpdate(FunctionContext* context, const DoubleVal& input, DoubleVal* val);
|
|
|
|
void SumOfSquaresMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst);
|
|
void SumOfSquaresMerge(FunctionContext* context, const DoubleVal& src, DoubleVal* dst);
|
|
|
|
BigIntVal SumOfSquaresFinalize(FunctionContext* context, const BigIntVal& val);
|
|
DoubleVal SumOfSquaresFinalize(FunctionContext* context, const DoubleVal& val);</codeblock>
|
|
|
|
<p>
|
|
We add the function bodies to a C++ source file (in this case, <filepath>uda-sample.cc</filepath>):
|
|
</p>
|
|
|
|
<codeblock>void SumOfSquaresInit(FunctionContext* context, BigIntVal* val) {
|
|
val->is_null = false;
|
|
val->val = 0;
|
|
}
|
|
void SumOfSquaresInit(FunctionContext* context, DoubleVal* val) {
|
|
val->is_null = false;
|
|
val->val = 0.0;
|
|
}
|
|
|
|
void SumOfSquaresUpdate(FunctionContext* context, const BigIntVal& input, BigIntVal* val) {
|
|
if (input.is_null) return;
|
|
val->val += input.val * input.val;
|
|
}
|
|
void SumOfSquaresUpdate(FunctionContext* context, const DoubleVal& input, DoubleVal* val) {
|
|
if (input.is_null) return;
|
|
val->val += input.val * input.val;
|
|
}
|
|
|
|
void SumOfSquaresMerge(FunctionContext* context, const BigIntVal& src, BigIntVal* dst) {
|
|
dst->val += src.val;
|
|
}
|
|
void SumOfSquaresMerge(FunctionContext* context, const DoubleVal& src, DoubleVal* dst) {
|
|
dst->val += src.val;
|
|
}
|
|
|
|
BigIntVal SumOfSquaresFinalize(FunctionContext* context, const BigIntVal& val) {
|
|
return val;
|
|
}
|
|
DoubleVal SumOfSquaresFinalize(FunctionContext* context, const DoubleVal& val) {
|
|
return val;
|
|
}</codeblock>
|
|
|
|
<p>
|
|
As with the sample UDF, we build a shared library and put it into HDFS:
|
|
</p>
|
|
|
|
<codeblock>$ make
|
|
[ 0%] Generating udf_samples/uda-sample.ll
|
|
[ 16%] Built target uda-sample-ir
|
|
Scanning dependencies of target udasample
|
|
[ 33%] Building CXX object CMakeFiles/udasample.dir/uda-sample.o
|
|
Linking CXX shared library udf_samples/libudasample.so
|
|
[ 33%] Built target udasample
|
|
Scanning dependencies of target uda-sample-test
|
|
[ 50%] Building CXX object CMakeFiles/uda-sample-test.dir/uda-sample-test.o
|
|
Linking CXX executable udf_samples/uda-sample-test
|
|
[ 50%] Built target uda-sample-test
|
|
[ 50%] Generating udf_samples/udf-sample.ll
|
|
[ 66%] Built target udf-sample-ir
|
|
[ 83%] Built target udfsample
|
|
[100%] Built target udf-sample-test
|
|
$ hdfs dfs -put ./udf_samples/libudasample.so /user/hive/udfs/libudasample.so</codeblock>
|
|
|
|
<p>
|
|
To create the SQL function, we issue a <codeph>CREATE AGGREGATE FUNCTION</codeph> statement and specify
|
|
the underlying C++ function names for the different phases:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > use udf_testing;
|
|
|
|
[localhost:21000] > create table sos (x bigint, y double);
|
|
[localhost:21000] > insert into sos values (1, 1.1), (2, 2.2), (3, 3.3), (4, 4.4);
|
|
Inserted 4 rows in 1.10s
|
|
|
|
[localhost:21000] > create aggregate function sum_of_squares(bigint) returns bigint
|
|
> location '/user/hive/udfs/libudasample.so'
|
|
> init_fn='SumOfSquaresInit'
|
|
> update_fn='SumOfSquaresUpdate'
|
|
> merge_fn='SumOfSquaresMerge'
|
|
> finalize_fn='SumOfSquaresFinalize';
|
|
|
|
[localhost:21000] > -- Compute the same value using literals or the UDA;
|
|
[localhost:21000] > select 1*1 + 2*2 + 3*3 + 4*4;
|
|
+-------------------------------+
|
|
| 1 * 1 + 2 * 2 + 3 * 3 + 4 * 4 |
|
|
+-------------------------------+
|
|
| 30 |
|
|
+-------------------------------+
|
|
Returned 1 row(s) in 0.12s
|
|
[localhost:21000] > select sum_of_squares(x) from sos;
|
|
+------------------------+
|
|
| udfs.sum_of_squares(x) |
|
|
+------------------------+
|
|
| 30 |
|
|
+------------------------+
|
|
Returned 1 row(s) in 0.35s</codeblock>
|
|
|
|
<p>
|
|
Until we create the overloaded version of the UDA, it can only handle a single data type. To allow it to
|
|
handle <codeph>DOUBLE</codeph> as well as <codeph>BIGINT</codeph>, we issue another <codeph>CREATE
|
|
AGGREGATE FUNCTION</codeph> statement:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > select sum_of_squares(y) from sos;
|
|
ERROR: AnalysisException: No matching function with signature: udfs.sum_of_squares(DOUBLE).
|
|
|
|
[localhost:21000] > create aggregate function sum_of_squares(double) returns double
|
|
> location '/user/hive/udfs/libudasample.so'
|
|
> init_fn='SumOfSquaresInit'
|
|
> update_fn='SumOfSquaresUpdate'
|
|
> merge_fn='SumOfSquaresMerge'
|
|
> finalize_fn='SumOfSquaresFinalize';
|
|
|
|
[localhost:21000] > -- Compute the same value using literals or the UDA;
|
|
[localhost:21000] > select 1.1*1.1 + 2.2*2.2 + 3.3*3.3 + 4.4*4.4;
|
|
+-----------------------------------------------+
|
|
| 1.1 * 1.1 + 2.2 * 2.2 + 3.3 * 3.3 + 4.4 * 4.4 |
|
|
+-----------------------------------------------+
|
|
| 36.3 |
|
|
+-----------------------------------------------+
|
|
Returned 1 row(s) in 0.12s
|
|
[localhost:21000] > select sum_of_squares(y) from sos;
|
|
+------------------------+
|
|
| udfs.sum_of_squares(y) |
|
|
+------------------------+
|
|
| 36.3 |
|
|
+------------------------+
|
|
Returned 1 row(s) in 0.35s</codeblock>
|
|
|
|
<p>
|
|
Typically, you use a UDA in queries with <codeph>GROUP BY</codeph> clauses, to produce a result set with
|
|
a separate aggregate value for each combination of values from the <codeph>GROUP BY</codeph> clause.
|
|
Let's change our sample table to use <codeph>0</codeph> to indicate rows containing even values, and
|
|
<codeph>1</codeph> to flag rows containing odd values. Then the <codeph>GROUP BY</codeph> query can
|
|
return two values, the sum of the squares for the even values, and the sum of the squares for the odd
|
|
values:
|
|
</p>
|
|
|
|
<codeblock>[localhost:21000] > insert overwrite sos values (1, 1), (2, 0), (3, 1), (4, 0);
|
|
Inserted 4 rows in 1.24s
|
|
|
|
[localhost:21000] > -- Compute 1 squared + 3 squared, and 2 squared + 4 squared;
|
|
[localhost:21000] > select y, sum_of_squares(x) from sos group by y;
|
|
+---+------------------------+
|
|
| y | udfs.sum_of_squares(x) |
|
|
+---+------------------------+
|
|
| 1 | 10 |
|
|
| 0 | 20 |
|
|
+---+------------------------+
|
|
Returned 2 row(s) in 0.43s</codeblock>
|
|
|
|
</example>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_security">
|
|
|
|
<title>Security Considerations for User-Defined Functions</title>
|
|
<prolog>
|
|
<metadata>
|
|
<data name="Category" value="Security"/>
|
|
</metadata>
|
|
</prolog>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
When the Impala authorization feature is enabled:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
To call a UDF in a query, you must have the required read privilege for any databases and tables used in
|
|
the query.
|
|
</li>
|
|
<li> The <codeph>CREATE FUNCTION</codeph> statement requires:<ul>
|
|
<li>The <codeph>CREATE</codeph> privilege on the database.</li>
|
|
<li>The <codeph>ALL</codeph> privilege on URI where URI is the value
|
|
you specified for the <codeph>LOCATION</codeph> in the
|
|
<codeph>CREATE FUNCTION</codeph> statement. </li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>
|
|
See <xref href="impala_authorization.xml#authorization"/> for details about authorization in Impala.
|
|
</p>
|
|
</conbody>
|
|
</concept>
|
|
|
|
<concept id="udf_limits">
|
|
|
|
<title>Limitations and Restrictions for Impala UDFs</title>
|
|
|
|
<conbody>
|
|
|
|
<p>
|
|
The following limitations and restrictions apply to Impala UDFs in the current release:
|
|
</p>
|
|
|
|
<ul>
|
|
<li>
|
|
Impala does not support Hive UDFs that accept or return composite or nested types, or other types not
|
|
available in Impala tables.
|
|
</li>
|
|
|
|
<li>
|
|
<p conref="../shared/impala_common.xml#common/current_user_caveat"/>
|
|
</li>
|
|
|
|
<li>
|
|
All Impala UDFs must be deterministic, that is, produce the same output each time when passed the same
|
|
argument values. For example, an Impala UDF must not call functions such as <codeph>rand()</codeph> to
|
|
produce different values for each invocation. It must not retrieve data from external sources, such as
|
|
from disk or over the network.
|
|
</li>
|
|
|
|
<li>
|
|
An Impala UDF must not spawn other threads or processes.
|
|
</li>
|
|
|
|
<li rev="2.5.0 IMPALA-2843">
|
|
Prior to <keyword keyref="impala25_full"/> when the <cmdname>catalogd</cmdname> process is restarted,
|
|
all UDFs become undefined and must be reloaded. In <keyword keyref="impala25_full"/> and higher, this
|
|
limitation only applies to older Java UDFs. Re-create those UDFs using the new
|
|
<codeph>CREATE FUNCTION</codeph> syntax for Java UDFs, which excludes the function signature,
|
|
to remove the limitation entirely.
|
|
</li>
|
|
|
|
<li>
|
|
Impala currently does not support user-defined table functions (UDTFs).
|
|
</li>
|
|
|
|
<li rev="2.0.0">
|
|
The <codeph>CHAR</codeph> and <codeph>VARCHAR</codeph> types cannot be used as input arguments or return
|
|
values for UDFs.
|
|
</li>
|
|
</ul>
|
|
</conbody>
|
|
</concept>
|
|
</concept>
|