This commit adds a new feature to persist hive/java udfs across
catalog restarts. IMPALA-1748 already added this for non-java
udfs by storing them in parameters map of the Db object and
reading them back at catalog startup. However we follow a
different approach for hive udfs by converting them to Hive's
function format and adding them as hive functions to the metastore.
This makes it possible to share udfs between hive and Impala as the
udfs added from one service are accessible to other. This commit
takes care of format conversions between hive and impala and user
can just add function once in either of the services.
Background: Hive and impala treat udfs differently. Hive resolves the
evaluate function in the udf class at runtime depending on the data
types of the input arguments. So user can add one function by name and
can pass any arguments to it as long as there is a compatible evaluate
function in the udf class. However Impala takes the input types of the
udf as a part of function definition (that maps to only one evaluate
function) and loads the function only for those set of input argument
types. If we have multiple 'evaluate' methods, we need to add multiple
functions one for each of them.
This commit adds new variants of CREATE | DROP FUNCTIONS to Impala which
lets the user to create and drop hive/java udfs without input argument
types or return types. Catalog takes care of loading/dropping the udf
signatures corresponding to each "evaluate" method in the udf symbol
class. The syntax is as follows,
CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts>
DROP FUNCTION [IF EXISTS] <function name>
Examples:
CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;
The older way of creating hive/java udfs with specific signature is still supported,
however they are *not* persisted across restarts. So a restart of catalog can
wipe them out. Additionally this commit also loads all the compatible java udfs
added outside of Impala and they needn't be separately loaded. One thing
to note here is that the functions added using the new CREATE FUNCTION
can only be dropped using the new DROP FUNCTION syntax (without
signature). The same rule applies for the java udfs added using the old
CREATE FUNCTION syntax (with signature).
Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d
Reviewed-on: http://gerrit.cloudera.org:8080/2250
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
This patch extends the SHOW statement to also support
user-defined functions and user-defined aggregate functions.
The syntax of the new SHOW statements is as follows:
SHOW CREATE [AGGREGATE] FUNCTION [<db_name>.]<func_name>;
<db_name> and <func_name> are the names of the database
and udf/uda respectively.
Sample outputs of the new SHOW statements are as follows:
Query: show create function fn
+------------------------------------------------------------------+
| result |
+------------------------------------------------------------------+
| CREATE FUNCTION default.fn() |
| RETURNS INT |
| LOCATION 'hdfs://localhost:20500/test-warehouse/libTestUdfs.so' |
| SYMBOL='_Z2FnPN10impala_udf15FunctionContextE' |
| |
+------------------------------------------------------------------+
Query: show create aggregate function agg_fn
+------------------------------------------------------------------------------------------+
| result |
+------------------------------------------------------------------------------------------+
| CREATE AGGREGATE FUNCTION default.agg_fn(INT) |
| RETURNS BIGINT |
| LOCATION 'hdfs://localhost:20500/test-warehouse/libudasample.so' |
| UPDATE_FN='_Z11CountUpdatePN10impala_udf15FunctionContextERKNS_6IntValEPNS_9BigIntValE' |
| INIT_FN='_Z9CountInitPN10impala_udf15FunctionContextEPNS_9BigIntValE' |
| MERGE_FN='_Z10CountMergePN10impala_udf15FunctionContextERKNS_9BigIntValEPS2_' |
| FINALIZE_FN='_Z13CountFinalizePN10impala_udf15FunctionContextERKNS_9BigIntValE' |
| |
+------------------------------------------------------------------------------------------+
Please note that all the overloaded functions which match
the given function name and category will be printed.
This patch also extends the python test infrastructure to
support expected results which include newline characters.
A new subsection comment called 'MULTI_LINE' has been added
for the 'RESULT' section. With this comment, a test can
include its multi-line output inside [ ] and the content
inside [ ] will be treated as a single line, including the
newline character.
Change-Id: Idbe433eeaf5e24ed55c31d905fea2a6160c46011
Reviewed-on: http://gerrit.cloudera.org:8080/1271
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Add skip markers for S3 that can be used to categorize the tests that
are skipped against S3 to help see what coverage is missing. Soon
we'll be reworking some tests and/or adding new tests to get back the
important gaps.
Also, add a mechanism to parameterize paths in the .test files, and
start using these new variables. This is a step toward enabling some
more tests against S3.
Finally, a fix for buildall.sh to stop the minicluster before applying
the metastore snapshot. Otherwise, this fails since the ms db is in
use.
Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0
Reviewed-on: http://gerrit.cloudera.org:8080/127
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
When dropping functions, we neeed to remove the function from the list
of Functions with that name AND remove the list from the Function map if
the list is empty. The second part wasn't happening.
Also fixes the test_ddl to properly create all test databases.
Change-Id: Id85af7d5db74a31161f48bea3816bdf734063133
Reviewed-on: http://gerrit.ent.cloudera.com:8080/952
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
This change adds support for cluster-synchronized catalog operations. This provides the
guaranteethat after a catalog op completes, all other subscribers to the catalog topic have
also processed that update. This is useful when load balancing, because a common workflow
is to target a different impalad for each statement executed.
For example if each of the following were executed sequentially, but targeting
a different node:
1) CREATE TABLE Foo
2) INSERT INTO Foo
3) SELECT * FROM Foo
4) INSERT INTO Foo ....
Since both the INSERT and the CREATE update the catalog, it would not work as expected
without this patch. The user might either get a "table not found" error or would be
missing partition information from the INSERT.
The downside is that this approach to DDL takes a bit longer because we need to wait
until all subscribers have processed an update. If all nodes are healthy, this overhead
should not be significantly longer than the current DDL time. However, a single bad node
might slow down or completely block the completion of all DDL operations. By default
this feature is disabled, but it can be enabled using a new query option: SYNCED_DDL=1
To test this, the base test suite was updated to support selecting a random impalad
to execute each query section in a query test file. This is currently only enabled
for the insert and DDL tests, but could be leveraged by more tests in the future.
TODO: Add additional failure tests around this functionality.
TODO: Add an explicit "sync" statement so users do not need to run all their DDL
in this mode (since it is slower).
Change-Id: I45e757a931bf2a4740cc0cdd1e76ce49a1e22b83
Reviewed-on: http://gerrit.ent.cloudera.com:8080/899
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Unfortunately, the BE does not have the codegen path to execute UDAs.
This puts some restrictions on the UDAs we can run.
- No IR UDAs
- No varargs
- Must have 8 arguments or less.
The code to do this is almost all there for UDFs but I'm not sure I'll get to it.
Change-Id: I8a06e635a9138397c8474a5704c3e588bb92347b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/703
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).
This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.
Some other minor cleanup in:
- JNI from FE to BE
- UDFs/UDAs that are loaded as test data
Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
I looked around some and I think having create/drop/show [aggregate] function
seems reasonable and extends nicely for UDTs.
The create aggregate function can accept a lot of arguments. The non-essential one, I
went with resolving them by name rather than position (i.e. argName="value"). I think
this is better for the user than specifying it by position.
The grammar is:
CREATE AGGREGATE <name>(<arg_types>) RETURNS <type> [INTERMEDIATE <type>]
LOCATION '/path' UpdateFn='Fn' [comment='comment']
[SerializeFn='symbol'] [MergeFn='symbol'] [InitFn='symbol'] [FinalizeFn='symbol']
The optional args at the end can be in any order. If the other symbols are not
specified, we derive them from the UpdateFn symbol that's required. The analyzer
would try to figure it out and fail if we can't find the derived symbol in the binary.
The simplest example would be:
CREATE AGGREGATE FUNCTION count(float) RETURNS BIGINT LOCATION '/path'
UpdateFn='CountUpdateFn';
In which case we assume the intermediate type is the return type and the other functions
are called 'CountInitFn', 'CountSerializeFn', 'CountMergeFn' 'CountFinalizeFn'.
Change-Id: Iefc5741293050f5b295df28e9d1a7d039ead8675
Reviewed-on: http://gerrit.ent.cloudera.com:8080/513
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>