Commit Graph

17 Commits

Author SHA1 Message Date
Bharath Vissapragada
ef0dac661c IMPALA-2843: Persist hive udfs across catalog restarts
This commit adds a new feature to persist hive/java udfs across
catalog restarts. IMPALA-1748 already added this for non-java
udfs by storing them in parameters map of the Db object and
reading them back at catalog startup. However we follow a
different approach for hive udfs by converting them to Hive's
function format and adding them as hive functions to the metastore.
This makes it possible to share udfs between hive and Impala as the
udfs added from one service are accessible to other. This commit
takes care of format conversions between hive and impala and user
can just add function once in either of the services.

Background: Hive and impala treat udfs differently. Hive resolves the
evaluate function in the udf class at runtime depending on the data
types of the input arguments. So user can add one function by name and
can pass any arguments to it as long as there is a compatible evaluate
function in the udf class. However Impala takes the input types of the
udf as a part of function definition (that maps to only one evaluate
function) and loads the function only for those set of input argument
types. If we have multiple 'evaluate' methods, we need to add multiple
functions one for each of them.

This commit adds new variants of CREATE | DROP FUNCTIONS  to Impala which
lets the user to create and drop hive/java udfs without input argument
types or return types. Catalog takes care of loading/dropping the udf
signatures corresponding to each "evaluate" method in the udf symbol
class. The syntax is as follows,

CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts>
DROP FUNCTION [IF EXISTS] <function name>

Examples:

CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;

The older way of creating hive/java udfs with specific signature is still supported,
however they are *not* persisted across restarts. So a restart of catalog can
wipe them out. Additionally this commit also loads all the compatible java udfs
added outside of Impala and they needn't be separately loaded. One thing
to note here is that the functions added using the new CREATE FUNCTION
can only be dropped using the new DROP FUNCTION syntax (without
signature). The same rule applies for the java udfs added using the old
CREATE FUNCTION syntax (with signature).

Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d
Reviewed-on: http://gerrit.cloudera.org:8080/2250
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 23:04:03 -08:00
Michael Ho
f0c2742641 IMPALA-2004: Implement "SHOW CREATE" for udfs and udas.
This patch extends the SHOW statement to also support
user-defined functions and user-defined aggregate functions.
The syntax of the new SHOW statements is as follows:

SHOW CREATE [AGGREGATE] FUNCTION [<db_name>.]<func_name>;

<db_name> and <func_name> are the names of the database
and udf/uda respectively.

Sample outputs of the new SHOW statements are as follows:

Query: show create function fn
+------------------------------------------------------------------+
| result                                                           |
+------------------------------------------------------------------+
| CREATE FUNCTION default.fn()                                     |
|  RETURNS INT                                                     |
|  LOCATION 'hdfs://localhost:20500/test-warehouse/libTestUdfs.so' |
|  SYMBOL='_Z2FnPN10impala_udf15FunctionContextE'                  |
|                                                                  |
+------------------------------------------------------------------+

Query: show create aggregate function agg_fn
+------------------------------------------------------------------------------------------+
| result                                                                                   |
+------------------------------------------------------------------------------------------+
| CREATE AGGREGATE FUNCTION default.agg_fn(INT)                                            |
|  RETURNS BIGINT                                                                          |
|  LOCATION 'hdfs://localhost:20500/test-warehouse/libudasample.so'                        |
|  UPDATE_FN='_Z11CountUpdatePN10impala_udf15FunctionContextERKNS_6IntValEPNS_9BigIntValE' |
|  INIT_FN='_Z9CountInitPN10impala_udf15FunctionContextEPNS_9BigIntValE'                   |
|  MERGE_FN='_Z10CountMergePN10impala_udf15FunctionContextERKNS_9BigIntValEPS2_'           |
|  FINALIZE_FN='_Z13CountFinalizePN10impala_udf15FunctionContextERKNS_9BigIntValE'         |
|                                                                                          |
+------------------------------------------------------------------------------------------+

Please note that all the overloaded functions which match
the given function name and category will be printed.

This patch also extends the python test infrastructure to
support expected results which include newline characters.
A new subsection comment called 'MULTI_LINE' has been added
for the 'RESULT' section. With this comment, a test can
include its multi-line output inside [ ] and the content
inside [ ] will be treated as a single line, including the
newline character.

Change-Id: Idbe433eeaf5e24ed55c31d905fea2a6160c46011
Reviewed-on: http://gerrit.cloudera.org:8080/1271
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-10-23 05:11:07 +00:00
Dan Hecht
c8fb10f50a S3: Some more work toward enabling additional S3 test coverage
Add skip markers for S3 that can be used to categorize the tests that
are skipped against S3 to help see what coverage is missing.  Soon
we'll be reworking some tests and/or adding new tests to get back the
important gaps.

Also, add a mechanism to parameterize paths in the .test files, and
start using these new variables.  This is a step toward enabling some
more tests against S3.

Finally, a fix for buildall.sh to stop the minicluster before applying
the metastore snapshot. Otherwise, this fails since the ms db is in
use.

Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0
Reviewed-on: http://gerrit.cloudera.org:8080/127
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-03-03 08:29:13 +00:00
Alex Behm
de75278125 Add SHOW ANALYTIC FUNCTIONS and additional analysis checks.
Change-Id: Ic1aac60fb9b094349b9cfbec68608ac50fc5660c
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4298
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-09-13 00:19:21 -07:00
Nong Li
1cab95066d Add the return type as a column for SHOW FUNCTIONS.
Also includes some misc pattern matching cleanup.

Change-Id: I6c9ec78b094a73864b4d669afbd75a48c9bf9585
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2199
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2271
2014-04-17 17:58:13 -07:00
Lenni Kuff
6bba0c8ffe Fix bug cleaning up removed Functions and fix test_ddl to create all test dbs
When dropping functions, we neeed to remove the function from the list
of Functions with that name AND remove the list from the Function map if
the list is empty. The second part wasn't happening.

Also fixes the test_ddl to properly create all test databases.

Change-Id: Id85af7d5db74a31161f48bea3816bdf734063133
Reviewed-on: http://gerrit.ent.cloudera.com:8080/952
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:00 -08:00
Lenni Kuff
39f77b8b8f Add support for cluster-synchronized catalog operations
This change adds support for cluster-synchronized catalog operations. This provides the
guaranteethat after a catalog op completes, all other subscribers to the catalog topic have
also processed that update. This is useful when load balancing, because a common workflow
is to target a different impalad for each statement executed.
For example if each of the following were executed sequentially, but targeting
a different node:
1) CREATE TABLE Foo
2) INSERT INTO Foo
3) SELECT * FROM Foo
4) INSERT INTO Foo ....

Since both the INSERT and the CREATE update the catalog, it would not work as expected
without this patch. The user might either get a "table not found" error or would be
missing partition information from the INSERT.

The downside is that this approach to DDL takes a bit longer because we need to wait
until all subscribers have processed an update. If all nodes are healthy, this overhead
should not be significantly longer than the current DDL time. However, a single bad node
might slow down or completely block the completion of all DDL operations. By default
this feature is disabled, but it can be enabled using a new query option: SYNCED_DDL=1

To test this, the base test suite was updated to support selecting a random impalad
to execute each query section in a query test file. This is currently only enabled
for the insert and DDL tests, but could be leveraged by more tests in the future.

TODO: Add additional failure tests around this functionality.
TODO: Add an explicit "sync" statement so users do not need to run all their DDL
in this mode (since it is slower).

Change-Id: I45e757a931bf2a4740cc0cdd1e76ce49a1e22b83
Reviewed-on: http://gerrit.ent.cloudera.com:8080/899
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:58 -08:00
Nong Li
601f24a198 UDA execution loose ends.
Unfortunately, the BE does not have the codegen path to execute UDAs.
This puts some restrictions on the UDAs we can run.

- No IR UDAs
- No varargs
- Must have 8 arguments or less.

The code to do this is almost all there for UDFs but I'm not sure I'll get to it.

Change-Id: I8a06e635a9138397c8474a5704c3e588bb92347b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/703
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:38 -08:00
Nong Li
a944a1fe52 'Invalidate metadata' no longer clears user functions.
Change-Id: I36de18fefa1d515a7960c2bf8c116d5217c388d6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/726
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:36 -08:00
Lenni Kuff
01c8c43fec Uniquify FUNCTION catalog topic entry keys by including parent database name
Change-Id: I6aa49520f548ddfcd557e2f908a09be454765e8c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/698
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:53:29 -08:00
Nong Li
6b9a7de02e Add symbol resolution during analysis for create function stmts.
Before this, we had to specify the entire mangled symbol. This can be quite
long and quite tedious (take a look at some of the create UDA test cases that
specify all the symbols).

This patch adds some code to convert from the user function signature to the
mangled name. This means the user can specify the unmangled name and we can
do the symbol lookup. The mangling rules are pretty convoluted but if it is
messed up, the user can always specify the full symbol.

Some other minor cleanup in:
  - JNI from FE to BE
  - UDFs/UDAs that are loaded as test data

Change-Id: I733dbf3a72cb7b06221c27e622d161bcca0d74a8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/624
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:53:20 -08:00
Nong Li
4bb1e8c854 Add varargs to UDF/UDA parser/analyzer.
Change-Id: I4c3f2e74f6c29cee4b0b787c058b0455b16a11fd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/548
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:05 -08:00
Nong Li
e39de94316 Add parser/analysis to support UDAs.
I looked around some and I think having create/drop/show [aggregate] function
seems reasonable and extends nicely for UDTs.

The create aggregate function can accept a lot of arguments. The non-essential one, I
went with resolving them by name rather than position (i.e. argName="value"). I think
this is better for the user than specifying it by position.

The grammar is:
CREATE AGGREGATE <name>(<arg_types>) RETURNS <type> [INTERMEDIATE <type>]
LOCATION '/path' UpdateFn='Fn' [comment='comment']
[SerializeFn='symbol'] [MergeFn='symbol'] [InitFn='symbol'] [FinalizeFn='symbol']

The optional args at the end can be in any order. If the other symbols are not
specified, we derive them from the UpdateFn symbol that's required. The analyzer
would try to figure it out and fail if we can't find the derived symbol in the binary.

The simplest example would be:
CREATE AGGREGATE FUNCTION count(float) RETURNS BIGINT LOCATION '/path'
UpdateFn='CountUpdateFn';

In which case we assume the intermediate type is the return type and the other functions
are called 'CountInitFn', 'CountSerializeFn', 'CountMergeFn' 'CountFinalizeFn'.

Change-Id: Iefc5741293050f5b295df28e9d1a7d039ead8675
Reviewed-on: http://gerrit.ent.cloudera.com:8080/513
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:59 -08:00
Nong Li
a0bf45a0b4 Add udf type.
Change-Id: Ic5f52c127750cc9c847a3e34d3fdcfc78bee5a8a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/454
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:48 -08:00
Nong Li
308650f208 Fix create function ddl test setup issue.
Change-Id: I30c9a4342efbdb17bd53fb14bdcee172506cdadb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/447
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:44 -08:00
Nong Li
8eb727b585 UDF ddl cleanup
Change-Id: I381fed277b5809727d2d8bf430258c01d2d0ae1f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/436
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:43 -08:00
Nong Li
2394ae2e66 UDF parsing and analysis.
Change-Id: If8058c1cb66bf5e9c7049d4b78f5882b46c03fc1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/318
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:32 -08:00