Commit Graph

149 Commits

Author SHA1 Message Date
Eyizoha
3af1930229 IMPALA-12322: Support converting UTC timestamps read from Kudu to local time
This patch adds a query option 'convert_kudu_utc_timestamps' similar to
'convert_legacy_hive_parquet_utc_timestamps'. When enabled, it converts
UTC timestamps read from Kudu to local timestamps.

The corresponding modification also include predicate pushdown and
runtime filter. Due to the ambiguity of timestamps caused by daylight
saving time changes, it is difficult to resolve in the bloom filter.
This patch additionally introduces a query option
'disable_kudu_local_timestamp_bloom_filter' to default disable the Kudu
timestamp bloom filter after enabling time zone conversion in order to
avoid erroneously filtering out data. However, for regions that do not
observe daylight saving time, it can be set to false to re-enable the
Kudu local timestamp bloom filter.

Testing:
- Add TestKuduTimestampConvert in query_test/test_kudu.py
Perform end-to-end testing in a custom cluster, including basic Kudu UTC
timestamp conversion testing, as well as checking if related predicate
pushdown and runtime filters are working correctly (even with timestamps
involving daylight saving time conversions).

Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3
Reviewed-on: http://gerrit.cloudera.org:8080/20681
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2023-12-14 13:19:35 +00:00
Csaba Ringhofer
7ca11dfc7f IMPALA-9482: Support for BINARY columns
This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-19 13:55:42 +00:00
Csaba Ringhofer
f24296aed5 IMPALA-11355: Add STRING overloads for hour/minute/second/millisecond
IMPALA-9531 dropped support for "dateless timestamps",
e.g. cast("12:05:05" as timestamp) now returns NULL.

This led to breaking functions like minute("12:05:05"), as minute()
expects a timestamp, and Impala adds an implicit cast, so what actually
happens is minute(cast("12:05:05" as timestamp)), which returns NULL.

This change adds overloads for similar functions that take STRING
instead of TIMESTAMP parameter. The same functions already take a
STRING parameter in Hive and mySQL.

The changes in the parser mainly restore code removed in IMPALA-9531.

Note that these functions could be potentially optimized by returning
parts of the parse result without converting them to boost time first,
but this is not done here to make the change minimal.

Testing:
- restored related tests in expr-test and added some new ones for
  malformed time-of-day strings
- added benchmarks for the new overloads and fixed the ones for the
  old functions (they tested NULL)

Change-Id: I6cc1c851ee71ab4fcc58105c7e9931155a483679
Reviewed-on: http://gerrit.cloudera.org:8080/18718
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-14 21:03:43 +00:00
Michael Smith
2243f331cb IMPALA-11274: CNF Rewrite causes a regress in join node performance
This patch defines a subset of all predicates that are common and
relatively inexpensive to compute. Such predicates must involve
columns, constants, simple math or cast functions only.

Examples of the subset of the predicates allowed:

  1. (a = 1 AND cast(b as int) = 2) OR (c = d AND e = f)
  2. a in ('1', '2', '3') OR ((b = 'abc') AND (c = d))
  3. (a between 1 and 100) OR ((b is null) AND (c = d))

Examples of the predicates not allowed:

  1. (upper(a) != 'Y') AND b = 2) OR (c = d AND e = f)
  2. (coalesce(CAST(a AS string), '') = '') AND b = 2) OR
     (c = d AND e = f)

This patch further restricts the predicates to be converted to
conjunctive normal form (CNF) to be such a subset, with the aim to
reduce the run-time evaluation overhead of CNFs in which some
of the predicates can be duplicated.

Uses a cache in branching expressions to avoid visiting the entire
subtree on each call to applyRuleBottomUp. Skips cache complexity on
casts as they don't branch and are unlikely to be deeply nested.

Testing:
- New expression writer tests
- New planner tests

Change-Id: I326406c6b004fe31ec0e2a2f390a3845b8925aa9
Reviewed-on: http://gerrit.cloudera.org:8080/18458
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-05-25 05:37:17 +00:00
stiga-huang
35375b3287 IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.

--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().

Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Reviewed-on: http://gerrit.cloudera.org:8080/17785
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-15 18:40:59 +00:00
pranav.lodha
bde995483a IMPALA-955: BYTES built-in function
The Bytes function returns the number of bytes contained
in the specified byte string. There are changes in
4 files. A few testcases are also added in
be/src/exprs/expr-test.cc and an end-to end test in
testdata/workloads/functional-query/queries/QueryTest/exprs.test.

Change-Id: I0bd06c3d6dba354d71f63c649eaa8f9f74d266ee
Reviewed-on: http://gerrit.cloudera.org:8080/18210
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-11 07:01:58 +00:00
stiga-huang
3850d49711 IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
  mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
  mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-15 05:04:07 +00:00
Amogh Margoor
8e7a6227a4 IMPALA-10730: Add MD5 fuction to compute 128-bit checksum for a string.
Built-in function has been added to compute MD5 128-bit checksum for
a non-null string. If input string is null, then output of the
function is null too. In FIPS mode, MD5 is disabled and function
will throw error on invocation.

Testing:
1. Added expression unit tests.
2. Added end-to-end tests for MD5.

Change-Id: Id406d30a7cc6573212b302fbfec43eb848352ff2
Reviewed-on: http://gerrit.cloudera.org:8080/17567
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-10 22:24:16 +00:00
Fucun Chu
ce21fe74b8 IMPALA-10689: Implement ds_cpc_union_f() function.
This function receives two strings that are serialized Apache
DataSketches CPC sketches. Union two sketches and returns the
resulting sketch of union.

Example:
select ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2))
from sketch_tbl;
+---------------------------------------------------+
| ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) |
+---------------------------------------------------+
| 15                                                |
+---------------------------------------------------+

Change-Id: Ib5c616316bf2bf2ff437678e9a44a15339920150
Reviewed-on: http://gerrit.cloudera.org:8080/17440
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-07 12:53:23 +00:00
Fucun Chu
67de4a48b0 IMPALA-10688: Implement ds_cpc_stringify() function
This function receives a string that is a serialized Apache
DataSketches CPC sketch and returns its stringified format.

A stringified format should look like and contains the following data:

select ds_cpc_stringify(ds_cpc_sketch(float_col)) from
functional_parquet.alltypestiny;
+--------------------------------------------+
| ds_cpc_stringify(ds_cpc_sketch(float_col)) |
+--------------------------------------------+
| ### CPC sketch summary:                    |
|    lg_k           : 11                     |
|    seed hash      : 93cc                   |
|    C              : 2                      |
|    flavor         : 1                      |
|    merged         : true                   |
|    intresting col : 0                      |
|    table entries  : 2                      |
|    window         : not allocated          |
| ### End sketch summary                     |
|                                            |
+--------------------------------------------+

Change-Id: I8c9d089bfada6bebd078d8f388d2e146c79e5285
Reviewed-on: http://gerrit.cloudera.org:8080/17373
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
2021-05-26 06:42:02 +00:00
Amogh Margoor
86beb2f9dd IMPALA-10679: Add builtin functions to comptute SHA-1 and SHA-2 digest.
Built-in functions to compute SHA-1 digest and SHA-2 family of digest
has been added. Support for SHA2 digest includes SHA224, SHA256,
SHA384 and SHA512. In FIPS mode SHA1, SHA224 and SHA256 have been
disabled and will throw error. SHA2 functions will also throw error
for unsupported bit length i.e., bit length apart from 224, 256, 384,
512.

Testing:
1. Added Unit test for expressions.
2. Added end-to-end test for new functions.

Change-Id: If163b7abda17cca3074c86519d59bcfc6ace21be
Reviewed-on: http://gerrit.cloudera.org:8080/17464
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-18 18:48:10 +00:00
Fucun Chu
e39c30b3cd IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch_parquet.lineitem to compare perfomance
   with ndv(). Depending on data characteristics ndv() appears 2x-3x
   faster. CPC gives closer estimate than current ndv(). CPC is more
   accurate than HLL in some cases

Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Reviewed-on: http://gerrit.cloudera.org:8080/16656
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-11 18:07:40 +00:00
Fucun Chu
77d6acd032 IMPALA-10581: Implement ds_theta_intersect_f() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Computes the intersection of two sketches
of same or different column and returns the resulting sketch of
intersection.

Example:
select ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2))
from sketch_tbl;
+-----------------------------------------------------------+
| ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) |
+-----------------------------------------------------------+
| 5                                                         |
+-----------------------------------------------------------+

Change-Id: I335eada00730036d5433775cfe673e0e4babaa01
Reviewed-on: http://gerrit.cloudera.org:8080/17186
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-29 15:59:49 +00:00
Fucun Chu
622e3c95ad IMPALA-10580: Implement ds_theta_union_f() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Union two sketches and returns the
resulting sketch of union.

Example:
select ds_theta_estimate(ds_theta_union_f(sketch1, sketch2))
from sketch_tbl;
+-------------------------------------------------------+
| ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) |
+-------------------------------------------------------+
| 15                                                    |
+-------------------------------------------------------+

Change-Id: I8329979b81ceeaad739a43fab79768ca9c2916fa
Reviewed-on: http://gerrit.cloudera.org:8080/17179
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-24 15:16:07 +00:00
Fucun Chu
3e82501531 IMPALA-10558: Implement ds_theta_exclude() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Computes the a-not-b set operation given
two sketches of same or different column.

Example:
select ds_theta_estimate(ds_theta_exclude(sketch1, sketch2))
from sketch_tbl;
+-------------------------------------------------------+
| ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) |
+-------------------------------------------------------+
| 5                                                     |
+-------------------------------------------------------+

Change-Id: I05119fd8c652c07ff248a99e44b0da3541e46ca3
Reviewed-on: http://gerrit.cloudera.org:8080/17153
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-17 22:14:44 +00:00
Fucun Chu
65c6a81ed9 IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions
These functions can be used to get cardinality estimates of data
using Theta algorithm from Apache DataSketches. ds_theta_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized Theta sketch in string format. This can be written to a
table or be fed directly to ds_theta_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the Theta sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' Theta see:
https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch25_parquet.lineitem to compare perfomance
   with ds_hll_*. ds_theta_* is faster than ds_hll_* on the original
   data, the difference is around 1%-10%. ds_hll_estimate() is faster
   than ds_theta_estimate() on existing sketch. HLL and Theta gives
   closer estimate except for string. see IMPALA-10464.

Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Reviewed-on: http://gerrit.cloudera.org:8080/17008
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-17 17:09:48 +00:00
stiga-huang
e8720b40f1 IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions
A unicode character can be encoded into 1-4 bytes in UTF-8. String
functions will return undesired results when the input contains unicode
characters, because we deal with a string as a byte array. For instance,
length() returns the length in bytes, not in unicode characters.

UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem.
This patch adds UTF-8 support in some string functions so they can have
UTF-8 aware behavior. For compatibility with the old versions, a new
query option, UTF8_MODE, is added for turning on/off the UTF-8 aware
behavior. Currently, only length(), substring() and reverse() support
it. Other function supports will be added in later patches.

String functions will check the query option and switch to use the
desired implementation. It's similar to how we use the decimal_v2 query
option in builtin functions.

For easy testing, the UTF-8 aware version of string functions are
also exposed as builtin functions (named by utf8_*, e.g. utf8_length).

Tests:
 - Add BE tests for utf8 functions.
 - Add e2e tests for the UTF8_MODE query option.

Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Reviewed-on: http://gerrit.cloudera.org:8080/16908
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-26 00:43:39 +00:00
stiga-huang
9bb7157bf0 IMPALA-10387: Add missing overloads of mask functions used in Ranger default masking policies
The mask functions in Hive are implemented through GenericUDFs which can
accept an infinite number of function signatures. Impala currently don't
support GenericUDFs. So we provide builtin mask functions with limited
overloads.

This patch adds some missing overloads that could be used by Ranger
default masking policies, e.g. MASK_HASH, MASK_SHOW_LAST_4,
MASK_DATE_SHOW_YEAR, etc.

Tests:
 - Add test coverage on all default masking policies applied on all
   supported types.

Change-Id: Icf3e70fd7aa9f3b6d6b508b776696e61ec1fcc2e
Reviewed-on: http://gerrit.cloudera.org:8080/16930
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-15 13:01:53 +00:00
Gabor Kaszab
6cb7cecacf IMPALA-10237: Support Bucket and Truncate partition transforms as built-in functions
This patch implements Truncate and Bucket partition transforms in
Impala BE as built-in functions. The expectation is that these
functions give the same result as Iceberg's implementation of the same
functions. These built-in functions are invisible so users won't be
able to invoke them e.g. from impala-shell.

Truncate:
  - Supported types are IntVal, BigIntVal, StringVal, DecimalVal.
  - Receives an input from the above types and a width.
  - Returns the same type as the input.
  - Expected behaviour is explained here:
    https://iceberg.apache.org/spec/#truncate-transform-details
Bucket:
  - Supported types are IntVal, BigIntVal, StringVal, DecimalVal,
    DateVal, TimestampVal.
  - Receives an input from the above types and the number of buckets as
    IntVal.
  - Returns IntVal.
  - Expected behaviour is explained here:
    https://iceberg.apache.org/spec/#bucket-transform-details

Change-Id: I485680cf79d96d578dd8cfbfd554bec468fe84bd
Reviewed-on: http://gerrit.cloudera.org:8080/16741
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-19 03:09:06 +00:00
Fucun Chu
8ea49e9b02 IMPALA-10134: Implement ds_hll_union_f() function.
This function receives two strings that are serialized Apache DataSketches
HLL sketches. Union two sketches and returns the resulting sketch of union.

Example:
select ds_hll_estimate(ds_hll_union_f(i_i, h_i))
from hll_sketches_impala_hive2;
+-------------------------------------------+
| ds_hll_estimate(ds_hll_union_f(i_i, h_i)) |
+-------------------------------------------+
| 7                                         |
+-------------------------------------------+

Change-Id: Ic06e959ed956af5cedbfc7d4d063141d5babb2a8
Reviewed-on: http://gerrit.cloudera.org:8080/16711
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-11-25 14:33:10 +00:00
Fucun Chu
193c2e773f IMPALA-10132: Implement ds_hll_estimate_bounds_as_string() function.
This function receives a string that is a serialized Apache DataSketches
HLL sketch and optional kappa that is a number of standard deviations
from the mean: 1, 2 or 3 (default 2). Returns estimate and bounds with
the values separated with commas.
The result is three values: estimate, lower bound and upper bound.

   ds_hll_estimate_bounds_as_string(sketch [, kappa])

Kappa:
 1 represent the 68.3% confidence bounds
 2 represent the 95.4% confidence bounds
 3 represent the 99.7% confidence bounds

Note, ds_hll_estimate_bounds() should return an Array of doubles as
the result but with that we have to wait for the complex type support.
Until, we provide ds_hll_estimate_bounds_as_string() that can be
deprecated once we have array support. Tracking Jira for returning
complex types from functions is IMPALA-9520.

Example:
select ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) from
functional_parquet.alltypestiny;
+----------------------------------------------------------+
| ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) |
+----------------------------------------------------------+
| 2,2,2.0002                                               |
+----------------------------------------------------------+

Change-Id: I46bf8263e8fd3877a087b9cb6f0d1a2392bb9153
Reviewed-on: http://gerrit.cloudera.org:8080/16626
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-10-29 17:45:01 +00:00
Adam Tamas
99e5f5a885 IMPALA-10133:Implement ds_hll_stringify function.
This function receives a string that is a serialized Apache DataSketches
HLL sketch and returns its stringified format.

A stringified format should look like and contains the following data:

select ds_hll_stringify(ds_hll_sketch(float_col)) from
functional_parquet.alltypestiny;
+--------------------------------------------+
| ds_hll_stringify(ds_hll_sketch(float_col)) |
+--------------------------------------------+
| ### HLL sketch summary:                    |
|   Log Config K   : 12                      |
|   Hll Target     : HLL_4                   |
|   Current Mode   : LIST                    |
|   LB             : 2                       |
|   Estimate       : 2                       |
|   UB             : 2.0001                  |
|   OutOfOrder flag: false                   |
|   Coupon count   : 2                       |
| ### End HLL sketch summary                 |
|                                            |
+--------------------------------------------+

Change-Id: I85dbf20b5114dd75c300eef0accabe90eac240a0
Reviewed-on: http://gerrit.cloudera.org:8080/16382
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-09-03 12:11:48 +00:00
Adam Tamas
4cb3c3556e IMPALA-10108: Implement ds_kll_stringify function
This function receives a string that is a serialized Apache DataSketches
KLL sketch and returns its stringified format.

A stringified format should look like and contains the following data:

select ds_kll_stringify(ds_kll_sketch(float_col))
from functional_parquet.alltypestiny;
+--------------------------------------------+
| ds_kll_stringify(ds_kll_sketch(float_col)) |
+--------------------------------------------+
| ### KLL sketch summary:                    |
|    K              : 200                    |
|    min K          : 200                    |
|    M              : 8                      |
|    N              : 8                      |
|    Epsilon        : 1.33%                  |
|    Epsilon PMF    : 1.65%                  |
|    Empty          : false                  |
|    Estimation mode: false                  |
|    Levels         : 1                      |
|    Sorted         : false                  |
|    Capacity items : 200                    |
|    Retained items : 8                      |
|    Storage bytes  : 64                     |
|    Min value      : 0                      |
|    Max value      : 1.1                    |
| ### End sketch summary                     |
|                                            |
+--------------------------------------------+

Change-Id: I97f654a4838bf91e3e0bed6a00d78b2c7aa96f75
Reviewed-on: http://gerrit.cloudera.org:8080/16370
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-09-02 10:49:10 +00:00
Gabor Kaszab
28d94851b1 IMPALA-10020: Implement ds_kll_cdf_as_string() function
This is the support for Cumulative Distribution Function (CDF) from
Apache DataSketches KLL algorithm collection. It receives a serialized
KLL sketch and one or more float values to represent ranges in the
sketched values.
E.g. [1, 5, 10] will mean the following ranges:
(-inf, 1), (-inf, 5), (-inf, 10), (-inf, +inf)
Returns a comma separated string where each value in the string is a
number in the range of [0,1] and shows that what percentage of the
data is in the particular ranges.

Note, ds_kll_cdf() should return an Array of doubles as the result but
with that we have to wait for the complex type support. Until, we
provide ds_kll_cdf_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.

Example:
select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10)
from alltypes;
+----------------------------------------------------------+
| ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) |
+----------------------------------------------------------+
| 0.2,0.401644,1,1                                         |
+----------------------------------------------------------+

Change-Id: I77e6afc4556ad05a295b89f6d06c2e4a6bb2cf82
Reviewed-on: http://gerrit.cloudera.org:8080/16359
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-26 10:59:49 +00:00
Gabor Kaszab
a8a35edbc4 IMPALA-10019: Implement ds_kll_pmf_as_string() function
This is the support for Probabilistic Mass Function (PMF) from Apache
DataSketches KLL algorithm collection. It receives a serialized KLL
sketch and one or more float values to represent ranges in the
sketched values.
E.g. [1, 5, 10] will mean the following ranges:
(-inf, 1), [1, 5), [5, 10), [10, +inf)
Returns a comma separated string where each value in the string is a
number in the range of [0,1] and shows that what percentage of the
data is in the particular ranges.

Note, ds_kll_pmf() should return an Array of doubles as the result but
with that we have to wait for the complex type support. Until, we
provide ds_kll_pmf_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.

Example:
select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10)
from alltypes;
+----------------------------------------------------------+
| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) |
+----------------------------------------------------------+
| 0.202192,0.199452,0.598356,0                             |
+----------------------------------------------------------+

Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba
Reviewed-on: http://gerrit.cloudera.org:8080/16336
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-26 01:50:16 +00:00
Gabor Kaszab
41065845e9 IMPALA-9962: Implement ds_kll_quantiles_as_string() function
This function is very similar to ds_kll_quantile() but this one can
receive any number of rank parameters and returns a comma separated
string that holds the results for all of the given ranks.
For more details about ds_kll_quantile() see IMPALA-9959.

Note, ds_kll_quantiles() should return an Array of floats as the result
but with that we have to wait for the complex type support. Until, we
provide ds_kll_quantiles_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.

Change-Id: I76f6039977f4e14ded89a3ee4bc4e6ff855f5e7f
Reviewed-on: http://gerrit.cloudera.org:8080/16324
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-25 18:06:22 +00:00
Gabor Kaszab
0736fcf691 IMPALA-10018: Implement ds_kll_rank() function
ds_kll_rank() receives two parameters: a STRING that represents a
serialized DataSketches KLL sketch and a float to provide a probing
value in the sketch.
Returns a DOUBLE that is the rank of the given probing value in the
range of [0,1]. E.g. a return value of 0.2 means that the probing value
given as parameter is greater than the 20% of all the values in the
sketch. Note, this is an approximate calculation.

Change-Id: I95857886dfbb8c84aeeaf718c0e610012fda4be0
Reviewed-on: http://gerrit.cloudera.org:8080/16283
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-07 12:58:40 +00:00
Gabor Kaszab
87aeb2ad78 IMPALA-9963: Implement ds_kll_n() function
This function receives a serialized Apache DataSketches KLL sketch
and returns how many input values were fed into this sketch.

Change-Id: I166e87a468e68e888ac15fca7429ac2552dbb781
Reviewed-on: http://gerrit.cloudera.org:8080/16259
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-06 19:15:04 +00:00
Gabor Kaszab
033a4607e2 IMPALA-9959: Implement ds_kll_sketch() and ds_kll_quantile() functions
ds_kll_sketch() is an aggregate function that receives a float
parameter (e.g. a float column of a table) and returns a serialized
Apache DataSketches KLL sketch of the input data set wrapped into
STRING type. This sketch can be saved into a table or view and later
used for quantile approximations. ds_kll_quantile() receives two
parameters: a STRING parameter that contains a serialized KLL sketch
and a DOUBLE that represents the rank of the quantile in the range of
[0,1]. E.g. rank=0.1 means the approximate value in the sketch where
10% of the sketched items are less than or equals to this value.

Testing:
  - Added automated tests on small data sets to check the basic
    functionality of sketching and getting a quantile approximate.
  - Tested on TPCH25_parquet.lineitem to check that sketching and
    approximating works on bigger scale as well where serialize/merge
    phases are also required. On this scale the error range of the
    quantile approximation is within 1-1.5%

Change-Id: I11de5fe10bb5d0dd42fb4ee45c4f21cb31963e52
Reviewed-on: http://gerrit.cloudera.org:8080/16235
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-07-31 14:34:49 +00:00
Adam Tamas
1bafb7bd29 IMPALA-9531: Dropped support for dateless timestamps
Removed the support for dateless timestamps.
During dateless timestamp casts if the format doesn't contain
date part we get an error during tokenization of the format.
If the input str doesn't contain a date part then we get null result.

Examples:
select cast('01:02:59' as timestamp);
This will come back as NULL value.

select to_timestamp('01:01:01', 'HH:mm:ss');
select cast('01:02:59' as timestamp format 'HH12:MI:SS');
select cast('12 AM' as timestamp FORMAT 'AM.HH12');
These will come back with a parsing errors.

Casting from a table will generate similar results.

Testing:
Modified the previous tests related to dateless timestamps.
Added test to read fromtables which are still containing dateless
timestamps and covered timestamp to string path when no date tokens
are requested in the output string.

Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Reviewed-on: http://gerrit.cloudera.org:8080/15866
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
2020-07-08 19:32:15 +00:00
Gabor Kaszab
7e456dfa9d IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Reviewed-on: http://gerrit.cloudera.org:8080/16000
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2020-07-07 14:11:21 +00:00
stiga-huang
0936384271 IMPALA-9010: Add builtin mask functions
There're 6 builtin GenericUDFs for column masking in Hive:
  mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_hash(value)
  mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar,
      dayValue, monthValue, yearValue)

Description of the parameters:
   value      - value to mask. Supported types: TINYINT, SMALLINT, INT,
                BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()).
   charCount  - number of characters. Default value: 4
   upperChar  - character to replace upper-case characters with. Specify
                -1 to retain original character. Default value: 'X'
   lowerChar  - character to replace lower-case characters with. Specify
                -1 to retain original character. Default value: 'x'
   digitChar  - character to replace digit characters with. Specify -1
                to retain original character. Default value: 'n'
   otherChar  - character to replace all other characters with. Specify
                -1 to retain original character. Default value: -1
   numberChar - character to replace digits in a number with. Valid
                values: 0-9. Default value: '1'
   dayValue   - value to replace day field in a date with.
                Specify -1 to retain original value. Valid values: 1-31.
                Default value: 1
   monthValue - value to replace month field in a date with. Specify -1
                to retain original value. Valid values: 0-11. Default
                value: 0
   yearValue  - value to replace year field in a date with. Specify -1
                to retain original value. Default value: 1

In Hive, these functions accept variable length of arguments in
non-restricted types:
   mask_show_first_n(val)
   mask_show_first_n(val, 8)
   mask_show_first_n(val, 8, 'X', 'x', 'n')
   mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2)
   mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9')
The arguments of upperChar, lowerChar, digitChar, otherChar and
numberChar can be in string or numeric types.

Impala doesn't support Hive GenericUDFs, so we are lack of these mask
functions to support Ranger column masking policies. On the other hand,
we want the masking functions to be evaluated in the C++ builtin logic
rather than calling out to java UDFs for performance. This patch
introduces our builtin implementation of them.

We currently don't have a corresponding framework for GenericUDF
(IMPALA-9271), so we implement these by overloads. However, it may
requires hundreds of overloads to cover all possible combinations. We
just implement some important overloads, including
 - those used by Ranger default masking policies,
 - those with simple arguments and may be useful for users,
 - an overload with all arguments in int type for full functionality.
   Char argument need to be converted to their ASCII value.

Tests:
 - Add BE tests in expr-test

Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664
Reviewed-on: http://gerrit.cloudera.org:8080/14963
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-17 15:34:34 +00:00
norbert.luksa
a862282811 IMPALA-8709: Add Damerau-Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate restricted
Damerau-Levenshtein edit distance (optimal string alignment).
Implmented as dle_dst() and damerau_levenshtein(). If either value is
NULL or both values are NULL returns NULL which differs from Netezza's
dle_dst() which returns the length of the not NULL value or 0 if both
values are NULL. The NULL behavior matches the existing levenshtein()
function.

Also cleans up levenshtein tests.

Testing:
- Added unit tests to expr-test.cc
- Manual testing on over 1400 string pairs from
  http://marvin.cs.uidaho.edu/misspell.html and results match Netezza

Change-Id: Ib759817ec15e7075bf49d51e494e45c8af4db94d
Reviewed-on: http://gerrit.cloudera.org:8080/13794
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-22 21:39:21 +00:00
luksan47
8db7f27ddd IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function
The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Reviewed-on: http://gerrit.cloudera.org:8080/13870
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-13 18:25:32 +00:00
Attila Jeges
f40935a30e IMPALA-7369: part 2: Add INTERVAL expr support and built-in functions for DATE
This change implements INTERVAL expression support for DATE type and
adds several DATE related built-in functions. The efficiency of the
DateValue::ToYearMonthDay() function used in many of the built-in
functions below was also improved.

The following functions are supported in Hive:

  INT YEAR(DATE d)
  Extracts year of the 'd' date, returns it as an int in 0-9999 range.

  INT MONTH(DATE d)
  Extracts month of the 'd' date and returns it as an int in 1-12
  range.

  INT DAY(DATE d), INT DAYOFMONTH(DATE d)
  Extracts day-of-month of the 'd' date and returns it as an int in
  1-31 range.

  INT QUARTER(DATE d)
  Extracts quarter of the 'd' date and returns it as an int in 1-4
  range.

  INT DAYOFWEEK(DATE d)
  Extracts day-of-week of the 'd' date and returns it as an int in
  1-7 range. 1 is Sunday and 7 is Saturday.

  INT DAYOFYEAR(DATE d)
  Extracts day-of-year of the 'd' date and returns it as an int in
  1-366 range.

  INT WEEKOFYEAR(DATE d)
  Extracts week-of-year of the 'd' date and returns it as an int in
  1-53 range.

  STRING DAYNAME(DATE d)
  Returns the day field from a 'd' date, converted to the string
  corresponding to that day name. The range of return values is
  "Sunday" to "Saturday".

  STRING MONTHNAME(DATE d)
  Returns the month field from a 'd' date, converted to the string
  corresponding to that month name. The range of return values is
  "January" to "December".

  DATE NEXT_DAY(DATE d, STRING weekday)
  Returns the first date which is later than 'd' and named as
  'weekday'. 'weekday' is 3 letters or full name of the day of the
  week.

  DATE LAST_DAY(DATE d)
  Returns the last day of the month which the 'd' date belongs to.

  INT DATEDIFF(DATE d1, DATE d2)
  Returns the number of days from 'd1' date to 'd2' date.

  DATE CURRENT_DATE()
  Returns the current date (in the local time zone).

  INT INT_MONTHS_BETWEEN(DATE d1, DATE d2)
  Returns the number of months between 'd1' and 'd2' dates, as an int
  representing only the full months that passed.
  If 'd1' represents an earlier date than 'd2', the result is
  negative.

  DOUBLE MONTHS_BETWEEN(DATE d1, DATE d2)
  Returns the number of months between 'd1' and 'd2' dates. Can
  include a fractional part representing extra days in addition to the
  full months between the dates. The fractional component is computed
  by dividing the difference in days by 31 (regardless of the month).
  If 'd1' represents an earlier date than 'd2', the result is
  negative.

  DATE ADD_YEARS(DATE d, INT/BIGINT num_years),
  DATE SUB_YEARS(DATE d, INT/BIGINT num_years)
  Adds/subtracts a specified number of years to a 'd' date value.

  DATE ADD_MONTHS(DATE d, INT/BIGINT num_months),
  DATE SUB_MONTHS(DATE d, INT/BIGINT num_months)
  Adds/subtracts a specified number of months to a date value.
  If 'd' is the last day of a month, the returned date will fall on
  the last day of the target month too.

  DATE ADD_DAYS(DATE d, INT/BIGINT num_days),
  DATE SUB_DAYS(DATE d, INT/BIGINT num_days)
  Adds/subtracts a specified number of days to a date value.

  DATE ADD_WEEKS(DATE d, INT/BIGINT num_weeks),
  DATE SUB_WEEKS(DATE d, INT/BIGINT num_weeks)
  Adds/subtracts a specified number of weeks to a date value.

The following function doesn't exist in Hive but supported by Amazon
Redshift

  INT DATE_CMP(DATE d1, DATE d2)
  Compares 'd1' and 'd2' dates. Returns:
  1. NULL, if either 'd1' or 'd2' is NULL
  2. -1 if d1 < d2
  3. 1 if d1 > d2
  4. 0 if d1 == d2
  (https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_CMP.html)

Change-Id: If404bffdaf055c769e79ffa8f193bac415cfdd1a
Reviewed-on: http://gerrit.cloudera.org:8080/13648
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-25 23:06:25 +00:00
Attila Jeges
f0678b06e6 IMPALA-7369: part 1: Implement TRUNC, DATE_TRUNC, EXTRACT, DATE_PART functions for DATE
These functions are somewhat similar in that each of them takes a DATE
argument and a time unit to work with.

They work identically to the corresponding TIMESTAMP functions. The
only difference is that the DATE functions don't accept time-of-day
units.

TRUNC(DATE d, STRING unit)
Truncates a DATE value to the specified time unit. The 'unit' argument
is case insensitive. This argument string can be one of:
  SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year.
  Q: Quarter.
  MONTH, MON, MM, RM: Month.
  DDD, DD, J: Day.
  DAY, DY, D: Starting day (Monday) of the week.
  WW: Truncates to the most recent date, no later than 'd', which is
      on the same day of the week as the first day of year.
  W: Truncates to the most recent date, no later than 'd', which is on
     the same day of the week as the first day of month.

The impelementation mirrors Impala's TRUNC(TIMESTAMP ts, STRING unit)
function. Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions201.htm
.

DATE_TRUNC(STRING unit, DATE d)
Truncates a DATE value to the specified precision. The 'unit' argument
is case insensitive. This argument string can be one of: DAY, WEEK,
MONTH, YEAR, DECADE, CENTURY, MILLENNIUM.

The implementation mirrors Impala's DATE_TRUNC(STRING unit,
TIMESTAMP ts) function. Vertica has a similar function too.
Reference:
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/
    SQLReferenceManual/Functions/Date-Time/DATE_TRUNC.htm
.

EXTRACT(DATE d, STRING unit), EXTRACT(unit FROM DATE d)
Returns one of the numeric date fields from a DATE value. The 'unit'
string can be one of YEAR, QUARTER, MONTH, DAY. This argument value is
case-insensitive.

The implementation mirrors that Impala's EXTRACT(TIMESTAMP ts,
STRING unit). Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
.

DATE_PART(STRING unit, DATE date)
Similar to EXTRACT(), with the argument order reversed. Supports the
same date units as EXTRACT().

The implementation mirrors Impala's DATE_PART(STRING unit,
TIMESTAMP ts) function.

Change-Id: I843358a45eb5faa2c134994600546fc1d0a797c8
Reviewed-on: http://gerrit.cloudera.org:8080/13363
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-05 14:23:23 +00:00
Attila Jeges
b5805de3e6 IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.

This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.

The changes are as follows:
- Support for DATE literal syntax.

- Explicit casting between DATE and other types (note that invalid
  casts will fail with an error just like invalid DECIMAL_V2 casts,
  while failed casts to other types do no lead to warning or error):
    - from STRING to DATE. The string value must be formatted as
      yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
      the time component is optional. If the time component is
      present, it will be truncated silently.
    - from DATE to STRING. The resulting string value is formatted as
      yyyy-MM-dd.
    - from TIMESTAMP to DATE. The source timestamp's time of day
      component is ignored.
    - from DATE to TIMESTAMP. The target timestamp's time of day
      component is set to 00:00:00.

- Implicit casting between DATE and other types:
    - from STRING to DATE if the source string value is used in a
      context where a DATE value is expected.
    - from DATE to TIMESTAMP if the source date value is used in a
      context where a TIMESTAMP value is expected.

- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
  implicit conversions are now all possible, the existing function
  overload resolution logic is not adequate anymore.
  For example, it resolves the
  if(false, '2011-01-01', DATE '1499-02-02') function call to the
  if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
  function, instead of the if(BOOLEAN, DATE, DATE) version.

  This is clearly wrong, so the function overload resolution logic had
  to be changed to resolve function calls to the best-fit overloaded
  function definition if there are multiple applicable candidates.

  An overloaded function definition is an applicable candidate for a
  function call if each actual parameter in the function call either
  matches the corresponding formal parameter's type (without casting)
  or is implicitly castable to that type.

  When looking for the best-fit applicable candidate, a parameter
  match score (i.e. the number of actual parameters in the function
  call that match their corresponding formal parameter's type without
  casting) is calculated and the applicable candidate with the highest
  parameter match score is chosen.

  There's one more issue that the new resolution logic has to address:
  if two applicable candidates have the same parameter match score and
  the only difference between the two is that the first one requires a
  STRING -> TIMESTAMP implicit cast for some of its parameters while
  the second one requires a STRING -> DATE implicit cast for the same
  parameters then the first candidate has to be chosen not to break
  backward compatibility.
  E.g: year('2019-02-15') function call must resolve to
  year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
  implemented yet, so this is not an issue at the moment but it will
  be in the future.
  When the resolution algorithm considers overloaded function
  definitions, first it orders them lexicographically by the types in
  their parameter lists. To ensure the backward compatible behavior
  Primitivetype.DATE enum value has to come after
  PrimitiveType.TIMESTAMP.

- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
  math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.

These items are tightly coupled and it makes sense to implement them
in one change-set.

Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
  corresponding HBASE table 'functional_hbase.date_tbl') was
  introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
    - since DATE type is supported for TEXT and HBASE fileformats
      only, most DATE tests were implemented separately in
      tests/query_test/test_date_queries.py.

Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.

Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-23 13:33:57 +00:00
Greg Rahn
ba9b78c103 IMPALA-7759: Add Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate Levenshtein edit
distance. Implemented as levenshtein() to match PostgreSQL in
both functionality and name and also added le_dst() alias for Netezza,
compatibility, but note that levenshtein() differs in functionality in
that if either value is NULL or both values are NULL, levenshtein()
returns NULL, where Netezza's le_dst() returns the length of the not
NULL value or 0 if both values are NULL.

Testing:
- Added unit tests to expr-test.cc
- Manual test on 966289 string pairs and results match PostgreSQL
- Added changes to qgen tests for PostgreSQL comparison

Change-Id: I549d33ab7cebfa10db2934461c8ec91e2cc1cdcb
Reviewed-on: http://gerrit.cloudera.org:8080/11793
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-12-02 10:39:44 +00:00
stiga-huang
ddef2cb9b1 IMPALA-376: add built-in functions for parsing JSON
This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
    get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Reviewed-on: http://gerrit.cloudera.org:8080/10950
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-29 11:59:03 +00:00
Andrew Sherman
15e40a3c94 IMPALA-589: Add sql function returning the impalad coordinator hostname.
In every execution of an Impala query, one of the impalad daemons acts
as the coordinator node. In some cases, such as when using a proxy, a
user cannot predict which host will act as the coordinator. To aid in
diagnosis, we provide a sql function which returns the name of the host
on which the coordinator is running.

EXTERNAL DESCRIPTION:

Add a builtin function called coordinator(), which returns the name of
the host which is running the impalad that is acting as the coordinator
for the current query.

TESTING:
- Added a basic unit test for the new function.
- Added a unit test which simulates the case when coord_address is
  unset.
- Added a query that uses coordinator() to exprs.test
- Hand tested in a development deployment.
- Ran regression tests and got a clean run.

Change-Id: I94d6e2664ba659b48df53c5c06f67b502c533e47
Reviewed-on: http://gerrit.cloudera.org:8080/11459
Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-24 20:26:44 +00:00
Adam Holley
7002ad3b35 IMPALA-7398: Add logged_in_user alias for effective_user
This patch adds an alias to the effective_user function so that
views created in Hive using the logged_in_user function will work
in Impala.

Example:
CREATE VIEW foo.view1 AS SELECT * FROM foo.table1 WHERE
name=logged_in_user();

Tests:
- Added function and ran delegation tests
- Ran backend tests
- Ran custom-cluster tests including delegation

Change-Id: Id63f243e0fffbe2798f1f9dbc4cc3ebe9d9529a6
Reviewed-on: http://gerrit.cloudera.org:8080/11184
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-08-11 06:02:52 +00:00
aphadke
dde930830b IMPALA-4848: Add WIDTH_BUCKET() function
Syntax :
width_bucket(expr decimal, min_val decimal, max_val decimal,
  num_buckets int)

This function creates equiwidth histograms , where the histogram range
is divided into num_buckets buckets having identical sizes. This
function returns the bucket in which the expr value would fall. min_val
and max_val are the minimum and maximum value of the histogram range
respectively.

-> This function returns NULL if expr is a NULL.
-> This function returns 0 if expr < min_val
-> This function returns num_buckets + 1 if expr > max_val

 E.g.
[localhost:21000] > select width_bucket(8, 1, 20, 3);
+---------------------------+
| width_bucket(8, 1, 20, 3) |
+---------------------------+
| 2                         |
+---------------------------+

Change-Id: I081bc916b1bef7b929ca161a9aade3b54c6b858f
Reviewed-on: http://gerrit.cloudera.org:8080/6023
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-06-30 04:47:23 +00:00
Tim Armstrong
2995be8238 IMPALA-5607: part 1: breaking extract/date_part changes
This is the compatibility-breaking part of Jinchul Kim's change
to add additional units. To support nanoseconds we need to
widen the output type of these functions. We also change
the meaning of "milliseconds" to include the seconds component.

Cherry-picks: not for 2.x

Change-Id: I42d83712d9bb3a4900bec38a9c009dcf2a1fe019
Reviewed-on: http://gerrit.cloudera.org:8080/9957
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-10 04:00:37 +00:00
Taras Bobrovytsky
8fec1911e5 IMPALA-6230, IMPALA-6468: Fix the output type of round() and related fns
Before this patch, the output type of round() ceil() floor() trunc() was
not always the same as the input type. It was also inconsistent in
general. For example, round(double) returned an integer, but
round(double, int) returned a double.

After looking at other database systems, we decided that the guideline
should be that the output type should be the same as the input type. In
this patch, we change the behavior of the previously mentioned functions
so that if a double is given then a double is returned.

We also modify the rounding behavior to always round away from zero.
Before, we were rounding towards positive infinity in some cases.

Testinging:
- Updated tests
- Ran an exhaustive build which passed.

Cherry-picks: not for 2.x

Change-Id: I77541678012edab70b182378b11ca8753be53f97
Reviewed-on: http://gerrit.cloudera.org:8080/9346
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-24 04:43:01 +00:00
Greg Rahn
d91df9b63f IMPALA-6537: Add missing ODBC scalar functions
This patch contains the following builtin function changes:

New aliases for existing functions:
- LEFT() same as STRLEFT()
- RIGHT() same as STRRIGHT()
- WEEK() same as WEEKOFYEAR()

New functions:
- QUARTER()
- MONTHNAME()

Refactors:
- Remove TimestampFunctions::DayName and add LongDayName to match pattern of
  TimestampFunctions::ShortDayName

Additionally, it adds the unit of QUARTER to EXTRACT() and DATE_PART()

Testing:
- manual testing comparing the translated ODBC functions to the
  non-translated ones
- added at least one new expr-test for aliases
- new expr-tests added for new functions

Change-Id: Ia60af2b4de8c098be7ecb3e60840e459ae10d499
Reviewed-on: http://gerrit.cloudera.org:8080/9376
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-23 07:19:07 +00:00
Jinchul
1b1087eb05 IMPALA-3282: Adds regexp_escape built-in function
Escapes the following special characters in RE2 library:
.\+*?[^]$(){}=!<>|:-

Testing:
Add some unit tests into ExprTest.StringRegexpFunctions
Add some E2E tests into exprs.test

Change-Id: I84c3e0ded26f6eb20794c38b75be9b25cd111e4b
Reviewed-on: http://gerrit.cloudera.org:8080/8900
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-02-01 05:14:14 +00:00
Zoram Thanga
c2d27ca823 IMPALA-6059: Enhance ltrim()/rtrim() functions to trim any set of
characters.

This patch generalizes ltrim()/rtrim() functions to accept a second
argument that specifies the set of characters to be removed from the
leading/trailing end of the target string:

ltrim(string text[, characters text])
rtrim(string text[, characters text])

A common string trimming method has been added to StringFunctions,
which is called from the general ltrim/rtrim/btrim functions. The
functions also share prepare and close operations.

New StringFunctions tests have been added to ExprTest for the new
forms of ltrim() and rtrim(). New tests to cover handling of special
characters have also been added.

Note that our string handling functions only work with the ASCII
character set. Handling of other character sets lies outside the
scope of this patch.

The existing ltrim()/rtrim()/trim() functions that take only one
argument have been updated to use the more general methods.

Testing: Queries like the following were run on a 1.5-billion row
tpch_parquet.lineitem table, with the old and new implementations
to ensure there is no performance regression:

  1. select count(trim(l_shipinstruct)), count(trim(l_returnflag)), ...
  2. select count(*) from t where trim(l_shipinstruct) = '' and ...

Change-Id: I8a5ae3f59762e70c3268a01e14ed57a9e36b8d79
Reviewed-on: http://gerrit.cloudera.org:8080/8349
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-23 23:44:46 +00:00
Jinchul
6041865031 IMPALA-3651: Adds murmur_hash() built-in function
murmur_hash relys on HashUtil::MurmurHash2_64 which MurmurHash2 64-bit
version.

Testing:
Add unit tests for primitive types: ExprTest.MurmurHashFunction
Add E2E tests into exprs.test

Change-Id: I14d56ffb8fab256f3f66a2669271fd4b3c50cc29
Reviewed-on: http://gerrit.cloudera.org:8080/8893
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-10 20:17:26 +00:00
Zachary Amsden
f53ce3b16d IMPALA-4513: Promote integer types for ABS()
The internal representation of the most negative number
in two's complement requires 1 more bit to represent the
positive version.  This means ABS() must promote integer
types to the next highest width.

Change-Id: I86cc880e78258d5f90471bd8af4caeb4305eed77
Reviewed-on: http://gerrit.cloudera.org:8080/8004
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-09-23 02:41:32 +00:00
Philip Zeyliger
02302b7cfe IMPALA-5211: Simplifying nullif conditional.
This commit:
* Converts nullif(x, y) into if(x IS DISTINCT FROM y, x, NULL).
* Re-writes x IS DINSTINCT FROM y -> FALSE if x.equals(y).
* Removes backend implementation of nullif.

As is the case with all conversions, the original nullif(...) is
replaced with if(...) in error messages, explain plans, and so on.

It's important and subtle that the conversion uses "x IS DISTINCT FROM y"
rather than "x != y" so that the simplification can be made while
handling null values correctly. ("x != x" may be either false or null,
but x is distinct from x is always false.)

Testing:
* Added new tests to ExprRewriteRulesTests for nullif and the if(x
  distinct from y, ...) simplification.
* New test for the rewrite in ParserTest.
* Adds an nvl2() test, incidentally.
* Confirmed (using EclEmma, which uses jococo engine) that coverage is good.
* Ran the tests.

Change-Id: Id91ca968a0c0be44e1ec54ad8602f91a5cb2e0e5
Reviewed-on: http://gerrit.cloudera.org:8080/7829
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-09-15 22:48:52 +00:00