175 Commits

Author SHA1 Message Date
ttttttz
5d1f1e0180 IMPALA-14183: Rename the environment variable USE_APACHE_HIVE to USE_APACHE_HIVE_3
When the environment variable USE_APACHE_HIVE is set to true, build
Impala for adapting to Apache Hive 3.x. In order to better distinguish it
from Apache Hive 2.x later, rename USE_APACHE_HIVE to USE_APACHE_HIVE_3.
Additionally, to facilitate referencing different versions of the Hive
MetastoreShim, the major version of Hive has been added to the environment
variable IMPALA_HIVE_DIST_TYPE.

Change-Id: I11b5fe1604b6fc34469fb357c98784b7ad88574d
Reviewed-on: http://gerrit.cloudera.org:8080/21724
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-12-03 13:38:45 +00:00
Arnab Karmakar
a2a11dec62 IMPALA-13263: Add single-argument overload for ST_ConvexHull()
Implemented a single-argument version of ST_ConvexHull() to align with
PostGIS behavior and simplify usage across geometry types.

Testing:
Added new tests in test_geospatial_functions.py for ST_ConvexHull(),
which previously had no test coverage, to verify correctness across
supported geometry types.

Change-Id: Idb17d98f5e75929ec0143aa16195a84dd6e50796
Reviewed-on: http://gerrit.cloudera.org:8080/23604
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2025-11-18 10:26:04 +00:00
Joe McDonnell
1913ab46ed IMPALA-14501: Migrate most scripts from impala-python to impala-python3
To remove the dependency on Python 2, existing scripts need to use
python3 rather than python. These commands find those
locations (for impala-python and regular python):
git grep impala-python | grep -v impala-python3 | grep -v impala-python-common | grep -v init-impala-python
git grep bin/python | grep -v python3

This removes or switches most of these locations by various means:
1. If a python file has a #!/bin/env impala-python (or python) but
   doesn't have a main function, it removes the hash-bang and makes
   sure that the file is not executable.
2. Most scripts can simply switch from impala-python to impala-python3
   (or python to python3) with minimal changes.
3. The cm-api pypi package (which doesn't support Python 3) has been
   replaced by the cm-client pypi package and interfaces have changed.
   Rather than migrating the code (which hasn't been used in years), this
   deletes the old code and stops installing cm-api into the virtualenv.
   The code can be restored and revamped if there is any interest in
   interacting with CM clusters.
4. This switches tests/comparison over to impala-python3, but this code has
   bit-rotted. Some pieces can be run manually, but it can't be fully
   verified with Python 3. It shouldn't hold back the migration on its own.
5. This also replaces locations of impala-python in comments / documentation /
   READMEs.
6. kazoo (used for interacting with HBase) needed to be upgraded to a
   version that supports Python 3. The newest version of kazoo requires
   upgrades of other component versions, so this uses kazoo 2.8.0 to avoid
   needing other upgrades.

The two remaining uses of impala-python are:
 - bin/cmake_aux/create_virtualenv.sh
 - bin/impala-env-versioned-python
These will be removed separately when we drop Python 2 support
completely. In particular, these are useful for testing impala-shell
with Python 2 until we stop supporting Python 2 for impala-shell.

The docker-based tests still use /usr/bin/python, but this can
be switched over independently (and doesn't impact impala-python)

Testing:
 - Ran core job
 - Ran build + dataload on Centos 7, Redhat 8
 - Manual testing of individual scripts (except some bitrotted areas like the
   random query generator)

Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc
Reviewed-on: http://gerrit.cloudera.org:8080/23468
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2025-10-22 16:30:17 +00:00
pranavyl
a61b90f860 IMPALA-13039: AES Encryption/ Decryption Support in Impala
AES (Advanced Encryption Standard) crypto functions are
widely recognized and respected encryption algorithm used to protect
sensitive data which operate by transforming plaintext data into
ciphertext using a symmetric key, ensuring confidentiality and
integrity. This standard specifies the Rijndael algorithm, a symmetric
block cipher that can process data blocks of 128 bits, using cipher
keys with lengths of 128 and 256 bits. The patch makes use of the
EVP_*() algorithms from the OpenSSL library.

The patch includes:
1. AES-GCM, AES-CTR, and AES-CFB encryption functionalities and
AES-GCM, AES-ECB, AES-CTR, and AES-CFB decryption functionalities.
2. Support for both 128-bit and 256-bit key sizes for GCM and ECB modes.
3. Enhancements to EncryptionKey class to accommodate various AES modes.

The aes_encrypt() and aes_decrypt() functions serve as entry
points for encryption and decryption operations, handling
encryption and decryption based on user-provided keys, AES modes,
and initialization vectors (IVs). The implementation includes key
length validation and IV vector size checks to ensure data
integrity and confidentiality.

Multiple AES modes: GCM, CFB, CTR for encryption, and GCM, CFB, CTR
and ECB for decryption are supported to provide flexibility and
compatibility with various use cases and OpenSSL features. AES-GCM
is set as the default mode due to its strong security properties.
AES-CTR and AES-CFB are provided as fallbacks for environments where
AES-GCM may not be supported. Note that AES-GCM is not available in
OpenSSL versions prior to 1.0.1, so having multiple methods ensures
broader compatibility.

Testing: The patch is thouroughly tested and the tests are included in
exprs.test.

Change-Id: I3902f2b1d95da4d06995cbd687e79c48e16190c9
Reviewed-on: http://gerrit.cloudera.org:8080/20447
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-27 22:16:37 +00:00
Yida Wu
702131b677 IMPALA-13565: Add general AI platform support to ai_generate_text
Currently only OpenAI sites are allowed for ai_generate_text(),
this patch adds support for general AI platforms to
the ai_generate_text function. It introduces a new flag,
ai_additional_platforms, allowing Impala to access additional
AI platforms. For these general AI platforms, only the openai
standard is supported, and the default api credential serves as
the api token for general platforms.

The ai_api_key_jceks_secret parameter has been renamed to
auth_credential to support passing both plain text and jceks
encrypted secrets.

A new impala_options parameter is added to ai_generate_text() to
enable future extensions. Adds the api_standard option to
impala_options, with "openai" as the only supported standard.
Adds the credential_type option to impala_options for allowing
the plain text as the token, by default it is set to jceks.
Adds the payload option to impala_options for customized
payload input. If set, the request will use the provided
customized payload directly, and the response will follow the
openai standard for parsing. The customized payload size must not
exceed 5MB.

Adding the impala_options parameter to ai_generate_text() should
be fine for backward compatibility, as this is a relatively new
feature.

Example:
1. Add the site to ai_api_additional_platforms,like:
ai_additional_platforms='new_ai.site,new_ai.com'
2. Example sql:
select ai_generate_text("https://new_ai.com/v1/chat/completions",
"hello", "model-name", "ai-api-token", "platform params",
'{"api_standard":"openai", "credential_type":"plain",
"payload":"payload content"}}')

Tests:
Added a new test AiFunctionsTestAdditionalSites.
Manual tested the example with the Cloudera AI platform.
Passed core and asan tests.

Change-Id: I4ea2e1946089f262dda7ace73d5f7e37a5c98b14
Reviewed-on: http://gerrit.cloudera.org:8080/22130
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
2025-01-16 19:12:07 +00:00
Mihaly Szjatinya
22723d0f27 IMPALA-7086: Cache timezone in *_utc_timestamp()
Added Prepare - Close routine around from/to_utc standard functions.
This gives a consistent time improvement for constant timezones.

Given sample table with 600M timestamp rows, on all-default
environment the query below gives a stable 2-3 seconds improvement.
SELECT count(*) FROM a_table
where from_utc_timestamp(ts, "a_timezone") > "a_date";

Averaged results for Release, SET MT_OP=1, SET DISABLE_CODEGEN=TRUE:
from_utc: 16,53s -> 12,53s
to_utc: 14,02 - > 11,53

Change-Id: Icdf5ff82c5d0554333aef1bc3bba034a4cf48230
Reviewed-on: http://gerrit.cloudera.org:8080/21735
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-09-06 18:08:26 +00:00
Yida Wu
9837637d93 IMPALA-12920: Support ai_generate_text built-in function for OpenAI's chat completion API
Added support for following built-in functions:
- ai_generate_text_default(prompt)
- ai_generate_text(ai_endpoint, prompt, ai_model,
  ai_api_key_jceks_secret, additional_params)

'ai_endpoint', 'ai_model' and 'ai_api_key_jceks_secret' are flagfile
options. 'ai_generate_text_default(prompt)' syntax expects all these
to be set to proper values. The other syntax, will try to use the
provided input parameter values, but fallback to instance level values
if the inputs are NULL or empty.

Only public OpenAI (api.openai.com) and Azure OpenAI (openai.azure.com)
API endpoints are currently supported.

Exposed these functions in FunctionContext so that they can also be
called from UDFs:
- ai_generate_text_default(context, model)
- ai_generate_text(context, ai_endpoint, prompt, ai_model,
  ai_api_key_jceks_secret, additional_params)

Testing:
- Added unit tests for AiGenerateTextInternal function
- Added fe test for JniFrontend::getSecretFromKeyStore
- Ran manual tests to make sure Impala can talk with OpenAI LLMs using
'ai_generate_text' built-in function. Example sql:
select ai_generate_text("https://api.openai.com/v1/chat/completions",
"hello", "gpt-3.5-turbo", "open-ai-key",
'{"temperature": 0.9, "model": "gpt-4"}')
- Tested using standalone UDF SDK and made sure that the UDFs can invoke
  BuiltInFunctions (ai_generate_text and ai_generate_text_default)

Change-Id: Id4446957f6030bab1f985fdd69185c3da07d7c4b
Reviewed-on: http://gerrit.cloudera.org:8080/21168
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-11 07:25:50 +00:00
jasonmfehr
3e4fdeece1 IMPALA-12824: Removes the prettyprint_duration Built-in Function
The prettyprint_duration function was originally
implemented in IMPALA-12824 to work with the workload
management tables which stored durations in integer
nanoseconds. These tables have changed to store decimal
seconds.

The prettyprint_duration function would have required a
large investment of time to make it work with decimal
values, and since the new format is more human readable
anyways, this function has been removed.

Change-Id: If2154c2ed9a7217ed4b7587adeae87df55ff03dc
Reviewed-on: http://gerrit.cloudera.org:8080/21208
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-03-28 06:58:56 +00:00
jasonmfehr
d03ffc70f2 IMPALA-12824: Adds built-in functions prettyprint_duration and prettyprint_bytes.
The prettyprint_duration function takes an integer input containing a
number of nanoseconds and returns a human readable value breaking down
the input by hours, minutes, seconds, milliseconds, microseconds, and
nanoseconds.

The prettyprint_bytes function takes an integer input containing a
number of bytes and returns a human readable values breaking down the
input by gigabytes, megabytes, kilobytes, and bytes.

Functionality tests were added to the existing expr-test suite that
tests built-in functions.

Functional-query workloads were added in two new .test files under the
testdata directory to exercise these two new functions. Corresponding
pytests were added to run the tests in these new .test files.

Benchmarks were added to expr-benchmark, and new benchmarks were
generated with a release build running on a machine with the cpu
Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz.

Documentation was added to the built-in string functions docs.

Change-Id: I3e76632ce21ad2ca5df474160338699a542a6913
Reviewed-on: http://gerrit.cloudera.org:8080/21038
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-02-21 04:23:28 +00:00
Eyizoha
e489ab35b1 IMPALA-12718: Provides UTF-8 support for the trim functions
Currently, the trim function (including BTRIM, LTRIM, RTRIM) cannot
correctly handle strings containing multi-byte UTF-8 characters.
Multi-byte UTF-8 characters are interpreted as multiple single-byte
characters, leading to unexpected results.

This patch provides UTF-8 support for the trim functions, enabling these
functions to correctly handle multi-byte UTF-8 characters (when set
utf8_mode=true). It also introduces a set of trim functions with the
'utf8_' prefix, offering the same capability even when utf8_mode is not
enabled.

Testing:
 - Added new BE test case in ExprTest#Utf8Test
 - Added new E2E test case in TestUtf8StringFunctions

Change-Id: I5cfaffd71009f16eae75910af835bd2a34410856
Reviewed-on: http://gerrit.cloudera.org:8080/20926
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-02-02 11:06:24 +00:00
Eyizoha
3af1930229 IMPALA-12322: Support converting UTC timestamps read from Kudu to local time
This patch adds a query option 'convert_kudu_utc_timestamps' similar to
'convert_legacy_hive_parquet_utc_timestamps'. When enabled, it converts
UTC timestamps read from Kudu to local timestamps.

The corresponding modification also include predicate pushdown and
runtime filter. Due to the ambiguity of timestamps caused by daylight
saving time changes, it is difficult to resolve in the bloom filter.
This patch additionally introduces a query option
'disable_kudu_local_timestamp_bloom_filter' to default disable the Kudu
timestamp bloom filter after enabling time zone conversion in order to
avoid erroneously filtering out data. However, for regions that do not
observe daylight saving time, it can be set to false to re-enable the
Kudu local timestamp bloom filter.

Testing:
- Add TestKuduTimestampConvert in query_test/test_kudu.py
Perform end-to-end testing in a custom cluster, including basic Kudu UTC
timestamp conversion testing, as well as checking if related predicate
pushdown and runtime filters are working correctly (even with timestamps
involving daylight saving time conversions).

Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3
Reviewed-on: http://gerrit.cloudera.org:8080/20681
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>
2023-12-14 13:19:35 +00:00
Michael Smith
0a42185d17 IMPALA-9627: Update utility scripts for Python 3 (part 2)
We're starting to see environments where the system Python ('python') is
Python 3. Updates utility and build scripts to work with Python 3, and
updates check-pylint-py3k.sh to check scripts that use system python.

Fixes other issues found during a full build and test run with Python
3.8 as the default for 'python'.

Fixes a impala-shell tip that was supposed to have been two tips (and
had no space after period when they were printed).

Removes out-of-date deploy.py and various Python 2.6 workarounds.

Testing:
- Full build with /usr/bin/python pointed to python3
- run-all-tests passed with python pointed to python3
- ran push_to_asf.py

Change-Id: Idff388aff33817b0629347f5843ec34c78f0d0cb
Reviewed-on: http://gerrit.cloudera.org:8080/19697
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
2023-04-26 18:52:23 +00:00
Peter Rozsa
1d05381b7b IMPALA-11745: Add Hive's ESRI geospatial functions as builtins
This change adds geospatial functions from Hive's ESRI library
as builtin UDFs. Plain Hive UDFs are imported without changes,
but the generic and varargs functions are handled differently;
generic functions are added with all of the combinations of
their parameters (cartesian product of the parameters), and
varargs functions are unfolded as an nth parameter simple
function. The varargs function wrappers are generated at build
time and they can be configured in
gen_geospatial_udf_wrappers.py. These additional steps are
required because of the limitations in Impala's UDF Executor
(lack of varargs support and only partial generics support)
which could be further improved; in this case, the additional
wrapping/mapping steps could be removed.

Changes regarding function handling/creating are sourced from
https://gerrit.cloudera.org/c/19177

A new backend flag was added to turn this feature on/off
as "geospatial_library". The default value is "NONE" which
means no geospatial function gets registered
as builtin, "HIVE_ESRI" value enables this implementation.

The ESRI geospatial implementation for Hive currently only
available in Hive 4, but CDP Hive backported it to Hive 3,
therefore for Apache Hive this feature is disabled
regardless of the "geospatial_library" flag.

Known limitations:
 - ST_MultiLineString, ST_MultiPolygon only works
   with the WKT overload
 - ST_Polygon supports a maximum of 6 pairs of coordinates
 - ST_MultiPoint, ST_LineString supports a maximum of 7
   pairs of coordinates
 - ST_ConvexHull, ST_Union supports a maximum of 6 geoms

These limits can be increased in gen_geospatial_udf_wrappers.py

Tests:
 - test_geospatial_udfs.py added based on
   https://github.com/Esri/spatial-framework-for-hadoop

Co-Authored-by: Csaba Ringhofer <csringhofer@cloudera.com>

Change-Id: If0ca02a70b4ba244778c9db6d14df4423072b225
Reviewed-on: http://gerrit.cloudera.org:8080/19425
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-02-07 20:18:47 +00:00
Csaba Ringhofer
7ca11dfc7f IMPALA-9482: Support for BINARY columns
This patch adds support for BINARY columns for all table formats with
the exception of Kudu.

In Hive the main difference between STRING and BINARY is that STRING is
assumed to be UTF8 encoded, while BINARY can be any byte array.
Some other differences in Hive:
- BINARY can be only cast from/to STRING
- Only a small subset of built-in STRING functions support BINARY.
- In several file formats (e.g. text) BINARY is base64 encoded.
- No NDV is calculated during COMPUTE STATISTICS.

As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly
identical, especially from the backend's perspective. For this reason,
BINARY is implemented a bit differently compared to other types:
while the frontend treats STRING and BINARY as two separate types, most
of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g.
in SlotDesc. Only the following parts of backend need to differentiate
between STRING and BINARY:
- table scanners
- table writers
- HS2/Beeswax service
These parts have access to column metadata, which allows to add special
handling for BINARY.

Only a very few builtins are allowed for BINARY at the moment:
- length
- min/max/count
- coalesce and similar "selector" functions
Other STRING functions can be only used by casting to STRING first.
Adding support for more of these functions is very easy, as simply
the BINARY type has to be "connected" to the already existing STRING
function's signature. Functions where the result depends on utf8_mode
need to ensure that with BINARY it always works as if utf8_mode=0 (for
example length() is mapped to bytes() as length count utf8 chars if
utf8_mode=1).

All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY,
though in case of legacy Hive UDFs it is only supported if the argument
and return types are set explicitely to ensure backward compatibility.
See IMPALA-11340 for details.

The original plan was to behave as close to Hive as possible, but I
realized that Hive has more relaxed casting rules than Impala, which
led to STRING<->BINARY casts being necessary in more cases in Impala.
This was needed to disallow passing a BINARY to functions that expect
a STRING argument. An example for the difference is that in
INSERT ... VALUES () string literals need to be explicitly cast to
BINARY, while this is not needed in Hive.

Testing:
- Added functional.binary_tbl for all file formats (except Kudu)
  to test scanning.
- Removed functional.unsupported_types and related tests, as now
  Impala supports all (non-complex) types that Hive does.
- Added FE/EE tests mainly based on the ones added to the DATE type

Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582
Reviewed-on: http://gerrit.cloudera.org:8080/16066
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-19 13:55:42 +00:00
Joe McDonnell
ba4cb95b62 IMPALA-11257: Fix CMake warnings for module names and cmake_minimum_required
This fixes a few different CMake warnings:
1. This removes cmake_minimum_required invocations except for the
   top-most CMakeLists.txt. This eliminates the warnings like this:
     Compatibility with CMake < 2.8.12 will be removed from a future version of
     CMake.

     Update the VERSION argument <min> value or use a ...<max> suffix to tell
     CMake that the project does not need compatibility with older versions.
   Moving to a later version also required setting CMAKE_ENABLE_EXPORTS
   to continue exporting symbols.
2. This modifies the module names so that they match the corresponding
   module names from Find*.cmake. This is mostly dealing with case
   differences. This address warnings like:
     The package name passed to `find_package_handle_standard_args` (PROTOBUF)
     does not match the name of the calling package (Protobuf).  This can lead
     to problems in calling code that expects `find_package` result variables
     (e.g., `_FOUND`) to follow a certain pattern.
   This fixed the detection logic for KerberosPrograms, and so it required
   adding more Kerberos packages to bin/bootstrap_build.sh.
3. This adds a missing .cc suffix. This addresses the following warning:
     CMake Warning (dev) at be/src/util/CMakeLists.txt:141 (add_library):
     Policy CMP0115 is not set: Source file extensions must be explicit.  Run
     "cmake --help-policy CMP0115" for policy details.  Use the cmake_policy
     command to set the policy and suppress this warning.

These fixes mostly match how these warnings were handled in
Apache Kudu.

Testing:
 - Ran GVO

Change-Id: I2a97dd07cdd0831e90882a2035415ac71d670147
Reviewed-on: http://gerrit.cloudera.org:8080/18444
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-08-11 05:48:36 +00:00
Csaba Ringhofer
f24296aed5 IMPALA-11355: Add STRING overloads for hour/minute/second/millisecond
IMPALA-9531 dropped support for "dateless timestamps",
e.g. cast("12:05:05" as timestamp) now returns NULL.

This led to breaking functions like minute("12:05:05"), as minute()
expects a timestamp, and Impala adds an implicit cast, so what actually
happens is minute(cast("12:05:05" as timestamp)), which returns NULL.

This change adds overloads for similar functions that take STRING
instead of TIMESTAMP parameter. The same functions already take a
STRING parameter in Hive and mySQL.

The changes in the parser mainly restore code removed in IMPALA-9531.

Note that these functions could be potentially optimized by returning
parts of the parse result without converting them to boost time first,
but this is not done here to make the change minimal.

Testing:
- restored related tests in expr-test and added some new ones for
  malformed time-of-day strings
- added benchmarks for the new overloads and fixed the ones for the
  old functions (they tested NULL)

Change-Id: I6cc1c851ee71ab4fcc58105c7e9931155a483679
Reviewed-on: http://gerrit.cloudera.org:8080/18718
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-07-14 21:03:43 +00:00
Michael Smith
2243f331cb IMPALA-11274: CNF Rewrite causes a regress in join node performance
This patch defines a subset of all predicates that are common and
relatively inexpensive to compute. Such predicates must involve
columns, constants, simple math or cast functions only.

Examples of the subset of the predicates allowed:

  1. (a = 1 AND cast(b as int) = 2) OR (c = d AND e = f)
  2. a in ('1', '2', '3') OR ((b = 'abc') AND (c = d))
  3. (a between 1 and 100) OR ((b is null) AND (c = d))

Examples of the predicates not allowed:

  1. (upper(a) != 'Y') AND b = 2) OR (c = d AND e = f)
  2. (coalesce(CAST(a AS string), '') = '') AND b = 2) OR
     (c = d AND e = f)

This patch further restricts the predicates to be converted to
conjunctive normal form (CNF) to be such a subset, with the aim to
reduce the run-time evaluation overhead of CNFs in which some
of the predicates can be duplicated.

Uses a cache in branching expressions to avoid visiting the entire
subtree on each call to applyRuleBottomUp. Skips cache complexity on
casts as they don't branch and are unlikely to be deeply nested.

Testing:
- New expression writer tests
- New planner tests

Change-Id: I326406c6b004fe31ec0e2a2f390a3845b8925aa9
Reviewed-on: http://gerrit.cloudera.org:8080/18458
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-05-25 05:37:17 +00:00
stiga-huang
35375b3287 IMPALA-2019(part-4): Add UTF-8 support for case conversion functions
There are 3 builtin case conversion string functions: upper(), lower(),
and initcap(). Previously they only convert English alphabetic
characters. This patch adds support to deal with Unicode characters.

There are many corner cases in case conversion depending on the locale
and context. E.g.
1) Case conversion is locale-sensitive.
Turkish has 4 letter "I"s. English has only two, a lowercase dotted i
and an uppercase dotless I. Turkish has lowercase and uppercase forms of
both dotted and dotless I. So simply converting "i" to "I" for upper
case is wrong in Turkish:
    +-------+--------+---------+
    |       | Dotted | Dotless |
    +-------+--------+---------+
    | Upper | İ      | I       |
    +-------+--------+---------+
    | Lower | i      | ı       |
    +-------+--------+---------+

2) Case conversion may change a string's length.
The German word "grüßen" should be converted to "GRÜSSEN" in upper case:
the letter "ß" should be converted to "SS".

3) Case conversion is context-sensitive.
The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the
Greek letter "Σ" is converted to "σ" or to "ς", depending on its
position in the word.

The above cases will be focus in follow-up JIRAs. This patch addes the
initial implementation of UTF-8 aware case conversion functions.

--------
Implementation:
In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the
bytes in strings are converted to wide characters using std::mbrtowc().
Each wide character (wchar_t) will then be converted using std::towupper
or std::towlower correspondingly. We then convert them back to multi
bytes using std::wcrtomb().

Note that these builtins are locale aware. If impalad is launched
without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't
recognize non-ascii characters, which will return unexpected results.
Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C".
This patch also logs the current locale when launching impala daemons
for better debugging. We will support customized locale in IMPALA-11080.

Test:
 - Add BE unit tests and e2e tests.

Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd
Reviewed-on: http://gerrit.cloudera.org:8080/17785
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-15 18:40:59 +00:00
pranav.lodha
bde995483a IMPALA-955: BYTES built-in function
The Bytes function returns the number of bytes contained
in the specified byte string. There are changes in
4 files. A few testcases are also added in
be/src/exprs/expr-test.cc and an end-to end test in
testdata/workloads/functional-query/queries/QueryTest/exprs.test.

Change-Id: I0bd06c3d6dba354d71f63c649eaa8f9f74d266ee
Reviewed-on: http://gerrit.cloudera.org:8080/18210
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2022-02-11 07:01:58 +00:00
stiga-huang
3850d49711 IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions
Mask functions are used in Ranger column masking policies to mask
sensitive data. There are 5 mask functions: mask(), mask_first_n(),
mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as
an example, by default, it will mask uppercase to 'X', lowercase to 'x',
digits to 'n' and leave other characters unmasked. For masking all
characters to '*', we can use
  mask(my_col, '*', '*', '*', '*');
The current implementations mask strings byte-to-byte, which have
inconsistent results with Hive when the string contains unicode
characters:
  mask('中国', '*', '*', '*', '*') => '******'
Each Chinese character is encoded into 3 bytes in UTF-8 so we get the
above result. The result in Hive is '**' since there are two Chinese
characters.

This patch provides consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode,
the masked unit of a string is a unicode code point.

Implementation
 - Extends the existing MaskTransform function to deal with unicode code
   points(represented by uint32_t).
 - Extends the existing GetFirstChar function to get the code point of
   given masked charactors in UTF-8 mode.
 - Implement a MaskSubStrUtf8 method as the core functionality.
 - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode.
 - For better testing, this patch also adds an overload for all mask
   functions for only masking other chars but keeping the
   upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X').

Tests
 - Add BE tests in expr-test
 - Add e2e tests in utf8-string-functions.test

Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Reviewed-on: http://gerrit.cloudera.org:8080/17780
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-09-15 05:04:07 +00:00
Amogh Margoor
8e7a6227a4 IMPALA-10730: Add MD5 fuction to compute 128-bit checksum for a string.
Built-in function has been added to compute MD5 128-bit checksum for
a non-null string. If input string is null, then output of the
function is null too. In FIPS mode, MD5 is disabled and function
will throw error on invocation.

Testing:
1. Added expression unit tests.
2. Added end-to-end tests for MD5.

Change-Id: Id406d30a7cc6573212b302fbfec43eb848352ff2
Reviewed-on: http://gerrit.cloudera.org:8080/17567
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-10 22:24:16 +00:00
Fucun Chu
ce21fe74b8 IMPALA-10689: Implement ds_cpc_union_f() function.
This function receives two strings that are serialized Apache
DataSketches CPC sketches. Union two sketches and returns the
resulting sketch of union.

Example:
select ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2))
from sketch_tbl;
+---------------------------------------------------+
| ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) |
+---------------------------------------------------+
| 15                                                |
+---------------------------------------------------+

Change-Id: Ib5c616316bf2bf2ff437678e9a44a15339920150
Reviewed-on: http://gerrit.cloudera.org:8080/17440
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-06-07 12:53:23 +00:00
Fucun Chu
67de4a48b0 IMPALA-10688: Implement ds_cpc_stringify() function
This function receives a string that is a serialized Apache
DataSketches CPC sketch and returns its stringified format.

A stringified format should look like and contains the following data:

select ds_cpc_stringify(ds_cpc_sketch(float_col)) from
functional_parquet.alltypestiny;
+--------------------------------------------+
| ds_cpc_stringify(ds_cpc_sketch(float_col)) |
+--------------------------------------------+
| ### CPC sketch summary:                    |
|    lg_k           : 11                     |
|    seed hash      : 93cc                   |
|    C              : 2                      |
|    flavor         : 1                      |
|    merged         : true                   |
|    intresting col : 0                      |
|    table entries  : 2                      |
|    window         : not allocated          |
| ### End sketch summary                     |
|                                            |
+--------------------------------------------+

Change-Id: I8c9d089bfada6bebd078d8f388d2e146c79e5285
Reviewed-on: http://gerrit.cloudera.org:8080/17373
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
2021-05-26 06:42:02 +00:00
Amogh Margoor
86beb2f9dd IMPALA-10679: Add builtin functions to comptute SHA-1 and SHA-2 digest.
Built-in functions to compute SHA-1 digest and SHA-2 family of digest
has been added. Support for SHA2 digest includes SHA224, SHA256,
SHA384 and SHA512. In FIPS mode SHA1, SHA224 and SHA256 have been
disabled and will throw error. SHA2 functions will also throw error
for unsupported bit length i.e., bit length apart from 224, 256, 384,
512.

Testing:
1. Added Unit test for expressions.
2. Added end-to-end test for new functions.

Change-Id: If163b7abda17cca3074c86519d59bcfc6ace21be
Reviewed-on: http://gerrit.cloudera.org:8080/17464
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-18 18:48:10 +00:00
Fucun Chu
e39c30b3cd IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch_parquet.lineitem to compare perfomance
   with ndv(). Depending on data characteristics ndv() appears 2x-3x
   faster. CPC gives closer estimate than current ndv(). CPC is more
   accurate than HLL in some cases

Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Reviewed-on: http://gerrit.cloudera.org:8080/16656
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-05-11 18:07:40 +00:00
Fucun Chu
77d6acd032 IMPALA-10581: Implement ds_theta_intersect_f() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Computes the intersection of two sketches
of same or different column and returns the resulting sketch of
intersection.

Example:
select ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2))
from sketch_tbl;
+-----------------------------------------------------------+
| ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) |
+-----------------------------------------------------------+
| 5                                                         |
+-----------------------------------------------------------+

Change-Id: I335eada00730036d5433775cfe673e0e4babaa01
Reviewed-on: http://gerrit.cloudera.org:8080/17186
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-29 15:59:49 +00:00
Fucun Chu
622e3c95ad IMPALA-10580: Implement ds_theta_union_f() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Union two sketches and returns the
resulting sketch of union.

Example:
select ds_theta_estimate(ds_theta_union_f(sketch1, sketch2))
from sketch_tbl;
+-------------------------------------------------------+
| ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) |
+-------------------------------------------------------+
| 15                                                    |
+-------------------------------------------------------+

Change-Id: I8329979b81ceeaad739a43fab79768ca9c2916fa
Reviewed-on: http://gerrit.cloudera.org:8080/17179
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-24 15:16:07 +00:00
Fucun Chu
3e82501531 IMPALA-10558: Implement ds_theta_exclude() function
This function receives two strings that are serialized Apache
DataSketches Theta sketches. Computes the a-not-b set operation given
two sketches of same or different column.

Example:
select ds_theta_estimate(ds_theta_exclude(sketch1, sketch2))
from sketch_tbl;
+-------------------------------------------------------+
| ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) |
+-------------------------------------------------------+
| 5                                                     |
+-------------------------------------------------------+

Change-Id: I05119fd8c652c07ff248a99e44b0da3541e46ca3
Reviewed-on: http://gerrit.cloudera.org:8080/17153
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-03-17 22:14:44 +00:00
Fucun Chu
65c6a81ed9 IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions
These functions can be used to get cardinality estimates of data
using Theta algorithm from Apache DataSketches. ds_theta_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized Theta sketch in string format. This can be written to a
table or be fed directly to ds_theta_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the Theta sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' Theta see:
https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch25_parquet.lineitem to compare perfomance
   with ds_hll_*. ds_theta_* is faster than ds_hll_* on the original
   data, the difference is around 1%-10%. ds_hll_estimate() is faster
   than ds_theta_estimate() on existing sketch. HLL and Theta gives
   closer estimate except for string. see IMPALA-10464.

Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc
Reviewed-on: http://gerrit.cloudera.org:8080/17008
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-02-17 17:09:48 +00:00
stiga-huang
e8720b40f1 IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions
A unicode character can be encoded into 1-4 bytes in UTF-8. String
functions will return undesired results when the input contains unicode
characters, because we deal with a string as a byte array. For instance,
length() returns the length in bytes, not in unicode characters.

UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem.
This patch adds UTF-8 support in some string functions so they can have
UTF-8 aware behavior. For compatibility with the old versions, a new
query option, UTF8_MODE, is added for turning on/off the UTF-8 aware
behavior. Currently, only length(), substring() and reverse() support
it. Other function supports will be added in later patches.

String functions will check the query option and switch to use the
desired implementation. It's similar to how we use the decimal_v2 query
option in builtin functions.

For easy testing, the UTF-8 aware version of string functions are
also exposed as builtin functions (named by utf8_*, e.g. utf8_length).

Tests:
 - Add BE tests for utf8 functions.
 - Add e2e tests for the UTF8_MODE query option.

Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c
Reviewed-on: http://gerrit.cloudera.org:8080/16908
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-26 00:43:39 +00:00
stiga-huang
9bb7157bf0 IMPALA-10387: Add missing overloads of mask functions used in Ranger default masking policies
The mask functions in Hive are implemented through GenericUDFs which can
accept an infinite number of function signatures. Impala currently don't
support GenericUDFs. So we provide builtin mask functions with limited
overloads.

This patch adds some missing overloads that could be used by Ranger
default masking policies, e.g. MASK_HASH, MASK_SHOW_LAST_4,
MASK_DATE_SHOW_YEAR, etc.

Tests:
 - Add test coverage on all default masking policies applied on all
   supported types.

Change-Id: Icf3e70fd7aa9f3b6d6b508b776696e61ec1fcc2e
Reviewed-on: http://gerrit.cloudera.org:8080/16930
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2021-01-15 13:01:53 +00:00
Gabor Kaszab
6cb7cecacf IMPALA-10237: Support Bucket and Truncate partition transforms as built-in functions
This patch implements Truncate and Bucket partition transforms in
Impala BE as built-in functions. The expectation is that these
functions give the same result as Iceberg's implementation of the same
functions. These built-in functions are invisible so users won't be
able to invoke them e.g. from impala-shell.

Truncate:
  - Supported types are IntVal, BigIntVal, StringVal, DecimalVal.
  - Receives an input from the above types and a width.
  - Returns the same type as the input.
  - Expected behaviour is explained here:
    https://iceberg.apache.org/spec/#truncate-transform-details
Bucket:
  - Supported types are IntVal, BigIntVal, StringVal, DecimalVal,
    DateVal, TimestampVal.
  - Receives an input from the above types and the number of buckets as
    IntVal.
  - Returns IntVal.
  - Expected behaviour is explained here:
    https://iceberg.apache.org/spec/#bucket-transform-details

Change-Id: I485680cf79d96d578dd8cfbfd554bec468fe84bd
Reviewed-on: http://gerrit.cloudera.org:8080/16741
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-12-19 03:09:06 +00:00
Fucun Chu
8ea49e9b02 IMPALA-10134: Implement ds_hll_union_f() function.
This function receives two strings that are serialized Apache DataSketches
HLL sketches. Union two sketches and returns the resulting sketch of union.

Example:
select ds_hll_estimate(ds_hll_union_f(i_i, h_i))
from hll_sketches_impala_hive2;
+-------------------------------------------+
| ds_hll_estimate(ds_hll_union_f(i_i, h_i)) |
+-------------------------------------------+
| 7                                         |
+-------------------------------------------+

Change-Id: Ic06e959ed956af5cedbfc7d4d063141d5babb2a8
Reviewed-on: http://gerrit.cloudera.org:8080/16711
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-11-25 14:33:10 +00:00
Fucun Chu
193c2e773f IMPALA-10132: Implement ds_hll_estimate_bounds_as_string() function.
This function receives a string that is a serialized Apache DataSketches
HLL sketch and optional kappa that is a number of standard deviations
from the mean: 1, 2 or 3 (default 2). Returns estimate and bounds with
the values separated with commas.
The result is three values: estimate, lower bound and upper bound.

   ds_hll_estimate_bounds_as_string(sketch [, kappa])

Kappa:
 1 represent the 68.3% confidence bounds
 2 represent the 95.4% confidence bounds
 3 represent the 99.7% confidence bounds

Note, ds_hll_estimate_bounds() should return an Array of doubles as
the result but with that we have to wait for the complex type support.
Until, we provide ds_hll_estimate_bounds_as_string() that can be
deprecated once we have array support. Tracking Jira for returning
complex types from functions is IMPALA-9520.

Example:
select ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) from
functional_parquet.alltypestiny;
+----------------------------------------------------------+
| ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) |
+----------------------------------------------------------+
| 2,2,2.0002                                               |
+----------------------------------------------------------+

Change-Id: I46bf8263e8fd3877a087b9cb6f0d1a2392bb9153
Reviewed-on: http://gerrit.cloudera.org:8080/16626
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-10-29 17:45:01 +00:00
Adam Tamas
99e5f5a885 IMPALA-10133:Implement ds_hll_stringify function.
This function receives a string that is a serialized Apache DataSketches
HLL sketch and returns its stringified format.

A stringified format should look like and contains the following data:

select ds_hll_stringify(ds_hll_sketch(float_col)) from
functional_parquet.alltypestiny;
+--------------------------------------------+
| ds_hll_stringify(ds_hll_sketch(float_col)) |
+--------------------------------------------+
| ### HLL sketch summary:                    |
|   Log Config K   : 12                      |
|   Hll Target     : HLL_4                   |
|   Current Mode   : LIST                    |
|   LB             : 2                       |
|   Estimate       : 2                       |
|   UB             : 2.0001                  |
|   OutOfOrder flag: false                   |
|   Coupon count   : 2                       |
| ### End HLL sketch summary                 |
|                                            |
+--------------------------------------------+

Change-Id: I85dbf20b5114dd75c300eef0accabe90eac240a0
Reviewed-on: http://gerrit.cloudera.org:8080/16382
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-09-03 12:11:48 +00:00
Adam Tamas
4cb3c3556e IMPALA-10108: Implement ds_kll_stringify function
This function receives a string that is a serialized Apache DataSketches
KLL sketch and returns its stringified format.

A stringified format should look like and contains the following data:

select ds_kll_stringify(ds_kll_sketch(float_col))
from functional_parquet.alltypestiny;
+--------------------------------------------+
| ds_kll_stringify(ds_kll_sketch(float_col)) |
+--------------------------------------------+
| ### KLL sketch summary:                    |
|    K              : 200                    |
|    min K          : 200                    |
|    M              : 8                      |
|    N              : 8                      |
|    Epsilon        : 1.33%                  |
|    Epsilon PMF    : 1.65%                  |
|    Empty          : false                  |
|    Estimation mode: false                  |
|    Levels         : 1                      |
|    Sorted         : false                  |
|    Capacity items : 200                    |
|    Retained items : 8                      |
|    Storage bytes  : 64                     |
|    Min value      : 0                      |
|    Max value      : 1.1                    |
| ### End sketch summary                     |
|                                            |
+--------------------------------------------+

Change-Id: I97f654a4838bf91e3e0bed6a00d78b2c7aa96f75
Reviewed-on: http://gerrit.cloudera.org:8080/16370
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-09-02 10:49:10 +00:00
Gabor Kaszab
28d94851b1 IMPALA-10020: Implement ds_kll_cdf_as_string() function
This is the support for Cumulative Distribution Function (CDF) from
Apache DataSketches KLL algorithm collection. It receives a serialized
KLL sketch and one or more float values to represent ranges in the
sketched values.
E.g. [1, 5, 10] will mean the following ranges:
(-inf, 1), (-inf, 5), (-inf, 10), (-inf, +inf)
Returns a comma separated string where each value in the string is a
number in the range of [0,1] and shows that what percentage of the
data is in the particular ranges.

Note, ds_kll_cdf() should return an Array of doubles as the result but
with that we have to wait for the complex type support. Until, we
provide ds_kll_cdf_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.

Example:
select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10)
from alltypes;
+----------------------------------------------------------+
| ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) |
+----------------------------------------------------------+
| 0.2,0.401644,1,1                                         |
+----------------------------------------------------------+

Change-Id: I77e6afc4556ad05a295b89f6d06c2e4a6bb2cf82
Reviewed-on: http://gerrit.cloudera.org:8080/16359
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-26 10:59:49 +00:00
Gabor Kaszab
a8a35edbc4 IMPALA-10019: Implement ds_kll_pmf_as_string() function
This is the support for Probabilistic Mass Function (PMF) from Apache
DataSketches KLL algorithm collection. It receives a serialized KLL
sketch and one or more float values to represent ranges in the
sketched values.
E.g. [1, 5, 10] will mean the following ranges:
(-inf, 1), [1, 5), [5, 10), [10, +inf)
Returns a comma separated string where each value in the string is a
number in the range of [0,1] and shows that what percentage of the
data is in the particular ranges.

Note, ds_kll_pmf() should return an Array of doubles as the result but
with that we have to wait for the complex type support. Until, we
provide ds_kll_pmf_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.

Example:
select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10)
from alltypes;
+----------------------------------------------------------+
| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) |
+----------------------------------------------------------+
| 0.202192,0.199452,0.598356,0                             |
+----------------------------------------------------------+

Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba
Reviewed-on: http://gerrit.cloudera.org:8080/16336
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-26 01:50:16 +00:00
Gabor Kaszab
41065845e9 IMPALA-9962: Implement ds_kll_quantiles_as_string() function
This function is very similar to ds_kll_quantile() but this one can
receive any number of rank parameters and returns a comma separated
string that holds the results for all of the given ranks.
For more details about ds_kll_quantile() see IMPALA-9959.

Note, ds_kll_quantiles() should return an Array of floats as the result
but with that we have to wait for the complex type support. Until, we
provide ds_kll_quantiles_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.

Change-Id: I76f6039977f4e14ded89a3ee4bc4e6ff855f5e7f
Reviewed-on: http://gerrit.cloudera.org:8080/16324
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-25 18:06:22 +00:00
Gabor Kaszab
0736fcf691 IMPALA-10018: Implement ds_kll_rank() function
ds_kll_rank() receives two parameters: a STRING that represents a
serialized DataSketches KLL sketch and a float to provide a probing
value in the sketch.
Returns a DOUBLE that is the rank of the given probing value in the
range of [0,1]. E.g. a return value of 0.2 means that the probing value
given as parameter is greater than the 20% of all the values in the
sketch. Note, this is an approximate calculation.

Change-Id: I95857886dfbb8c84aeeaf718c0e610012fda4be0
Reviewed-on: http://gerrit.cloudera.org:8080/16283
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-07 12:58:40 +00:00
Gabor Kaszab
87aeb2ad78 IMPALA-9963: Implement ds_kll_n() function
This function receives a serialized Apache DataSketches KLL sketch
and returns how many input values were fed into this sketch.

Change-Id: I166e87a468e68e888ac15fca7429ac2552dbb781
Reviewed-on: http://gerrit.cloudera.org:8080/16259
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-08-06 19:15:04 +00:00
Gabor Kaszab
033a4607e2 IMPALA-9959: Implement ds_kll_sketch() and ds_kll_quantile() functions
ds_kll_sketch() is an aggregate function that receives a float
parameter (e.g. a float column of a table) and returns a serialized
Apache DataSketches KLL sketch of the input data set wrapped into
STRING type. This sketch can be saved into a table or view and later
used for quantile approximations. ds_kll_quantile() receives two
parameters: a STRING parameter that contains a serialized KLL sketch
and a DOUBLE that represents the rank of the quantile in the range of
[0,1]. E.g. rank=0.1 means the approximate value in the sketch where
10% of the sketched items are less than or equals to this value.

Testing:
  - Added automated tests on small data sets to check the basic
    functionality of sketching and getting a quantile approximate.
  - Tested on TPCH25_parquet.lineitem to check that sketching and
    approximating works on bigger scale as well where serialize/merge
    phases are also required. On this scale the error range of the
    quantile approximation is within 1-1.5%

Change-Id: I11de5fe10bb5d0dd42fb4ee45c4f21cb31963e52
Reviewed-on: http://gerrit.cloudera.org:8080/16235
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-07-31 14:34:49 +00:00
Adam Tamas
1bafb7bd29 IMPALA-9531: Dropped support for dateless timestamps
Removed the support for dateless timestamps.
During dateless timestamp casts if the format doesn't contain
date part we get an error during tokenization of the format.
If the input str doesn't contain a date part then we get null result.

Examples:
select cast('01:02:59' as timestamp);
This will come back as NULL value.

select to_timestamp('01:01:01', 'HH:mm:ss');
select cast('01:02:59' as timestamp format 'HH12:MI:SS');
select cast('12 AM' as timestamp FORMAT 'AM.HH12');
These will come back with a parsing errors.

Casting from a table will generate similar results.

Testing:
Modified the previous tests related to dateless timestamps.
Added test to read fromtables which are still containing dateless
timestamps and covered timestamp to string path when no date tokens
are requested in the output string.

Change-Id: I48c49bf027cc4b917849b3d58518facba372b322
Reviewed-on: http://gerrit.cloudera.org:8080/15866
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
2020-07-08 19:32:15 +00:00
Gabor Kaszab
7e456dfa9d IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Reviewed-on: http://gerrit.cloudera.org:8080/16000
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
2020-07-07 14:11:21 +00:00
stiga-huang
0936384271 IMPALA-9010: Add builtin mask functions
There're 6 builtin GenericUDFs for column masking in Hive:
  mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_hash(value)
  mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar,
      dayValue, monthValue, yearValue)

Description of the parameters:
   value      - value to mask. Supported types: TINYINT, SMALLINT, INT,
                BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()).
   charCount  - number of characters. Default value: 4
   upperChar  - character to replace upper-case characters with. Specify
                -1 to retain original character. Default value: 'X'
   lowerChar  - character to replace lower-case characters with. Specify
                -1 to retain original character. Default value: 'x'
   digitChar  - character to replace digit characters with. Specify -1
                to retain original character. Default value: 'n'
   otherChar  - character to replace all other characters with. Specify
                -1 to retain original character. Default value: -1
   numberChar - character to replace digits in a number with. Valid
                values: 0-9. Default value: '1'
   dayValue   - value to replace day field in a date with.
                Specify -1 to retain original value. Valid values: 1-31.
                Default value: 1
   monthValue - value to replace month field in a date with. Specify -1
                to retain original value. Valid values: 0-11. Default
                value: 0
   yearValue  - value to replace year field in a date with. Specify -1
                to retain original value. Default value: 1

In Hive, these functions accept variable length of arguments in
non-restricted types:
   mask_show_first_n(val)
   mask_show_first_n(val, 8)
   mask_show_first_n(val, 8, 'X', 'x', 'n')
   mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2)
   mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9')
The arguments of upperChar, lowerChar, digitChar, otherChar and
numberChar can be in string or numeric types.

Impala doesn't support Hive GenericUDFs, so we are lack of these mask
functions to support Ranger column masking policies. On the other hand,
we want the masking functions to be evaluated in the C++ builtin logic
rather than calling out to java UDFs for performance. This patch
introduces our builtin implementation of them.

We currently don't have a corresponding framework for GenericUDF
(IMPALA-9271), so we implement these by overloads. However, it may
requires hundreds of overloads to cover all possible combinations. We
just implement some important overloads, including
 - those used by Ranger default masking policies,
 - those with simple arguments and may be useful for users,
 - an overload with all arguments in int type for full functionality.
   Char argument need to be converted to their ASCII value.

Tests:
 - Add BE tests in expr-test

Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664
Reviewed-on: http://gerrit.cloudera.org:8080/14963
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-17 15:34:34 +00:00
norbert.luksa
a862282811 IMPALA-8709: Add Damerau-Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate restricted
Damerau-Levenshtein edit distance (optimal string alignment).
Implmented as dle_dst() and damerau_levenshtein(). If either value is
NULL or both values are NULL returns NULL which differs from Netezza's
dle_dst() which returns the length of the not NULL value or 0 if both
values are NULL. The NULL behavior matches the existing levenshtein()
function.

Also cleans up levenshtein tests.

Testing:
- Added unit tests to expr-test.cc
- Manual testing on over 1400 string pairs from
  http://marvin.cs.uidaho.edu/misspell.html and results match Netezza

Change-Id: Ib759817ec15e7075bf49d51e494e45c8af4db94d
Reviewed-on: http://gerrit.cloudera.org:8080/13794
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-22 21:39:21 +00:00
luksan47
8db7f27ddd IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function
The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Reviewed-on: http://gerrit.cloudera.org:8080/13870
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-13 18:25:32 +00:00
Attila Jeges
f40935a30e IMPALA-7369: part 2: Add INTERVAL expr support and built-in functions for DATE
This change implements INTERVAL expression support for DATE type and
adds several DATE related built-in functions. The efficiency of the
DateValue::ToYearMonthDay() function used in many of the built-in
functions below was also improved.

The following functions are supported in Hive:

  INT YEAR(DATE d)
  Extracts year of the 'd' date, returns it as an int in 0-9999 range.

  INT MONTH(DATE d)
  Extracts month of the 'd' date and returns it as an int in 1-12
  range.

  INT DAY(DATE d), INT DAYOFMONTH(DATE d)
  Extracts day-of-month of the 'd' date and returns it as an int in
  1-31 range.

  INT QUARTER(DATE d)
  Extracts quarter of the 'd' date and returns it as an int in 1-4
  range.

  INT DAYOFWEEK(DATE d)
  Extracts day-of-week of the 'd' date and returns it as an int in
  1-7 range. 1 is Sunday and 7 is Saturday.

  INT DAYOFYEAR(DATE d)
  Extracts day-of-year of the 'd' date and returns it as an int in
  1-366 range.

  INT WEEKOFYEAR(DATE d)
  Extracts week-of-year of the 'd' date and returns it as an int in
  1-53 range.

  STRING DAYNAME(DATE d)
  Returns the day field from a 'd' date, converted to the string
  corresponding to that day name. The range of return values is
  "Sunday" to "Saturday".

  STRING MONTHNAME(DATE d)
  Returns the month field from a 'd' date, converted to the string
  corresponding to that month name. The range of return values is
  "January" to "December".

  DATE NEXT_DAY(DATE d, STRING weekday)
  Returns the first date which is later than 'd' and named as
  'weekday'. 'weekday' is 3 letters or full name of the day of the
  week.

  DATE LAST_DAY(DATE d)
  Returns the last day of the month which the 'd' date belongs to.

  INT DATEDIFF(DATE d1, DATE d2)
  Returns the number of days from 'd1' date to 'd2' date.

  DATE CURRENT_DATE()
  Returns the current date (in the local time zone).

  INT INT_MONTHS_BETWEEN(DATE d1, DATE d2)
  Returns the number of months between 'd1' and 'd2' dates, as an int
  representing only the full months that passed.
  If 'd1' represents an earlier date than 'd2', the result is
  negative.

  DOUBLE MONTHS_BETWEEN(DATE d1, DATE d2)
  Returns the number of months between 'd1' and 'd2' dates. Can
  include a fractional part representing extra days in addition to the
  full months between the dates. The fractional component is computed
  by dividing the difference in days by 31 (regardless of the month).
  If 'd1' represents an earlier date than 'd2', the result is
  negative.

  DATE ADD_YEARS(DATE d, INT/BIGINT num_years),
  DATE SUB_YEARS(DATE d, INT/BIGINT num_years)
  Adds/subtracts a specified number of years to a 'd' date value.

  DATE ADD_MONTHS(DATE d, INT/BIGINT num_months),
  DATE SUB_MONTHS(DATE d, INT/BIGINT num_months)
  Adds/subtracts a specified number of months to a date value.
  If 'd' is the last day of a month, the returned date will fall on
  the last day of the target month too.

  DATE ADD_DAYS(DATE d, INT/BIGINT num_days),
  DATE SUB_DAYS(DATE d, INT/BIGINT num_days)
  Adds/subtracts a specified number of days to a date value.

  DATE ADD_WEEKS(DATE d, INT/BIGINT num_weeks),
  DATE SUB_WEEKS(DATE d, INT/BIGINT num_weeks)
  Adds/subtracts a specified number of weeks to a date value.

The following function doesn't exist in Hive but supported by Amazon
Redshift

  INT DATE_CMP(DATE d1, DATE d2)
  Compares 'd1' and 'd2' dates. Returns:
  1. NULL, if either 'd1' or 'd2' is NULL
  2. -1 if d1 < d2
  3. 1 if d1 > d2
  4. 0 if d1 == d2
  (https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_CMP.html)

Change-Id: If404bffdaf055c769e79ffa8f193bac415cfdd1a
Reviewed-on: http://gerrit.cloudera.org:8080/13648
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-25 23:06:25 +00:00
Attila Jeges
f0678b06e6 IMPALA-7369: part 1: Implement TRUNC, DATE_TRUNC, EXTRACT, DATE_PART functions for DATE
These functions are somewhat similar in that each of them takes a DATE
argument and a time unit to work with.

They work identically to the corresponding TIMESTAMP functions. The
only difference is that the DATE functions don't accept time-of-day
units.

TRUNC(DATE d, STRING unit)
Truncates a DATE value to the specified time unit. The 'unit' argument
is case insensitive. This argument string can be one of:
  SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year.
  Q: Quarter.
  MONTH, MON, MM, RM: Month.
  DDD, DD, J: Day.
  DAY, DY, D: Starting day (Monday) of the week.
  WW: Truncates to the most recent date, no later than 'd', which is
      on the same day of the week as the first day of year.
  W: Truncates to the most recent date, no later than 'd', which is on
     the same day of the week as the first day of month.

The impelementation mirrors Impala's TRUNC(TIMESTAMP ts, STRING unit)
function. Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions201.htm
.

DATE_TRUNC(STRING unit, DATE d)
Truncates a DATE value to the specified precision. The 'unit' argument
is case insensitive. This argument string can be one of: DAY, WEEK,
MONTH, YEAR, DECADE, CENTURY, MILLENNIUM.

The implementation mirrors Impala's DATE_TRUNC(STRING unit,
TIMESTAMP ts) function. Vertica has a similar function too.
Reference:
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/
    SQLReferenceManual/Functions/Date-Time/DATE_TRUNC.htm
.

EXTRACT(DATE d, STRING unit), EXTRACT(unit FROM DATE d)
Returns one of the numeric date fields from a DATE value. The 'unit'
string can be one of YEAR, QUARTER, MONTH, DAY. This argument value is
case-insensitive.

The implementation mirrors that Impala's EXTRACT(TIMESTAMP ts,
STRING unit). Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
.

DATE_PART(STRING unit, DATE date)
Similar to EXTRACT(), with the argument order reversed. Supports the
same date units as EXTRACT().

The implementation mirrors Impala's DATE_PART(STRING unit,
TIMESTAMP ts) function.

Change-Id: I843358a45eb5faa2c134994600546fc1d0a797c8
Reviewed-on: http://gerrit.cloudera.org:8080/13363
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-05 14:23:23 +00:00
Attila Jeges
b5805de3e6 IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.

This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.

The changes are as follows:
- Support for DATE literal syntax.

- Explicit casting between DATE and other types (note that invalid
  casts will fail with an error just like invalid DECIMAL_V2 casts,
  while failed casts to other types do no lead to warning or error):
    - from STRING to DATE. The string value must be formatted as
      yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
      the time component is optional. If the time component is
      present, it will be truncated silently.
    - from DATE to STRING. The resulting string value is formatted as
      yyyy-MM-dd.
    - from TIMESTAMP to DATE. The source timestamp's time of day
      component is ignored.
    - from DATE to TIMESTAMP. The target timestamp's time of day
      component is set to 00:00:00.

- Implicit casting between DATE and other types:
    - from STRING to DATE if the source string value is used in a
      context where a DATE value is expected.
    - from DATE to TIMESTAMP if the source date value is used in a
      context where a TIMESTAMP value is expected.

- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
  implicit conversions are now all possible, the existing function
  overload resolution logic is not adequate anymore.
  For example, it resolves the
  if(false, '2011-01-01', DATE '1499-02-02') function call to the
  if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
  function, instead of the if(BOOLEAN, DATE, DATE) version.

  This is clearly wrong, so the function overload resolution logic had
  to be changed to resolve function calls to the best-fit overloaded
  function definition if there are multiple applicable candidates.

  An overloaded function definition is an applicable candidate for a
  function call if each actual parameter in the function call either
  matches the corresponding formal parameter's type (without casting)
  or is implicitly castable to that type.

  When looking for the best-fit applicable candidate, a parameter
  match score (i.e. the number of actual parameters in the function
  call that match their corresponding formal parameter's type without
  casting) is calculated and the applicable candidate with the highest
  parameter match score is chosen.

  There's one more issue that the new resolution logic has to address:
  if two applicable candidates have the same parameter match score and
  the only difference between the two is that the first one requires a
  STRING -> TIMESTAMP implicit cast for some of its parameters while
  the second one requires a STRING -> DATE implicit cast for the same
  parameters then the first candidate has to be chosen not to break
  backward compatibility.
  E.g: year('2019-02-15') function call must resolve to
  year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
  implemented yet, so this is not an issue at the moment but it will
  be in the future.
  When the resolution algorithm considers overloaded function
  definitions, first it orders them lexicographically by the types in
  their parameter lists. To ensure the backward compatible behavior
  Primitivetype.DATE enum value has to come after
  PrimitiveType.TIMESTAMP.

- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
  math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.

These items are tightly coupled and it makes sense to implement them
in one change-set.

Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
  corresponding HBASE table 'functional_hbase.date_tbl') was
  introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
    - since DATE type is supported for TEXT and HBASE fileformats
      only, most DATE tests were implemented separately in
      tests/query_test/test_date_queries.py.

Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.

Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-23 13:33:57 +00:00