impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 09:58:28 -05:00

Author	SHA1	Message	Date
ttttttz	5d1f1e0180	IMPALA-14183: Rename the environment variable USE_APACHE_HIVE to USE_APACHE_HIVE_3 When the environment variable USE_APACHE_HIVE is set to true, build Impala for adapting to Apache Hive 3.x. In order to better distinguish it from Apache Hive 2.x later, rename USE_APACHE_HIVE to USE_APACHE_HIVE_3. Additionally, to facilitate referencing different versions of the Hive MetastoreShim, the major version of Hive has been added to the environment variable IMPALA_HIVE_DIST_TYPE. Change-Id: I11b5fe1604b6fc34469fb357c98784b7ad88574d Reviewed-on: http://gerrit.cloudera.org:8080/21724 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-12-03 13:38:45 +00:00
Arnab Karmakar	a2a11dec62	IMPALA-13263: Add single-argument overload for ST_ConvexHull() Implemented a single-argument version of ST_ConvexHull() to align with PostGIS behavior and simplify usage across geometry types. Testing: Added new tests in test_geospatial_functions.py for ST_ConvexHull(), which previously had no test coverage, to verify correctness across supported geometry types. Change-Id: Idb17d98f5e75929ec0143aa16195a84dd6e50796 Reviewed-on: http://gerrit.cloudera.org:8080/23604 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2025-11-18 10:26:04 +00:00
Joe McDonnell	1913ab46ed	IMPALA-14501: Migrate most scripts from impala-python to impala-python3 To remove the dependency on Python 2, existing scripts need to use python3 rather than python. These commands find those locations (for impala-python and regular python): git grep impala-python \| grep -v impala-python3 \| grep -v impala-python-common \| grep -v init-impala-python git grep bin/python \| grep -v python3 This removes or switches most of these locations by various means: 1. If a python file has a #!/bin/env impala-python (or python) but doesn't have a main function, it removes the hash-bang and makes sure that the file is not executable. 2. Most scripts can simply switch from impala-python to impala-python3 (or python to python3) with minimal changes. 3. The cm-api pypi package (which doesn't support Python 3) has been replaced by the cm-client pypi package and interfaces have changed. Rather than migrating the code (which hasn't been used in years), this deletes the old code and stops installing cm-api into the virtualenv. The code can be restored and revamped if there is any interest in interacting with CM clusters. 4. This switches tests/comparison over to impala-python3, but this code has bit-rotted. Some pieces can be run manually, but it can't be fully verified with Python 3. It shouldn't hold back the migration on its own. 5. This also replaces locations of impala-python in comments / documentation / READMEs. 6. kazoo (used for interacting with HBase) needed to be upgraded to a version that supports Python 3. The newest version of kazoo requires upgrades of other component versions, so this uses kazoo 2.8.0 to avoid needing other upgrades. The two remaining uses of impala-python are: - bin/cmake_aux/create_virtualenv.sh - bin/impala-env-versioned-python These will be removed separately when we drop Python 2 support completely. In particular, these are useful for testing impala-shell with Python 2 until we stop supporting Python 2 for impala-shell. The docker-based tests still use /usr/bin/python, but this can be switched over independently (and doesn't impact impala-python) Testing: - Ran core job - Ran build + dataload on Centos 7, Redhat 8 - Manual testing of individual scripts (except some bitrotted areas like the random query generator) Change-Id: If209b761290bc7e7c716c312ea757da3e3bca6dc Reviewed-on: http://gerrit.cloudera.org:8080/23468 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2025-10-22 16:30:17 +00:00
pranavyl	a61b90f860	IMPALA-13039: AES Encryption/ Decryption Support in Impala AES (Advanced Encryption Standard) crypto functions are widely recognized and respected encryption algorithm used to protect sensitive data which operate by transforming plaintext data into ciphertext using a symmetric key, ensuring confidentiality and integrity. This standard speciﬁes the Rijndael algorithm, a symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128 and 256 bits. The patch makes use of the EVP_*() algorithms from the OpenSSL library. The patch includes: 1. AES-GCM, AES-CTR, and AES-CFB encryption functionalities and AES-GCM, AES-ECB, AES-CTR, and AES-CFB decryption functionalities. 2. Support for both 128-bit and 256-bit key sizes for GCM and ECB modes. 3. Enhancements to EncryptionKey class to accommodate various AES modes. The aes_encrypt() and aes_decrypt() functions serve as entry points for encryption and decryption operations, handling encryption and decryption based on user-provided keys, AES modes, and initialization vectors (IVs). The implementation includes key length validation and IV vector size checks to ensure data integrity and confidentiality. Multiple AES modes: GCM, CFB, CTR for encryption, and GCM, CFB, CTR and ECB for decryption are supported to provide flexibility and compatibility with various use cases and OpenSSL features. AES-GCM is set as the default mode due to its strong security properties. AES-CTR and AES-CFB are provided as fallbacks for environments where AES-GCM may not be supported. Note that AES-GCM is not available in OpenSSL versions prior to 1.0.1, so having multiple methods ensures broader compatibility. Testing: The patch is thouroughly tested and the tests are included in exprs.test. Change-Id: I3902f2b1d95da4d06995cbd687e79c48e16190c9 Reviewed-on: http://gerrit.cloudera.org:8080/20447 Reviewed-by: Daniel Becker <daniel.becker@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-01-27 22:16:37 +00:00
Yida Wu	702131b677	IMPALA-13565: Add general AI platform support to ai_generate_text Currently only OpenAI sites are allowed for ai_generate_text(), this patch adds support for general AI platforms to the ai_generate_text function. It introduces a new flag, ai_additional_platforms, allowing Impala to access additional AI platforms. For these general AI platforms, only the openai standard is supported, and the default api credential serves as the api token for general platforms. The ai_api_key_jceks_secret parameter has been renamed to auth_credential to support passing both plain text and jceks encrypted secrets. A new impala_options parameter is added to ai_generate_text() to enable future extensions. Adds the api_standard option to impala_options, with "openai" as the only supported standard. Adds the credential_type option to impala_options for allowing the plain text as the token, by default it is set to jceks. Adds the payload option to impala_options for customized payload input. If set, the request will use the provided customized payload directly, and the response will follow the openai standard for parsing. The customized payload size must not exceed 5MB. Adding the impala_options parameter to ai_generate_text() should be fine for backward compatibility, as this is a relatively new feature. Example: 1. Add the site to ai_api_additional_platforms,like: ai_additional_platforms='new_ai.site,new_ai.com' 2. Example sql: select ai_generate_text("https://new_ai.com/v1/chat/completions", "hello", "model-name", "ai-api-token", "platform params", '{"api_standard":"openai", "credential_type":"plain", "payload":"payload content"}}') Tests: Added a new test AiFunctionsTestAdditionalSites. Manual tested the example with the Cloudera AI platform. Passed core and asan tests. Change-Id: I4ea2e1946089f262dda7ace73d5f7e37a5c98b14 Reviewed-on: http://gerrit.cloudera.org:8080/22130 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Abhishek Rawat <arawat@cloudera.com>	2025-01-16 19:12:07 +00:00
Mihaly Szjatinya	22723d0f27	IMPALA-7086: Cache timezone in _utc_timestamp() Added Prepare - Close routine around from/to_utc standard functions. This gives a consistent time improvement for constant timezones. Given sample table with 600M timestamp rows, on all-default environment the query below gives a stable 2-3 seconds improvement. SELECT count() FROM a_table where from_utc_timestamp(ts, "a_timezone") > "a_date"; Averaged results for Release, SET MT_OP=1, SET DISABLE_CODEGEN=TRUE: from_utc: 16,53s -> 12,53s to_utc: 14,02 - > 11,53 Change-Id: Icdf5ff82c5d0554333aef1bc3bba034a4cf48230 Reviewed-on: http://gerrit.cloudera.org:8080/21735 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-09-06 18:08:26 +00:00
Yida Wu	9837637d93	IMPALA-12920: Support ai_generate_text built-in function for OpenAI's chat completion API Added support for following built-in functions: - ai_generate_text_default(prompt) - ai_generate_text(ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) 'ai_endpoint', 'ai_model' and 'ai_api_key_jceks_secret' are flagfile options. 'ai_generate_text_default(prompt)' syntax expects all these to be set to proper values. The other syntax, will try to use the provided input parameter values, but fallback to instance level values if the inputs are NULL or empty. Only public OpenAI (api.openai.com) and Azure OpenAI (openai.azure.com) API endpoints are currently supported. Exposed these functions in FunctionContext so that they can also be called from UDFs: - ai_generate_text_default(context, model) - ai_generate_text(context, ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) Testing: - Added unit tests for AiGenerateTextInternal function - Added fe test for JniFrontend::getSecretFromKeyStore - Ran manual tests to make sure Impala can talk with OpenAI LLMs using 'ai_generate_text' built-in function. Example sql: select ai_generate_text("https://api.openai.com/v1/chat/completions", "hello", "gpt-3.5-turbo", "open-ai-key", '{"temperature": 0.9, "model": "gpt-4"}') - Tested using standalone UDF SDK and made sure that the UDFs can invoke BuiltInFunctions (ai_generate_text and ai_generate_text_default) Change-Id: Id4446957f6030bab1f985fdd69185c3da07d7c4b Reviewed-on: http://gerrit.cloudera.org:8080/21168 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-11 07:25:50 +00:00
jasonmfehr	3e4fdeece1	IMPALA-12824: Removes the prettyprint_duration Built-in Function The prettyprint_duration function was originally implemented in IMPALA-12824 to work with the workload management tables which stored durations in integer nanoseconds. These tables have changed to store decimal seconds. The prettyprint_duration function would have required a large investment of time to make it work with decimal values, and since the new format is more human readable anyways, this function has been removed. Change-Id: If2154c2ed9a7217ed4b7587adeae87df55ff03dc Reviewed-on: http://gerrit.cloudera.org:8080/21208 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-03-28 06:58:56 +00:00
jasonmfehr	d03ffc70f2	IMPALA-12824: Adds built-in functions prettyprint_duration and prettyprint_bytes. The prettyprint_duration function takes an integer input containing a number of nanoseconds and returns a human readable value breaking down the input by hours, minutes, seconds, milliseconds, microseconds, and nanoseconds. The prettyprint_bytes function takes an integer input containing a number of bytes and returns a human readable values breaking down the input by gigabytes, megabytes, kilobytes, and bytes. Functionality tests were added to the existing expr-test suite that tests built-in functions. Functional-query workloads were added in two new .test files under the testdata directory to exercise these two new functions. Corresponding pytests were added to run the tests in these new .test files. Benchmarks were added to expr-benchmark, and new benchmarks were generated with a release build running on a machine with the cpu Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz. Documentation was added to the built-in string functions docs. Change-Id: I3e76632ce21ad2ca5df474160338699a542a6913 Reviewed-on: http://gerrit.cloudera.org:8080/21038 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-02-21 04:23:28 +00:00
Eyizoha	e489ab35b1	IMPALA-12718: Provides UTF-8 support for the trim functions Currently, the trim function (including BTRIM, LTRIM, RTRIM) cannot correctly handle strings containing multi-byte UTF-8 characters. Multi-byte UTF-8 characters are interpreted as multiple single-byte characters, leading to unexpected results. This patch provides UTF-8 support for the trim functions, enabling these functions to correctly handle multi-byte UTF-8 characters (when set utf8_mode=true). It also introduces a set of trim functions with the 'utf8_' prefix, offering the same capability even when utf8_mode is not enabled. Testing: - Added new BE test case in ExprTest#Utf8Test - Added new E2E test case in TestUtf8StringFunctions Change-Id: I5cfaffd71009f16eae75910af835bd2a34410856 Reviewed-on: http://gerrit.cloudera.org:8080/20926 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-02-02 11:06:24 +00:00
Eyizoha	3af1930229	IMPALA-12322: Support converting UTC timestamps read from Kudu to local time This patch adds a query option 'convert_kudu_utc_timestamps' similar to 'convert_legacy_hive_parquet_utc_timestamps'. When enabled, it converts UTC timestamps read from Kudu to local timestamps. The corresponding modification also include predicate pushdown and runtime filter. Due to the ambiguity of timestamps caused by daylight saving time changes, it is difficult to resolve in the bloom filter. This patch additionally introduces a query option 'disable_kudu_local_timestamp_bloom_filter' to default disable the Kudu timestamp bloom filter after enabling time zone conversion in order to avoid erroneously filtering out data. However, for regions that do not observe daylight saving time, it can be set to false to re-enable the Kudu local timestamp bloom filter. Testing: - Add TestKuduTimestampConvert in query_test/test_kudu.py Perform end-to-end testing in a custom cluster, including basic Kudu UTC timestamp conversion testing, as well as checking if related predicate pushdown and runtime filters are working correctly (even with timestamps involving daylight saving time conversions). Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3 Reviewed-on: http://gerrit.cloudera.org:8080/20681 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2023-12-14 13:19:35 +00:00
Michael Smith	0a42185d17	IMPALA-9627: Update utility scripts for Python 3 (part 2) We're starting to see environments where the system Python ('python') is Python 3. Updates utility and build scripts to work with Python 3, and updates check-pylint-py3k.sh to check scripts that use system python. Fixes other issues found during a full build and test run with Python 3.8 as the default for 'python'. Fixes a impala-shell tip that was supposed to have been two tips (and had no space after period when they were printed). Removes out-of-date deploy.py and various Python 2.6 workarounds. Testing: - Full build with /usr/bin/python pointed to python3 - run-all-tests passed with python pointed to python3 - ran push_to_asf.py Change-Id: Idff388aff33817b0629347f5843ec34c78f0d0cb Reviewed-on: http://gerrit.cloudera.org:8080/19697 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Tested-by: Michael Smith <michael.smith@cloudera.com>	2023-04-26 18:52:23 +00:00
Peter Rozsa	1d05381b7b	IMPALA-11745: Add Hive's ESRI geospatial functions as builtins This change adds geospatial functions from Hive's ESRI library as builtin UDFs. Plain Hive UDFs are imported without changes, but the generic and varargs functions are handled differently; generic functions are added with all of the combinations of their parameters (cartesian product of the parameters), and varargs functions are unfolded as an nth parameter simple function. The varargs function wrappers are generated at build time and they can be configured in gen_geospatial_udf_wrappers.py. These additional steps are required because of the limitations in Impala's UDF Executor (lack of varargs support and only partial generics support) which could be further improved; in this case, the additional wrapping/mapping steps could be removed. Changes regarding function handling/creating are sourced from https://gerrit.cloudera.org/c/19177 A new backend flag was added to turn this feature on/off as "geospatial_library". The default value is "NONE" which means no geospatial function gets registered as builtin, "HIVE_ESRI" value enables this implementation. The ESRI geospatial implementation for Hive currently only available in Hive 4, but CDP Hive backported it to Hive 3, therefore for Apache Hive this feature is disabled regardless of the "geospatial_library" flag. Known limitations: - ST_MultiLineString, ST_MultiPolygon only works with the WKT overload - ST_Polygon supports a maximum of 6 pairs of coordinates - ST_MultiPoint, ST_LineString supports a maximum of 7 pairs of coordinates - ST_ConvexHull, ST_Union supports a maximum of 6 geoms These limits can be increased in gen_geospatial_udf_wrappers.py Tests: - test_geospatial_udfs.py added based on https://github.com/Esri/spatial-framework-for-hadoop Co-Authored-by: Csaba Ringhofer <csringhofer@cloudera.com> Change-Id: If0ca02a70b4ba244778c9db6d14df4423072b225 Reviewed-on: http://gerrit.cloudera.org:8080/19425 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-02-07 20:18:47 +00:00
Csaba Ringhofer	7ca11dfc7f	IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 13:55:42 +00:00
Joe McDonnell	ba4cb95b62	IMPALA-11257: Fix CMake warnings for module names and cmake_minimum_required This fixes a few different CMake warnings: 1. This removes cmake_minimum_required invocations except for the top-most CMakeLists.txt. This eliminates the warnings like this: Compatibility with CMake < 2.8.12 will be removed from a future version of CMake. Update the VERSION argument <min> value or use a ...<max> suffix to tell CMake that the project does not need compatibility with older versions. Moving to a later version also required setting CMAKE_ENABLE_EXPORTS to continue exporting symbols. 2. This modifies the module names so that they match the corresponding module names from Find*.cmake. This is mostly dealing with case differences. This address warnings like: The package name passed to `find_package_handle_standard_args` (PROTOBUF) does not match the name of the calling package (Protobuf). This can lead to problems in calling code that expects `find_package` result variables (e.g., `_FOUND`) to follow a certain pattern. This fixed the detection logic for KerberosPrograms, and so it required adding more Kerberos packages to bin/bootstrap_build.sh. 3. This adds a missing .cc suffix. This addresses the following warning: CMake Warning (dev) at be/src/util/CMakeLists.txt:141 (add_library): Policy CMP0115 is not set: Source file extensions must be explicit. Run "cmake --help-policy CMP0115" for policy details. Use the cmake_policy command to set the policy and suppress this warning. These fixes mostly match how these warnings were handled in Apache Kudu. Testing: - Ran GVO Change-Id: I2a97dd07cdd0831e90882a2035415ac71d670147 Reviewed-on: http://gerrit.cloudera.org:8080/18444 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-11 05:48:36 +00:00
Csaba Ringhofer	f24296aed5	IMPALA-11355: Add STRING overloads for hour/minute/second/millisecond IMPALA-9531 dropped support for "dateless timestamps", e.g. cast("12:05:05" as timestamp) now returns NULL. This led to breaking functions like minute("12:05:05"), as minute() expects a timestamp, and Impala adds an implicit cast, so what actually happens is minute(cast("12:05:05" as timestamp)), which returns NULL. This change adds overloads for similar functions that take STRING instead of TIMESTAMP parameter. The same functions already take a STRING parameter in Hive and mySQL. The changes in the parser mainly restore code removed in IMPALA-9531. Note that these functions could be potentially optimized by returning parts of the parse result without converting them to boost time first, but this is not done here to make the change minimal. Testing: - restored related tests in expr-test and added some new ones for malformed time-of-day strings - added benchmarks for the new overloads and fixed the ones for the old functions (they tested NULL) Change-Id: I6cc1c851ee71ab4fcc58105c7e9931155a483679 Reviewed-on: http://gerrit.cloudera.org:8080/18718 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-14 21:03:43 +00:00
Michael Smith	2243f331cb	IMPALA-11274: CNF Rewrite causes a regress in join node performance This patch defines a subset of all predicates that are common and relatively inexpensive to compute. Such predicates must involve columns, constants, simple math or cast functions only. Examples of the subset of the predicates allowed: 1. (a = 1 AND cast(b as int) = 2) OR (c = d AND e = f) 2. a in ('1', '2', '3') OR ((b = 'abc') AND (c = d)) 3. (a between 1 and 100) OR ((b is null) AND (c = d)) Examples of the predicates not allowed: 1. (upper(a) != 'Y') AND b = 2) OR (c = d AND e = f) 2. (coalesce(CAST(a AS string), '') = '') AND b = 2) OR (c = d AND e = f) This patch further restricts the predicates to be converted to conjunctive normal form (CNF) to be such a subset, with the aim to reduce the run-time evaluation overhead of CNFs in which some of the predicates can be duplicated. Uses a cache in branching expressions to avoid visiting the entire subtree on each call to applyRuleBottomUp. Skips cache complexity on casts as they don't branch and are unlikely to be deeply nested. Testing: - New expression writer tests - New planner tests Change-Id: I326406c6b004fe31ec0e2a2f390a3845b8925aa9 Reviewed-on: http://gerrit.cloudera.org:8080/18458 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-25 05:37:17 +00:00
stiga-huang	35375b3287	IMPALA-2019(part-4): Add UTF-8 support for case conversion functions There are 3 builtin case conversion string functions: upper(), lower(), and initcap(). Previously they only convert English alphabetic characters. This patch adds support to deal with Unicode characters. There are many corner cases in case conversion depending on the locale and context. E.g. 1) Case conversion is locale-sensitive. Turkish has 4 letter "I"s. English has only two, a lowercase dotted i and an uppercase dotless I. Turkish has lowercase and uppercase forms of both dotted and dotless I. So simply converting "i" to "I" for upper case is wrong in Turkish: +-------+--------+---------+ \| \| Dotted \| Dotless \| +-------+--------+---------+ \| Upper \| İ \| I \| +-------+--------+---------+ \| Lower \| i \| ı \| +-------+--------+---------+ 2) Case conversion may change a string's length. The German word "grüßen" should be converted to "GRÜSSEN" in upper case: the letter "ß" should be converted to "SS". 3) Case conversion is context-sensitive. The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the Greek letter "Σ" is converted to "σ" or to "ς", depending on its position in the word. The above cases will be focus in follow-up JIRAs. This patch addes the initial implementation of UTF-8 aware case conversion functions. -------- Implementation: In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the bytes in strings are converted to wide characters using std::mbrtowc(). Each wide character (wchar_t) will then be converted using std::towupper or std::towlower correspondingly. We then convert them back to multi bytes using std::wcrtomb(). Note that these builtins are locale aware. If impalad is launched without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't recognize non-ascii characters, which will return unexpected results. Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C". This patch also logs the current locale when launching impala daemons for better debugging. We will support customized locale in IMPALA-11080. Test: - Add BE unit tests and e2e tests. Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd Reviewed-on: http://gerrit.cloudera.org:8080/17785 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-15 18:40:59 +00:00
pranav.lodha	bde995483a	IMPALA-955: BYTES built-in function The Bytes function returns the number of bytes contained in the specified byte string. There are changes in 4 files. A few testcases are also added in be/src/exprs/expr-test.cc and an end-to end test in testdata/workloads/functional-query/queries/QueryTest/exprs.test. Change-Id: I0bd06c3d6dba354d71f63c649eaa8f9f74d266ee Reviewed-on: http://gerrit.cloudera.org:8080/18210 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-11 07:01:58 +00:00
stiga-huang	3850d49711	IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions Mask functions are used in Ranger column masking policies to mask sensitive data. There are 5 mask functions: mask(), mask_first_n(), mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as an example, by default, it will mask uppercase to 'X', lowercase to 'x', digits to 'n' and leave other characters unmasked. For masking all characters to '', we can use mask(my_col, '', '', '', ''); The current implementations mask strings byte-to-byte, which have inconsistent results with Hive when the string contains unicode characters: mask('中国', '', '', '', '') => '***' Each Chinese character is encoded into 3 bytes in UTF-8 so we get the above result. The result in Hive is '' since there are two Chinese characters. This patch provides consistent masking behavior with Hive for strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode, the masked unit of a string is a unicode code point. Implementation - Extends the existing MaskTransform function to deal with unicode code points(represented by uint32_t). - Extends the existing GetFirstChar function to get the code point of given masked charactors in UTF-8 mode. - Implement a MaskSubStrUtf8 method as the core functionality. - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode. - For better testing, this patch also adds an overload for all mask functions for only masking other chars but keeping the upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X'). Tests - Add BE tests in expr-test - Add e2e tests in utf8-string-functions.test Change-Id: I1276eccc94c9528507349b155a51e76f338367d5 Reviewed-on: http://gerrit.cloudera.org:8080/17780 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-15 05:04:07 +00:00
Amogh Margoor	8e7a6227a4	IMPALA-10730: Add MD5 fuction to compute 128-bit checksum for a string. Built-in function has been added to compute MD5 128-bit checksum for a non-null string. If input string is null, then output of the function is null too. In FIPS mode, MD5 is disabled and function will throw error on invocation. Testing: 1. Added expression unit tests. 2. Added end-to-end tests for MD5. Change-Id: Id406d30a7cc6573212b302fbfec43eb848352ff2 Reviewed-on: http://gerrit.cloudera.org:8080/17567 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-06-10 22:24:16 +00:00
Fucun Chu	ce21fe74b8	IMPALA-10689: Implement ds_cpc_union_f() function. This function receives two strings that are serialized Apache DataSketches CPC sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) from sketch_tbl; +---------------------------------------------------+ \| ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) \| +---------------------------------------------------+ \| 15 \| +---------------------------------------------------+ Change-Id: Ib5c616316bf2bf2ff437678e9a44a15339920150 Reviewed-on: http://gerrit.cloudera.org:8080/17440 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-06-07 12:53:23 +00:00
Fucun Chu	67de4a48b0	IMPALA-10688: Implement ds_cpc_stringify() function This function receives a string that is a serialized Apache DataSketches CPC sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_cpc_stringify(ds_cpc_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_cpc_stringify(ds_cpc_sketch(float_col)) \| +--------------------------------------------+ \| ### CPC sketch summary: \| \| lg_k : 11 \| \| seed hash : 93cc \| \| C : 2 \| \| flavor : 1 \| \| merged : true \| \| intresting col : 0 \| \| table entries : 2 \| \| window : not allocated \| \| ### End sketch summary \| \| \| +--------------------------------------------+ Change-Id: I8c9d089bfada6bebd078d8f388d2e146c79e5285 Reviewed-on: http://gerrit.cloudera.org:8080/17373 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2021-05-26 06:42:02 +00:00
Amogh Margoor	86beb2f9dd	IMPALA-10679: Add builtin functions to comptute SHA-1 and SHA-2 digest. Built-in functions to compute SHA-1 digest and SHA-2 family of digest has been added. Support for SHA2 digest includes SHA224, SHA256, SHA384 and SHA512. In FIPS mode SHA1, SHA224 and SHA256 have been disabled and will throw error. SHA2 functions will also throw error for unsupported bit length i.e., bit length apart from 224, 256, 384, 512. Testing: 1. Added Unit test for expressions. 2. Added end-to-end test for new functions. Change-Id: If163b7abda17cca3074c86519d59bcfc6ace21be Reviewed-on: http://gerrit.cloudera.org:8080/17464 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-18 18:48:10 +00:00
Fucun Chu	e39c30b3cd	IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions These functions can be used to get cardinality estimates of data using CPC algorithm from Apache DataSketches. ds_cpc_sketch() receives a dataset, e.g. a column from a table, and returns a serialized CPC sketch in string format. This can be written to a table or be fed directly to ds_cpc_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the CPC sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' CPC see: http://datasketches.apache.org/docs/CPC/CPC.html Figures-of-Merit Comparison of the HLL and CPC Sketches see: https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch_parquet.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3x faster. CPC gives closer estimate than current ndv(). CPC is more accurate than HLL in some cases Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a Reviewed-on: http://gerrit.cloudera.org:8080/16656 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-11 18:07:40 +00:00
Fucun Chu	77d6acd032	IMPALA-10581: Implement ds_theta_intersect_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the intersection of two sketches of same or different column and returns the resulting sketch of intersection. Example: select ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) from sketch_tbl; +-----------------------------------------------------------+ \| ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) \| +-----------------------------------------------------------+ \| 5 \| +-----------------------------------------------------------+ Change-Id: I335eada00730036d5433775cfe673e0e4babaa01 Reviewed-on: http://gerrit.cloudera.org:8080/17186 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-29 15:59:49 +00:00
Fucun Chu	622e3c95ad	IMPALA-10580: Implement ds_theta_union_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 15 \| +-------------------------------------------------------+ Change-Id: I8329979b81ceeaad739a43fab79768ca9c2916fa Reviewed-on: http://gerrit.cloudera.org:8080/17179 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-24 15:16:07 +00:00
Fucun Chu	3e82501531	IMPALA-10558: Implement ds_theta_exclude() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the a-not-b set operation given two sketches of same or different column. Example: select ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 5 \| +-------------------------------------------------------+ Change-Id: I05119fd8c652c07ff248a99e44b0da3541e46ca3 Reviewed-on: http://gerrit.cloudera.org:8080/17153 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-17 22:14:44 +00:00
Fucun Chu	65c6a81ed9	IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions These functions can be used to get cardinality estimates of data using Theta algorithm from Apache DataSketches. ds_theta_sketch() receives a dataset, e.g. a column from a table, and returns a serialized Theta sketch in string format. This can be written to a table or be fed directly to ds_theta_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the Theta sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' Theta see: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch25_parquet.lineitem to compare perfomance with ds_hll_. ds_theta_ is faster than ds_hll_* on the original data, the difference is around 1%-10%. ds_hll_estimate() is faster than ds_theta_estimate() on existing sketch. HLL and Theta gives closer estimate except for string. see IMPALA-10464. Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Reviewed-on: http://gerrit.cloudera.org:8080/17008 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-17 17:09:48 +00:00
stiga-huang	e8720b40f1	IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions A unicode character can be encoded into 1-4 bytes in UTF-8. String functions will return undesired results when the input contains unicode characters, because we deal with a string as a byte array. For instance, length() returns the length in bytes, not in unicode characters. UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem. This patch adds UTF-8 support in some string functions so they can have UTF-8 aware behavior. For compatibility with the old versions, a new query option, UTF8_MODE, is added for turning on/off the UTF-8 aware behavior. Currently, only length(), substring() and reverse() support it. Other function supports will be added in later patches. String functions will check the query option and switch to use the desired implementation. It's similar to how we use the decimal_v2 query option in builtin functions. For easy testing, the UTF-8 aware version of string functions are also exposed as builtin functions (named by utf8_*, e.g. utf8_length). Tests: - Add BE tests for utf8 functions. - Add e2e tests for the UTF8_MODE query option. Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Reviewed-on: http://gerrit.cloudera.org:8080/16908 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-26 00:43:39 +00:00
stiga-huang	9bb7157bf0	IMPALA-10387: Add missing overloads of mask functions used in Ranger default masking policies The mask functions in Hive are implemented through GenericUDFs which can accept an infinite number of function signatures. Impala currently don't support GenericUDFs. So we provide builtin mask functions with limited overloads. This patch adds some missing overloads that could be used by Ranger default masking policies, e.g. MASK_HASH, MASK_SHOW_LAST_4, MASK_DATE_SHOW_YEAR, etc. Tests: - Add test coverage on all default masking policies applied on all supported types. Change-Id: Icf3e70fd7aa9f3b6d6b508b776696e61ec1fcc2e Reviewed-on: http://gerrit.cloudera.org:8080/16930 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-15 13:01:53 +00:00
Gabor Kaszab	6cb7cecacf	IMPALA-10237: Support Bucket and Truncate partition transforms as built-in functions This patch implements Truncate and Bucket partition transforms in Impala BE as built-in functions. The expectation is that these functions give the same result as Iceberg's implementation of the same functions. These built-in functions are invisible so users won't be able to invoke them e.g. from impala-shell. Truncate: - Supported types are IntVal, BigIntVal, StringVal, DecimalVal. - Receives an input from the above types and a width. - Returns the same type as the input. - Expected behaviour is explained here: https://iceberg.apache.org/spec/#truncate-transform-details Bucket: - Supported types are IntVal, BigIntVal, StringVal, DecimalVal, DateVal, TimestampVal. - Receives an input from the above types and the number of buckets as IntVal. - Returns IntVal. - Expected behaviour is explained here: https://iceberg.apache.org/spec/#bucket-transform-details Change-Id: I485680cf79d96d578dd8cfbfd554bec468fe84bd Reviewed-on: http://gerrit.cloudera.org:8080/16741 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-19 03:09:06 +00:00
Fucun Chu	8ea49e9b02	IMPALA-10134: Implement ds_hll_union_f() function. This function receives two strings that are serialized Apache DataSketches HLL sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_hll_estimate(ds_hll_union_f(i_i, h_i)) from hll_sketches_impala_hive2; +-------------------------------------------+ \| ds_hll_estimate(ds_hll_union_f(i_i, h_i)) \| +-------------------------------------------+ \| 7 \| +-------------------------------------------+ Change-Id: Ic06e959ed956af5cedbfc7d4d063141d5babb2a8 Reviewed-on: http://gerrit.cloudera.org:8080/16711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-25 14:33:10 +00:00
Fucun Chu	193c2e773f	IMPALA-10132: Implement ds_hll_estimate_bounds_as_string() function. This function receives a string that is a serialized Apache DataSketches HLL sketch and optional kappa that is a number of standard deviations from the mean: 1, 2 or 3 (default 2). Returns estimate and bounds with the values separated with commas. The result is three values: estimate, lower bound and upper bound. ds_hll_estimate_bounds_as_string(sketch [, kappa]) Kappa: 1 represent the 68.3% confidence bounds 2 represent the 95.4% confidence bounds 3 represent the 99.7% confidence bounds Note, ds_hll_estimate_bounds() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_hll_estimate_bounds_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) from functional_parquet.alltypestiny; +----------------------------------------------------------+ \| ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) \| +----------------------------------------------------------+ \| 2,2,2.0002 \| +----------------------------------------------------------+ Change-Id: I46bf8263e8fd3877a087b9cb6f0d1a2392bb9153 Reviewed-on: http://gerrit.cloudera.org:8080/16626 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-29 17:45:01 +00:00
Adam Tamas	99e5f5a885	IMPALA-10133:Implement ds_hll_stringify function. This function receives a string that is a serialized Apache DataSketches HLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_hll_stringify(ds_hll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_hll_stringify(ds_hll_sketch(float_col)) \| +--------------------------------------------+ \| ### HLL sketch summary: \| \| Log Config K : 12 \| \| Hll Target : HLL_4 \| \| Current Mode : LIST \| \| LB : 2 \| \| Estimate : 2 \| \| UB : 2.0001 \| \| OutOfOrder flag: false \| \| Coupon count : 2 \| \| ### End HLL sketch summary \| \| \| +--------------------------------------------+ Change-Id: I85dbf20b5114dd75c300eef0accabe90eac240a0 Reviewed-on: http://gerrit.cloudera.org:8080/16382 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-03 12:11:48 +00:00
Adam Tamas	4cb3c3556e	IMPALA-10108: Implement ds_kll_stringify function This function receives a string that is a serialized Apache DataSketches KLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_kll_stringify(ds_kll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_kll_stringify(ds_kll_sketch(float_col)) \| +--------------------------------------------+ \| ### KLL sketch summary: \| \| K : 200 \| \| min K : 200 \| \| M : 8 \| \| N : 8 \| \| Epsilon : 1.33% \| \| Epsilon PMF : 1.65% \| \| Empty : false \| \| Estimation mode: false \| \| Levels : 1 \| \| Sorted : false \| \| Capacity items : 200 \| \| Retained items : 8 \| \| Storage bytes : 64 \| \| Min value : 0 \| \| Max value : 1.1 \| \| ### End sketch summary \| \| \| +--------------------------------------------+ Change-Id: I97f654a4838bf91e3e0bed6a00d78b2c7aa96f75 Reviewed-on: http://gerrit.cloudera.org:8080/16370 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 10:49:10 +00:00
Gabor Kaszab	28d94851b1	IMPALA-10020: Implement ds_kll_cdf_as_string() function This is the support for Cumulative Distribution Function (CDF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), (-inf, 5), (-inf, 10), (-inf, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_cdf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_cdf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.2,0.401644,1,1 \| +----------------------------------------------------------+ Change-Id: I77e6afc4556ad05a295b89f6d06c2e4a6bb2cf82 Reviewed-on: http://gerrit.cloudera.org:8080/16359 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 10:59:49 +00:00
Gabor Kaszab	a8a35edbc4	IMPALA-10019: Implement ds_kll_pmf_as_string() function This is the support for Probabilistic Mass Function (PMF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), [1, 5), [5, 10), [10, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_pmf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_pmf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.202192,0.199452,0.598356,0 \| +----------------------------------------------------------+ Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba Reviewed-on: http://gerrit.cloudera.org:8080/16336 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 01:50:16 +00:00
Gabor Kaszab	41065845e9	IMPALA-9962: Implement ds_kll_quantiles_as_string() function This function is very similar to ds_kll_quantile() but this one can receive any number of rank parameters and returns a comma separated string that holds the results for all of the given ranks. For more details about ds_kll_quantile() see IMPALA-9959. Note, ds_kll_quantiles() should return an Array of floats as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_quantiles_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Change-Id: I76f6039977f4e14ded89a3ee4bc4e6ff855f5e7f Reviewed-on: http://gerrit.cloudera.org:8080/16324 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-25 18:06:22 +00:00
Gabor Kaszab	0736fcf691	IMPALA-10018: Implement ds_kll_rank() function ds_kll_rank() receives two parameters: a STRING that represents a serialized DataSketches KLL sketch and a float to provide a probing value in the sketch. Returns a DOUBLE that is the rank of the given probing value in the range of [0,1]. E.g. a return value of 0.2 means that the probing value given as parameter is greater than the 20% of all the values in the sketch. Note, this is an approximate calculation. Change-Id: I95857886dfbb8c84aeeaf718c0e610012fda4be0 Reviewed-on: http://gerrit.cloudera.org:8080/16283 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-07 12:58:40 +00:00
Gabor Kaszab	87aeb2ad78	IMPALA-9963: Implement ds_kll_n() function This function receives a serialized Apache DataSketches KLL sketch and returns how many input values were fed into this sketch. Change-Id: I166e87a468e68e888ac15fca7429ac2552dbb781 Reviewed-on: http://gerrit.cloudera.org:8080/16259 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-06 19:15:04 +00:00
Gabor Kaszab	033a4607e2	IMPALA-9959: Implement ds_kll_sketch() and ds_kll_quantile() functions ds_kll_sketch() is an aggregate function that receives a float parameter (e.g. a float column of a table) and returns a serialized Apache DataSketches KLL sketch of the input data set wrapped into STRING type. This sketch can be saved into a table or view and later used for quantile approximations. ds_kll_quantile() receives two parameters: a STRING parameter that contains a serialized KLL sketch and a DOUBLE that represents the rank of the quantile in the range of [0,1]. E.g. rank=0.1 means the approximate value in the sketch where 10% of the sketched items are less than or equals to this value. Testing: - Added automated tests on small data sets to check the basic functionality of sketching and getting a quantile approximate. - Tested on TPCH25_parquet.lineitem to check that sketching and approximating works on bigger scale as well where serialize/merge phases are also required. On this scale the error range of the quantile approximation is within 1-1.5% Change-Id: I11de5fe10bb5d0dd42fb4ee45c4f21cb31963e52 Reviewed-on: http://gerrit.cloudera.org:8080/16235 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-31 14:34:49 +00:00
Adam Tamas	1bafb7bd29	IMPALA-9531: Dropped support for dateless timestamps Removed the support for dateless timestamps. During dateless timestamp casts if the format doesn't contain date part we get an error during tokenization of the format. If the input str doesn't contain a date part then we get null result. Examples: select cast('01:02:59' as timestamp); This will come back as NULL value. select to_timestamp('01:01:01', 'HH:mm:ss'); select cast('01:02:59' as timestamp format 'HH12:MI:SS'); select cast('12 AM' as timestamp FORMAT 'AM.HH12'); These will come back with a parsing errors. Casting from a table will generate similar results. Testing: Modified the previous tests related to dateless timestamps. Added test to read fromtables which are still containing dateless timestamps and covered timestamp to string path when no date tokens are requested in the output string. Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Reviewed-on: http://gerrit.cloudera.org:8080/15866 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2020-07-08 19:32:15 +00:00
Gabor Kaszab	7e456dfa9d	IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Reviewed-on: http://gerrit.cloudera.org:8080/16000 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-07-07 14:11:21 +00:00
stiga-huang	0936384271	IMPALA-9010: Add builtin mask functions There're 6 builtin GenericUDFs for column masking in Hive: mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_first_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_hash(value) mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar, dayValue, monthValue, yearValue) Description of the parameters: value - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()). charCount - number of characters. Default value: 4 upperChar - character to replace upper-case characters with. Specify -1 to retain original character. Default value: 'X' lowerChar - character to replace lower-case characters with. Specify -1 to retain original character. Default value: 'x' digitChar - character to replace digit characters with. Specify -1 to retain original character. Default value: 'n' otherChar - character to replace all other characters with. Specify -1 to retain original character. Default value: -1 numberChar - character to replace digits in a number with. Valid values: 0-9. Default value: '1' dayValue - value to replace day field in a date with. Specify -1 to retain original value. Valid values: 1-31. Default value: 1 monthValue - value to replace month field in a date with. Specify -1 to retain original value. Valid values: 0-11. Default value: 0 yearValue - value to replace year field in a date with. Specify -1 to retain original value. Default value: 1 In Hive, these functions accept variable length of arguments in non-restricted types: mask_show_first_n(val) mask_show_first_n(val, 8) mask_show_first_n(val, 8, 'X', 'x', 'n') mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2) mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9') The arguments of upperChar, lowerChar, digitChar, otherChar and numberChar can be in string or numeric types. Impala doesn't support Hive GenericUDFs, so we are lack of these mask functions to support Ranger column masking policies. On the other hand, we want the masking functions to be evaluated in the C++ builtin logic rather than calling out to java UDFs for performance. This patch introduces our builtin implementation of them. We currently don't have a corresponding framework for GenericUDF (IMPALA-9271), so we implement these by overloads. However, it may requires hundreds of overloads to cover all possible combinations. We just implement some important overloads, including - those used by Ranger default masking policies, - those with simple arguments and may be useful for users, - an overload with all arguments in int type for full functionality. Char argument need to be converted to their ASCII value. Tests: - Add BE tests in expr-test Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664 Reviewed-on: http://gerrit.cloudera.org:8080/14963 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-17 15:34:34 +00:00
norbert.luksa	a862282811	IMPALA-8709: Add Damerau-Levenshtein edit distance built-in function This patch adds new built-in functions to calculate restricted Damerau-Levenshtein edit distance (optimal string alignment). Implmented as dle_dst() and damerau_levenshtein(). If either value is NULL or both values are NULL returns NULL which differs from Netezza's dle_dst() which returns the length of the not NULL value or 0 if both values are NULL. The NULL behavior matches the existing levenshtein() function. Also cleans up levenshtein tests. Testing: - Added unit tests to expr-test.cc - Manual testing on over 1400 string pairs from http://marvin.cs.uidaho.edu/misspell.html and results match Netezza Change-Id: Ib759817ec15e7075bf49d51e494e45c8af4db94d Reviewed-on: http://gerrit.cloudera.org:8080/13794 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-22 21:39:21 +00:00
luksan47	8db7f27ddd	IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function The added functions return the Jaro/Jaro-Winkler similarity/distance of two strings. The algorithm calcuates the Jaro-Similarity of the strings, then adds more weight to the result if there are common prefixes. (Jaro-Winkler) For more detail, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Extended the algorithm with another optional parameter: boost threshold The prefix weight will only be applied if the Jaro-similarity exceeds the given threshold. By default, its value is 0.7. The new built-in functions are: * jaro_distance, jaro_dst * jaro_similarity, jaro_sim * jaro_winkler_distance, jw_dst * jaro_winkler_similarity, jw_sim Testing: * Added unit tests to expr-test.cc * Manual testing over 1400 word pairs from http://marvin.cs.uidaho.edu/misspell.html Results match Apache commons Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c Reviewed-on: http://gerrit.cloudera.org:8080/13870 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-13 18:25:32 +00:00
Attila Jeges	f40935a30e	IMPALA-7369: part 2: Add INTERVAL expr support and built-in functions for DATE This change implements INTERVAL expression support for DATE type and adds several DATE related built-in functions. The efficiency of the DateValue::ToYearMonthDay() function used in many of the built-in functions below was also improved. The following functions are supported in Hive: INT YEAR(DATE d) Extracts year of the 'd' date, returns it as an int in 0-9999 range. INT MONTH(DATE d) Extracts month of the 'd' date and returns it as an int in 1-12 range. INT DAY(DATE d), INT DAYOFMONTH(DATE d) Extracts day-of-month of the 'd' date and returns it as an int in 1-31 range. INT QUARTER(DATE d) Extracts quarter of the 'd' date and returns it as an int in 1-4 range. INT DAYOFWEEK(DATE d) Extracts day-of-week of the 'd' date and returns it as an int in 1-7 range. 1 is Sunday and 7 is Saturday. INT DAYOFYEAR(DATE d) Extracts day-of-year of the 'd' date and returns it as an int in 1-366 range. INT WEEKOFYEAR(DATE d) Extracts week-of-year of the 'd' date and returns it as an int in 1-53 range. STRING DAYNAME(DATE d) Returns the day field from a 'd' date, converted to the string corresponding to that day name. The range of return values is "Sunday" to "Saturday". STRING MONTHNAME(DATE d) Returns the month field from a 'd' date, converted to the string corresponding to that month name. The range of return values is "January" to "December". DATE NEXT_DAY(DATE d, STRING weekday) Returns the first date which is later than 'd' and named as 'weekday'. 'weekday' is 3 letters or full name of the day of the week. DATE LAST_DAY(DATE d) Returns the last day of the month which the 'd' date belongs to. INT DATEDIFF(DATE d1, DATE d2) Returns the number of days from 'd1' date to 'd2' date. DATE CURRENT_DATE() Returns the current date (in the local time zone). INT INT_MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates, as an int representing only the full months that passed. If 'd1' represents an earlier date than 'd2', the result is negative. DOUBLE MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates. Can include a fractional part representing extra days in addition to the full months between the dates. The fractional component is computed by dividing the difference in days by 31 (regardless of the month). If 'd1' represents an earlier date than 'd2', the result is negative. DATE ADD_YEARS(DATE d, INT/BIGINT num_years), DATE SUB_YEARS(DATE d, INT/BIGINT num_years) Adds/subtracts a specified number of years to a 'd' date value. DATE ADD_MONTHS(DATE d, INT/BIGINT num_months), DATE SUB_MONTHS(DATE d, INT/BIGINT num_months) Adds/subtracts a specified number of months to a date value. If 'd' is the last day of a month, the returned date will fall on the last day of the target month too. DATE ADD_DAYS(DATE d, INT/BIGINT num_days), DATE SUB_DAYS(DATE d, INT/BIGINT num_days) Adds/subtracts a specified number of days to a date value. DATE ADD_WEEKS(DATE d, INT/BIGINT num_weeks), DATE SUB_WEEKS(DATE d, INT/BIGINT num_weeks) Adds/subtracts a specified number of weeks to a date value. The following function doesn't exist in Hive but supported by Amazon Redshift INT DATE_CMP(DATE d1, DATE d2) Compares 'd1' and 'd2' dates. Returns: 1. NULL, if either 'd1' or 'd2' is NULL 2. -1 if d1 < d2 3. 1 if d1 > d2 4. 0 if d1 == d2 (https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_CMP.html) Change-Id: If404bffdaf055c769e79ffa8f193bac415cfdd1a Reviewed-on: http://gerrit.cloudera.org:8080/13648 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-25 23:06:25 +00:00
Attila Jeges	f0678b06e6	IMPALA-7369: part 1: Implement TRUNC, DATE_TRUNC, EXTRACT, DATE_PART functions for DATE These functions are somewhat similar in that each of them takes a DATE argument and a time unit to work with. They work identically to the corresponding TIMESTAMP functions. The only difference is that the DATE functions don't accept time-of-day units. TRUNC(DATE d, STRING unit) Truncates a DATE value to the specified time unit. The 'unit' argument is case insensitive. This argument string can be one of: SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year. Q: Quarter. MONTH, MON, MM, RM: Month. DDD, DD, J: Day. DAY, DY, D: Starting day (Monday) of the week. WW: Truncates to the most recent date, no later than 'd', which is on the same day of the week as the first day of year. W: Truncates to the most recent date, no later than 'd', which is on the same day of the week as the first day of month. The impelementation mirrors Impala's TRUNC(TIMESTAMP ts, STRING unit) function. Hive and Oracle SQL have a similar function too. Reference: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions201.htm . DATE_TRUNC(STRING unit, DATE d) Truncates a DATE value to the specified precision. The 'unit' argument is case insensitive. This argument string can be one of: DAY, WEEK, MONTH, YEAR, DECADE, CENTURY, MILLENNIUM. The implementation mirrors Impala's DATE_TRUNC(STRING unit, TIMESTAMP ts) function. Vertica has a similar function too. Reference: https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/ SQLReferenceManual/Functions/Date-Time/DATE_TRUNC.htm . EXTRACT(DATE d, STRING unit), EXTRACT(unit FROM DATE d) Returns one of the numeric date fields from a DATE value. The 'unit' string can be one of YEAR, QUARTER, MONTH, DAY. This argument value is case-insensitive. The implementation mirrors that Impala's EXTRACT(TIMESTAMP ts, STRING unit). Hive and Oracle SQL have a similar function too. Reference: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm . DATE_PART(STRING unit, DATE date) Similar to EXTRACT(), with the argument order reversed. Supports the same date units as EXTRACT(). The implementation mirrors Impala's DATE_PART(STRING unit, TIMESTAMP ts) function. Change-Id: I843358a45eb5faa2c134994600546fc1d0a797c8 Reviewed-on: http://gerrit.cloudera.org:8080/13363 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 14:23:23 +00:00
Attila Jeges	b5805de3e6	IMPALA-7368: Add initial support for DATE type DATE values describe a particular year/month/day in the form yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a time of day component. The range of values supported for the DATE type is 0000-01-01 to 9999-12-31. This initial DATE type support covers TEXT and HBASE fileformats only. 'DateValue' is used as the internal type to represent DATE values. The changes are as follows: - Support for DATE literal syntax. - Explicit casting between DATE and other types (note that invalid casts will fail with an error just like invalid DECIMAL_V2 casts, while failed casts to other types do no lead to warning or error): - from STRING to DATE. The string value must be formatted as yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory, the time component is optional. If the time component is present, it will be truncated silently. - from DATE to STRING. The resulting string value is formatted as yyyy-MM-dd. - from TIMESTAMP to DATE. The source timestamp's time of day component is ignored. - from DATE to TIMESTAMP. The target timestamp's time of day component is set to 00:00:00. - Implicit casting between DATE and other types: - from STRING to DATE if the source string value is used in a context where a DATE value is expected. - from DATE to TIMESTAMP if the source date value is used in a context where a TIMESTAMP value is expected. - Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP implicit conversions are now all possible, the existing function overload resolution logic is not adequate anymore. For example, it resolves the if(false, '2011-01-01', DATE '1499-02-02') function call to the if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded function, instead of the if(BOOLEAN, DATE, DATE) version. This is clearly wrong, so the function overload resolution logic had to be changed to resolve function calls to the best-fit overloaded function definition if there are multiple applicable candidates. An overloaded function definition is an applicable candidate for a function call if each actual parameter in the function call either matches the corresponding formal parameter's type (without casting) or is implicitly castable to that type. When looking for the best-fit applicable candidate, a parameter match score (i.e. the number of actual parameters in the function call that match their corresponding formal parameter's type without casting) is calculated and the applicable candidate with the highest parameter match score is chosen. There's one more issue that the new resolution logic has to address: if two applicable candidates have the same parameter match score and the only difference between the two is that the first one requires a STRING -> TIMESTAMP implicit cast for some of its parameters while the second one requires a STRING -> DATE implicit cast for the same parameters then the first candidate has to be chosen not to break backward compatibility. E.g: year('2019-02-15') function call must resolve to year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not implemented yet, so this is not an issue at the moment but it will be in the future. When the resolution algorithm considers overloaded function definitions, first it orders them lexicographically by the types in their parameter lists. To ensure the backward compatible behavior Primitivetype.DATE enum value has to come after PrimitiveType.TIMESTAMP. - Codegen infrastructure changes for expression evaluation. - 'IS [NOT] NULL' and '[NOT] IN' predicates. - Common comparison operators (including the 'BETWEEN' operator). - Infrastructure changes for built-in functions. - Some built-in functions: conditional, aggregate, analytical and math functions. - C++ UDF/UDA support. - Support partitioning and grouping by DATE. - Beeswax, HiveServer2 support. These items are tightly coupled and it makes sense to implement them in one change-set. Testing: - A new partitioned TEXT table 'functional.date_tbl' (and the corresponding HBASE table 'functional_hbase.date_tbl') was introduced for DATE-related tests. - BE and FE tests were extended to cover DATE type. - E2E tests: - since DATE type is supported for TEXT and HBASE fileformats only, most DATE tests were implemented separately in tests/query_test/test_date_queries.py. Note, that this change-set is not a complete DATE type implementation, but it lays the foundation for future work: - Add date support to the random query generator. - Implement a complete set of built-in functions. - Add Parquet support. - Add Kudu support. - Optionally support Avro and ORC. For further details, see IMPALA-6169. Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Reviewed-on: http://gerrit.cloudera.org:8080/12481 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-23 13:33:57 +00:00

1 2 3 4

175 Commits