impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Eyizoha	3af1930229	IMPALA-12322: Support converting UTC timestamps read from Kudu to local time This patch adds a query option 'convert_kudu_utc_timestamps' similar to 'convert_legacy_hive_parquet_utc_timestamps'. When enabled, it converts UTC timestamps read from Kudu to local timestamps. The corresponding modification also include predicate pushdown and runtime filter. Due to the ambiguity of timestamps caused by daylight saving time changes, it is difficult to resolve in the bloom filter. This patch additionally introduces a query option 'disable_kudu_local_timestamp_bloom_filter' to default disable the Kudu timestamp bloom filter after enabling time zone conversion in order to avoid erroneously filtering out data. However, for regions that do not observe daylight saving time, it can be set to false to re-enable the Kudu local timestamp bloom filter. Testing: - Add TestKuduTimestampConvert in query_test/test_kudu.py Perform end-to-end testing in a custom cluster, including basic Kudu UTC timestamp conversion testing, as well as checking if related predicate pushdown and runtime filters are working correctly (even with timestamps involving daylight saving time conversions). Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3 Reviewed-on: http://gerrit.cloudera.org:8080/20681 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2023-12-14 13:19:35 +00:00
Csaba Ringhofer	7ca11dfc7f	IMPALA-9482: Support for BINARY columns This patch adds support for BINARY columns for all table formats with the exception of Kudu. In Hive the main difference between STRING and BINARY is that STRING is assumed to be UTF8 encoded, while BINARY can be any byte array. Some other differences in Hive: - BINARY can be only cast from/to STRING - Only a small subset of built-in STRING functions support BINARY. - In several file formats (e.g. text) BINARY is base64 encoded. - No NDV is calculated during COMPUTE STATISTICS. As Impala doesn't treat STRINGs as UTF8, BINARY and STRING become nearly identical, especially from the backend's perspective. For this reason, BINARY is implemented a bit differently compared to other types: while the frontend treats STRING and BINARY as two separate types, most of the backend uses PrimitiveType::TYPE_STRING for BINARY too, e.g. in SlotDesc. Only the following parts of backend need to differentiate between STRING and BINARY: - table scanners - table writers - HS2/Beeswax service These parts have access to column metadata, which allows to add special handling for BINARY. Only a very few builtins are allowed for BINARY at the moment: - length - min/max/count - coalesce and similar "selector" functions Other STRING functions can be only used by casting to STRING first. Adding support for more of these functions is very easy, as simply the BINARY type has to be "connected" to the already existing STRING function's signature. Functions where the result depends on utf8_mode need to ensure that with BINARY it always works as if utf8_mode=0 (for example length() is mapped to bytes() as length count utf8 chars if utf8_mode=1). All kinds of UDFs (native, Hive legacy, Hive generic) support BINARY, though in case of legacy Hive UDFs it is only supported if the argument and return types are set explicitely to ensure backward compatibility. See IMPALA-11340 for details. The original plan was to behave as close to Hive as possible, but I realized that Hive has more relaxed casting rules than Impala, which led to STRING<->BINARY casts being necessary in more cases in Impala. This was needed to disallow passing a BINARY to functions that expect a STRING argument. An example for the difference is that in INSERT ... VALUES () string literals need to be explicitly cast to BINARY, while this is not needed in Hive. Testing: - Added functional.binary_tbl for all file formats (except Kudu) to test scanning. - Removed functional.unsupported_types and related tests, as now Impala supports all (non-complex) types that Hive does. - Added FE/EE tests mainly based on the ones added to the DATE type Change-Id: I36861a9ca6c2047b0d76862507c86f7f153bc582 Reviewed-on: http://gerrit.cloudera.org:8080/16066 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-08-19 13:55:42 +00:00
Csaba Ringhofer	f24296aed5	IMPALA-11355: Add STRING overloads for hour/minute/second/millisecond IMPALA-9531 dropped support for "dateless timestamps", e.g. cast("12:05:05" as timestamp) now returns NULL. This led to breaking functions like minute("12:05:05"), as minute() expects a timestamp, and Impala adds an implicit cast, so what actually happens is minute(cast("12:05:05" as timestamp)), which returns NULL. This change adds overloads for similar functions that take STRING instead of TIMESTAMP parameter. The same functions already take a STRING parameter in Hive and mySQL. The changes in the parser mainly restore code removed in IMPALA-9531. Note that these functions could be potentially optimized by returning parts of the parse result without converting them to boost time first, but this is not done here to make the change minimal. Testing: - restored related tests in expr-test and added some new ones for malformed time-of-day strings - added benchmarks for the new overloads and fixed the ones for the old functions (they tested NULL) Change-Id: I6cc1c851ee71ab4fcc58105c7e9931155a483679 Reviewed-on: http://gerrit.cloudera.org:8080/18718 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-07-14 21:03:43 +00:00
Michael Smith	2243f331cb	IMPALA-11274: CNF Rewrite causes a regress in join node performance This patch defines a subset of all predicates that are common and relatively inexpensive to compute. Such predicates must involve columns, constants, simple math or cast functions only. Examples of the subset of the predicates allowed: 1. (a = 1 AND cast(b as int) = 2) OR (c = d AND e = f) 2. a in ('1', '2', '3') OR ((b = 'abc') AND (c = d)) 3. (a between 1 and 100) OR ((b is null) AND (c = d)) Examples of the predicates not allowed: 1. (upper(a) != 'Y') AND b = 2) OR (c = d AND e = f) 2. (coalesce(CAST(a AS string), '') = '') AND b = 2) OR (c = d AND e = f) This patch further restricts the predicates to be converted to conjunctive normal form (CNF) to be such a subset, with the aim to reduce the run-time evaluation overhead of CNFs in which some of the predicates can be duplicated. Uses a cache in branching expressions to avoid visiting the entire subtree on each call to applyRuleBottomUp. Skips cache complexity on casts as they don't branch and are unlikely to be deeply nested. Testing: - New expression writer tests - New planner tests Change-Id: I326406c6b004fe31ec0e2a2f390a3845b8925aa9 Reviewed-on: http://gerrit.cloudera.org:8080/18458 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-05-25 05:37:17 +00:00
stiga-huang	35375b3287	IMPALA-2019(part-4): Add UTF-8 support for case conversion functions There are 3 builtin case conversion string functions: upper(), lower(), and initcap(). Previously they only convert English alphabetic characters. This patch adds support to deal with Unicode characters. There are many corner cases in case conversion depending on the locale and context. E.g. 1) Case conversion is locale-sensitive. Turkish has 4 letter "I"s. English has only two, a lowercase dotted i and an uppercase dotless I. Turkish has lowercase and uppercase forms of both dotted and dotless I. So simply converting "i" to "I" for upper case is wrong in Turkish: +-------+--------+---------+ \| \| Dotted \| Dotless \| +-------+--------+---------+ \| Upper \| İ \| I \| +-------+--------+---------+ \| Lower \| i \| ı \| +-------+--------+---------+ 2) Case conversion may change a string's length. The German word "grüßen" should be converted to "GRÜSSEN" in upper case: the letter "ß" should be converted to "SS". 3) Case conversion is context-sensitive. The Greek word "ὈΔΥΣΣΕΎΣ" should be converted to "ὀδυσσεύς", where the Greek letter "Σ" is converted to "σ" or to "ς", depending on its position in the word. The above cases will be focus in follow-up JIRAs. This patch addes the initial implementation of UTF-8 aware case conversion functions. -------- Implementation: In UTF-8 mode (turned on by set UTF8_MODE=true) of these functions, the bytes in strings are converted to wide characters using std::mbrtowc(). Each wide character (wchar_t) will then be converted using std::towupper or std::towlower correspondingly. We then convert them back to multi bytes using std::wcrtomb(). Note that these builtins are locale aware. If impalad is launched without a UTF-8 aware locale, e.g. LC_ALL="C", these builtins can't recognize non-ascii characters, which will return unexpected results. Thus we modify our docker images to set LC_ALL="C.UTF-8" instead of "C". This patch also logs the current locale when launching impala daemons for better debugging. We will support customized locale in IMPALA-11080. Test: - Add BE unit tests and e2e tests. Change-Id: I443e89d46f4638ce85664b021666bc4f03ee8abd Reviewed-on: http://gerrit.cloudera.org:8080/17785 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-15 18:40:59 +00:00
pranav.lodha	bde995483a	IMPALA-955: BYTES built-in function The Bytes function returns the number of bytes contained in the specified byte string. There are changes in 4 files. A few testcases are also added in be/src/exprs/expr-test.cc and an end-to end test in testdata/workloads/functional-query/queries/QueryTest/exprs.test. Change-Id: I0bd06c3d6dba354d71f63c649eaa8f9f74d266ee Reviewed-on: http://gerrit.cloudera.org:8080/18210 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2022-02-11 07:01:58 +00:00
stiga-huang	3850d49711	IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions Mask functions are used in Ranger column masking policies to mask sensitive data. There are 5 mask functions: mask(), mask_first_n(), mask_last_n(), mask_show_first_n(), mask_show_last_n(). Take mask() as an example, by default, it will mask uppercase to 'X', lowercase to 'x', digits to 'n' and leave other characters unmasked. For masking all characters to '', we can use mask(my_col, '', '', '', ''); The current implementations mask strings byte-to-byte, which have inconsistent results with Hive when the string contains unicode characters: mask('中国', '', '', '', '') => '***' Each Chinese character is encoded into 3 bytes in UTF-8 so we get the above result. The result in Hive is '' since there are two Chinese characters. This patch provides consistent masking behavior with Hive for strings under the UTF-8 mode, i.e., set UTF8_MODE=true. In UTF-8 mode, the masked unit of a string is a unicode code point. Implementation - Extends the existing MaskTransform function to deal with unicode code points(represented by uint32_t). - Extends the existing GetFirstChar function to get the code point of given masked charactors in UTF-8 mode. - Implement a MaskSubStrUtf8 method as the core functionality. - Swith to use MaskSubStrUtf8 instead of MaskSubStr in UTF-8 mode. - For better testing, this patch also adds an overload for all mask functions for only masking other chars but keeping the upper/lower/digit chars unmasked. E.g. mask({col}, -1, -1, -1, 'X'). Tests - Add BE tests in expr-test - Add e2e tests in utf8-string-functions.test Change-Id: I1276eccc94c9528507349b155a51e76f338367d5 Reviewed-on: http://gerrit.cloudera.org:8080/17780 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-09-15 05:04:07 +00:00
Amogh Margoor	8e7a6227a4	IMPALA-10730: Add MD5 fuction to compute 128-bit checksum for a string. Built-in function has been added to compute MD5 128-bit checksum for a non-null string. If input string is null, then output of the function is null too. In FIPS mode, MD5 is disabled and function will throw error on invocation. Testing: 1. Added expression unit tests. 2. Added end-to-end tests for MD5. Change-Id: Id406d30a7cc6573212b302fbfec43eb848352ff2 Reviewed-on: http://gerrit.cloudera.org:8080/17567 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-06-10 22:24:16 +00:00
Fucun Chu	ce21fe74b8	IMPALA-10689: Implement ds_cpc_union_f() function. This function receives two strings that are serialized Apache DataSketches CPC sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) from sketch_tbl; +---------------------------------------------------+ \| ds_cpc_estimate(ds_cpc_union_f(sketch1, sketch2)) \| +---------------------------------------------------+ \| 15 \| +---------------------------------------------------+ Change-Id: Ib5c616316bf2bf2ff437678e9a44a15339920150 Reviewed-on: http://gerrit.cloudera.org:8080/17440 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-06-07 12:53:23 +00:00
Fucun Chu	67de4a48b0	IMPALA-10688: Implement ds_cpc_stringify() function This function receives a string that is a serialized Apache DataSketches CPC sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_cpc_stringify(ds_cpc_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_cpc_stringify(ds_cpc_sketch(float_col)) \| +--------------------------------------------+ \| ### CPC sketch summary: \| \| lg_k : 11 \| \| seed hash : 93cc \| \| C : 2 \| \| flavor : 1 \| \| merged : true \| \| intresting col : 0 \| \| table entries : 2 \| \| window : not allocated \| \| ### End sketch summary \| \| \| +--------------------------------------------+ Change-Id: I8c9d089bfada6bebd078d8f388d2e146c79e5285 Reviewed-on: http://gerrit.cloudera.org:8080/17373 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2021-05-26 06:42:02 +00:00
Amogh Margoor	86beb2f9dd	IMPALA-10679: Add builtin functions to comptute SHA-1 and SHA-2 digest. Built-in functions to compute SHA-1 digest and SHA-2 family of digest has been added. Support for SHA2 digest includes SHA224, SHA256, SHA384 and SHA512. In FIPS mode SHA1, SHA224 and SHA256 have been disabled and will throw error. SHA2 functions will also throw error for unsupported bit length i.e., bit length apart from 224, 256, 384, 512. Testing: 1. Added Unit test for expressions. 2. Added end-to-end test for new functions. Change-Id: If163b7abda17cca3074c86519d59bcfc6ace21be Reviewed-on: http://gerrit.cloudera.org:8080/17464 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-18 18:48:10 +00:00
Fucun Chu	e39c30b3cd	IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions These functions can be used to get cardinality estimates of data using CPC algorithm from Apache DataSketches. ds_cpc_sketch() receives a dataset, e.g. a column from a table, and returns a serialized CPC sketch in string format. This can be written to a table or be fed directly to ds_cpc_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the CPC sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' CPC see: http://datasketches.apache.org/docs/CPC/CPC.html Figures-of-Merit Comparison of the HLL and CPC Sketches see: https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch_parquet.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3x faster. CPC gives closer estimate than current ndv(). CPC is more accurate than HLL in some cases Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a Reviewed-on: http://gerrit.cloudera.org:8080/16656 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-05-11 18:07:40 +00:00
Fucun Chu	77d6acd032	IMPALA-10581: Implement ds_theta_intersect_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the intersection of two sketches of same or different column and returns the resulting sketch of intersection. Example: select ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) from sketch_tbl; +-----------------------------------------------------------+ \| ds_theta_estimate(ds_theta_intersect_f(sketch1, sketch2)) \| +-----------------------------------------------------------+ \| 5 \| +-----------------------------------------------------------+ Change-Id: I335eada00730036d5433775cfe673e0e4babaa01 Reviewed-on: http://gerrit.cloudera.org:8080/17186 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-29 15:59:49 +00:00
Fucun Chu	622e3c95ad	IMPALA-10580: Implement ds_theta_union_f() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_union_f(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 15 \| +-------------------------------------------------------+ Change-Id: I8329979b81ceeaad739a43fab79768ca9c2916fa Reviewed-on: http://gerrit.cloudera.org:8080/17179 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-24 15:16:07 +00:00
Fucun Chu	3e82501531	IMPALA-10558: Implement ds_theta_exclude() function This function receives two strings that are serialized Apache DataSketches Theta sketches. Computes the a-not-b set operation given two sketches of same or different column. Example: select ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) from sketch_tbl; +-------------------------------------------------------+ \| ds_theta_estimate(ds_theta_exclude(sketch1, sketch2)) \| +-------------------------------------------------------+ \| 5 \| +-------------------------------------------------------+ Change-Id: I05119fd8c652c07ff248a99e44b0da3541e46ca3 Reviewed-on: http://gerrit.cloudera.org:8080/17153 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-17 22:14:44 +00:00
Fucun Chu	65c6a81ed9	IMPALA-10463: Implement ds_theta_sketch() and ds_theta_estimate() functions These functions can be used to get cardinality estimates of data using Theta algorithm from Apache DataSketches. ds_theta_sketch() receives a dataset, e.g. a column from a table, and returns a serialized Theta sketch in string format. This can be written to a table or be fed directly to ds_theta_estimate() that returns the cardinality estimate for that sketch. Similar to the HLL sketch, the primary use-case for the Theta sketch is for counting distinct values as a stream, and then merging multiple sketches together for a total distinct count. For more details about Apache DataSketches' Theta see: https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on tpch25_parquet.lineitem to compare perfomance with ds_hll_. ds_theta_ is faster than ds_hll_* on the original data, the difference is around 1%-10%. ds_hll_estimate() is faster than ds_theta_estimate() on existing sketch. HLL and Theta gives closer estimate except for string. see IMPALA-10464. Change-Id: I14f24c16b815eec75cf90bb92c8b8b0363dcbfbc Reviewed-on: http://gerrit.cloudera.org:8080/17008 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-17 17:09:48 +00:00
stiga-huang	e8720b40f1	IMPALA-2019(Part-1): Provide UTF-8 support in length, substring and reverse functions A unicode character can be encoded into 1-4 bytes in UTF-8. String functions will return undesired results when the input contains unicode characters, because we deal with a string as a byte array. For instance, length() returns the length in bytes, not in unicode characters. UTF-8 is the dominant unicode encoding used in the Hadoop ecosystem. This patch adds UTF-8 support in some string functions so they can have UTF-8 aware behavior. For compatibility with the old versions, a new query option, UTF8_MODE, is added for turning on/off the UTF-8 aware behavior. Currently, only length(), substring() and reverse() support it. Other function supports will be added in later patches. String functions will check the query option and switch to use the desired implementation. It's similar to how we use the decimal_v2 query option in builtin functions. For easy testing, the UTF-8 aware version of string functions are also exposed as builtin functions (named by utf8_*, e.g. utf8_length). Tests: - Add BE tests for utf8 functions. - Add e2e tests for the UTF8_MODE query option. Change-Id: I0aaf3544e89f8a3d531ad6afe056b3658b525b7c Reviewed-on: http://gerrit.cloudera.org:8080/16908 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-26 00:43:39 +00:00
stiga-huang	9bb7157bf0	IMPALA-10387: Add missing overloads of mask functions used in Ranger default masking policies The mask functions in Hive are implemented through GenericUDFs which can accept an infinite number of function signatures. Impala currently don't support GenericUDFs. So we provide builtin mask functions with limited overloads. This patch adds some missing overloads that could be used by Ranger default masking policies, e.g. MASK_HASH, MASK_SHOW_LAST_4, MASK_DATE_SHOW_YEAR, etc. Tests: - Add test coverage on all default masking policies applied on all supported types. Change-Id: Icf3e70fd7aa9f3b6d6b508b776696e61ec1fcc2e Reviewed-on: http://gerrit.cloudera.org:8080/16930 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-01-15 13:01:53 +00:00
Gabor Kaszab	6cb7cecacf	IMPALA-10237: Support Bucket and Truncate partition transforms as built-in functions This patch implements Truncate and Bucket partition transforms in Impala BE as built-in functions. The expectation is that these functions give the same result as Iceberg's implementation of the same functions. These built-in functions are invisible so users won't be able to invoke them e.g. from impala-shell. Truncate: - Supported types are IntVal, BigIntVal, StringVal, DecimalVal. - Receives an input from the above types and a width. - Returns the same type as the input. - Expected behaviour is explained here: https://iceberg.apache.org/spec/#truncate-transform-details Bucket: - Supported types are IntVal, BigIntVal, StringVal, DecimalVal, DateVal, TimestampVal. - Receives an input from the above types and the number of buckets as IntVal. - Returns IntVal. - Expected behaviour is explained here: https://iceberg.apache.org/spec/#bucket-transform-details Change-Id: I485680cf79d96d578dd8cfbfd554bec468fe84bd Reviewed-on: http://gerrit.cloudera.org:8080/16741 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-12-19 03:09:06 +00:00
Fucun Chu	8ea49e9b02	IMPALA-10134: Implement ds_hll_union_f() function. This function receives two strings that are serialized Apache DataSketches HLL sketches. Union two sketches and returns the resulting sketch of union. Example: select ds_hll_estimate(ds_hll_union_f(i_i, h_i)) from hll_sketches_impala_hive2; +-------------------------------------------+ \| ds_hll_estimate(ds_hll_union_f(i_i, h_i)) \| +-------------------------------------------+ \| 7 \| +-------------------------------------------+ Change-Id: Ic06e959ed956af5cedbfc7d4d063141d5babb2a8 Reviewed-on: http://gerrit.cloudera.org:8080/16711 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-11-25 14:33:10 +00:00
Fucun Chu	193c2e773f	IMPALA-10132: Implement ds_hll_estimate_bounds_as_string() function. This function receives a string that is a serialized Apache DataSketches HLL sketch and optional kappa that is a number of standard deviations from the mean: 1, 2 or 3 (default 2). Returns estimate and bounds with the values separated with commas. The result is three values: estimate, lower bound and upper bound. ds_hll_estimate_bounds_as_string(sketch [, kappa]) Kappa: 1 represent the 68.3% confidence bounds 2 represent the 95.4% confidence bounds 3 represent the 99.7% confidence bounds Note, ds_hll_estimate_bounds() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_hll_estimate_bounds_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) from functional_parquet.alltypestiny; +----------------------------------------------------------+ \| ds_hll_estimate_bounds_as_string(ds_hll_sketch(int_col)) \| +----------------------------------------------------------+ \| 2,2,2.0002 \| +----------------------------------------------------------+ Change-Id: I46bf8263e8fd3877a087b9cb6f0d1a2392bb9153 Reviewed-on: http://gerrit.cloudera.org:8080/16626 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-10-29 17:45:01 +00:00
Adam Tamas	99e5f5a885	IMPALA-10133:Implement ds_hll_stringify function. This function receives a string that is a serialized Apache DataSketches HLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_hll_stringify(ds_hll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_hll_stringify(ds_hll_sketch(float_col)) \| +--------------------------------------------+ \| ### HLL sketch summary: \| \| Log Config K : 12 \| \| Hll Target : HLL_4 \| \| Current Mode : LIST \| \| LB : 2 \| \| Estimate : 2 \| \| UB : 2.0001 \| \| OutOfOrder flag: false \| \| Coupon count : 2 \| \| ### End HLL sketch summary \| \| \| +--------------------------------------------+ Change-Id: I85dbf20b5114dd75c300eef0accabe90eac240a0 Reviewed-on: http://gerrit.cloudera.org:8080/16382 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-03 12:11:48 +00:00
Adam Tamas	4cb3c3556e	IMPALA-10108: Implement ds_kll_stringify function This function receives a string that is a serialized Apache DataSketches KLL sketch and returns its stringified format. A stringified format should look like and contains the following data: select ds_kll_stringify(ds_kll_sketch(float_col)) from functional_parquet.alltypestiny; +--------------------------------------------+ \| ds_kll_stringify(ds_kll_sketch(float_col)) \| +--------------------------------------------+ \| ### KLL sketch summary: \| \| K : 200 \| \| min K : 200 \| \| M : 8 \| \| N : 8 \| \| Epsilon : 1.33% \| \| Epsilon PMF : 1.65% \| \| Empty : false \| \| Estimation mode: false \| \| Levels : 1 \| \| Sorted : false \| \| Capacity items : 200 \| \| Retained items : 8 \| \| Storage bytes : 64 \| \| Min value : 0 \| \| Max value : 1.1 \| \| ### End sketch summary \| \| \| +--------------------------------------------+ Change-Id: I97f654a4838bf91e3e0bed6a00d78b2c7aa96f75 Reviewed-on: http://gerrit.cloudera.org:8080/16370 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-09-02 10:49:10 +00:00
Gabor Kaszab	28d94851b1	IMPALA-10020: Implement ds_kll_cdf_as_string() function This is the support for Cumulative Distribution Function (CDF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), (-inf, 5), (-inf, 10), (-inf, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_cdf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_cdf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_cdf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.2,0.401644,1,1 \| +----------------------------------------------------------+ Change-Id: I77e6afc4556ad05a295b89f6d06c2e4a6bb2cf82 Reviewed-on: http://gerrit.cloudera.org:8080/16359 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 10:59:49 +00:00
Gabor Kaszab	a8a35edbc4	IMPALA-10019: Implement ds_kll_pmf_as_string() function This is the support for Probabilistic Mass Function (PMF) from Apache DataSketches KLL algorithm collection. It receives a serialized KLL sketch and one or more float values to represent ranges in the sketched values. E.g. [1, 5, 10] will mean the following ranges: (-inf, 1), [1, 5), [5, 10), [10, +inf) Returns a comma separated string where each value in the string is a number in the range of [0,1] and shows that what percentage of the data is in the particular ranges. Note, ds_kll_pmf() should return an Array of doubles as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_pmf_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Example: select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) from alltypes; +----------------------------------------------------------+ \| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) \| +----------------------------------------------------------+ \| 0.202192,0.199452,0.598356,0 \| +----------------------------------------------------------+ Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba Reviewed-on: http://gerrit.cloudera.org:8080/16336 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-26 01:50:16 +00:00
Gabor Kaszab	41065845e9	IMPALA-9962: Implement ds_kll_quantiles_as_string() function This function is very similar to ds_kll_quantile() but this one can receive any number of rank parameters and returns a comma separated string that holds the results for all of the given ranks. For more details about ds_kll_quantile() see IMPALA-9959. Note, ds_kll_quantiles() should return an Array of floats as the result but with that we have to wait for the complex type support. Until, we provide ds_kll_quantiles_as_string() that can be deprecated once we have array support. Tracking Jira for returning complex types from functions is IMPALA-9520. Change-Id: I76f6039977f4e14ded89a3ee4bc4e6ff855f5e7f Reviewed-on: http://gerrit.cloudera.org:8080/16324 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-25 18:06:22 +00:00
Gabor Kaszab	0736fcf691	IMPALA-10018: Implement ds_kll_rank() function ds_kll_rank() receives two parameters: a STRING that represents a serialized DataSketches KLL sketch and a float to provide a probing value in the sketch. Returns a DOUBLE that is the rank of the given probing value in the range of [0,1]. E.g. a return value of 0.2 means that the probing value given as parameter is greater than the 20% of all the values in the sketch. Note, this is an approximate calculation. Change-Id: I95857886dfbb8c84aeeaf718c0e610012fda4be0 Reviewed-on: http://gerrit.cloudera.org:8080/16283 Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-07 12:58:40 +00:00
Gabor Kaszab	87aeb2ad78	IMPALA-9963: Implement ds_kll_n() function This function receives a serialized Apache DataSketches KLL sketch and returns how many input values were fed into this sketch. Change-Id: I166e87a468e68e888ac15fca7429ac2552dbb781 Reviewed-on: http://gerrit.cloudera.org:8080/16259 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-06 19:15:04 +00:00
Gabor Kaszab	033a4607e2	IMPALA-9959: Implement ds_kll_sketch() and ds_kll_quantile() functions ds_kll_sketch() is an aggregate function that receives a float parameter (e.g. a float column of a table) and returns a serialized Apache DataSketches KLL sketch of the input data set wrapped into STRING type. This sketch can be saved into a table or view and later used for quantile approximations. ds_kll_quantile() receives two parameters: a STRING parameter that contains a serialized KLL sketch and a DOUBLE that represents the rank of the quantile in the range of [0,1]. E.g. rank=0.1 means the approximate value in the sketch where 10% of the sketched items are less than or equals to this value. Testing: - Added automated tests on small data sets to check the basic functionality of sketching and getting a quantile approximate. - Tested on TPCH25_parquet.lineitem to check that sketching and approximating works on bigger scale as well where serialize/merge phases are also required. On this scale the error range of the quantile approximation is within 1-1.5% Change-Id: I11de5fe10bb5d0dd42fb4ee45c4f21cb31963e52 Reviewed-on: http://gerrit.cloudera.org:8080/16235 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-31 14:34:49 +00:00
Adam Tamas	1bafb7bd29	IMPALA-9531: Dropped support for dateless timestamps Removed the support for dateless timestamps. During dateless timestamp casts if the format doesn't contain date part we get an error during tokenization of the format. If the input str doesn't contain a date part then we get null result. Examples: select cast('01:02:59' as timestamp); This will come back as NULL value. select to_timestamp('01:01:01', 'HH:mm:ss'); select cast('01:02:59' as timestamp format 'HH12:MI:SS'); select cast('12 AM' as timestamp FORMAT 'AM.HH12'); These will come back with a parsing errors. Casting from a table will generate similar results. Testing: Modified the previous tests related to dateless timestamps. Added test to read fromtables which are still containing dateless timestamps and covered timestamp to string path when no date tokens are requested in the output string. Change-Id: I48c49bf027cc4b917849b3d58518facba372b322 Reviewed-on: http://gerrit.cloudera.org:8080/15866 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>	2020-07-08 19:32:15 +00:00
Gabor Kaszab	7e456dfa9d	IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate() These functions can be used to get cardinality estimates of data using HLL algorithm from Apache DataSketches. ds_hll_sketch() receives a dataset, e.g. a column from a table, and returns a serialized HLL sketch in string format. This can be written to a table or be fed directly to ds_hll_estimate() that returns the cardinality estimate for that sketch. Comparing to ndv() these functions bring more flexibility as once we fed data to the sketch it can be written to a table and next time we can save scanning through the dataset and simply return the estimate using the sketch. This doesn't come for free, however, as perfomance measurements show that ndv() is 2x-3.5x faster than sketching. On the other hand if we query the estimate from an existing sketch then the runtime is negligible. Another flexibility with these sketches is that they can be merged together so e.g. if we had saved a sketch for each of the partitions of a table then they can be combined with each other based on the query without touching the actual data. DataSketches HLL is sensitive for the order of the data fed to the sketch and as a result running these algorithms in Impala gets non-deterministic results within the error bounds of the algorithm. In terms of correctness DataSketches HLL is most of the time in 2% range from the correct result but there are occasional spikes where the difference is bigger but never goes out of the range of 5%. Even though the DataSketches HLL algorithm could be parameterized currently this implementation hard-codes these parameters and use HLL_4 and lg_k=12. For more details about Apache DataSketches' HLL implementation see: https://datasketches.apache.org/docs/HLL/HLL.html Testing: - Added some tests running estimates for small datasets where the amount of data is small enough to get the correct results. - Ran manual tests on TPCH25.lineitem to compare perfomance with ndv(). Depending on data characteristics ndv() appears 2x-3.5x faster. The lower the cardinality of the dataset the bigger the difference between the 2 algorithms is. - Ran manual tests on TPCH25.lineitem and functional_parquet.alltypes to compare correctness with ndv(). See results above. Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe Reviewed-on: http://gerrit.cloudera.org:8080/16000 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>	2020-07-07 14:11:21 +00:00
stiga-huang	0936384271	IMPALA-9010: Add builtin mask functions There're 6 builtin GenericUDFs for column masking in Hive: mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_first_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_hash(value) mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar, dayValue, monthValue, yearValue) Description of the parameters: value - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()). charCount - number of characters. Default value: 4 upperChar - character to replace upper-case characters with. Specify -1 to retain original character. Default value: 'X' lowerChar - character to replace lower-case characters with. Specify -1 to retain original character. Default value: 'x' digitChar - character to replace digit characters with. Specify -1 to retain original character. Default value: 'n' otherChar - character to replace all other characters with. Specify -1 to retain original character. Default value: -1 numberChar - character to replace digits in a number with. Valid values: 0-9. Default value: '1' dayValue - value to replace day field in a date with. Specify -1 to retain original value. Valid values: 1-31. Default value: 1 monthValue - value to replace month field in a date with. Specify -1 to retain original value. Valid values: 0-11. Default value: 0 yearValue - value to replace year field in a date with. Specify -1 to retain original value. Default value: 1 In Hive, these functions accept variable length of arguments in non-restricted types: mask_show_first_n(val) mask_show_first_n(val, 8) mask_show_first_n(val, 8, 'X', 'x', 'n') mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2) mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9') The arguments of upperChar, lowerChar, digitChar, otherChar and numberChar can be in string or numeric types. Impala doesn't support Hive GenericUDFs, so we are lack of these mask functions to support Ranger column masking policies. On the other hand, we want the masking functions to be evaluated in the C++ builtin logic rather than calling out to java UDFs for performance. This patch introduces our builtin implementation of them. We currently don't have a corresponding framework for GenericUDF (IMPALA-9271), so we implement these by overloads. However, it may requires hundreds of overloads to cover all possible combinations. We just implement some important overloads, including - those used by Ranger default masking policies, - those with simple arguments and may be useful for users, - an overload with all arguments in int type for full functionality. Char argument need to be converted to their ASCII value. Tests: - Add BE tests in expr-test Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664 Reviewed-on: http://gerrit.cloudera.org:8080/14963 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-17 15:34:34 +00:00
norbert.luksa	a862282811	IMPALA-8709: Add Damerau-Levenshtein edit distance built-in function This patch adds new built-in functions to calculate restricted Damerau-Levenshtein edit distance (optimal string alignment). Implmented as dle_dst() and damerau_levenshtein(). If either value is NULL or both values are NULL returns NULL which differs from Netezza's dle_dst() which returns the length of the not NULL value or 0 if both values are NULL. The NULL behavior matches the existing levenshtein() function. Also cleans up levenshtein tests. Testing: - Added unit tests to expr-test.cc - Manual testing on over 1400 string pairs from http://marvin.cs.uidaho.edu/misspell.html and results match Netezza Change-Id: Ib759817ec15e7075bf49d51e494e45c8af4db94d Reviewed-on: http://gerrit.cloudera.org:8080/13794 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-22 21:39:21 +00:00
luksan47	8db7f27ddd	IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function The added functions return the Jaro/Jaro-Winkler similarity/distance of two strings. The algorithm calcuates the Jaro-Similarity of the strings, then adds more weight to the result if there are common prefixes. (Jaro-Winkler) For more detail, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Extended the algorithm with another optional parameter: boost threshold The prefix weight will only be applied if the Jaro-similarity exceeds the given threshold. By default, its value is 0.7. The new built-in functions are: * jaro_distance, jaro_dst * jaro_similarity, jaro_sim * jaro_winkler_distance, jw_dst * jaro_winkler_similarity, jw_sim Testing: * Added unit tests to expr-test.cc * Manual testing over 1400 word pairs from http://marvin.cs.uidaho.edu/misspell.html Results match Apache commons Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c Reviewed-on: http://gerrit.cloudera.org:8080/13870 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-13 18:25:32 +00:00
Attila Jeges	f40935a30e	IMPALA-7369: part 2: Add INTERVAL expr support and built-in functions for DATE This change implements INTERVAL expression support for DATE type and adds several DATE related built-in functions. The efficiency of the DateValue::ToYearMonthDay() function used in many of the built-in functions below was also improved. The following functions are supported in Hive: INT YEAR(DATE d) Extracts year of the 'd' date, returns it as an int in 0-9999 range. INT MONTH(DATE d) Extracts month of the 'd' date and returns it as an int in 1-12 range. INT DAY(DATE d), INT DAYOFMONTH(DATE d) Extracts day-of-month of the 'd' date and returns it as an int in 1-31 range. INT QUARTER(DATE d) Extracts quarter of the 'd' date and returns it as an int in 1-4 range. INT DAYOFWEEK(DATE d) Extracts day-of-week of the 'd' date and returns it as an int in 1-7 range. 1 is Sunday and 7 is Saturday. INT DAYOFYEAR(DATE d) Extracts day-of-year of the 'd' date and returns it as an int in 1-366 range. INT WEEKOFYEAR(DATE d) Extracts week-of-year of the 'd' date and returns it as an int in 1-53 range. STRING DAYNAME(DATE d) Returns the day field from a 'd' date, converted to the string corresponding to that day name. The range of return values is "Sunday" to "Saturday". STRING MONTHNAME(DATE d) Returns the month field from a 'd' date, converted to the string corresponding to that month name. The range of return values is "January" to "December". DATE NEXT_DAY(DATE d, STRING weekday) Returns the first date which is later than 'd' and named as 'weekday'. 'weekday' is 3 letters or full name of the day of the week. DATE LAST_DAY(DATE d) Returns the last day of the month which the 'd' date belongs to. INT DATEDIFF(DATE d1, DATE d2) Returns the number of days from 'd1' date to 'd2' date. DATE CURRENT_DATE() Returns the current date (in the local time zone). INT INT_MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates, as an int representing only the full months that passed. If 'd1' represents an earlier date than 'd2', the result is negative. DOUBLE MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates. Can include a fractional part representing extra days in addition to the full months between the dates. The fractional component is computed by dividing the difference in days by 31 (regardless of the month). If 'd1' represents an earlier date than 'd2', the result is negative. DATE ADD_YEARS(DATE d, INT/BIGINT num_years), DATE SUB_YEARS(DATE d, INT/BIGINT num_years) Adds/subtracts a specified number of years to a 'd' date value. DATE ADD_MONTHS(DATE d, INT/BIGINT num_months), DATE SUB_MONTHS(DATE d, INT/BIGINT num_months) Adds/subtracts a specified number of months to a date value. If 'd' is the last day of a month, the returned date will fall on the last day of the target month too. DATE ADD_DAYS(DATE d, INT/BIGINT num_days), DATE SUB_DAYS(DATE d, INT/BIGINT num_days) Adds/subtracts a specified number of days to a date value. DATE ADD_WEEKS(DATE d, INT/BIGINT num_weeks), DATE SUB_WEEKS(DATE d, INT/BIGINT num_weeks) Adds/subtracts a specified number of weeks to a date value. The following function doesn't exist in Hive but supported by Amazon Redshift INT DATE_CMP(DATE d1, DATE d2) Compares 'd1' and 'd2' dates. Returns: 1. NULL, if either 'd1' or 'd2' is NULL 2. -1 if d1 < d2 3. 1 if d1 > d2 4. 0 if d1 == d2 (https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_CMP.html) Change-Id: If404bffdaf055c769e79ffa8f193bac415cfdd1a Reviewed-on: http://gerrit.cloudera.org:8080/13648 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-25 23:06:25 +00:00
Attila Jeges	f0678b06e6	IMPALA-7369: part 1: Implement TRUNC, DATE_TRUNC, EXTRACT, DATE_PART functions for DATE These functions are somewhat similar in that each of them takes a DATE argument and a time unit to work with. They work identically to the corresponding TIMESTAMP functions. The only difference is that the DATE functions don't accept time-of-day units. TRUNC(DATE d, STRING unit) Truncates a DATE value to the specified time unit. The 'unit' argument is case insensitive. This argument string can be one of: SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year. Q: Quarter. MONTH, MON, MM, RM: Month. DDD, DD, J: Day. DAY, DY, D: Starting day (Monday) of the week. WW: Truncates to the most recent date, no later than 'd', which is on the same day of the week as the first day of year. W: Truncates to the most recent date, no later than 'd', which is on the same day of the week as the first day of month. The impelementation mirrors Impala's TRUNC(TIMESTAMP ts, STRING unit) function. Hive and Oracle SQL have a similar function too. Reference: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions201.htm . DATE_TRUNC(STRING unit, DATE d) Truncates a DATE value to the specified precision. The 'unit' argument is case insensitive. This argument string can be one of: DAY, WEEK, MONTH, YEAR, DECADE, CENTURY, MILLENNIUM. The implementation mirrors Impala's DATE_TRUNC(STRING unit, TIMESTAMP ts) function. Vertica has a similar function too. Reference: https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/ SQLReferenceManual/Functions/Date-Time/DATE_TRUNC.htm . EXTRACT(DATE d, STRING unit), EXTRACT(unit FROM DATE d) Returns one of the numeric date fields from a DATE value. The 'unit' string can be one of YEAR, QUARTER, MONTH, DAY. This argument value is case-insensitive. The implementation mirrors that Impala's EXTRACT(TIMESTAMP ts, STRING unit). Hive and Oracle SQL have a similar function too. Reference: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm . DATE_PART(STRING unit, DATE date) Similar to EXTRACT(), with the argument order reversed. Supports the same date units as EXTRACT(). The implementation mirrors Impala's DATE_PART(STRING unit, TIMESTAMP ts) function. Change-Id: I843358a45eb5faa2c134994600546fc1d0a797c8 Reviewed-on: http://gerrit.cloudera.org:8080/13363 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 14:23:23 +00:00
Attila Jeges	b5805de3e6	IMPALA-7368: Add initial support for DATE type DATE values describe a particular year/month/day in the form yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a time of day component. The range of values supported for the DATE type is 0000-01-01 to 9999-12-31. This initial DATE type support covers TEXT and HBASE fileformats only. 'DateValue' is used as the internal type to represent DATE values. The changes are as follows: - Support for DATE literal syntax. - Explicit casting between DATE and other types (note that invalid casts will fail with an error just like invalid DECIMAL_V2 casts, while failed casts to other types do no lead to warning or error): - from STRING to DATE. The string value must be formatted as yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory, the time component is optional. If the time component is present, it will be truncated silently. - from DATE to STRING. The resulting string value is formatted as yyyy-MM-dd. - from TIMESTAMP to DATE. The source timestamp's time of day component is ignored. - from DATE to TIMESTAMP. The target timestamp's time of day component is set to 00:00:00. - Implicit casting between DATE and other types: - from STRING to DATE if the source string value is used in a context where a DATE value is expected. - from DATE to TIMESTAMP if the source date value is used in a context where a TIMESTAMP value is expected. - Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP implicit conversions are now all possible, the existing function overload resolution logic is not adequate anymore. For example, it resolves the if(false, '2011-01-01', DATE '1499-02-02') function call to the if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded function, instead of the if(BOOLEAN, DATE, DATE) version. This is clearly wrong, so the function overload resolution logic had to be changed to resolve function calls to the best-fit overloaded function definition if there are multiple applicable candidates. An overloaded function definition is an applicable candidate for a function call if each actual parameter in the function call either matches the corresponding formal parameter's type (without casting) or is implicitly castable to that type. When looking for the best-fit applicable candidate, a parameter match score (i.e. the number of actual parameters in the function call that match their corresponding formal parameter's type without casting) is calculated and the applicable candidate with the highest parameter match score is chosen. There's one more issue that the new resolution logic has to address: if two applicable candidates have the same parameter match score and the only difference between the two is that the first one requires a STRING -> TIMESTAMP implicit cast for some of its parameters while the second one requires a STRING -> DATE implicit cast for the same parameters then the first candidate has to be chosen not to break backward compatibility. E.g: year('2019-02-15') function call must resolve to year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not implemented yet, so this is not an issue at the moment but it will be in the future. When the resolution algorithm considers overloaded function definitions, first it orders them lexicographically by the types in their parameter lists. To ensure the backward compatible behavior Primitivetype.DATE enum value has to come after PrimitiveType.TIMESTAMP. - Codegen infrastructure changes for expression evaluation. - 'IS [NOT] NULL' and '[NOT] IN' predicates. - Common comparison operators (including the 'BETWEEN' operator). - Infrastructure changes for built-in functions. - Some built-in functions: conditional, aggregate, analytical and math functions. - C++ UDF/UDA support. - Support partitioning and grouping by DATE. - Beeswax, HiveServer2 support. These items are tightly coupled and it makes sense to implement them in one change-set. Testing: - A new partitioned TEXT table 'functional.date_tbl' (and the corresponding HBASE table 'functional_hbase.date_tbl') was introduced for DATE-related tests. - BE and FE tests were extended to cover DATE type. - E2E tests: - since DATE type is supported for TEXT and HBASE fileformats only, most DATE tests were implemented separately in tests/query_test/test_date_queries.py. Note, that this change-set is not a complete DATE type implementation, but it lays the foundation for future work: - Add date support to the random query generator. - Implement a complete set of built-in functions. - Add Parquet support. - Add Kudu support. - Optionally support Avro and ORC. For further details, see IMPALA-6169. Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Reviewed-on: http://gerrit.cloudera.org:8080/12481 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-23 13:33:57 +00:00
Greg Rahn	ba9b78c103	IMPALA-7759: Add Levenshtein edit distance built-in function This patch adds new built-in functions to calculate Levenshtein edit distance. Implemented as levenshtein() to match PostgreSQL in both functionality and name and also added le_dst() alias for Netezza, compatibility, but note that levenshtein() differs in functionality in that if either value is NULL or both values are NULL, levenshtein() returns NULL, where Netezza's le_dst() returns the length of the not NULL value or 0 if both values are NULL. Testing: - Added unit tests to expr-test.cc - Manual test on 966289 string pairs and results match PostgreSQL - Added changes to qgen tests for PostgreSQL comparison Change-Id: I549d33ab7cebfa10db2934461c8ec91e2cc1cdcb Reviewed-on: http://gerrit.cloudera.org:8080/11793 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-02 10:39:44 +00:00
stiga-huang	ddef2cb9b1	IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[].a'), we first perform '$[]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Reviewed-on: http://gerrit.cloudera.org:8080/10950 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-29 11:59:03 +00:00
Andrew Sherman	15e40a3c94	IMPALA-589: Add sql function returning the impalad coordinator hostname. In every execution of an Impala query, one of the impalad daemons acts as the coordinator node. In some cases, such as when using a proxy, a user cannot predict which host will act as the coordinator. To aid in diagnosis, we provide a sql function which returns the name of the host on which the coordinator is running. EXTERNAL DESCRIPTION: Add a builtin function called coordinator(), which returns the name of the host which is running the impalad that is acting as the coordinator for the current query. TESTING: - Added a basic unit test for the new function. - Added a unit test which simulates the case when coord_address is unset. - Added a query that uses coordinator() to exprs.test - Hand tested in a development deployment. - Ran regression tests and got a clean run. Change-Id: I94d6e2664ba659b48df53c5c06f67b502c533e47 Reviewed-on: http://gerrit.cloudera.org:8080/11459 Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-24 20:26:44 +00:00
Adam Holley	7002ad3b35	IMPALA-7398: Add logged_in_user alias for effective_user This patch adds an alias to the effective_user function so that views created in Hive using the logged_in_user function will work in Impala. Example: CREATE VIEW foo.view1 AS SELECT * FROM foo.table1 WHERE name=logged_in_user(); Tests: - Added function and ran delegation tests - Ran backend tests - Ran custom-cluster tests including delegation Change-Id: Id63f243e0fffbe2798f1f9dbc4cc3ebe9d9529a6 Reviewed-on: http://gerrit.cloudera.org:8080/11184 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-11 06:02:52 +00:00
aphadke	dde930830b	IMPALA-4848: Add WIDTH_BUCKET() function Syntax : width_bucket(expr decimal, min_val decimal, max_val decimal, num_buckets int) This function creates equiwidth histograms , where the histogram range is divided into num_buckets buckets having identical sizes. This function returns the bucket in which the expr value would fall. min_val and max_val are the minimum and maximum value of the histogram range respectively. -> This function returns NULL if expr is a NULL. -> This function returns 0 if expr < min_val -> This function returns num_buckets + 1 if expr > max_val E.g. [localhost:21000] > select width_bucket(8, 1, 20, 3); +---------------------------+ \| width_bucket(8, 1, 20, 3) \| +---------------------------+ \| 2 \| +---------------------------+ Change-Id: I081bc916b1bef7b929ca161a9aade3b54c6b858f Reviewed-on: http://gerrit.cloudera.org:8080/6023 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-30 04:47:23 +00:00
Tim Armstrong	2995be8238	IMPALA-5607: part 1: breaking extract/date_part changes This is the compatibility-breaking part of Jinchul Kim's change to add additional units. To support nanoseconds we need to widen the output type of these functions. We also change the meaning of "milliseconds" to include the seconds component. Cherry-picks: not for 2.x Change-Id: I42d83712d9bb3a4900bec38a9c009dcf2a1fe019 Reviewed-on: http://gerrit.cloudera.org:8080/9957 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-10 04:00:37 +00:00
Taras Bobrovytsky	8fec1911e5	IMPALA-6230, IMPALA-6468: Fix the output type of round() and related fns Before this patch, the output type of round() ceil() floor() trunc() was not always the same as the input type. It was also inconsistent in general. For example, round(double) returned an integer, but round(double, int) returned a double. After looking at other database systems, we decided that the guideline should be that the output type should be the same as the input type. In this patch, we change the behavior of the previously mentioned functions so that if a double is given then a double is returned. We also modify the rounding behavior to always round away from zero. Before, we were rounding towards positive infinity in some cases. Testinging: - Updated tests - Ran an exhaustive build which passed. Cherry-picks: not for 2.x Change-Id: I77541678012edab70b182378b11ca8753be53f97 Reviewed-on: http://gerrit.cloudera.org:8080/9346 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-24 04:43:01 +00:00
Greg Rahn	d91df9b63f	IMPALA-6537: Add missing ODBC scalar functions This patch contains the following builtin function changes: New aliases for existing functions: - LEFT() same as STRLEFT() - RIGHT() same as STRRIGHT() - WEEK() same as WEEKOFYEAR() New functions: - QUARTER() - MONTHNAME() Refactors: - Remove TimestampFunctions::DayName and add LongDayName to match pattern of TimestampFunctions::ShortDayName Additionally, it adds the unit of QUARTER to EXTRACT() and DATE_PART() Testing: - manual testing comparing the translated ODBC functions to the non-translated ones - added at least one new expr-test for aliases - new expr-tests added for new functions Change-Id: Ia60af2b4de8c098be7ecb3e60840e459ae10d499 Reviewed-on: http://gerrit.cloudera.org:8080/9376 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 07:19:07 +00:00
Jinchul	1b1087eb05	IMPALA-3282: Adds regexp_escape built-in function Escapes the following special characters in RE2 library: .\+*?[^]$(){}=!<>\|:- Testing: Add some unit tests into ExprTest.StringRegexpFunctions Add some E2E tests into exprs.test Change-Id: I84c3e0ded26f6eb20794c38b75be9b25cd111e4b Reviewed-on: http://gerrit.cloudera.org:8080/8900 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-01 05:14:14 +00:00
Zoram Thanga	c2d27ca823	IMPALA-6059: Enhance ltrim()/rtrim() functions to trim any set of characters. This patch generalizes ltrim()/rtrim() functions to accept a second argument that specifies the set of characters to be removed from the leading/trailing end of the target string: ltrim(string text[, characters text]) rtrim(string text[, characters text]) A common string trimming method has been added to StringFunctions, which is called from the general ltrim/rtrim/btrim functions. The functions also share prepare and close operations. New StringFunctions tests have been added to ExprTest for the new forms of ltrim() and rtrim(). New tests to cover handling of special characters have also been added. Note that our string handling functions only work with the ASCII character set. Handling of other character sets lies outside the scope of this patch. The existing ltrim()/rtrim()/trim() functions that take only one argument have been updated to use the more general methods. Testing: Queries like the following were run on a 1.5-billion row tpch_parquet.lineitem table, with the old and new implementations to ensure there is no performance regression: 1. select count(trim(l_shipinstruct)), count(trim(l_returnflag)), ... 2. select count(*) from t where trim(l_shipinstruct) = '' and ... Change-Id: I8a5ae3f59762e70c3268a01e14ed57a9e36b8d79 Reviewed-on: http://gerrit.cloudera.org:8080/8349 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-23 23:44:46 +00:00
Jinchul	6041865031	IMPALA-3651: Adds murmur_hash() built-in function murmur_hash relys on HashUtil::MurmurHash2_64 which MurmurHash2 64-bit version. Testing: Add unit tests for primitive types: ExprTest.MurmurHashFunction Add E2E tests into exprs.test Change-Id: I14d56ffb8fab256f3f66a2669271fd4b3c50cc29 Reviewed-on: http://gerrit.cloudera.org:8080/8893 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-10 20:17:26 +00:00
Zachary Amsden	f53ce3b16d	IMPALA-4513: Promote integer types for ABS() The internal representation of the most negative number in two's complement requires 1 more bit to represent the positive version. This means ABS() must promote integer types to the next highest width. Change-Id: I86cc880e78258d5f90471bd8af4caeb4305eed77 Reviewed-on: http://gerrit.cloudera.org:8080/8004 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-23 02:41:32 +00:00
Philip Zeyliger	02302b7cfe	IMPALA-5211: Simplifying nullif conditional. This commit: * Converts nullif(x, y) into if(x IS DISTINCT FROM y, x, NULL). * Re-writes x IS DINSTINCT FROM y -> FALSE if x.equals(y). * Removes backend implementation of nullif. As is the case with all conversions, the original nullif(...) is replaced with if(...) in error messages, explain plans, and so on. It's important and subtle that the conversion uses "x IS DISTINCT FROM y" rather than "x != y" so that the simplification can be made while handling null values correctly. ("x != x" may be either false or null, but x is distinct from x is always false.) Testing: * Added new tests to ExprRewriteRulesTests for nullif and the if(x distinct from y, ...) simplification. * New test for the rewrite in ParserTest. * Adds an nvl2() test, incidentally. * Confirmed (using EclEmma, which uses jococo engine) that coverage is good. * Ran the tests. Change-Id: Id91ca968a0c0be44e1ec54ad8602f91a5cb2e0e5 Reviewed-on: http://gerrit.cloudera.org:8080/7829 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-09-15 22:48:52 +00:00

1 2 3

149 Commits