impala

mirror of https://github.com/apache/impala.git synced 2026-01-27 06:10:53 -05:00

Author	SHA1	Message	Date
Tim Armstrong	da5b498c18	IMPALA-9373: more tactical IWYU fixes This is a grab-bag of fixes that I did with a mix of manual inspection. The techniques used were: * Getting preprocessor output for a few files by modifying command lines from compiler_commands.json to include -E. This is revealing because you see all the random unrelated cruft that gets pulled in. A useful one liner to extract an (approximate) list of headers from preprocessor output is: grep '^#.h' be/src/util/CMakeFiles/Util.dir/os-info.cc.i \| \ grep -o '"."' \| sort -u * Looking at the IWYU recommendations for guidance on what headers can be removed (and what need to be added). * Grepping for includes of headers, especially in other headers where they become viral. An example one-liner to find these: git grep -l 'include.<iostream>' \| grep '\.h$' Non-exhaustive list of changes made: ----------------------------------- Unnest classes from TmpFileMgr so we can forward-declare them. This lets us remove tmp-file-mgr.h from buffer-pool.h and query-state.h, which are both widely included headers in the codebase. Also remove webserver.h from other headers, since it pulls in openssl-util.h and consequently a lot of openssl headers. Avoid including runtime/multi-precision.h in other headers. It pulls in a lot of boost multiprecision headers that are only needed for internal implementations of math and decimal operations. This required replacing some references to int128_t with __int128_t, which I don't think significantly hurts code readability. Also remove references to decimal-util.h where they're not needed, since it transitively pulls in multi-precision.h Reduce includes of boost/date_time modules, which are transitively many places via timestamp-value.h. Remove transitive dependencies of timestamp-value.h to avoid pulling in remaining boost date_time headers where not needed. Dependent headers are: scalar-expr-evaluator.h, expr-value.h Remove references to debug-util.h in other headers, because it pulls in a lot of thread headers. Remove references to llvm-codegen.h where possible, because it pulls in many llvm headers. Other opportunities: -------------------- boost/algorithm/string.hpp includes many string algorithms and pulls in a lot of headers. * util/string-parser.h is a giant header with many dependencies. * There's lots of redundancy between boost and standard c++ headers. Both pull in vast numbers of utility headers for C++ metaprogramming and similar things. If we reduced virality of boost headers this would help a lot, and also if we switch to equivalent standard headers where possible (e.g. unordered_map, unordered_set, function, bind, etc). Compile time with clang/ASAN: ----------------------------- Before: real 9m6.311s user 62m25.006s sys 2m44.798s After: real 8m17.073s user 55m38.425s sys 2m25.808s Change-Id: I8de71866bdf3211e53560d9bfe930e7657c4d7f1 Reviewed-on: http://gerrit.cloudera.org:8080/15248 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-25 03:37:32 +00:00
Sahil Takiar	984f675e05	IMPALA-5904: (part 2) Fix more TSAN bugs Fixes the following data races reported by TSAN: data race be/src/runtime/krpc-data-stream-sender.cc:581:3 in KrpcDataStreamSender::Channel::SerializeAndSendBatch(impala::RowBatch) Race condition when reading 'rpc_in_flight_batch_' outside of the 'lock_' * Since this race condition is only triggered inside a DCHECK, added a suppresion data race be/src/util/stopwatch.h:183:9 in MonotonicStopWatch::RunningTime() const * Race condition on BlockingJoinNode::built_probe_overlap_stop_watch_; changed from a MonotonicStopWatch to a ConcurrentStopWatch data race be/src/exec/kudu-scan-node.cc:211:13 in KuduScanNode::ProcessScanToken(impala::KuduScanner, std::string const&) Some reads on KuduScanNode::done_ are racey, so I made 'done_' an AtomicBool; this has the added benefit that failed scans will be aborted as soon as 'done_' is set to false data race be/src/service/client-request-state.h:220:29 in ClientRequestState::eos() const * Race condition when reading / updating ClientRequestState::eos_; made 'eos_' an AtomicBool data race be/src/exec/parquet/parquet-column-readers.cc:497:9 in bool ScalarColumnReader<...>::ReadValueBatch<false>(...) * Race condition in SHOULD_TRIGGER_COL_READER_DEBUG_ACTION / parquet_column_reader_debug_count data race be/src/service/impala-server.cc:817:20 in ImpalaServer::ArchiveQuery(impala::ClientRequestState const&) * Race condition on some ClientRequestState fields when creating a QueryStateRecord Fixes IMPALA-9313: 'TSAN data race in TmpFileMgr::File::Blacklist' and adds a suppresion for IMPALA-9404: 'Instantiations/ExprTest.MathConversionFunctions fails in TSAN builds'. Testing: * Ran core tests * Re-ran TSAN tests and confirmed the data races have been fixed Change-Id: I01feb40417dc5ea64ccb0c1044cfc3eed8508476 Reviewed-on: http://gerrit.cloudera.org:8080/15244 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-03-02 09:01:14 +00:00
wzhou-code	3224839876	IMPALA-8759: Use double precision for HLL finalize function Current HLL finalize function use single precision of data type float32 to calculate estimate. It's not accurate for the larger cardinalities beyond 1,000,000 since float32 only has 6~7 decimal digit precision. This patch change single precision data type to double precision type for HLL finalize function. Testing: - Passed all exhaustive tests. - Did benchmark for queries with NDV functions. The performance impact is negligible. See following spreadsheet for the menchmark: https://docs.google.com/spreadsheets/d/1DIVOEs5C4MJL1b7O4MA_jkaM3Y-JSMFREjXCUHJ3eHc/edit#gid=0 Change-Id: I0c5a5229b682070b0bc14da287db5231159dbb3d Reviewed-on: http://gerrit.cloudera.org:8080/15167 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-26 11:06:12 +00:00
Csaba Ringhofer	2c54dbe225	IMPALA-9385: Unix time conversion cleanup + ORC fix ORC scanner uses TimestampValue::FromUnixTimeNanos() to convert sec + nano representation to Impala's TimestampValue (day + nano). FromUnixTimeNanos was affected by flag use_local_tz_for_unix_timestamp_conversions, while that global option should not affect ORC. By default there was no conversion, but if the flag is 1, then timestamps were interpreted as UTC and converted to local time. This could be solved by creating a UTC version of FromUnixTimeNanos, but I decided to change the interface in the hope of making To/From timestamp functions less confusing. Changes: - Fixed the bug by passing UTC as timezone in the ORC scanner. - Changed the interface of these TimestampValue functions to expect a timezone pointer, interpret null as UTC and skip conversion. It would be also possible to pass the actual UTC timezone and check for this in the functions, but I guess it is easier to optimize the inlined functions this way. - Moved the checking of use_local_tz_for_unix_timestamp_conversions to RuntimeState and added property time_zone_for_unix_time_conversions() to return the timezone to use in Unix time conversions. This made TimestampValue's interface clearer and makes it easy to replace the flag with a query option if we want to. - Changed RuntimeState and the Parquet scanner to skip timezone conversion if convert_legacy_hive_parquet_utc_timestamps=1 but the timezone is UTC. This allows users to avoid the performance penalty of this flag by setting query option timezone to UTC in their session (IMPALA-7557). CCTZ is not good at this, actually conversions are slower with fixed offset timezones (including UTC) than with timezones that have DST/historical rule changes. Postponed changes: - Didn't remove the UTC versions of the functions yet, as that would require changing (and possibly rethinking) several BE tests and benchmarks (IMPALA-9409). Tests: - Added regression test for Orc and other file formats to check that they are not affected by this flag. - Extended test_hive_parquet_timestamp_conversion.py to cover the case when convert_legacy_hive_parquet_utc_timestamps=1 and timezone=UTC. Also did some cleanup there to use query option timezone instead of env var TZ. Change-Id: I14e2a7e512ccd013d5d9fe480a5467ed4c46b76e Reviewed-on: http://gerrit.cloudera.org:8080/15222 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-22 02:02:56 +00:00
Tim Armstrong	04fd9ae268	IMPALA-9373: Trial run of include-what-you-use Implemented recommendations from IWYU in a subset of files, mostly in util. Did a few cleanups related to systematic problems that I noticed as a result. I noticed that uid-util.h was pulling in boost UUID headers to a lot of compilation units, so refactored that a little bit, including pulling out the hash functions into unique-id-hash.h and moving some inline functions into client-request-state-map.cc. Systematically replaced the general boost mutex header with the internal pthread-based one. This is equivalent for us, since we assume that boost::mutex is implemented by pthread_mutex_t, e.g. for the implementation of ConditionVariable. Switch include guards to pragma once just as general cleanup. Prefix string with std:: consistently in headers so that they don't depend on "using" declarations pulled in from random headers. Look at includes of C++ stream headers, including iostream and stringstream, and replaced them with iosfwd or removed them if possible. Compile time: Measured a full ASAN build of the impalad binary on an 8 core machine with cccache enabled, but cleared. It used very slightly less CPU, probably because we are still pulling in most of the same system headers. Before: real 9m27.502s user 64m39.775s sys 2m49.002s After: real 9m26.561s user 64m28.948s sys 2m48.252s So for the moment, the only significant wins are on incremental builds, where touching header files should not require as many recompilations. Compile times should start to drop meaningfully once we thin out more unnecessary includes - currently it seems like most compile units end up with large chunks of boost/std code included via transitive header dependencies. Change-Id: I3450e0ffcb8b183e18ac59c8b33b9ecbd3f60e20 Reviewed-on: http://gerrit.cloudera.org:8080/15202 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-02-19 05:57:11 +00:00
Bikramjeet Vig	ba00551581	IMPALA-4080 [part 1]: Move codegen code from aggregation exec nodes to their plan nodes Refactored code to move codegen code from aggregation exec nodes to their plan nodes. Added some TODOs that will be fixed in the next few patch. Testing: - Ran queries and confirmed manually that the codegened code works. - Ran all e2e tests for agg nodes and partition joins. Change-Id: I58f52a262ac7d0af259d5bcda72ada93a851d3b2 Reviewed-on: http://gerrit.cloudera.org:8080/15053 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-31 03:06:24 +00:00
stiga-huang	0936384271	IMPALA-9010: Add builtin mask functions There're 6 builtin GenericUDFs for column masking in Hive: mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_first_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_last_n(value, charCount, upperChar, lowerChar, digitChar, otherChar, numberChar) mask_hash(value) mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar, dayValue, monthValue, yearValue) Description of the parameters: value - value to mask. Supported types: TINYINT, SMALLINT, INT, BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()). charCount - number of characters. Default value: 4 upperChar - character to replace upper-case characters with. Specify -1 to retain original character. Default value: 'X' lowerChar - character to replace lower-case characters with. Specify -1 to retain original character. Default value: 'x' digitChar - character to replace digit characters with. Specify -1 to retain original character. Default value: 'n' otherChar - character to replace all other characters with. Specify -1 to retain original character. Default value: -1 numberChar - character to replace digits in a number with. Valid values: 0-9. Default value: '1' dayValue - value to replace day field in a date with. Specify -1 to retain original value. Valid values: 1-31. Default value: 1 monthValue - value to replace month field in a date with. Specify -1 to retain original value. Valid values: 0-11. Default value: 0 yearValue - value to replace year field in a date with. Specify -1 to retain original value. Default value: 1 In Hive, these functions accept variable length of arguments in non-restricted types: mask_show_first_n(val) mask_show_first_n(val, 8) mask_show_first_n(val, 8, 'X', 'x', 'n') mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2) mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9') The arguments of upperChar, lowerChar, digitChar, otherChar and numberChar can be in string or numeric types. Impala doesn't support Hive GenericUDFs, so we are lack of these mask functions to support Ranger column masking policies. On the other hand, we want the masking functions to be evaluated in the C++ builtin logic rather than calling out to java UDFs for performance. This patch introduces our builtin implementation of them. We currently don't have a corresponding framework for GenericUDF (IMPALA-9271), so we implement these by overloads. However, it may requires hundreds of overloads to cover all possible combinations. We just implement some important overloads, including - those used by Ranger default masking policies, - those with simple arguments and may be useful for users, - an overload with all arguments in int type for full functionality. Char argument need to be converted to their ASCII value. Tests: - Add BE tests in expr-test Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664 Reviewed-on: http://gerrit.cloudera.org:8080/14963 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-01-17 15:34:34 +00:00
jchen	4c04e67738	IMPALA-8891: Fix non-standard null handling in concat_ws() This patch fixes the non-standard null handling logic for function 'concat_ws', while maintaining the original null handling for function 'concat' Existing statuses: For function concat_ws, any null string element in array argument 'strs' will result in null result, just like below: ------------------------------------------------ select concat_ws('-','foo',null,'bar') as expr1; +-------+ \| expr1 \| +-------+ \| NULL \| +-------+ New Statuses: In this implementation, the function conforms to hive standard: 1.will join all the non-null string objects as the result 2.if all string objects are null, return empty string 3.if separator is null, return null below is a example: ------------------------------------------------- select concat_ws('-','foo',null,'bar') as expr1; +----------+ \| expr1 \| +----------+ \| foo-bar \| +----------+ ------------------------------------------------ Key changes: * Reimplement function StringFunctions::ConcatWs by filtering the null value and only process the valid string values, based on original code structure. * StringFunctions::Concat was also reimplemented, as it used to call ConcatWs but should keep the original NULL handling. Testing: * Ran exaustive tests. Change-Id: I64cd3bfbb952e431a0cf52a5835ac05d2513d29b Reviewed-on: http://gerrit.cloudera.org:8080/14885 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-17 07:11:20 +00:00
Attila Jeges	590da59a3c	IMPALA-8706: ISO:SQL:2016 datetime patterns - Milestone 4 This patch adds ISO 8601 week-based date format tokens on top of what was introduced in IMPALA-8703, IMPALA-8704 and IMPALA-8705. The ISO 8601 week-based date tokens may be used for both datetime to string and string to datetime conversion. The ISO 8601 week-based date tokens are as follows: - IYYY: 4-digit ISO 8601 week-numbering year. Week-numbering year is the year relating to the ISO 8601 week number (IW), which is the full week (Monday to Sunday) which contains January 4 of the Gregorian year. Behaves similarly to YYYY in that for datetime to string conversion, prefix digits for 1, 2, and 3-digit inputs are obtained from current ISO 8601 week-numbering year. - IYY: Last 3 digits of ISO 8601 week-numbering year. Behaves similarly to YYY in that for datetime to string conversion, prefix digit is obtained from current ISO 8601 week-numbering year and can accept 1 or 2-digit input. - IY: Last 2 digits of ISO 8601 week-numbering year. Behaves similarly to YY in that for datetime to string conversion, prefix digits are obtained from current ISO 8601 week-numbering year and can accept 1-digit input. - I: Last digit of ISO 8601 week-numbering year. Behaves similarly to Y in that for datetime to string conversion, prefix digits are obtained from current ISO 8601 week-numbering year. - IW: ISO 8601 week of year (1-53). Begins on the Monday closest to January 1 of the year. For string to datetime conversion, if the input ISO 8601 week does not exist in the input year, an error will be thrown. Note that IW is different from the other week-related tokens WW and W (implemented in IMPALA-8705). With WW and W weeks start with the first day of the year/month. ISO 8601 weeks on the other hand always start with Monday. - ID: ISO 8601 day of week (1-7). 1 means Monday and 7 means Sunday. When doing string to datetime conversion, the ISO 8601 week-based tokens are meant to be used together and not mixed with other ISO SQL date tokens. E.g. 'YYYY-IW-ID' is an invalid format string. The only exceptions are the day name tokens (DAY and DY) which may be used instead of ID with the rest of the ISO 8601 week-based date tokens. E.g. 'IYYY-IW-DAY' is a valid format string. Change-Id: I89a8c1b98742391cb7b331840d216558dbca362b Reviewed-on: http://gerrit.cloudera.org:8080/14852 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-13 20:02:16 +00:00
Gabor Kaszab	30c7a6a18c	IMPALA-8705: ISO:SQL:2016 datetime patterns - Milestone 3 This patch adds additional datetime format tokens on top of Milestone 1 (IMPALA-8703) and Milestone 2 (IMPALA-8704). The tokens introduced: - Full month name (MONTH, Month, month): In a string to datetime conversion this token can parse textual month name into a datetime type. In a datetime to string conversion this token gives the textual representation of a month. - Short month name (MON, Mon, mon): Similar to the full month name token but this works for 3-character month names like 'JAN'. - Full day name (DAY, Day, day): In a datetime to string conversion this token gives the textual representation of a day like 'Tuesday.' Not suppported in a string to datetime conversion. - Short day name (DY, Dy, dy): Similar to full day name token but this works for 3-character day names like 'TUE'. Not suppported in a string to datetime conversion. - Day of week (D): In a datetime to string conversion this gives a number in [1-7] where 1 represents Sunday. Not supported in a string to datetime conversion. - Quarter of year (Q): In a datetime to string conversion this gives a number in [1-4] representing a quarter of the year. Not supported in a string to datetime conversion. - Week of year (WW): In a datetime to string conversion this gives a number in [1-53] to represent the week of year where the first week starts from 1st of January. Not supported in a string to datetime conversion. - Week of month (W): In a datetime to string conversion this gives a number in [1-5] to represent the week of month where the first week starts from the first day of the month. Not supported in a string to datetime conversion. Change-Id: Ic797f19a1311b54e5d00d01d0a7afe1f0f21fb8f Reviewed-on: http://gerrit.cloudera.org:8080/14714 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-12-05 14:19:41 +00:00
norbert.luksa	a862282811	IMPALA-8709: Add Damerau-Levenshtein edit distance built-in function This patch adds new built-in functions to calculate restricted Damerau-Levenshtein edit distance (optimal string alignment). Implmented as dle_dst() and damerau_levenshtein(). If either value is NULL or both values are NULL returns NULL which differs from Netezza's dle_dst() which returns the length of the not NULL value or 0 if both values are NULL. The NULL behavior matches the existing levenshtein() function. Also cleans up levenshtein tests. Testing: - Added unit tests to expr-test.cc - Manual testing on over 1400 string pairs from http://marvin.cs.uidaho.edu/misspell.html and results match Netezza Change-Id: Ib759817ec15e7075bf49d51e494e45c8af4db94d Reviewed-on: http://gerrit.cloudera.org:8080/13794 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-11-22 21:39:21 +00:00
Attila Jeges	684a54a89e	IMPALA-7368: Change supported year range for DATE values to 1..9999 Before this patch the supported year range for DATE type started with year 0. This contradicts the ANSI SQL standard that defines the valid DATE value range to be 0001-01-01 to 9999-12-31. Change-Id: Iefdf1c036834763f52d44d0c39a25a1f04e41e07 Reviewed-on: http://gerrit.cloudera.org:8080/14349 Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-10-04 18:36:22 +00:00
Tim Armstrong	983e3a66de	IMPALA-2138: part 1: initial cleanup This is a mixed bag of simplifications, debugging improvements and test fixes that came up in the projection work. I had to update some planner tests because some expressions now include their arguments. Various things in the planner tests were stale, so there are spurious changes in the expected output that are ignored by the plan verification. Change-Id: I75d2c8cab79988300c1a9c6c23d6ccea53da7d23 Reviewed-on: http://gerrit.cloudera.org:8080/14265 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-20 22:15:41 +00:00
Gabor Kaszab	bca1b43efb	IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1 This enhancement introduces FORMAT clause for CAST() operator that is applicable for casts between string types and timestamp types. Instead of accepting SimpleDateFormat patterns the FORMAT clause supports datetime patterns following the ISO:SQL:2016 standard. Note, the CAST() operator without the FORMAT clause still uses Impala's implementation of SimpleDateFormat handling. Similarly, the existing conversion functions such as to_timestamp(), from_timestamp() etc. remain unchanged and use SimpleDateFormat. Contrary to how these functions work the FORMAT clause must specify a string literal and cannot be used with any other kind of a string expression. Milestone 1 contains all the format tokens covered by the SQL standard. Further milestones will add more functionality on top of this list to cover functionality provided by other RDBMS systems. List of tokens implemented by this change: - YYYY, YYY, YY, Y: Year tokens - RRRR, RR: Round year tokens - MM: Month (1-12) - DD: Day (1-31) - DDD: Day of year (1-366) - HH, HH12: Hour of day (1-12) - HH24: Hour of day (0-23) - MI: Minute (0-59) - SS: Second (0-59) - SSSSS: Second of day (0-86399) - FF, FF1, ..., FF9: Fractional second - AM, PM, A.M., P.M.: Meridiem indicators - TZH: Timezone hour (-99-+99) - TZM: Timezone minute (0-99) - Separators: - . / , ' ; : space - ISO8601 date indicators (T, Z) Some notes about the matching algorithm: - The parsing algorithm uses these tokens in a case insensitive manner. - The separators are interchangeable with each other. For example a '-' separator in the format will match with a '.' character in the input. - The length of the separator sequences is handled flexibly meaning that a single separator character in the format for instance would match with a multi-separator sequence in the input. - In a string type to timestamp conversion the timezone offset tokens are parsed, expected to match with the input but they don't adjust the result as the input is already expected to be in UTC format. Usage example: SELECT CAST('01-02-2019' AS TIMESTAMP FORMAT 'MM-DD-YYYY'); SELECT CAST('2019.10.10 13:30:40.123456 +01:30' AS TIMESTAMP FORMAT 'YYYY-MM-DD HH24:MI:SS.FF9 TZH:TZM'); SELECT CAST(timestamp_column as STRING FORMAT "YYYY MM HH12 YY") from some_table; Change-Id: I19d8d097a45ae6f103b6cd1b2d81aad38dfd9e23 Reviewed-on: http://gerrit.cloudera.org:8080/13722 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-19 18:46:19 +00:00
Sahil Takiar	151835116a	IMPALA-7312: Non-blocking mode for Fetch() RPC Adds the query option FETCH_ROWS_TIMEOUT_MS to control the client timeout when fetching rows. Set to 10 seconds by default to avoid unnecessary fetch requests. Timeout applies when result spooling is enabled or disabled. When result spooling is disabled, the timeout controls how long the client thread will wait for a single RowBatch to be produced by the coordinator fragment. When result spooling is enabled, a client can fetch multiple RowBatches at a time, so the timeout controls the total time spent waiting for RowBatches to be produced. The timeout applies to both waiting for rows to be sent by the fragment instance thread, and waiting for rows to be materialized (e.g. the time measured by RowMaterializationTimer). Testing: * Added new tests to test_fetch.py * Ran core tests Change-Id: I331acaba23a65dab43cca48e9dc0dc957b9c632d Reviewed-on: http://gerrit.cloudera.org:8080/14157 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-09-10 05:56:57 +00:00
norbertluksa	1f904719e4	IMPALA-7770: SPLIT_PART to support negative indexes Third parameter of SPLIT_PART (nth field) accepts now negative values, and searches the string backwards. Testing: * Added unit tests to expr-test.cc Change-Id: I2db762989a90bd95661a59eb9c11a29eb2edfafb Reviewed-on: http://gerrit.cloudera.org:8080/13880 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-15 11:06:49 +00:00
luksan47	8db7f27ddd	IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function The added functions return the Jaro/Jaro-Winkler similarity/distance of two strings. The algorithm calcuates the Jaro-Similarity of the strings, then adds more weight to the result if there are common prefixes. (Jaro-Winkler) For more detail, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Extended the algorithm with another optional parameter: boost threshold The prefix weight will only be applied if the Jaro-similarity exceeds the given threshold. By default, its value is 0.7. The new built-in functions are: * jaro_distance, jaro_dst * jaro_similarity, jaro_sim * jaro_winkler_distance, jw_dst * jaro_winkler_similarity, jw_sim Testing: * Added unit tests to expr-test.cc * Manual testing over 1400 word pairs from http://marvin.cs.uidaho.edu/misspell.html Results match Apache commons Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c Reviewed-on: http://gerrit.cloudera.org:8080/13870 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-13 18:25:32 +00:00
Joe McDonnell	4cc3ff9c67	IMPALA-8176: Convert simple backend tests to the unified executable This converts tests with trivial main() functions to the unified executable. This means that the code change is strictly removing main() functions and updating the CMakeLists.txt files. Any test that requires a change larger than that will be addressed separately. The only exceptions are: - exec/incr-stats-util-test.cc requires naming changes to avoid conflicts with util/rle-test.cc - runtime/decimal-test.cc simplified the naming to make the CMakeLists.txt arguments easier. The new test libraries are marked STATIC, because they are linked into a single binary (unifiedbetests) and googletest has problems with tests in shared libraries. Converting this set of tests saves about 18GB of disk space for a debug build and saves a minute or two of link time. For any CMakeLists.txt that has unified tests, this adds a comment for each test that is not unified. Testing: - Ran backend tests in DEBUG and ASAN modes on Centos7 - Ran backend tests in DEBUG mode on Centos6 Change-Id: I840d0f9b70edb3a7195a2a33b21fd2874d4c52bd Reviewed-on: http://gerrit.cloudera.org:8080/13515 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-07-19 22:01:00 +00:00
Tim Armstrong	c353cf7a64	IMPALA-8713: fix stack overflow in unhex() Write the results into the output heap buffer instead of into a temporary stack buffer. No additional memory is used because AnyValUtil::FromBuffer() allocated a temporary buffer anyway. Testing: Added a targeted test to expr-test that caused a crash before this fix. Change-Id: Ie0c1760511a04c0823fc465cf6e529e9681b2488 Reviewed-on: http://gerrit.cloudera.org:8080/13743 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-28 03:02:01 +00:00
Jiawei Wang	dbba52c77c	IMPALA-8665:Include extra info in error message when date cast fails This change extends the error message Impala yields when casting STRING to DATE (explicitly or implicitly) fails. The new error message includes the violating string value. Testing: changes -> date-partitioning.test & date.test query_test/test_date_queries.py test passed Example: select cast('20' as date); ERROR: UDF ERROR: String to Date parse failed. Invalid string val: "20" Change-Id: If800b7696515cd61afee27220c55ff2440a86f04 Reviewed-on: http://gerrit.cloudera.org:8080/13680 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-27 14:39:46 +00:00
Attila Jeges	f40935a30e	IMPALA-7369: part 2: Add INTERVAL expr support and built-in functions for DATE This change implements INTERVAL expression support for DATE type and adds several DATE related built-in functions. The efficiency of the DateValue::ToYearMonthDay() function used in many of the built-in functions below was also improved. The following functions are supported in Hive: INT YEAR(DATE d) Extracts year of the 'd' date, returns it as an int in 0-9999 range. INT MONTH(DATE d) Extracts month of the 'd' date and returns it as an int in 1-12 range. INT DAY(DATE d), INT DAYOFMONTH(DATE d) Extracts day-of-month of the 'd' date and returns it as an int in 1-31 range. INT QUARTER(DATE d) Extracts quarter of the 'd' date and returns it as an int in 1-4 range. INT DAYOFWEEK(DATE d) Extracts day-of-week of the 'd' date and returns it as an int in 1-7 range. 1 is Sunday and 7 is Saturday. INT DAYOFYEAR(DATE d) Extracts day-of-year of the 'd' date and returns it as an int in 1-366 range. INT WEEKOFYEAR(DATE d) Extracts week-of-year of the 'd' date and returns it as an int in 1-53 range. STRING DAYNAME(DATE d) Returns the day field from a 'd' date, converted to the string corresponding to that day name. The range of return values is "Sunday" to "Saturday". STRING MONTHNAME(DATE d) Returns the month field from a 'd' date, converted to the string corresponding to that month name. The range of return values is "January" to "December". DATE NEXT_DAY(DATE d, STRING weekday) Returns the first date which is later than 'd' and named as 'weekday'. 'weekday' is 3 letters or full name of the day of the week. DATE LAST_DAY(DATE d) Returns the last day of the month which the 'd' date belongs to. INT DATEDIFF(DATE d1, DATE d2) Returns the number of days from 'd1' date to 'd2' date. DATE CURRENT_DATE() Returns the current date (in the local time zone). INT INT_MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates, as an int representing only the full months that passed. If 'd1' represents an earlier date than 'd2', the result is negative. DOUBLE MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates. Can include a fractional part representing extra days in addition to the full months between the dates. The fractional component is computed by dividing the difference in days by 31 (regardless of the month). If 'd1' represents an earlier date than 'd2', the result is negative. DATE ADD_YEARS(DATE d, INT/BIGINT num_years), DATE SUB_YEARS(DATE d, INT/BIGINT num_years) Adds/subtracts a specified number of years to a 'd' date value. DATE ADD_MONTHS(DATE d, INT/BIGINT num_months), DATE SUB_MONTHS(DATE d, INT/BIGINT num_months) Adds/subtracts a specified number of months to a date value. If 'd' is the last day of a month, the returned date will fall on the last day of the target month too. DATE ADD_DAYS(DATE d, INT/BIGINT num_days), DATE SUB_DAYS(DATE d, INT/BIGINT num_days) Adds/subtracts a specified number of days to a date value. DATE ADD_WEEKS(DATE d, INT/BIGINT num_weeks), DATE SUB_WEEKS(DATE d, INT/BIGINT num_weeks) Adds/subtracts a specified number of weeks to a date value. The following function doesn't exist in Hive but supported by Amazon Redshift INT DATE_CMP(DATE d1, DATE d2) Compares 'd1' and 'd2' dates. Returns: 1. NULL, if either 'd1' or 'd2' is NULL 2. -1 if d1 < d2 3. 1 if d1 > d2 4. 0 if d1 == d2 (https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_CMP.html) Change-Id: If404bffdaf055c769e79ffa8f193bac415cfdd1a Reviewed-on: http://gerrit.cloudera.org:8080/13648 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-25 23:06:25 +00:00
Jim Apple	45c6c46bf6	IMPALA-5031: signed overflow is undefined behavior Fix remaining signed overflow undefined behaviors in end-to-end tests. The interesting part of the backtraces: exprs/aggregate-functions-ir.cc:464:25: runtime error: signed integer overflow: 0x5a4728ca063b522c0b728f8000000000 + 0x3c2f7086aed236c807a1b50000000000 cannot be represented in type '__int128' #0 AggregateFunctions::DecimalAvgMerge( impala_udf::FunctionContext, impala_udf::StringVal const&, impala_udf::StringVal) exprs/aggregate-functions-ir.cc:464:25 #1 AggFnEvaluator::Update(TupleRow const, Tuple, void) exprs/agg-fn-evaluator.cc:327:7 #2 AggFnEvaluator::Add(TupleRow const, Tuple) exprs/agg-fn-evaluator.h:257:3 #3 Aggregator::UpdateTuple(AggFnEvaluator, Tuple, TupleRow, bool) exec/aggregator.cc:167:24 #4 NonGroupingAggregator::AddBatchImpl(RowBatch) exec/non-grouping-aggregator-ir.cc:27:5 #5 NonGroupingAggregator::AddBatch(RuntimeState, RowBatch) exec/non-grouping-aggregator.cc:124:45 #6 AggregationNode::Open(RuntimeState) exec/aggregation-node.cc:70:57 exprs/aggregate-functions-ir.cc:513:12: runtime error: signed integer overflow: -8282081183197145958 + -4473782455107795527 cannot be represented in type 'long' #0 void AggregateFunctions::SumUpdate<impala_udf::BigIntVal, impala_udf::BigIntVal>(impala_udf::FunctionContext, impala_udf::BigIntVal const&, impala_udf::BigIntVal) exprs/aggregate-functions-ir.cc:513:12 #1 AggFnEvaluator::Update(TupleRow const, Tuple, void) exprs/agg-fn-evaluator.cc:327:7 #2 AggFnEvaluator::Add(TupleRow const, Tuple) exprs/agg-fn-evaluator.h:257:3 #3 Aggregator::UpdateTuple(AggFnEvaluator*, Tuple, TupleRow, bool) exec/aggregator.cc:167:24 #4 NonGroupingAggregator::AddBatchImpl(RowBatch) exec/non-grouping-aggregator-ir.cc:27:5 #5 NonGroupingAggregator::AddBatch(RuntimeState, RowBatch) exec/non-grouping-aggregator.cc:124:45 #6 AggregationNode::Open(RuntimeState) exec/aggregation-node.cc:70:57 exprs/aggregate-functions-ir.cc:585:14: runtime error: signed integer overflow: 0x5a4728ca063b522c0b728f8000000000 + 0x3c2f7086aed236c807a1b50000000000 cannot be represented in type '__int128' #0 AggregateFunctions::SumDecimalMerge( impala_udf::FunctionContext, impala_udf::DecimalVal const&, impala_udf::DecimalVal) exprs/aggregate-functions-ir.cc:585:14 #1 AggFnEvaluator::Update(TupleRow const, Tuple, void) exprs/agg-fn-evaluator.cc:327:7 #2 AggFnEvaluator::Add(TupleRow const, Tuple) exprs/agg-fn-evaluator.h:257:3 #3 Aggregator::UpdateTuple(AggFnEvaluator*, Tuple, TupleRow, bool) exec/aggregator.cc:167:24 #4 NonGroupingAggregator::AddBatchImpl(RowBatch) exec/non-grouping-aggregator-ir.cc:27:5 #5 NonGroupingAggregator::AddBatch(RuntimeState, RowBatch) exec/non-grouping-aggregator.cc:124:45 #6 AggregationNode::Open(RuntimeState) exec/aggregation-node.cc:70:57 runtime/decimal-value.inline.h:145:12: runtime error: signed integer overflow: 18 0x0785ee10d5da46d900f436a000000000 cannot be represented in type '__int128' #0 DecimalValue<__int128>::ScaleTo(int, int, int, bool) const runtime/decimal-value.inline.h:145:12 #1 DecimalOperators::ScaleDecimalValue( impala_udf::FunctionContext, DecimalValue<int> const&, int, int, int) exprs/decimal-operators-ir.cc:132:41 #2 DecimalOperators::RoundDecimal(impala_udf::FunctionContext, impala_udf::DecimalVal const&, int, int, int, int, DecimalOperators::DecimalRoundOp const&) exprs/decimal-operators-ir.cc:465:16 #3 DecimalOperators::RoundDecimal(impala_udf::FunctionContext, impala_udf::DecimalVal const&, DecimalOperators::DecimalRoundOp const&) exprs/decimal-operators-ir.cc:519:10 #4 DecimalOperators::CastToDecimalVal( impala_udf::FunctionContext, impala_udf::DecimalVal const&) exprs/decimal-operators-ir.cc:529:10 #5 impala_udf::DecimalVal ScalarFnCall::InterpretEval <impala_udf::DecimalVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:485:208 #6 ScalarFnCall::GetDecimalVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:618:44 #7 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const) exprs/scalar-expr-evaluator.cc:321:27 #8 ScalarExprEvaluator::GetValue(TupleRow const) exprs/scalar-expr-evaluator.cc:251:10 #9 Java_org_apache_impala_service_FeSupport_NativeEvalExprsWithoutRow service/fe-support.cc:246:26 #10 (<unknown module>) runtime/multi-precision.h:116:21: runtime error: negation of 0x80000000000000000000000000000000 cannot be represented in type 'int128_t' (aka '__int128'); cast to an unsigned type to negate this value to itself #0 ConvertToInt128(boost::multiprecision::number <boost::multiprecision::backends::cpp_int_backend<256u, 256u, (boost::multiprecision::cpp_integer_type)1, (boost::multiprecision::cpp_int_check_type)0, void>, (boost::multiprecision::expression_template_option)0>, __int128, bool) runtime/multi-precision.h:116:21 #1 DecimalValue<__int128> DecimalValue<__int128>::Multiply<__int128>(int, DecimalValue<__int128> const&, int, int, int, bool, bool) const runtime/decimal-value.inline.h:438:16 #2 DecimalOperators::Multiply_DecimalVal_DecimalVal( impala_udf::FunctionContext, impala_udf::DecimalVal const&, impala_udf::DecimalVal const&) exprs/decimal-operators-ir.cc:859:3336 #3 impala_udf::DecimalVal ScalarFnCall::InterpretEval <impala_udf::DecimalVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:485:376 #4 ScalarFnCall::GetDecimalVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:618:44 #5 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const) exprs/scalar-expr-evaluator.cc:321:27 #6 ScalarExprEvaluator::GetValue(TupleRow const) exprs/scalar-expr-evaluator.cc:251:10 #7 Java_org_apache_impala_service_FeSupport_NativeEvalExprsWithoutRow service/fe-support.cc:246:26 #8 (<unknown module>) util/runtime-profile-counters.h:194:24: runtime error: signed integer overflow: -1263418397011577524 + -9223370798768111350 cannot be represented in type 'long' #0 RuntimeProfile::AveragedCounter::UpdateCounter (RuntimeProfile::Counter) util/runtime-profile-counters.h:194:24 #1 RuntimeProfile::UpdateAverage(RuntimeProfile) util/runtime-profile.cc:199:20 #2 RuntimeProfile::UpdateAverage(RuntimeProfile) util/runtime-profile.cc:245:14 #3 Coordinator::BackendState::UpdateExecStats (vector<Coordinator::FragmentStats, allocator<Coordinator::FragmentStats> > const&) runtime/coordinator-backend-state.cc:429:22 #4 Coordinator::ComputeQuerySummary() runtime/coordinator.cc:775:20 #5 Coordinator::HandleExecStateTransition(Coordinator::ExecState, Coordinator::ExecState) runtime/coordinator.cc:567:3 #6 Coordinator::SetNonErrorTerminalState(Coordinator::ExecState) runtime/coordinator.cc:484:3 #7 Coordinator::GetNext(QueryResultSet, int, bool) runtime/coordinator.cc:657:53 #8 ClientRequestState::FetchRowsInternal(int, QueryResultSet) service/client-request-state.cc:943:34 #9 ClientRequestState::FetchRows(int, QueryResultSet) service/client-request-state.cc:835:36 #10 ImpalaServer::FetchInternal(TUniqueId const&, bool, int, beeswax::Results) service/impala-beeswax-server.cc:545:40 #11 ImpalaServer::fetch(beeswax::Results&, beeswax::QueryHandle const&, bool, int) service/impala-beeswax-server.cc:178:19 #12 beeswax::BeeswaxServiceProcessor::process_fetch(int, apache::thrift::protocol::TProtocol, apache::thrift::protocol::TProtocol, void) generated-sources/gen-cpp/BeeswaxService.cpp:3398:13 #13 beeswax::BeeswaxServiceProcessor::dispatchCall (apache::thrift::protocol::TProtocol, apache::thrift::protocol::TProtocol, string const&, int, void) generated-sources/gen-cpp/BeeswaxService.cpp:3200:3 #14 ImpalaServiceProcessor::dispatchCall (apache::thrift::protocol::TProtocol, apache::thrift::protocol::TProtocol, string const&, int, void) generated-sources/gen-cpp/ImpalaService.cpp:1824:48 #15 apache::thrift::TDispatchProcessor::process (boost::shared_ptr<apache::thrift::protocol::TProtocol>, boost::shared_ptr<apache::thrift::protocol::TProtocol>, void) toolchain/thrift-0.9.3-p5/include/thrift/TDispatchProcessor.h:121:12 Change-Id: I73dd6802ec1023275d09a99a2950f3558313fc8e Reviewed-on: http://gerrit.cloudera.org:8080/13437 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-10 08:34:27 +00:00
Attila Jeges	f0678b06e6	IMPALA-7369: part 1: Implement TRUNC, DATE_TRUNC, EXTRACT, DATE_PART functions for DATE These functions are somewhat similar in that each of them takes a DATE argument and a time unit to work with. They work identically to the corresponding TIMESTAMP functions. The only difference is that the DATE functions don't accept time-of-day units. TRUNC(DATE d, STRING unit) Truncates a DATE value to the specified time unit. The 'unit' argument is case insensitive. This argument string can be one of: SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year. Q: Quarter. MONTH, MON, MM, RM: Month. DDD, DD, J: Day. DAY, DY, D: Starting day (Monday) of the week. WW: Truncates to the most recent date, no later than 'd', which is on the same day of the week as the first day of year. W: Truncates to the most recent date, no later than 'd', which is on the same day of the week as the first day of month. The impelementation mirrors Impala's TRUNC(TIMESTAMP ts, STRING unit) function. Hive and Oracle SQL have a similar function too. Reference: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions201.htm . DATE_TRUNC(STRING unit, DATE d) Truncates a DATE value to the specified precision. The 'unit' argument is case insensitive. This argument string can be one of: DAY, WEEK, MONTH, YEAR, DECADE, CENTURY, MILLENNIUM. The implementation mirrors Impala's DATE_TRUNC(STRING unit, TIMESTAMP ts) function. Vertica has a similar function too. Reference: https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/ SQLReferenceManual/Functions/Date-Time/DATE_TRUNC.htm . EXTRACT(DATE d, STRING unit), EXTRACT(unit FROM DATE d) Returns one of the numeric date fields from a DATE value. The 'unit' string can be one of YEAR, QUARTER, MONTH, DAY. This argument value is case-insensitive. The implementation mirrors that Impala's EXTRACT(TIMESTAMP ts, STRING unit). Hive and Oracle SQL have a similar function too. Reference: http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm . DATE_PART(STRING unit, DATE date) Similar to EXTRACT(), with the argument order reversed. Supports the same date units as EXTRACT(). The implementation mirrors Impala's DATE_PART(STRING unit, TIMESTAMP ts) function. Change-Id: I843358a45eb5faa2c134994600546fc1d0a797c8 Reviewed-on: http://gerrit.cloudera.org:8080/13363 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 14:23:23 +00:00
Tim Armstrong	95a1da2d32	IMPALA-8578: part 2: move metrics code to .cc files This moves a lot of metric function definitions into .cc files, to reduce the size of compilation units and to reduce the frequency of recompilation required when making changes to metrics. This moves most of the large, non-perf-critical metric functions into .cc files. For template classes, this requires explicitly instantiating all combinations of template parameters that are used in impala, including in tests. Disable weak-template-vtables warning because of spurious warnings on template instantiations. See https://bugs.llvm.org/show_bug.cgi?id=18733 Change-Id: I78ad045ded6e6a7b7524711be9302c26115b97b9 Reviewed-on: http://gerrit.cloudera.org:8080/13501 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-06-05 03:29:47 +00:00
Tim Armstrong	d4648e87b4	IMPALA-4356,IMPALA-7331: codegen all ScalarExprs Based on initial draft patch by Pooja Nilangekar. Codegen'd expressions can be executed in two ways - either by being called directly from a fully codegend function, or from interpreted code via a function pointer (previously ScalarFnCall::scalar_fn_wrapper_). This change moves the function pointer from ScalarFnCall to its base class ScalarExpr, so the full expr tree can be codegen'd, not just the ScalarFnCall subtrees. The key refactoring and improvements are: * ScalarExpr::GetVal() switches between interpreted and the codegen'd function pointer code paths in an inline function, avoiding a virtual function call to ScalarFnCal::GetVal(). * Boilerplate logic is moved to ScalarExpr::GetCodegendComputeFn(), which calls a virtual function GetCodegenComputeFnImpl(). * ScalarFnCall's logic for deciding whether to interpret or codegen is better abstracted and exposed to ScalarExpr as IsInterpretable() and ShouldCodegen() methods. * The ScalarExpr::codegend_compute_fn_ function pointer is only populated for expressions that are "codegen entry points". These include the roots of expr trees and non-root expressions where the parent expression calls GetVal() from the pseudo-codegend GetCodegendComputeFnWrapper(). ScalarFnCall is always initialised for interpreted execution. Otherwise the function pointer is needed for non-root expressions, e.g. to support ScalarExprEvaluator::GetConstantVal(). * Latent bugs/gaps for codegen of CollectionVal are fixed. CollectionVal is modified to use the StringVal memory layout to allow code sharing with StringVal. These fixes allowed simplification of IsNotEmptyPredicate codegen (from IMPALA-7657). I chose to tackle two problems in one change - adding support for generating codegen'd function pointers for all ScalarExprs, and adding the "entry point" concept - to avoid a blow-up in the number of codegen'd entry points that could lead to longer codegen times and/or worse code because of inlining changes. IMPALA-7331 (CHAR codegen support functions) is also fixed because it was simpler to enable CHAR codegen within ScalarExpr than to carry forward the exiting CHAR workarounds from ScalarFnCall. The CHAR-specific codegen support required in the scalar expr subsystem is very limited. StringVal intermediates are used everywhere. Only SlotRef actually operates on the different tuple layout, and the required codegen support for SlotRef already exists for UDA intermediates anyway. Testing: * Ran exhaustive tests. Perf: * Ran a basic insert benchmark, which went from 10.1s to 7.6s create table foo stored as parquet as select case when l_orderkey % 2 = 0 then 'aaa' else 'bbb' end from tpch30_parquet.lineitem; * Ran a basic CHAR expr test: set num_nodes=1; set mt_dop=1; select count() from lineitem where cast(l_linestatus as CHAR(2)) = 'O ' and cast(l_returnflag as CHAR(2)) = 'N ' The time spent in the scan went from 520ms to 220ms. Added perf regression test to tpcds-insert, similar to the manual benchmark. * Ran single-node TPC-H with large and small scale factors, to estimate impact on execution perf and query startup time, respectively. +----------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +----------+-----------------------+---------+------------+------------+----------------+ \| TPCH(30) \| parquet / none / none \| 6.84 \| -0.18% \| 4.49 \| -0.31% \| +----------+-----------------------+---------+------------+------------+----------------+ +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| TPCH(30) \| TPCH-Q20 \| parquet / none / none \| 2.58 \| 2.47 \| +4.18% \| 1.29% \| 0.88% \| 5 \| +4.12% \| 2.31 \| 5.81 \| \| TPCH(30) \| TPCH-Q17 \| parquet / none / none \| 4.81 \| 4.61 \| +4.33% \| 2.18% \| 2.15% \| 5 \| +3.91% \| 1.73 \| 3.09 \| \| TPCH(30) \| TPCH-Q21 \| parquet / none / none \| 26.45 \| 26.16 \| +1.09% \| 0.37% \| 0.50% \| 5 \| +1.36% \| 2.02 \| 3.94 \| \| TPCH(30) \| TPCH-Q9 \| parquet / none / none \| 15.92 \| 15.75 \| +1.09% \| 2.87% \| 1.65% \| 5 \| +0.88% \| 0.29 \| 0.73 \| \| TPCH(30) \| TPCH-Q12 \| parquet / none / none \| 2.38 \| 2.35 \| +1.12% \| 1.64% \| 1.11% \| 5 \| +0.80% \| 1.15 \| 1.26 \| \| TPCH(30) \| TPCH-Q14 \| parquet / none / none \| 2.94 \| 2.91 \| +1.13% \| 7.68% \| 5.37% \| 5 \| -0.34% \| -0.29 \| 0.27 \| \| TPCH(30) \| TPCH-Q18 \| parquet / none / none \| 18.10 \| 18.02 \| +0.42% \| 2.70% \| 0.56% \| 5 \| +0.28% \| 0.29 \| 0.34 \| \| TPCH(30) \| TPCH-Q8 \| parquet / none / none \| 4.72 \| 4.72 \| -0.04% \| 1.20% \| 1.65% \| 5 \| +0.05% \| 0.00 \| -0.04 \| \| TPCH(30) \| TPCH-Q19 \| parquet / none / none \| 3.92 \| 3.93 \| -0.26% \| 1.08% \| 2.36% \| 5 \| +0.20% \| 0.58 \| -0.23 \| \| TPCH(30) \| TPCH-Q6 \| parquet / none / none \| 1.27 \| 1.27 \| -0.28% \| 0.22% \| 0.88% \| 5 \| +0.09% \| 0.29 \| -0.68 \| \| TPCH(30) \| TPCH-Q16 \| parquet / none / none \| 2.64 \| 2.65 \| -0.45% \| 1.65% \| 0.65% \| 5 \| -0.24% \| -0.58 \| -0.57 \| \| TPCH(30) \| TPCH-Q22 \| parquet / none / none \| 3.10 \| 3.13 \| -0.76% \| 1.47% \| 1.12% \| 5 \| -0.21% \| -0.29 \| -0.93 \| \| TPCH(30) \| TPCH-Q2 \| parquet / none / none \| 1.20 \| 1.21 \| -0.80% \| 2.26% \| 2.47% \| 5 \| -0.82% \| -1.15 \| -0.53 \| \| TPCH(30) \| TPCH-Q4 \| parquet / none / none \| 1.97 \| 1.99 \| -1.37% \| 1.84% \| 3.21% \| 5 \| -0.47% \| -0.58 \| -0.83 \| \| TPCH(30) \| TPCH-Q13 \| parquet / none / none \| 11.53 \| 11.63 \| -0.91% \| 0.46% \| 0.49% \| 5 \| -0.95% \| -2.02 \| -3.08 \| \| TPCH(30) \| TPCH-Q10 \| parquet / none / none \| 5.13 \| 5.21 \| -1.51% \| 2.24% \| 4.05% \| 5 \| -0.94% \| -0.58 \| -0.73 \| \| TPCH(30) \| TPCH-Q5 \| parquet / none / none \| 3.61 \| 3.66 \| -1.40% \| 0.66% \| 0.79% \| 5 \| -1.33% \| -1.73 \| -3.05 \| \| TPCH(30) \| TPCH-Q7 \| parquet / none / none \| 19.42 \| 19.71 \| -1.52% \| 1.34% \| 1.39% \| 5 \| -1.22% \| -1.44 \| -1.76 \| \| TPCH(30) \| TPCH-Q3 \| parquet / none / none \| 5.08 \| 5.15 \| -1.49% \| 1.34% \| 0.73% \| 5 \| -1.35% \| -1.44 \| -2.20 \| \| TPCH(30) \| TPCH-Q15 \| parquet / none / none \| 3.42 \| 3.49 \| -1.92% \| 0.93% \| 1.47% \| 5 \| -1.53% \| -1.15 \| -2.49 \| \| TPCH(30) \| TPCH-Q11 \| parquet / none / none \| 1.15 \| 1.19 \| -3.17% \| 2.27% \| 1.95% \| 5 \| -4.21% \| -1.15 \| -2.41 \| \| TPCH(30) \| TPCH-Q1 \| parquet / none / none \| 9.26 \| 9.63 \| -3.85% \| 0.62% \| 0.59% \| 5 \| -3.78% \| -2.31 \| -10.25 \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ Cluster Name: UNKNOWN Lab Run Info: UNKNOWN Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE () Baseline Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE (2019-03-19) +----------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +----------+-----------------------+---------+------------+------------+----------------+ \| TPCH(2) \| parquet / none / none \| 0.90 \| -0.08% \| 0.80 \| -0.05% \| +----------+-----------------------+---------+------------+------------+----------------+ +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+ \| TPCH(2) \| TPCH-Q18 \| parquet / none / none \| 1.22 \| 1.19 \| +1.93% \| 3.81% \| 4.46% \| 20 \| +3.34% \| 1.62 \| 1.46 \| \| TPCH(2) \| TPCH-Q10 \| parquet / none / none \| 0.74 \| 0.73 \| +1.97% \| 3.36% \| 2.94% \| 20 \| +0.97% \| 1.88 \| 1.95 \| \| TPCH(2) \| TPCH-Q11 \| parquet / none / none \| 0.49 \| 0.48 \| +1.91% \| 6.19% \| 4.64% \| 20 \| +0.25% \| 0.95 \| 1.09 \| \| TPCH(2) \| TPCH-Q4 \| parquet / none / none \| 0.43 \| 0.43 \| +1.99% \| 6.26% \| 5.86% \| 20 \| +0.15% \| 0.92 \| 1.03 \| \| TPCH(2) \| TPCH-Q15 \| parquet / none / none \| 0.50 \| 0.49 \| +1.82% \| 7.32% \| 6.35% \| 20 \| +0.26% \| 1.01 \| 0.83 \| \| TPCH(2) \| TPCH-Q1 \| parquet / none / none \| 0.98 \| 0.97 \| +0.79% \| 4.64% \| 2.73% \| 20 \| +0.36% \| 0.77 \| 0.65 \| \| TPCH(2) \| TPCH-Q19 \| parquet / none / none \| 0.83 \| 0.83 \| +0.65% \| 3.33% \| 2.80% \| 20 \| +0.44% \| 2.18 \| 0.67 \| \| TPCH(2) \| TPCH-Q14 \| parquet / none / none \| 0.62 \| 0.62 \| +0.97% \| 2.86% \| 1.00% \| 20 \| +0.04% \| 0.13 \| 1.42 \| \| TPCH(2) \| TPCH-Q3 \| parquet / none / none \| 0.88 \| 0.87 \| +0.57% \| 2.17% \| 1.74% \| 20 \| +0.29% \| 1.15 \| 0.92 \| \| TPCH(2) \| TPCH-Q12 \| parquet / none / none \| 0.53 \| 0.53 \| +0.27% \| 4.58% \| 5.78% \| 20 \| +0.46% \| 1.47 \| 0.16 \| \| TPCH(2) \| TPCH-Q17 \| parquet / none / none \| 0.72 \| 0.72 \| +0.15% \| 3.64% \| 5.55% \| 20 \| +0.21% \| 0.86 \| 0.10 \| \| TPCH(2) \| TPCH-Q21 \| parquet / none / none \| 2.05 \| 2.05 \| +0.21% \| 1.99% \| 2.37% \| 20 \| +0.01% \| 0.25 \| 0.30 \| \| TPCH(2) \| TPCH-Q5 \| parquet / none / none \| 1.28 \| 1.27 \| +0.24% \| 1.61% \| 1.80% \| 20 \| -0.02% \| -0.57 \| 0.44 \| \| TPCH(2) \| TPCH-Q13 \| parquet / none / none \| 1.27 \| 1.27 \| -0.34% \| 1.69% \| 1.83% \| 20 \| -0.20% \| -1.65 \| -0.61 \| \| TPCH(2) \| TPCH-Q7 \| parquet / none / none \| 1.72 \| 1.73 \| -0.55% \| 2.40% \| 1.69% \| 20 \| -0.03% \| -0.42 \| -0.83 \| \| TPCH(2) \| TPCH-Q8 \| parquet / none / none \| 1.27 \| 1.28 \| -0.68% \| 3.10% \| 3.89% \| 20 \| -0.06% \| -0.54 \| -0.62 \| \| TPCH(2) \| TPCH-Q6 \| parquet / none / none \| 0.36 \| 0.36 \| -0.84% \| 0.79% \| 3.51% \| 20 \| -0.07% \| -0.36 \| -1.04 \| \| TPCH(2) \| TPCH-Q2 \| parquet / none / none \| 0.65 \| 0.65 \| -1.17% \| 4.76% \| 5.99% \| 20 \| -0.05% \| -0.25 \| -0.69 \| \| TPCH(2) \| TPCH-Q9 \| parquet / none / none \| 1.59 \| 1.62 \| -2.01% \| 1.45% \| 5.12% \| 20 \| -0.16% \| -1.24 \| -1.69 \| \| TPCH(2) \| TPCH-Q20 \| parquet / none / none \| 0.68 \| 0.69 \| -1.73% \| 4.35% \| 4.43% \| 20 \| -0.49% \| -1.74 \| -1.25 \| \| TPCH(2) \| TPCH-Q22 \| parquet / none / none \| 0.38 \| 0.40 \| -2.89% \| 7.42% \| 6.39% \| 20 \| -0.21% \| -0.66 \| -1.34 \| \| TPCH(2) \| TPCH-Q16 \| parquet / none / none \| 0.59 \| 0.62 \| -4.01% \| 6.33% \| 5.83% \| 20 \| -4.72% \| -1.39 \| -2.13 \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+ Change-Id: I839d7a3a2f5e1309c33a1f66013ef11628c5dc11 Reviewed-on: http://gerrit.cloudera.org:8080/12797 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-15 22:34:28 +00:00
Todd Lipcon	e2ead7f857	expr-test: use gtest parameterization Instead of running the tests three times with different flags from main(), this uses gtest's parameterization feature to accomplish the same. The advantage here is that we end up with different test names for each of the runs. Additionally, this moves the setup code into a proper setup method so that executing expr-test --gtest_list_tests doesn't waste time starting a cluster. This is prep work towards adding multi-threaded test execution for long-running tests. expr-test seems to currently be one of the worst offenders. Change-Id: Idc9fb24ad62b4aa2e120a99d74ae04bb221c034b Reviewed-on: http://gerrit.cloudera.org:8080/13289 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-11 01:28:58 +00:00
Zoltan Borok-Nagy	d423979866	IMPALA-5843: Use page index in Parquet files to skip pages This commit implements page filtering based on the Parquet page index. The read and evaluation of the page index is done by the HdfsParquetScanner. At first, we determine the row ranges we are interested in, and based on the row ranges we determine the candidate pages for each column that we are reading. We still issue one ScanRange per column chunk, but we specify sub-ranges that store the candidate pages, i.e. we don't read the whole column chunk, but only fractions of it. Pages are not aligned across column chunks, i.e. page #2 of column A might store completely different rows than page #2 of column B. It means we need to implement some kind of row-skipping logic when we read the data pages. This logic is implemented in BaseScalarColumnReader and ScalarColumnReader. Collection column readers know nothing about page filtering. Page filtering can be turned off by setting the query option 'read_parquet_page_index' to false. Testing: * added some unit tests for the row range and page selection logic * generated various Parquet files with Parquet-MR * enabled Page index writing and wrote selective queries against tables written by Impala. Current tests are likely to use page filtering transparently. Performance: * Measured locally, observed 3x to 20x speedup for selective queries. The speedup was proportional to the IO operations need to be done. * The TPCH benchmark didn't show a significant performance change. It is not a suprise since the data is not being sorted in any useful way. So the main goal was to not introduce perf regression. TODO: * measure performance for remote reads Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Reviewed-on: http://gerrit.cloudera.org:8080/12065 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-10 11:46:38 +00:00
Todd Lipcon	1e49b6a6b4	IMPALA-2029. Implement our own getJNIEnv equivalent The libhdfs getJNIEnv function was made non-exported in Hadoop 2. For a while in CDH we were hacking around this with a vendor-specific patch that re-exported it. However, that was always a bit annoying to maintain our own patch each time we rebased to new versions, etc. Earlier attempts to solve this issue turned up strange bugs around coordinating whether we or libhdfs were responsible for attaching and detaching to the JVM/JNI environment. So, this patch takes a new approach: rather than directly creating/attaching to the JVM, we just look for an existing attached environment. If there isn't one, we call some simple libhdfs function which forces it to attach the current thread, and then try again. Performance is maintained (or maybe improved) by adding a thread-local cache of the attached JVM, with an inlined fast-path. I tested this with a CDP build of Hadoop which doesn't have the getJNIEnv workaround. Prior to this fix, I wasn't able to run Java tests against that build because it would fail to link getJNIEnv() at runtime. Now, they pass. Change-Id: I766bcfd70addb00e9fd8a860e89c2a1c5d4c71d5 Reviewed-on: http://gerrit.cloudera.org:8080/13275 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-10 01:50:06 +00:00
Attila Jeges	b5805de3e6	IMPALA-7368: Add initial support for DATE type DATE values describe a particular year/month/day in the form yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a time of day component. The range of values supported for the DATE type is 0000-01-01 to 9999-12-31. This initial DATE type support covers TEXT and HBASE fileformats only. 'DateValue' is used as the internal type to represent DATE values. The changes are as follows: - Support for DATE literal syntax. - Explicit casting between DATE and other types (note that invalid casts will fail with an error just like invalid DECIMAL_V2 casts, while failed casts to other types do no lead to warning or error): - from STRING to DATE. The string value must be formatted as yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory, the time component is optional. If the time component is present, it will be truncated silently. - from DATE to STRING. The resulting string value is formatted as yyyy-MM-dd. - from TIMESTAMP to DATE. The source timestamp's time of day component is ignored. - from DATE to TIMESTAMP. The target timestamp's time of day component is set to 00:00:00. - Implicit casting between DATE and other types: - from STRING to DATE if the source string value is used in a context where a DATE value is expected. - from DATE to TIMESTAMP if the source date value is used in a context where a TIMESTAMP value is expected. - Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP implicit conversions are now all possible, the existing function overload resolution logic is not adequate anymore. For example, it resolves the if(false, '2011-01-01', DATE '1499-02-02') function call to the if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded function, instead of the if(BOOLEAN, DATE, DATE) version. This is clearly wrong, so the function overload resolution logic had to be changed to resolve function calls to the best-fit overloaded function definition if there are multiple applicable candidates. An overloaded function definition is an applicable candidate for a function call if each actual parameter in the function call either matches the corresponding formal parameter's type (without casting) or is implicitly castable to that type. When looking for the best-fit applicable candidate, a parameter match score (i.e. the number of actual parameters in the function call that match their corresponding formal parameter's type without casting) is calculated and the applicable candidate with the highest parameter match score is chosen. There's one more issue that the new resolution logic has to address: if two applicable candidates have the same parameter match score and the only difference between the two is that the first one requires a STRING -> TIMESTAMP implicit cast for some of its parameters while the second one requires a STRING -> DATE implicit cast for the same parameters then the first candidate has to be chosen not to break backward compatibility. E.g: year('2019-02-15') function call must resolve to year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not implemented yet, so this is not an issue at the moment but it will be in the future. When the resolution algorithm considers overloaded function definitions, first it orders them lexicographically by the types in their parameter lists. To ensure the backward compatible behavior Primitivetype.DATE enum value has to come after PrimitiveType.TIMESTAMP. - Codegen infrastructure changes for expression evaluation. - 'IS [NOT] NULL' and '[NOT] IN' predicates. - Common comparison operators (including the 'BETWEEN' operator). - Infrastructure changes for built-in functions. - Some built-in functions: conditional, aggregate, analytical and math functions. - C++ UDF/UDA support. - Support partitioning and grouping by DATE. - Beeswax, HiveServer2 support. These items are tightly coupled and it makes sense to implement them in one change-set. Testing: - A new partitioned TEXT table 'functional.date_tbl' (and the corresponding HBASE table 'functional_hbase.date_tbl') was introduced for DATE-related tests. - BE and FE tests were extended to cover DATE type. - E2E tests: - since DATE type is supported for TEXT and HBASE fileformats only, most DATE tests were implemented separately in tests/query_test/test_date_queries.py. Note, that this change-set is not a complete DATE type implementation, but it lays the foundation for future work: - Add date support to the random query generator. - Implement a complete set of built-in functions. - Add Parquet support. - Add Kudu support. - Optionally support Avro and ORC. For further details, see IMPALA-6169. Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f Reviewed-on: http://gerrit.cloudera.org:8080/12481 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-23 13:33:57 +00:00
Todd Lipcon	209a350aae	Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp This re-lands commit `6e8c330f40` which was reverted in `d3428a58d8`. The revert was due to an assumption that this commit depended on the new version of re2 (which was correctly reverted due to a toolchain issue). In fact this commit does not depend on any toolchain changes. Original commit message follows -------------------------------- This changes the built-in regexp-related UDFs to use THREAD_LOCAL re2::RE instances instead of FRAGMENT_LOCAL. Although re2::RE is thread-safe, it achieves that thread safety through a certain amount of locking. Using thread-local regexps improves performance substantially. I ran a simple test query: select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0 on a table with three underlying parquet files (thus getting 3 scanner threads). Prior to this change, the query took ~60 seconds and burned 2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For a query with more scanner threads, the improvement should be even more dramatic. The only potential downside of this change is slightly increased memory consumption by having one RE instance per thread, but the REs themselves should be small relative to all of the other per-scanner-thread memory. Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e Reviewed-on: http://gerrit.cloudera.org:8080/12845 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-26 04:37:49 +00:00
Lars Volker	d3428a58d8	Revert "IMPALA-5393. Use THREAD_LOCAL state for regexp" This depends on a change which switches to a toolchain version that does not have packages for Ubuntu 18.04. Reverting both now to unblock everyone. This reverts commit `6e8c330f40`. Change-Id: Id35a90a58e3f775031a0f147b042ccd46d77e24b Reviewed-on: http://gerrit.cloudera.org:8080/12791 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Lars Volker <lv@cloudera.com>	2019-03-19 14:52:32 +00:00
Todd Lipcon	6e8c330f40	IMPALA-5393. Use THREAD_LOCAL state for regexp This changes the built-in regexp-related UDFs to use THREAD_LOCAL re2::RE instances instead of FRAGMENT_LOCAL. Although re2::RE is thread-safe, it achieves that thread safety through a certain amount of locking. Using thread-local regexps improves performance substantially. I ran a simple test query: select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0 on a table with three underlying parquet files (thus getting 3 scanner threads). Prior to this change, the query took ~60 seconds and burned 2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For a query with more scanner threads, the improvement should be even more dramatic. The only potential downside of this change is slightly increased memory consumption by having one RE instance per thread, but the REs themselves should be small relative to all of the other per-scanner-thread memory. Change-Id: Ibc331151a302e755701cb08adb3e6f289d54c3a6 Reviewed-on: http://gerrit.cloudera.org:8080/12772 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Todd Lipcon <todd@apache.org>	2019-03-18 23:55:57 +00:00
Philip Zeyliger	214f61a180	IMPALA-8250: Clean up JNI warnings. Using LIBHDFS_OPTS+="-Xcheck:jni" revealed a handful of warnings related to (a) checking for exceptions and (b) leaking local references. Checking for exceptions required sprinkling RETURN_ERROR_IF_EXC left and right. I chose not to expand the JniCall infrastructure to handle this more generally at the moment. The leaky local references are a bit harder. In the logs, they show up as "WARNING: JNI local refs: 2597, exceeds capacity: 35" or similar. A few of these errors seem to be not in our code. The ones that I've found in our code stemmed from HBaseTableScanner::GetRowKey(): this method uses local references and wasn't returning them. Using a JniLocalFrame seems to have taken care of the warnings. I have added code to skip test_large_strings when JNI checking is enabled. This test takes forever (presumably because JNI is checking bounds on strings very aggressively), and times out. The time out also causes some metric-related checks to fail (since a query is still in flight). Debugging this required customizing my JDK to give stack traces when these warnings occurred. The following diff facilitated this. diff -r 76a9c9cf14f1 src/share/vm/prims/jniCheck.cpp --- a/src/share/vm/prims/jniCheck.cpp Tue Jan 15 10:43:31 2019 +0000 +++ b/src/share/vm/prims/jniCheck.cpp Wed Feb 27 11:57:13 2019 -0800 @@ -143,11 +143,30 @@ static const char * fatal_instance_field_mismatch = "Field type (instance) mismatch in JNI get/set field operations"; static const char * fatal_non_string = "JNI string operation received a non-string"; +// thisone: whether to print every time, or maybe, depending on future +// how many future stacks we want printed (totally racy); helps catch +// missing exception handling if there's a way to tickle that code +// reliably. +static inline void dump_native_stack(JavaThread* thr, bool thisone, int future) { + static int fut_stacks = 0; // racy! + if (fut_stacks > 0) { + thisone = true; + fut_stacks--; + } + if (future > 0) fut_stacks = future; + if (thisone) { + frame fr = os::current_frame(); + char buf[6000]; + tty->print_cr("Thread: %s %d", thr->get_thread_name(), thr->osthread()->thread_id()); + print_native_stack(tty, fr, thr, buf, sizeof(buf)); + } +} // When in VM state: static void ReportJNIWarning(JavaThread* thr, const char msg) { tty->print_cr("WARNING in native method: %s", msg); thr->print_stack(); + dump_native_stack(thr, true, 0); } // When in NATIVE state: @@ -199,11 +218,14 @@ tty->print_cr("WARNING in native method: JNI call made without checking exceptions when required to from %s", thr->get_pending_jni_exception_check()); thr->print_stack(); + dump_native_stack(thr, true, 10); ) thr->clear_pending_jni_exception_check(); // Just complain once } } + + /* * Add to the planned number of handles. I.e. plus current live & warning threshold */ @@ -254,9 +276,12 @@ tty->print_cr("WARNING: JNI local refs: %zu, exceeds capacity: %zu", live_handles, planned_capacity); thr->print_stack(); + dump_native_stack(thr, true, 0); ) // Complain just the once, reset to current + warn threshold add_planned_handle_capacity(handles, 0); + } else { + dump_native_stack(thr, false, 0); } } Change-Id: Idd1709f749a764c1d947704bc64306493863b45f Reviewed-on: http://gerrit.cloudera.org:8080/12660 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-03-08 03:35:09 +00:00
Philip Zeyliger	0b7c964545	Adding hostname to Disk I/O errors. I recently ran into some queries that failed like so: WARNINGS: Disk I/O error: Could not open file: /data/...: Error(5): Input/output error These warnings were in the profile, but I had to cross-reference impalad logs to figure out which machine had the broken disk. In this commit, I've sprinkled GetBackendString() to include it. Change-Id: Ib977d2c0983ef81ab1338de090239ed57f3efde2 Reviewed-on: http://gerrit.cloudera.org:8080/12402 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-20 23:14:36 +00:00
Andrew Sherman	7707eb0417	IMPALA-7657: Codegen IsNotEmptyPredicate and ValidTupleIdExpr. These two classes evaluate scalar expressions. Previously codegen was done by calling ScalarExpr::GetCodegendComputeFnWrapper which generates a static method that calls the scalar expression evaluation method. Make this more efficient by generating code which is customized using information available at codegen time. Add new cross-compiled files null-literal-ir.cc slot-ref-ir.cc IsNotEmptyPredicate works by getting a CollectionVal object from the single child Expr node, and counting its tuples. At codegen time we know the type and value of the child node. Generate a call to a node-specific non-virtual cross-compiled method to get the CollectionVal object from the child. Then generate a code that examines the CollectionVal and returns an IntVal. A ValidTupleIdExpr node contains a vector of tuple ids. It works by probing each row for the tuple ids in the vector to find a non-null tuple. At codegen time we know the vector of tuple ids. We unroll the loop through the tuple ids, generating code that evaluates if the tuple is non-null, and returns the tuple id if/when a non-null tuple is found. IMPALA-7657 also requests replacing GetCodegendComputeFnWrapper() in TupleIsNullPredicate. In the current Impala code this method is never called. This is because TupleIsNullPredicate is always wrapped in an IfExpr. This is always codegen'd by IfExpr's GetCodegendComputeFnWrapper() method. There is a separate Jira IMPALA-7655 to improve codegen of IfExpr. Minor corrections: Correct the link to llvm tutorial in LlvmCodegen. PERFORMANCE: I tested performance on a local mini-cluster. I wrote some pathological queries to test the new code. The new codegen'd code is very similar in performance. Both ValidTupleIdExpr and IsNotEmptyPredicate seem very slightly faster than the old code. Overall these changes are not purely for performance but to move away from GetCodegendComputeFnWrapper. TESTING: The changed scalar expressions are well exercised by current tests. Ran exhaustive end-to-end tests. Change-Id: Ifb87b9e3b879c278ce8638d97bcb320a7555a6b3 Reviewed-on: http://gerrit.cloudera.org:8080/12068 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-07 22:14:15 +00:00
poojanilangekar	ae96a9fb19	IMPALA-8151: Use sizeof() in HiveUdfCall to specify non-primitive type's size Previously, data type sizes were hardcoded in HiveUdfCall::Evaluate(). Since IMPALA-7367 removed the padding from STRING and VARCHAR types, it could read past the end of the actual value and cause a crash. This change replaces the hardcoded values with sizeof() calls to determine the size of non-primitive types (STRING, VARCHAR and TIMESTAMP) to avoid similar issues in the future. Testing: Ran test_udfs.py on an ASAN build. Added logs to manually verify the size of bytes copied. Change-Id: I919c330546fa86b474ab66245b20ceb1f5525b41 Reviewed-on: http://gerrit.cloudera.org:8080/12355 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-06 04:19:20 +00:00
Attila Jeges	3338bae608	IMPALA-8043: Fix BE test failures related to SystemV timezones. This is a fix for the following issue: 1. Some BE tests (e.g. ExprTest.TimestampFunctions) use the system's local timezone but run against a test timezone db (instead of the system's timezone db). 2. On some Linux installations /usr/share/zoneinfo contains symlinks to files in the /usr/share/zoneifo/SystemV directory (e.g /usr/share/zoneinfo/America/Los_Angeles is a symlink to ../SystemV/PST8PDT). 3. The 'SystemV' directory is not part of the test timezone db, since it is obsolete and excluded by default. Consequently, if the system's local timezone is set to America/Los_Angeles, BE tests won't find the corresponding timezone file in the test timezone db. BE tests will default to UTC, which will break some of them. This change sets local timezone explicitly for failing BE tests, so they don't depend on the system's local timezone. It also adds 'SystemV' directory to the test timezone db to avoid similar issues in the future. Change-Id: I9288cd24c8af0c059e55d47c86bd92eaf0075681 Reviewed-on: http://gerrit.cloudera.org:8080/12199 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-15 17:04:55 +00:00
Tim Armstrong	928c5c261b	Fix some warnings on GCC7 I tried compiling with GCC7 to see what warnings popped up. Fix some ambiguous else warnings resulting from gtest macros. See https://github.com/google/googletest/issues/1119. Add a missing include that broke compilation on the release build. Fix some warnings that detect missing returns when there is a DCHECK (these warnings already occurred in release builds, but they now happen in gcc7 debug builds). Change-Id: I39a12bc5ed6957c147b7f0dba85c7687cc989439 Reviewed-on: http://gerrit.cloudera.org:8080/12132 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-03 22:07:56 +00:00
Paul Rogers	27577dd652	IMPALA-7902: NumericLiteral fixes, refactoring The work to clean up the rewriter logic must start with a stable AST, which must start with sprucing up some issues with the leaf nodes. This CR tackles the NumericLiteral used to hold numbers. IMPALA-7896: Literals should not need explicit analyze step Partial fix: removes the need to analyze a numeric literal: analyze() is a no-op. This eliminates the need to do a "fake" analysis with a null analyzer: numeric literals are now created analyzed. This is useful because the catalog module creates numeric literals outside of a query (and outside of an analyzer.) A literal is immutable except for type. Modified the constructor to set the type and cost, then mark the node as analyzed. A later call to analyze() has nothing to do. Code that created and dummy-analyzed numeric literals changed to use static create() methods resulting in simpler literal creation, and eliminates the special "analyzer == null" checks in analyze(). IMPALA-7886: NumericLiteral constructor fails to round values to Decimal type IMPALA-7887: NumericLiteral fails to detect numeric overflow IMPALA-7888: Incorrect NumericLiteral overflow checks for FLOAT, DOUBLE IMPALA-7891: Analyzer does not detect numeric overflow in CAST IMPALA-7894: Parser does not catch double overflow These are all caused by the somewhat cluttered state of the numeric range check code after years of incremental changes. This patch centralizes all checks into a series of constants and methods for uniformity. All values are set in the constructor which now checks that the value is legal for the type. Cast operations verify that the cast is valid. Multiple semi-parallel versions of the same logic is replaced by calls to a single implementation. The numeric checks now follow the SQL standard which says that implementations should fail if a cast would trucate the most significant digits, but round when truncating the least significant. IMPALA-7865: Repeated type widening of arithmetic expressions Partial fix. Replaces the "is explicit cast" flag in the numeric literal with the explicit type. This allows reseting an implicit type back to the explciit type if an arithmetic expression is analyzed multiple times. A later patch will feed this type information into the type inference mechanism to complete the fix. Finally, adds a set of new exceptions that begin to unify error reporting. These handle casts (SqlCastException), value validation (InvalidValueException) and unsupported features (UnsupportedFeatureException.) These all derive from AnalysisException for backward compatibility. Tests use the new exceptions to check for expected errors rather than parsing text strings (which tend to change.) Testing: * Added unit tests just for numeric literals. Refactored code to simplify the tests. * Added a test case for the obscure case in Decimal V1 of an implicit cast overflow. * The depth-check tests needed one extra level of nesting to trigger failure. * Ran all FE tests. Change-Id: I484600747b2871d3a6fe9153751973af9a8534f2 Reviewed-on: http://gerrit.cloudera.org:8080/12001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-28 23:25:06 +00:00
Greg Rahn	ba9b78c103	IMPALA-7759: Add Levenshtein edit distance built-in function This patch adds new built-in functions to calculate Levenshtein edit distance. Implemented as levenshtein() to match PostgreSQL in both functionality and name and also added le_dst() alias for Netezza, compatibility, but note that levenshtein() differs in functionality in that if either value is NULL or both values are NULL, levenshtein() returns NULL, where Netezza's le_dst() returns the length of the not NULL value or 0 if both values are NULL. Testing: - Added unit tests to expr-test.cc - Manual test on 966289 string pairs and results match PostgreSQL - Added changes to qgen tests for PostgreSQL comparison Change-Id: I549d33ab7cebfa10db2934461c8ec91e2cc1cdcb Reviewed-on: http://gerrit.cloudera.org:8080/11793 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-12-02 10:39:44 +00:00
poojanilangekar	2a4835cfba	IMPALA-7367: Pack StringValue and CollectionValue slots This change packs StringValue and CollectionValue slots to ensure they now occupy 12 bytes instead of 16 bytes. This reduces the memory requirements and improves the performance. Since Kudu tuples are populated using a memcopy, 4 bytes of padding was added to StringSlots in Kudu tables. Testing: Ran core tests. Added static asserts to ensure the value sizes are as expected. Performance tests on TPCH-40 produced 3.96% improvement. Change-Id: I32f3b06622c087e4aa288e8db1bf4581b10d386a Reviewed-on: http://gerrit.cloudera.org:8080/11599 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-11-19 17:27:13 +00:00
Jim Apple	067657aa7d	IMPALA-5031: prevent signed overflow in decimal This removes two signed integer overflows when using the 'conv' builtin. Signed integer overflow is undefined behavior according to the C++ standard. The interesting parts of the backtraces are: exprs/math-functions-ir.cc:405:13: runtime error: signed integer overflow: 4738381338321616896 * 36 cannot be represented in type 'long' exprs/math-functions-ir.cc:404:24: runtime error: signed integer overflow: 2 * 4738381338321616896 cannot be represented in type 'long' #0 MathFunctions::DecimalInBaseToDecimal(long, signed char, long) exprs/math-functions-ir.cc:404:24 #1 MathFunctions::ConvInt(impala_udf::FunctionContext, impala_udf::BigIntVal const&, impala_udf::TinyIntVal const&, impala_udf::TinyIntVal const&) exprs/math-functions-ir.cc:327:10 #2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:485:580 #3 ScalarFnCall::GetStringVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:599:44 #8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator> const&, TupleRow, unsigned char, RowBatch) exec/union-node-ir.cc:29:14 #9 UnionNode::GetNextConst(RuntimeState, RowBatch) exec/union-node.cc:263:5 #10 UnionNode::GetNext(RuntimeState, RowBatch, bool*) exec/union-node.cc:296:45 These were triggered in the backend test ExprTest.MathConversionFunctions. Change-Id: I0d97dfcf42072750c16e41175765cd9a468a3c39 Reviewed-on: http://gerrit.cloudera.org:8080/11876 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-14 22:32:12 +00:00
Csaba Ringhofer	60095a4c6b	IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet Changes: - parquet.thrift is updated to a newer version which contains the timestamp logical type. - INT64 columns with converted types TIMESTAMP_MILLIS and TIMESTAMP_MICROS can be read as TIMESTAMP. - If the logical type is timestamp, then the type will contain the information whether the UTC->local conversion is necessary. This feature is only supported for the new timestamp types, so INT96 timestamps must still use flag convert_legacy_hive_parquet_utc_timestamps. - Min/max stat filtering is enabled again for columns that need UTC->local conversion. This was disabled in IMPALA-7559 because it could incorrectly drop column chunks. - CREATE TABLE LIKE PARQUET converts these columns to TIMESTAMP - before the change, an error was returned instead. - Bulk of the Parquet column stat logic was moved to a new class called "ColumnStatsReader". Testing: - Added unit tests for timezone conversion (this needed a new public function in timezone_db.h and adding CET to tzdb_tiny). - Added parquet files (created with parquet-mr) with int64 timestamp columns. Change-Id: I4c7c01fffa31b3d2ca3480adf6ff851137dadac3 Reviewed-on: http://gerrit.cloudera.org:8080/11057 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-14 20:16:14 +00:00
Tim Armstrong	250d85e94e	IMPALA-7822: handle overflows in repeat() builtin We need to carefully check that the intermediate value fits in an int64_t and the final size fits in an int. If they don't we raise an error and fail the query. Testing: Added a couple of backend tests to exercise the overflow check code paths. Change-Id: I872ce77bc2cb29116881c27ca2a5216f722cdb2a Reviewed-on: http://gerrit.cloudera.org:8080/11889 Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-07 21:49:15 +00:00
Jim Apple	78b6f1db69	IMPALA-5031: Make UBSAN-friendly arithmetic generic ArithmeticUtil::AsUnsigned() makes it possible to do arithmetic on signed integers in a way that does not invoke undefined behavior, but it only works on integers. This patch adds ArithmeticUtil::Compute(), which dispatches (at compile time) to the normal arithmetic evaluation method if the type of the values is a floating point type, but uses AsUnsigned() if the type of the values is an integral type. Change-Id: I73bec71e59c5a921003d0ebca52a1d4e49bbef66 Reviewed-on: http://gerrit.cloudera.org:8080/11810 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-05 23:30:52 +00:00
Jim Apple	8fc702be0c	IMPALA-5031: fix signed overflows in decimal The standard says that overflow for signed arithmetic operations is undefined behavior; see [expr]: If during the evaluation of an expression, the result is not mathematically defined or not in the range of representable values for its type, the behavior is undefined. and [basic.fundamental]: Unsigned integers shall obey the laws of arithmetic modulo 2^n where n is the number of bits in the value representation of that particular size of integer. This implies that unsigned arithmetic does not overflow because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting unsigned integer type. All of the overflows fixed in this patch were tested with expr-test's DecimalArithmeticTest. Change-Id: Ibf882428931e4f4264be2fc8cd9d6b1fc89b8ace Reviewed-on: http://gerrit.cloudera.org:8080/11604 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-13 06:23:17 +00:00
Csaba Ringhofer	d301600a85	Revert "IMPALA-7595: Revert "IMPALA-7521: Speed up sub-second unix time->TimestampValue conversions"" IMPALA-7595 added proper handling for invalid time-of-day values in Parquet, so the DCHECK mentioned in IMPALA-7595 will no longer be hit. This means that IMPALA-7521 can be committed again without causing problems. This reverts commit `f8b472ee64`. Change-Id: Ibab04bc6ad09db331220312ed21d90622fdfc41b Reviewed-on: http://gerrit.cloudera.org:8080/11573 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-04 03:40:44 +00:00
Jim Apple	20bde289eb	IMPALA-5031: null ptr errors in C calls in BE tests This patch fixes all remaining UBSAN "null pointer passed as argument" errors in the backend tests. These are undefined behavior according to "7.1.4 Use of library functions" in the C99 standard (which is included in C++14 in section [intro.refs]): If an argument to a function has an invalid value (such as a value outside the domain of the function, or a pointer outside the address space of the program, or a null pointer, or a pointer to non-modifiable storage when the corresponding parameter is not const-qualified) or a type (after promotion) not expected by a function with variable number of arguments, the behavior is undefined. The interesting parts of the backtraces for the errors fixed in this patch are below: exprs/string-functions-ir.cc:311:17: runtime error: null pointer passed as argument 2, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 StringFunctions::Replace(impala_udf::FunctionContext, impala_udf::StringVal const&, impala_udf::StringVal const&, impala_udf::StringVal const&) exprs/string-functions-ir.cc:311:5 #1 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:485:580 #2 ScalarFnCall::GetStringVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:599:44 #3 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const) exprs/scalar-expr-evaluator.cc:299:38 #4 ScalarExprEvaluator::GetValue(TupleRow const) exprs/scalar-expr-evaluator.cc:250:10 #5 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, ScalarExprEvaluator* const, MemPool, StringValue*, int, int) runtime/tuple.cc:222:27 #6 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, vector<ScalarExprEvaluator> const&, MemPool, vector<StringValue>, int) runtime/tuple.h:174:5 #7 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator> const&, TupleRow, unsigned char, RowBatch) exec/union-node-ir.cc:29:14 #8 UnionNode::GetNextConst(RuntimeState, RowBatch) exec/union-node.cc:263:5 #9 UnionNode::GetNext(RuntimeState, RowBatch, bool) exec/union-node.cc:296:45 #10 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59 #11 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14 #12 QueryState::ExecFInstance(FragmentInstanceState) runtime/query-state.cc:488:24 #13 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35 #20 thread_proxy (exprs/expr-test+0x55ca939) exprs/string-functions-ir.cc:868:15: runtime error: null pointer passed as argument 2, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 StringFunctions::ConcatWs(impala_udf::FunctionContext, impala_udf::StringVal const&, int, impala_udf::StringVal const) exprs/string-functions-ir.cc:868:3 #1 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:510:270 #2 ScalarFnCall::GetStringVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:599:44 #3 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const) exprs/scalar-expr-evaluator.cc:299:38 #4 ScalarExprEvaluator::GetValue(TupleRow const) exprs/scalar-expr-evaluator.cc:250:10 #5 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, ScalarExprEvaluator* const, MemPool, StringValue*, int, int) runtime/tuple.cc:222:27 #6 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, vector<ScalarExprEvaluator> const&, MemPool, vector<StringValue>, int) runtime/tuple.h:174:5 #7 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator> const&, TupleRow, unsigned char, RowBatch) exec/union-node-ir.cc:29:14 #8 UnionNode::GetNextConst(RuntimeState, RowBatch) exec/union-node.cc:263:5 #9 UnionNode::GetNext(RuntimeState, RowBatch, bool) exec/union-node.cc:296:45 #10 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59 #11 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14 #12 QueryState::ExecFInstance(FragmentInstanceState) runtime/query-state.cc:488:24 #13 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35 #20 thread_proxy (exprs/expr-test+0x55ca939) exprs/string-functions-ir.cc:871:17: runtime error: null pointer passed as argument 2, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 StringFunctions::ConcatWs(impala_udf::FunctionContext, impala_udf::StringVal const&, int, impala_udf::StringVal const) exprs/string-functions-ir.cc:871:5 #1 StringFunctions::Concat(impala_udf::FunctionContext, int, impala_udf::StringVal const) exprs/string-functions-ir.cc:843:10 #2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:510:95 #3 ScalarFnCall::GetStringVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:599:44 #4 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const) exprs/scalar-expr-evaluator.cc:299:38 #5 ScalarExprEvaluator::GetValue(TupleRow const) exprs/scalar-expr-evaluator.cc:250:10 #6 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, ScalarExprEvaluator* const, MemPool, StringValue*, int, int) runtime/tuple.cc:222:27 #7 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, vector<ScalarExprEvaluator> const&, MemPool, vector<StringValue>, int) runtime/tuple.h:174:5 #8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator> const&, TupleRow, unsigned char, RowBatch) exec/union-node-ir.cc:29:14 #9 UnionNode::GetNextConst(RuntimeState, RowBatch) exec/union-node.cc:263:5 #10 UnionNode::GetNext(RuntimeState, RowBatch, bool) exec/union-node.cc:296:45 #11 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59 #12 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14 #13 QueryState::ExecFInstance(FragmentInstanceState) runtime/query-state.cc:488:24 #14 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35 #21 thread_proxy (exprs/expr-test+0x55ca939) exprs/string-functions-ir.cc:873:17: runtime error: null pointer passed as argument 2, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 StringFunctions::ConcatWs(impala_udf::FunctionContext, impala_udf::StringVal const&, int, impala_udf::StringVal const) exprs/string-functions-ir.cc:873:5 #1 StringFunctions::Concat(impala_udf::FunctionContext, int, impala_udf::StringVal const) exprs/string-functions-ir.cc:843:10 #2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:510:95 #3 ScalarFnCall::GetStringVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:599:44 #4 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const) exprs/scalar-expr-evaluator.cc:299:38 #5 ScalarExprEvaluator::GetValue(TupleRow const) exprs/scalar-expr-evaluator.cc:250:10 #6 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, ScalarExprEvaluator* const, MemPool, StringValue*, int, int) runtime/tuple.cc:222:27 #7 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, vector<ScalarExprEvaluator> const&, MemPool, vector<StringValue>, int) runtime/tuple.h:174:5 #8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator> const&, TupleRow, unsigned char, RowBatch) exec/union-node-ir.cc:29:14 #9 UnionNode::GetNextConst(RuntimeState, RowBatch) exec/union-node.cc:263:5 #10 UnionNode::GetNext(RuntimeState, RowBatch, bool) exec/union-node.cc:296:45 #11 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59 #12 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14 #13 QueryState::ExecFInstance(FragmentInstanceState) runtime/query-state.cc:488:24 #14 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35 #21 thread_proxy (exprs/expr-test+0x55ca939) runtime/raw-value.cc:159:27: runtime error: null pointer passed as argument 2, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 RawValue::Write(void const, void, ColumnType const&, MemPool) runtime/raw-value.cc:159:9 #1 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, ScalarExprEvaluator const, MemPool, StringValue*, int, int) runtime/tuple.cc:225:7 #2 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, vector<ScalarExprEvaluator> const&, MemPool, vector<StringValue>, int) runtime/tuple.h:174:5 #3 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator> const&, TupleRow, unsigned char, RowBatch) exec/union-node-ir.cc:29:14 #4 UnionNode::GetNextConst(RuntimeState, RowBatch) exec/union-node.cc:263:5 #5 UnionNode::GetNext(RuntimeState, RowBatch, bool) exec/union-node.cc:296:45 #6 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59 #7 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14 #8 QueryState::ExecFInstance(FragmentInstanceState) runtime/query-state.cc:488:24 #9 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35 #16 thread_proxy (exprs/expr-test+0x55ca939) udf/udf.cc:521:24: runtime error: null pointer passed as argument 2, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 impala_udf::StringVal::CopyFrom(impala_udf::FunctionContext, unsigned char const, unsigned long) udf/udf.cc:521:5 #1 AnyValUtil::FromBuffer(impala_udf::FunctionContext, char const, int) exprs/anyval-util.h:241:12 #2 StringFunctions::RegexpExtract(impala_udf::FunctionContext, impala_udf::StringVal const&, impala_udf::StringVal const&, impala_udf::BigIntVal const&) exprs/string-functions-ir.cc:726:10 #3 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:485:580 #4 ScalarFnCall::GetStringVal(ScalarExprEvaluator, TupleRow const) const exprs/scalar-fn-call.cc:599:44 #5 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const) exprs/scalar-expr-evaluator.cc:299:38 #6 ScalarExprEvaluator::GetValue(TupleRow const) exprs/scalar-expr-evaluator.cc:250:10 #7 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, ScalarExprEvaluator const, MemPool, StringValue*, int, int) runtime/tuple.cc:222:27 #8 void Tuple::MaterializeExprs<false, false>(TupleRow, TupleDescriptor const&, vector<ScalarExprEvaluator> const&, MemPool, vector<StringValue>, int) runtime/tuple.h:174:5 #9 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator> const&, TupleRow, unsigned char, RowBatch) exec/union-node-ir.cc:29:14 #10 UnionNode::GetNextConst(RuntimeState, RowBatch) exec/union-node.cc:263:5 #11 UnionNode::GetNext(RuntimeState, RowBatch, bool) exec/union-node.cc:296:45 #12 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59 #13 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14 #14 QueryState::ExecFInstance(FragmentInstanceState) runtime/query-state.cc:488:24 #15 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35 #22 thread_proxy (exprs/expr-test+0x55ca939) util/coding-util-test.cc:45:10: runtime error: null pointer passed as argument 1, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 TestUrl(string const&, string const&, bool) util/coding-util-test.cc:45:3 #1 UrlCodingTest_BlankString_Test::TestBody() util/coding-util-test.cc:88:3 #2 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) (util/coding-util-test+0x6630f42) #8 main util/coding-util-test.cc:123:192 util/decompress-test.cc:126:261: runtime error: null pointer passed as argument 1, which is declared to never be null /usr/include/string.h:66:58: note: nonnull attribute specified here #0 DecompressorTest::CompressAndDecompress(Codec, Codec, long, unsigned char) util/decompress-test.cc:126:254 #1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:84:9 #2 DecompressorTest_Default_Test::TestBody() util/decompress-test.cc:373:3 #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) (util/decompress-test+0x6642bb2) #9 main util/decompress-test.cc:479:47 util/decompress-test.cc:148:261: runtime error: null pointer passed as argument 1, which is declared to never be null /usr/include/string.h:66:58: note: nonnull attribute specified here #0 DecompressorTest::CompressAndDecompress(Codec, Codec, long, unsigned char) util/decompress-test.cc:148:254 #1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:84:9 #2 DecompressorTest_Default_Test::TestBody() util/decompress-test.cc:373:3 #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) (util/decompress-test+0x6642bb2) #9 main util/decompress-test.cc:479:47 util/decompress-test.cc:269:261: runtime error: null pointer passed as argument 1, which is declared to never be null /usr/include/string.h:66:58: note: nonnull attribute specified here #0 DecompressorTest::CompressAndDecompressNoOutputAllocated(Codec, Codec, long, unsigned char) util/decompress-test.cc:269:254 #1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:71:7 #2 DecompressorTest_LZ4_Test::TestBody() util/decompress-test.cc:381:3 #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) (util/decompress-test+0x6642bb2) #9 main util/decompress-test.cc:479:47 util/decompress-test.cc:221:329: runtime error: null pointer passed as argument 1, which is declared to never be null /usr/include/string.h:66:58: note: nonnull attribute specified here #0 DecompressorTest::StreamingDecompress(Codec, long, unsigned char, long, unsigned char, bool, long) util/decompress-test.cc:221:322 #1 DecompressorTest::CompressAndStreamingDecompress(Codec, Codec, long, unsigned char) util/decompress-test.cc:245:35 #2 DecompressorTest::RunTestStreaming(THdfsCompression::type) util/decompress-test.cc:104:5 #3 DecompressorTest_Gzip_Test::TestBody() util/decompress-test.cc:386:3 #4 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test, void (testing::Test::)(), char const) (util/decompress-test+0x6642bb2) #10 main util/decompress-test.cc:479:47 util/streaming-sampler.h:55:22: runtime error: null pointer passed as argument 2, which is declared to never be null /usr/include/string.h:43:45: note: nonnull attribute specified here #0 StreamingSampler<long, 64>::StreamingSampler(int, vector<long> const&) util/streaming-sampler.h:55:5 #1 RuntimeProfile::TimeSeriesCounter::TimeSeriesCounter(string const&, TUnit::type, int, vector<long> const&) util/runtime-profile-counters.h:401:53 #2 RuntimeProfile::Update(vector<TRuntimeProfileNode> const&, int) util/runtime-profile.cc:310:28 #3 RuntimeProfile::Update(TRuntimeProfileTree const&) util/runtime-profile.cc:245:3 #4 Coordinator::BackendState::InstanceStats::Update(TFragmentInstanceExecStatus const&, Coordinator::ExecSummary, ProgressUpdater) runtime/coordinator-backend-state.cc:473:13 #5 Coordinator::BackendState::ApplyExecStatusReport(TReportExecStatusParams const&, Coordinator::ExecSummary, ProgressUpdater*) runtime/coordinator-backend-state.cc:286:21 #6 Coordinator::UpdateBackendExecStatus(TReportExecStatusParams const&) runtime/coordinator.cc:678:22 #7 ClientRequestState::UpdateBackendExecStatus(TReportExecStatusParams const&) service/client-request-state.cc:1253:18 #8 ImpalaServer::ReportExecStatus(TReportExecStatusResult&, TReportExecStatusParams const&) service/impala-server.cc:1343:18 #9 ImpalaInternalService::ReportExecStatus(TReportExecStatusResult&, TReportExecStatusParams const&) service/impala-internal-service.cc:87:19 #24 thread_proxy (exprs/expr-test+0x55ca939) Change-Id: I317ccc99549744a26d65f3e07242079faad0355a Reviewed-on: http://gerrit.cloudera.org:8080/11545 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-02 20:24:17 +00:00
stiga-huang	ddef2cb9b1	IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[].a'), we first perform '$[]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Reviewed-on: http://gerrit.cloudera.org:8080/10950 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-29 11:59:03 +00:00
Attila Jeges	cb49371613	IMPALA-7492: Add support for DATE text parser/formatter This change is the first step in implementing support for DATE type (IMPALA-6169). The DATE parser/formatter is implemented by the new DateParser class. - The parser supports parsing both default and custom formatted DATE values. CCTZ is used to validate the parsed dates. - The formatter supports default and custom formatting of DATE values. In the future, DateParser will be used in the text scanner/writer and in the DATE <-> STRING cast functions. The DateParser class reuses some of the functionality already implemented in the TimestampParser class to minimize redundancy. To make code reuse easier, a new namespace (datetime_parse_util) was created and the common functionality was moved there. This change also adds a new class (DateValue) to represent a DATE value in-memory. The DateParser and DateValue classes are used only in tests at the moment, therefore this patch doesn't change user facing behavior. Testing: - Added BE-tests for DateParser and DateValue classes. - Re-run parse-timestamp-benchmark to make sure that parser performance hasn't degraded. Change-Id: I1eec00f22502c4c67c6807c4b51384419ea8b831 Reviewed-on: http://gerrit.cloudera.org:8080/11450 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-28 18:14:56 +00:00

1 2 3 4 5 ...

724 Commits