mirror of
https://github.com/apache/impala.git
synced 2026-01-27 06:10:53 -05:00
2cd7a2b77acfa04094e91efbc6803d11fabcc0e9
724 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
da5b498c18 |
IMPALA-9373: more tactical IWYU fixes
This is a grab-bag of fixes that I did with a mix of manual
inspection. The techniques used were:
* Getting preprocessor output for a few files by modifying
command lines from compiler_commands.json to include -E.
This is revealing because you see all the random unrelated
cruft that gets pulled in. A useful one liner to extract
an (approximate) list of headers from preprocessor output is:
grep '^#.*h' be/src/util/CMakeFiles/Util.dir/os-info.cc.i | \
grep -o '".*"' | sort -u
* Looking at the IWYU recommendations for guidance on what
headers can be removed (and what need to be added).
* Grepping for includes of headers, especially in other headers
where they become viral. An example one-liner to find these:
git grep -l 'include.*<iostream>' | grep '\.h$'
Non-exhaustive list of changes made:
-----------------------------------
Unnest classes from TmpFileMgr so we can forward-declare them.
This lets us remove tmp-file-mgr.h from buffer-pool.h and
query-state.h, which are both widely included headers in the
codebase.
Also remove webserver.h from other headers, since it
pulls in openssl-util.h and consequently a lot of
openssl headers.
Avoid including runtime/multi-precision.h in other headers.
It pulls in a lot of boost multiprecision headers that
are only needed for internal implementations of math
and decimal operations. This required replacing some
references to int128_t with __int128_t, which I don't
think significantly hurts code readability.
Also remove references to decimal-util.h where they're
not needed, since it transitively pulls in
multi-precision.h
Reduce includes of boost/date_time modules, which are
transitively many places via timestamp-value.h.
Remove transitive dependencies of timestamp-value.h
to avoid pulling in remaining boost date_time headers
where not needed. Dependent headers are:
scalar-expr-evaluator.h, expr-value.h
Remove references to debug-util.h in other headers,
because it pulls in a lot of thread headers.
Remove references to llvm-codegen.h where possible,
because it pulls in many llvm headers.
Other opportunities:
--------------------
* boost/algorithm/string.hpp includes many string algorithms
and pulls in a lot of headers.
* util/string-parser.h is a giant header with many dependencies.
* There's lots of redundancy between boost and standard c++
headers. Both pull in vast numbers of utility headers for
C++ metaprogramming and similar things. If we reduced virality
of boost headers this would help a lot, and also if we switch
to equivalent standard headers where possible (e.g. unordered_map,
unordered_set, function, bind, etc).
Compile time with clang/ASAN:
-----------------------------
Before:
real 9m6.311s
user 62m25.006s
sys 2m44.798s
After:
real 8m17.073s
user 55m38.425s
sys 2m25.808s
Change-Id: I8de71866bdf3211e53560d9bfe930e7657c4d7f1
Reviewed-on: http://gerrit.cloudera.org:8080/15248
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
984f675e05 |
IMPALA-5904: (part 2) Fix more TSAN bugs
Fixes the following data races reported by TSAN: data race be/src/runtime/krpc-data-stream-sender.cc:581:3 in KrpcDataStreamSender::Channel::SerializeAndSendBatch(impala::RowBatch*) * Race condition when reading 'rpc_in_flight_batch_' outside of the 'lock_' * Since this race condition is only triggered inside a DCHECK, added a suppresion data race be/src/util/stopwatch.h:183:9 in MonotonicStopWatch::RunningTime() const * Race condition on BlockingJoinNode::built_probe_overlap_stop_watch_; changed from a MonotonicStopWatch to a ConcurrentStopWatch data race be/src/exec/kudu-scan-node.cc:211:13 in KuduScanNode::ProcessScanToken(impala::KuduScanner*, std::string const&) * Some reads on KuduScanNode::done_ are racey, so I made 'done_' an AtomicBool; this has the added benefit that failed scans will be aborted as soon as 'done_' is set to false data race be/src/service/client-request-state.h:220:29 in ClientRequestState::eos() const * Race condition when reading / updating ClientRequestState::eos_; made 'eos_' an AtomicBool data race be/src/exec/parquet/parquet-column-readers.cc:497:9 in bool ScalarColumnReader<...>::ReadValueBatch<false>(...) * Race condition in SHOULD_TRIGGER_COL_READER_DEBUG_ACTION / parquet_column_reader_debug_count data race be/src/service/impala-server.cc:817:20 in ImpalaServer::ArchiveQuery(impala::ClientRequestState const&) * Race condition on some ClientRequestState fields when creating a QueryStateRecord Fixes IMPALA-9313: 'TSAN data race in TmpFileMgr::File::Blacklist' and adds a suppresion for IMPALA-9404: 'Instantiations/ExprTest.MathConversionFunctions fails in TSAN builds'. Testing: * Ran core tests * Re-ran TSAN tests and confirmed the data races have been fixed Change-Id: I01feb40417dc5ea64ccb0c1044cfc3eed8508476 Reviewed-on: http://gerrit.cloudera.org:8080/15244 Reviewed-by: Sahil Takiar <stakiar@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
3224839876 |
IMPALA-8759: Use double precision for HLL finalize function
Current HLL finalize function use single precision of data type float32 to calculate estimate. It's not accurate for the larger cardinalities beyond 1,000,000 since float32 only has 6~7 decimal digit precision. This patch change single precision data type to double precision type for HLL finalize function. Testing: - Passed all exhaustive tests. - Did benchmark for queries with NDV functions. The performance impact is negligible. See following spreadsheet for the menchmark: https://docs.google.com/spreadsheets/d/1DIVOEs5C4MJL1b7O4MA_jkaM3Y-JSMFREjXCUHJ3eHc/edit#gid=0 Change-Id: I0c5a5229b682070b0bc14da287db5231159dbb3d Reviewed-on: http://gerrit.cloudera.org:8080/15167 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
2c54dbe225 |
IMPALA-9385: Unix time conversion cleanup + ORC fix
ORC scanner uses TimestampValue::FromUnixTimeNanos() to convert sec + nano representation to Impala's TimestampValue (day + nano). FromUnixTimeNanos was affected by flag use_local_tz_for_unix_timestamp_conversions, while that global option should not affect ORC. By default there was no conversion, but if the flag is 1, then timestamps were interpreted as UTC and converted to local time. This could be solved by creating a UTC version of FromUnixTimeNanos, but I decided to change the interface in the hope of making To/From timestamp functions less confusing. Changes: - Fixed the bug by passing UTC as timezone in the ORC scanner. - Changed the interface of these TimestampValue functions to expect a timezone pointer, interpret null as UTC and skip conversion. It would be also possible to pass the actual UTC timezone and check for this in the functions, but I guess it is easier to optimize the inlined functions this way. - Moved the checking of use_local_tz_for_unix_timestamp_conversions to RuntimeState and added property time_zone_for_unix_time_conversions() to return the timezone to use in Unix time conversions. This made TimestampValue's interface clearer and makes it easy to replace the flag with a query option if we want to. - Changed RuntimeState and the Parquet scanner to skip timezone conversion if convert_legacy_hive_parquet_utc_timestamps=1 but the timezone is UTC. This allows users to avoid the performance penalty of this flag by setting query option timezone to UTC in their session (IMPALA-7557). CCTZ is not good at this, actually conversions are slower with fixed offset timezones (including UTC) than with timezones that have DST/historical rule changes. Postponed changes: - Didn't remove the UTC versions of the functions yet, as that would require changing (and possibly rethinking) several BE tests and benchmarks (IMPALA-9409). Tests: - Added regression test for Orc and other file formats to check that they are not affected by this flag. - Extended test_hive_parquet_timestamp_conversion.py to cover the case when convert_legacy_hive_parquet_utc_timestamps=1 and timezone=UTC. Also did some cleanup there to use query option timezone instead of env var TZ. Change-Id: I14e2a7e512ccd013d5d9fe480a5467ed4c46b76e Reviewed-on: http://gerrit.cloudera.org:8080/15222 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
04fd9ae268 |
IMPALA-9373: Trial run of include-what-you-use
Implemented recommendations from IWYU in a subset of files, mostly in util. Did a few cleanups related to systematic problems that I noticed as a result. I noticed that uid-util.h was pulling in boost UUID headers to a lot of compilation units, so refactored that a little bit, including pulling out the hash functions into unique-id-hash.h and moving some inline functions into client-request-state-map.cc. Systematically replaced the general boost mutex header with the internal pthread-based one. This is equivalent for us, since we assume that boost::mutex is implemented by pthread_mutex_t, e.g. for the implementation of ConditionVariable. Switch include guards to pragma once just as general cleanup. Prefix string with std:: consistently in headers so that they don't depend on "using" declarations pulled in from random headers. Look at includes of C++ stream headers, including iostream and stringstream, and replaced them with iosfwd or removed them if possible. Compile time: Measured a full ASAN build of the impalad binary on an 8 core machine with cccache enabled, but cleared. It used very slightly less CPU, probably because we are still pulling in most of the same system headers. Before: real 9m27.502s user 64m39.775s sys 2m49.002s After: real 9m26.561s user 64m28.948s sys 2m48.252s So for the moment, the only significant wins are on incremental builds, where touching header files should not require as many recompilations. Compile times should start to drop meaningfully once we thin out more unnecessary includes - currently it seems like most compile units end up with large chunks of boost/std code included via transitive header dependencies. Change-Id: I3450e0ffcb8b183e18ac59c8b33b9ecbd3f60e20 Reviewed-on: http://gerrit.cloudera.org:8080/15202 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
ba00551581 |
IMPALA-4080 [part 1]: Move codegen code from aggregation exec nodes to
their plan nodes Refactored code to move codegen code from aggregation exec nodes to their plan nodes. Added some TODOs that will be fixed in the next few patch. Testing: - Ran queries and confirmed manually that the codegened code works. - Ran all e2e tests for agg nodes and partition joins. Change-Id: I58f52a262ac7d0af259d5bcda72ada93a851d3b2 Reviewed-on: http://gerrit.cloudera.org:8080/15053 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
0936384271 |
IMPALA-9010: Add builtin mask functions
There're 6 builtin GenericUDFs for column masking in Hive:
mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar,
otherChar, numberChar)
mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar,
otherChar, numberChar)
mask_first_n(value, charCount, upperChar, lowerChar, digitChar,
otherChar, numberChar)
mask_last_n(value, charCount, upperChar, lowerChar, digitChar,
otherChar, numberChar)
mask_hash(value)
mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar,
dayValue, monthValue, yearValue)
Description of the parameters:
value - value to mask. Supported types: TINYINT, SMALLINT, INT,
BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()).
charCount - number of characters. Default value: 4
upperChar - character to replace upper-case characters with. Specify
-1 to retain original character. Default value: 'X'
lowerChar - character to replace lower-case characters with. Specify
-1 to retain original character. Default value: 'x'
digitChar - character to replace digit characters with. Specify -1
to retain original character. Default value: 'n'
otherChar - character to replace all other characters with. Specify
-1 to retain original character. Default value: -1
numberChar - character to replace digits in a number with. Valid
values: 0-9. Default value: '1'
dayValue - value to replace day field in a date with.
Specify -1 to retain original value. Valid values: 1-31.
Default value: 1
monthValue - value to replace month field in a date with. Specify -1
to retain original value. Valid values: 0-11. Default
value: 0
yearValue - value to replace year field in a date with. Specify -1
to retain original value. Default value: 1
In Hive, these functions accept variable length of arguments in
non-restricted types:
mask_show_first_n(val)
mask_show_first_n(val, 8)
mask_show_first_n(val, 8, 'X', 'x', 'n')
mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2)
mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9')
The arguments of upperChar, lowerChar, digitChar, otherChar and
numberChar can be in string or numeric types.
Impala doesn't support Hive GenericUDFs, so we are lack of these mask
functions to support Ranger column masking policies. On the other hand,
we want the masking functions to be evaluated in the C++ builtin logic
rather than calling out to java UDFs for performance. This patch
introduces our builtin implementation of them.
We currently don't have a corresponding framework for GenericUDF
(IMPALA-9271), so we implement these by overloads. However, it may
requires hundreds of overloads to cover all possible combinations. We
just implement some important overloads, including
- those used by Ranger default masking policies,
- those with simple arguments and may be useful for users,
- an overload with all arguments in int type for full functionality.
Char argument need to be converted to their ASCII value.
Tests:
- Add BE tests in expr-test
Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664
Reviewed-on: http://gerrit.cloudera.org:8080/14963
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
4c04e67738 |
IMPALA-8891: Fix non-standard null handling in concat_ws()
This patch fixes the non-standard null handling logic for
function 'concat_ws', while maintaining the original null
handling for function 'concat'
Existing statuses:
For function concat_ws, any null string element in array
argument 'strs' will result in null result, just like below:
------------------------------------------------
select concat_ws('-','foo',null,'bar') as expr1;
+-------+
| expr1 |
+-------+
| NULL |
+-------+
New Statuses:
In this implementation, the function conforms to hive standard:
1.will join all the non-null string objects as the result
2.if all string objects are null, return empty string
3.if separator is null, return null
below is a example:
-------------------------------------------------
select concat_ws('-','foo',null,'bar') as expr1;
+----------+
| expr1 |
+----------+
| foo-bar |
+----------+
------------------------------------------------
Key changes:
* Reimplement function StringFunctions::ConcatWs by filtering the
null value and only process the valid string values, based on
original code structure.
* StringFunctions::Concat was also reimplemented, as it used to
call ConcatWs but should keep the original NULL handling.
Testing:
* Ran exaustive tests.
Change-Id: I64cd3bfbb952e431a0cf52a5835ac05d2513d29b
Reviewed-on: http://gerrit.cloudera.org:8080/14885
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
590da59a3c |
IMPALA-8706: ISO:SQL:2016 datetime patterns - Milestone 4
This patch adds ISO 8601 week-based date format tokens on top
of what was introduced in IMPALA-8703, IMPALA-8704 and
IMPALA-8705.
The ISO 8601 week-based date tokens may be used for both datetime
to string and string to datetime conversion.
The ISO 8601 week-based date tokens are as follows:
- IYYY: 4-digit ISO 8601 week-numbering year.
Week-numbering year is the year relating to the ISO
8601 week number (IW), which is the full week (Monday
to Sunday) which contains January 4 of the Gregorian
year.
Behaves similarly to YYYY in that for datetime to
string conversion, prefix digits for 1, 2, and 3-digit
inputs are obtained from current ISO 8601
week-numbering year.
- IYY: Last 3 digits of ISO 8601 week-numbering year.
Behaves similarly to YYY in that for datetime to string
conversion, prefix digit is obtained from current ISO
8601 week-numbering year and can accept 1 or 2-digit
input.
- IY: Last 2 digits of ISO 8601 week-numbering year.
Behaves similarly to YY in that for datetime to string
conversion, prefix digits are obtained from current ISO
8601 week-numbering year and can accept 1-digit input.
- I: Last digit of ISO 8601 week-numbering year.
Behaves similarly to Y in that for datetime to string
conversion, prefix digits are obtained from current ISO
8601 week-numbering year.
- IW: ISO 8601 week of year (1-53).
Begins on the Monday closest to January 1 of the year.
For string to datetime conversion, if the input ISO
8601 week does not exist in the input year, an error
will be thrown.
Note that IW is different from the other week-related
tokens WW and W (implemented in IMPALA-8705). With WW
and W weeks start with the first day of the
year/month. ISO 8601 weeks on the other hand always
start with Monday.
- ID: ISO 8601 day of week (1-7). 1 means Monday and 7 means
Sunday.
When doing string to datetime conversion, the ISO 8601 week-based
tokens are meant to be used together and not mixed with other ISO
SQL date tokens. E.g. 'YYYY-IW-ID' is an invalid format string.
The only exceptions are the day name tokens (DAY and DY) which
may be used instead of ID with the rest of the ISO 8601
week-based date tokens. E.g. 'IYYY-IW-DAY' is a valid format
string.
Change-Id: I89a8c1b98742391cb7b331840d216558dbca362b
Reviewed-on: http://gerrit.cloudera.org:8080/14852
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
30c7a6a18c |
IMPALA-8705: ISO:SQL:2016 datetime patterns - Milestone 3
This patch adds additional datetime format tokens on top of Milestone 1 (IMPALA-8703) and Milestone 2 (IMPALA-8704). The tokens introduced: - Full month name (MONTH, Month, month): In a string to datetime conversion this token can parse textual month name into a datetime type. In a datetime to string conversion this token gives the textual representation of a month. - Short month name (MON, Mon, mon): Similar to the full month name token but this works for 3-character month names like 'JAN'. - Full day name (DAY, Day, day): In a datetime to string conversion this token gives the textual representation of a day like 'Tuesday.' Not suppported in a string to datetime conversion. - Short day name (DY, Dy, dy): Similar to full day name token but this works for 3-character day names like 'TUE'. Not suppported in a string to datetime conversion. - Day of week (D): In a datetime to string conversion this gives a number in [1-7] where 1 represents Sunday. Not supported in a string to datetime conversion. - Quarter of year (Q): In a datetime to string conversion this gives a number in [1-4] representing a quarter of the year. Not supported in a string to datetime conversion. - Week of year (WW): In a datetime to string conversion this gives a number in [1-53] to represent the week of year where the first week starts from 1st of January. Not supported in a string to datetime conversion. - Week of month (W): In a datetime to string conversion this gives a number in [1-5] to represent the week of month where the first week starts from the first day of the month. Not supported in a string to datetime conversion. Change-Id: Ic797f19a1311b54e5d00d01d0a7afe1f0f21fb8f Reviewed-on: http://gerrit.cloudera.org:8080/14714 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
a862282811 |
IMPALA-8709: Add Damerau-Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate restricted Damerau-Levenshtein edit distance (optimal string alignment). Implmented as dle_dst() and damerau_levenshtein(). If either value is NULL or both values are NULL returns NULL which differs from Netezza's dle_dst() which returns the length of the not NULL value or 0 if both values are NULL. The NULL behavior matches the existing levenshtein() function. Also cleans up levenshtein tests. Testing: - Added unit tests to expr-test.cc - Manual testing on over 1400 string pairs from http://marvin.cs.uidaho.edu/misspell.html and results match Netezza Change-Id: Ib759817ec15e7075bf49d51e494e45c8af4db94d Reviewed-on: http://gerrit.cloudera.org:8080/13794 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
684a54a89e |
IMPALA-7368: Change supported year range for DATE values to 1..9999
Before this patch the supported year range for DATE type started with year 0. This contradicts the ANSI SQL standard that defines the valid DATE value range to be 0001-01-01 to 9999-12-31. Change-Id: Iefdf1c036834763f52d44d0c39a25a1f04e41e07 Reviewed-on: http://gerrit.cloudera.org:8080/14349 Reviewed-by: Attila Jeges <attilaj@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
983e3a66de |
IMPALA-2138: part 1: initial cleanup
This is a mixed bag of simplifications, debugging improvements and test fixes that came up in the projection work. I had to update some planner tests because some expressions now include their arguments. Various things in the planner tests were stale, so there are spurious changes in the expected output that are ignored by the plan verification. Change-Id: I75d2c8cab79988300c1a9c6c23d6ccea53da7d23 Reviewed-on: http://gerrit.cloudera.org:8080/14265 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
bca1b43efb |
IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1
This enhancement introduces FORMAT clause for CAST() operator that is
applicable for casts between string types and timestamp types. Instead
of accepting SimpleDateFormat patterns the FORMAT clause supports
datetime patterns following the ISO:SQL:2016 standard.
Note, the CAST() operator without the FORMAT clause still uses
Impala's implementation of SimpleDateFormat handling. Similarly, the
existing conversion functions such as to_timestamp(), from_timestamp()
etc. remain unchanged and use SimpleDateFormat. Contrary to how these
functions work the FORMAT clause must specify a string literal and
cannot be used with any other kind of a string expression.
Milestone 1 contains all the format tokens covered by the SQL
standard. Further milestones will add more functionality on top of
this list to cover functionality provided by other RDBMS systems.
List of tokens implemented by this change:
- YYYY, YYY, YY, Y: Year tokens
- RRRR, RR: Round year tokens
- MM: Month (1-12)
- DD: Day (1-31)
- DDD: Day of year (1-366)
- HH, HH12: Hour of day (1-12)
- HH24: Hour of day (0-23)
- MI: Minute (0-59)
- SS: Second (0-59)
- SSSSS: Second of day (0-86399)
- FF, FF1, ..., FF9: Fractional second
- AM, PM, A.M., P.M.: Meridiem indicators
- TZH: Timezone hour (-99-+99)
- TZM: Timezone minute (0-99)
- Separators: - . / , ' ; : space
- ISO8601 date indicators (T, Z)
Some notes about the matching algorithm:
- The parsing algorithm uses these tokens in a case insensitive
manner.
- The separators are interchangeable with each other. For example a
'-' separator in the format will match with a '.' character in the
input.
- The length of the separator sequences is handled flexibly meaning
that a single separator character in the format for instance would
match with a multi-separator sequence in the input.
- In a string type to timestamp conversion the timezone offset tokens
are parsed, expected to match with the input but they don't adjust
the result as the input is already expected to be in UTC format.
Usage example:
SELECT CAST('01-02-2019' AS TIMESTAMP FORMAT 'MM-DD-YYYY');
SELECT CAST('2019.10.10 13:30:40.123456 +01:30' AS TIMESTAMP
FORMAT 'YYYY-MM-DD HH24:MI:SS.FF9 TZH:TZM');
SELECT CAST(timestamp_column as STRING
FORMAT "YYYY MM HH12 YY") from some_table;
Change-Id: I19d8d097a45ae6f103b6cd1b2d81aad38dfd9e23
Reviewed-on: http://gerrit.cloudera.org:8080/13722
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
151835116a |
IMPALA-7312: Non-blocking mode for Fetch() RPC
Adds the query option FETCH_ROWS_TIMEOUT_MS to control the client timeout when fetching rows. Set to 10 seconds by default to avoid unnecessary fetch requests. Timeout applies when result spooling is enabled or disabled. When result spooling is disabled, the timeout controls how long the client thread will wait for a single RowBatch to be produced by the coordinator fragment. When result spooling is enabled, a client can fetch multiple RowBatches at a time, so the timeout controls the total time spent waiting for RowBatches to be produced. The timeout applies to both waiting for rows to be sent by the fragment instance thread, and waiting for rows to be materialized (e.g. the time measured by RowMaterializationTimer). Testing: * Added new tests to test_fetch.py * Ran core tests Change-Id: I331acaba23a65dab43cca48e9dc0dc957b9c632d Reviewed-on: http://gerrit.cloudera.org:8080/14157 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
1f904719e4 |
IMPALA-7770: SPLIT_PART to support negative indexes
Third parameter of SPLIT_PART (nth field) accepts now negative values, and searches the string backwards. Testing: * Added unit tests to expr-test.cc Change-Id: I2db762989a90bd95661a59eb9c11a29eb2edfafb Reviewed-on: http://gerrit.cloudera.org:8080/13880 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
8db7f27ddd |
IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function
The added functions return the Jaro/Jaro-Winkler similarity/distance of two strings. The algorithm calcuates the Jaro-Similarity of the strings, then adds more weight to the result if there are common prefixes. (Jaro-Winkler) For more detail, see: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance Extended the algorithm with another optional parameter: boost threshold The prefix weight will only be applied if the Jaro-similarity exceeds the given threshold. By default, its value is 0.7. The new built-in functions are: * jaro_distance, jaro_dst * jaro_similarity, jaro_sim * jaro_winkler_distance, jw_dst * jaro_winkler_similarity, jw_sim Testing: * Added unit tests to expr-test.cc * Manual testing over 1400 word pairs from http://marvin.cs.uidaho.edu/misspell.html Results match Apache commons Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c Reviewed-on: http://gerrit.cloudera.org:8080/13870 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
4cc3ff9c67 |
IMPALA-8176: Convert simple backend tests to the unified executable
This converts tests with trivial main() functions to the unified executable. This means that the code change is strictly removing main() functions and updating the CMakeLists.txt files. Any test that requires a change larger than that will be addressed separately. The only exceptions are: - exec/incr-stats-util-test.cc requires naming changes to avoid conflicts with util/rle-test.cc - runtime/decimal-test.cc simplified the naming to make the CMakeLists.txt arguments easier. The new test libraries are marked STATIC, because they are linked into a single binary (unifiedbetests) and googletest has problems with tests in shared libraries. Converting this set of tests saves about 18GB of disk space for a debug build and saves a minute or two of link time. For any CMakeLists.txt that has unified tests, this adds a comment for each test that is not unified. Testing: - Ran backend tests in DEBUG and ASAN modes on Centos7 - Ran backend tests in DEBUG mode on Centos6 Change-Id: I840d0f9b70edb3a7195a2a33b21fd2874d4c52bd Reviewed-on: http://gerrit.cloudera.org:8080/13515 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
c353cf7a64 |
IMPALA-8713: fix stack overflow in unhex()
Write the results into the output heap buffer instead of into a temporary stack buffer. No additional memory is used because AnyValUtil::FromBuffer() allocated a temporary buffer anyway. Testing: Added a targeted test to expr-test that caused a crash before this fix. Change-Id: Ie0c1760511a04c0823fc465cf6e529e9681b2488 Reviewed-on: http://gerrit.cloudera.org:8080/13743 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
dbba52c77c |
IMPALA-8665:Include extra info in error message when date cast fails
This change extends the error message Impala yields when casting STRING
to DATE (explicitly or implicitly) fails. The new error message includes
the violating string value.
Testing:
changes -> date-partitioning.test & date.test
query_test/test_date_queries.py test passed
Example:
select cast('20' as date);
ERROR: UDF ERROR: String to Date parse failed. Invalid string val: "20"
Change-Id: If800b7696515cd61afee27220c55ff2440a86f04
Reviewed-on: http://gerrit.cloudera.org:8080/13680
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
f40935a30e |
IMPALA-7369: part 2: Add INTERVAL expr support and built-in functions for DATE
This change implements INTERVAL expression support for DATE type and adds several DATE related built-in functions. The efficiency of the DateValue::ToYearMonthDay() function used in many of the built-in functions below was also improved. The following functions are supported in Hive: INT YEAR(DATE d) Extracts year of the 'd' date, returns it as an int in 0-9999 range. INT MONTH(DATE d) Extracts month of the 'd' date and returns it as an int in 1-12 range. INT DAY(DATE d), INT DAYOFMONTH(DATE d) Extracts day-of-month of the 'd' date and returns it as an int in 1-31 range. INT QUARTER(DATE d) Extracts quarter of the 'd' date and returns it as an int in 1-4 range. INT DAYOFWEEK(DATE d) Extracts day-of-week of the 'd' date and returns it as an int in 1-7 range. 1 is Sunday and 7 is Saturday. INT DAYOFYEAR(DATE d) Extracts day-of-year of the 'd' date and returns it as an int in 1-366 range. INT WEEKOFYEAR(DATE d) Extracts week-of-year of the 'd' date and returns it as an int in 1-53 range. STRING DAYNAME(DATE d) Returns the day field from a 'd' date, converted to the string corresponding to that day name. The range of return values is "Sunday" to "Saturday". STRING MONTHNAME(DATE d) Returns the month field from a 'd' date, converted to the string corresponding to that month name. The range of return values is "January" to "December". DATE NEXT_DAY(DATE d, STRING weekday) Returns the first date which is later than 'd' and named as 'weekday'. 'weekday' is 3 letters or full name of the day of the week. DATE LAST_DAY(DATE d) Returns the last day of the month which the 'd' date belongs to. INT DATEDIFF(DATE d1, DATE d2) Returns the number of days from 'd1' date to 'd2' date. DATE CURRENT_DATE() Returns the current date (in the local time zone). INT INT_MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates, as an int representing only the full months that passed. If 'd1' represents an earlier date than 'd2', the result is negative. DOUBLE MONTHS_BETWEEN(DATE d1, DATE d2) Returns the number of months between 'd1' and 'd2' dates. Can include a fractional part representing extra days in addition to the full months between the dates. The fractional component is computed by dividing the difference in days by 31 (regardless of the month). If 'd1' represents an earlier date than 'd2', the result is negative. DATE ADD_YEARS(DATE d, INT/BIGINT num_years), DATE SUB_YEARS(DATE d, INT/BIGINT num_years) Adds/subtracts a specified number of years to a 'd' date value. DATE ADD_MONTHS(DATE d, INT/BIGINT num_months), DATE SUB_MONTHS(DATE d, INT/BIGINT num_months) Adds/subtracts a specified number of months to a date value. If 'd' is the last day of a month, the returned date will fall on the last day of the target month too. DATE ADD_DAYS(DATE d, INT/BIGINT num_days), DATE SUB_DAYS(DATE d, INT/BIGINT num_days) Adds/subtracts a specified number of days to a date value. DATE ADD_WEEKS(DATE d, INT/BIGINT num_weeks), DATE SUB_WEEKS(DATE d, INT/BIGINT num_weeks) Adds/subtracts a specified number of weeks to a date value. The following function doesn't exist in Hive but supported by Amazon Redshift INT DATE_CMP(DATE d1, DATE d2) Compares 'd1' and 'd2' dates. Returns: 1. NULL, if either 'd1' or 'd2' is NULL 2. -1 if d1 < d2 3. 1 if d1 > d2 4. 0 if d1 == d2 (https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_CMP.html) Change-Id: If404bffdaf055c769e79ffa8f193bac415cfdd1a Reviewed-on: http://gerrit.cloudera.org:8080/13648 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
45c6c46bf6 |
IMPALA-5031: signed overflow is undefined behavior
Fix remaining signed overflow undefined behaviors in end-to-end
tests. The interesting part of the backtraces:
exprs/aggregate-functions-ir.cc:464:25: runtime error: signed
integer overflow: 0x5a4728ca063b522c0b728f8000000000 +
0x3c2f7086aed236c807a1b50000000000 cannot be represented in
type '__int128'
#0 AggregateFunctions::DecimalAvgMerge(
impala_udf::FunctionContext*, impala_udf::StringVal const&,
impala_udf::StringVal*) exprs/aggregate-functions-ir.cc:464:25
#1 AggFnEvaluator::Update(TupleRow const*, Tuple*, void*)
exprs/agg-fn-evaluator.cc:327:7
#2 AggFnEvaluator::Add(TupleRow const*, Tuple*)
exprs/agg-fn-evaluator.h:257:3
#3 Aggregator::UpdateTuple(AggFnEvaluator**, Tuple*, TupleRow*, bool)
exec/aggregator.cc:167:24
#4 NonGroupingAggregator::AddBatchImpl(RowBatch*)
exec/non-grouping-aggregator-ir.cc:27:5
#5 NonGroupingAggregator::AddBatch(RuntimeState*, RowBatch*)
exec/non-grouping-aggregator.cc:124:45
#6 AggregationNode::Open(RuntimeState*)
exec/aggregation-node.cc:70:57
exprs/aggregate-functions-ir.cc:513:12: runtime error: signed
integer overflow: -8282081183197145958 + -4473782455107795527
cannot be represented in type 'long'
#0 void AggregateFunctions::SumUpdate<impala_udf::BigIntVal,
impala_udf::BigIntVal>(impala_udf::FunctionContext*,
impala_udf::BigIntVal const&, impala_udf::BigIntVal*)
exprs/aggregate-functions-ir.cc:513:12
#1 AggFnEvaluator::Update(TupleRow const*, Tuple*, void*)
exprs/agg-fn-evaluator.cc:327:7
#2 AggFnEvaluator::Add(TupleRow const*, Tuple*)
exprs/agg-fn-evaluator.h:257:3
#3 Aggregator::UpdateTuple(AggFnEvaluator**, Tuple*, TupleRow*,
bool) exec/aggregator.cc:167:24
#4 NonGroupingAggregator::AddBatchImpl(RowBatch*)
exec/non-grouping-aggregator-ir.cc:27:5
#5 NonGroupingAggregator::AddBatch(RuntimeState*, RowBatch*)
exec/non-grouping-aggregator.cc:124:45
#6 AggregationNode::Open(RuntimeState*)
exec/aggregation-node.cc:70:57
exprs/aggregate-functions-ir.cc:585:14: runtime error: signed
integer overflow: 0x5a4728ca063b522c0b728f8000000000 +
0x3c2f7086aed236c807a1b50000000000 cannot be represented in
type '__int128'
#0 AggregateFunctions::SumDecimalMerge(
impala_udf::FunctionContext*, impala_udf::DecimalVal const&,
impala_udf::DecimalVal*) exprs/aggregate-functions-ir.cc:585:14
#1 AggFnEvaluator::Update(TupleRow const*, Tuple*, void*)
exprs/agg-fn-evaluator.cc:327:7
#2 AggFnEvaluator::Add(TupleRow const*, Tuple*)
exprs/agg-fn-evaluator.h:257:3
#3 Aggregator::UpdateTuple(AggFnEvaluator**, Tuple*, TupleRow*, bool)
exec/aggregator.cc:167:24
#4 NonGroupingAggregator::AddBatchImpl(RowBatch*)
exec/non-grouping-aggregator-ir.cc:27:5
#5 NonGroupingAggregator::AddBatch(RuntimeState*, RowBatch*)
exec/non-grouping-aggregator.cc:124:45
#6 AggregationNode::Open(RuntimeState*)
exec/aggregation-node.cc:70:57
runtime/decimal-value.inline.h:145:12: runtime error: signed
integer overflow: 18 * 0x0785ee10d5da46d900f436a000000000 cannot
be represented in type '__int128'
#0 DecimalValue<__int128>::ScaleTo(int, int, int, bool*) const
runtime/decimal-value.inline.h:145:12
#1 DecimalOperators::ScaleDecimalValue(
impala_udf::FunctionContext*, DecimalValue<int> const&, int,
int, int) exprs/decimal-operators-ir.cc:132:41
#2 DecimalOperators::RoundDecimal(impala_udf::FunctionContext*,
impala_udf::DecimalVal const&, int, int, int, int,
DecimalOperators::DecimalRoundOp const&)
exprs/decimal-operators-ir.cc:465:16
#3 DecimalOperators::RoundDecimal(impala_udf::FunctionContext*,
impala_udf::DecimalVal const&, DecimalOperators::DecimalRoundOp
const&) exprs/decimal-operators-ir.cc:519:10
#4 DecimalOperators::CastToDecimalVal(
impala_udf::FunctionContext*, impala_udf::DecimalVal const&)
exprs/decimal-operators-ir.cc:529:10
#5 impala_udf::DecimalVal ScalarFnCall::InterpretEval
<impala_udf::DecimalVal>(ScalarExprEvaluator*, TupleRow const*)
const exprs/scalar-fn-call.cc:485:208
#6 ScalarFnCall::GetDecimalVal(ScalarExprEvaluator*, TupleRow
const*) const exprs/scalar-fn-call.cc:618:44
#7 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow
const*) exprs/scalar-expr-evaluator.cc:321:27
#8 ScalarExprEvaluator::GetValue(TupleRow const*)
exprs/scalar-expr-evaluator.cc:251:10
#9 Java_org_apache_impala_service_FeSupport_NativeEvalExprsWithoutRow
service/fe-support.cc:246:26
#10 (<unknown module>)
runtime/multi-precision.h:116:21: runtime error: negation of
0x80000000000000000000000000000000 cannot be represented in
type 'int128_t' (aka '__int128'); cast to an unsigned type to
negate this value to itself
#0 ConvertToInt128(boost::multiprecision::number
<boost::multiprecision::backends::cpp_int_backend<256u, 256u,
(boost::multiprecision::cpp_integer_type)1,
(boost::multiprecision::cpp_int_check_type)0, void>,
(boost::multiprecision::expression_template_option)0>,
__int128, bool*) runtime/multi-precision.h:116:21
#1 DecimalValue<__int128>
DecimalValue<__int128>::Multiply<__int128>(int,
DecimalValue<__int128> const&, int, int, int, bool, bool*) const
runtime/decimal-value.inline.h:438:16
#2 DecimalOperators::Multiply_DecimalVal_DecimalVal(
impala_udf::FunctionContext*, impala_udf::DecimalVal const&,
impala_udf::DecimalVal const&)
exprs/decimal-operators-ir.cc:859:3336
#3 impala_udf::DecimalVal ScalarFnCall::InterpretEval
<impala_udf::DecimalVal>(ScalarExprEvaluator*, TupleRow const*)
const exprs/scalar-fn-call.cc:485:376
#4 ScalarFnCall::GetDecimalVal(ScalarExprEvaluator*, TupleRow
const*) const exprs/scalar-fn-call.cc:618:44
#5 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow
const*) exprs/scalar-expr-evaluator.cc:321:27
#6 ScalarExprEvaluator::GetValue(TupleRow const*)
exprs/scalar-expr-evaluator.cc:251:10
#7 Java_org_apache_impala_service_FeSupport_NativeEvalExprsWithoutRow
service/fe-support.cc:246:26
#8 (<unknown module>)
util/runtime-profile-counters.h:194:24: runtime error: signed
integer overflow: -1263418397011577524 + -9223370798768111350
cannot be represented in type 'long'
#0 RuntimeProfile::AveragedCounter::UpdateCounter
(RuntimeProfile::Counter*)
util/runtime-profile-counters.h:194:24
#1 RuntimeProfile::UpdateAverage(RuntimeProfile*)
util/runtime-profile.cc:199:20
#2 RuntimeProfile::UpdateAverage(RuntimeProfile*)
util/runtime-profile.cc:245:14
#3 Coordinator::BackendState::UpdateExecStats
(vector<Coordinator::FragmentStats*,
allocator<Coordinator::FragmentStats*> > const&)
runtime/coordinator-backend-state.cc:429:22
#4 Coordinator::ComputeQuerySummary()
runtime/coordinator.cc:775:20
#5 Coordinator::HandleExecStateTransition(Coordinator::ExecState,
Coordinator::ExecState) runtime/coordinator.cc:567:3
#6 Coordinator::SetNonErrorTerminalState(Coordinator::ExecState)
runtime/coordinator.cc:484:3
#7 Coordinator::GetNext(QueryResultSet*, int, bool*)
runtime/coordinator.cc:657:53
#8 ClientRequestState::FetchRowsInternal(int, QueryResultSet*)
service/client-request-state.cc:943:34
#9 ClientRequestState::FetchRows(int, QueryResultSet*)
service/client-request-state.cc:835:36
#10 ImpalaServer::FetchInternal(TUniqueId const&, bool, int,
beeswax::Results*) service/impala-beeswax-server.cc:545:40
#11 ImpalaServer::fetch(beeswax::Results&, beeswax::QueryHandle
const&, bool, int) service/impala-beeswax-server.cc:178:19
#12 beeswax::BeeswaxServiceProcessor::process_fetch(int,
apache::thrift::protocol::TProtocol*,
apache::thrift::protocol::TProtocol*, void*)
generated-sources/gen-cpp/BeeswaxService.cpp:3398:13
#13 beeswax::BeeswaxServiceProcessor::dispatchCall
(apache::thrift::protocol::TProtocol*,
apache::thrift::protocol::TProtocol*, string const&, int,
void*) generated-sources/gen-cpp/BeeswaxService.cpp:3200:3
#14 ImpalaServiceProcessor::dispatchCall
(apache::thrift::protocol::TProtocol*,
apache::thrift::protocol::TProtocol*, string const&, int,
void*) generated-sources/gen-cpp/ImpalaService.cpp:1824:48
#15 apache::thrift::TDispatchProcessor::process
(boost::shared_ptr<apache::thrift::protocol::TProtocol>,
boost::shared_ptr<apache::thrift::protocol::TProtocol>, void*)
toolchain/thrift-0.9.3-p5/include/thrift/TDispatchProcessor.h:121:12
Change-Id: I73dd6802ec1023275d09a99a2950f3558313fc8e
Reviewed-on: http://gerrit.cloudera.org:8080/13437
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
f0678b06e6 |
IMPALA-7369: part 1: Implement TRUNC, DATE_TRUNC, EXTRACT, DATE_PART functions for DATE
These functions are somewhat similar in that each of them takes a DATE
argument and a time unit to work with.
They work identically to the corresponding TIMESTAMP functions. The
only difference is that the DATE functions don't accept time-of-day
units.
TRUNC(DATE d, STRING unit)
Truncates a DATE value to the specified time unit. The 'unit' argument
is case insensitive. This argument string can be one of:
SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year.
Q: Quarter.
MONTH, MON, MM, RM: Month.
DDD, DD, J: Day.
DAY, DY, D: Starting day (Monday) of the week.
WW: Truncates to the most recent date, no later than 'd', which is
on the same day of the week as the first day of year.
W: Truncates to the most recent date, no later than 'd', which is on
the same day of the week as the first day of month.
The impelementation mirrors Impala's TRUNC(TIMESTAMP ts, STRING unit)
function. Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions201.htm
.
DATE_TRUNC(STRING unit, DATE d)
Truncates a DATE value to the specified precision. The 'unit' argument
is case insensitive. This argument string can be one of: DAY, WEEK,
MONTH, YEAR, DECADE, CENTURY, MILLENNIUM.
The implementation mirrors Impala's DATE_TRUNC(STRING unit,
TIMESTAMP ts) function. Vertica has a similar function too.
Reference:
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/
SQLReferenceManual/Functions/Date-Time/DATE_TRUNC.htm
.
EXTRACT(DATE d, STRING unit), EXTRACT(unit FROM DATE d)
Returns one of the numeric date fields from a DATE value. The 'unit'
string can be one of YEAR, QUARTER, MONTH, DAY. This argument value is
case-insensitive.
The implementation mirrors that Impala's EXTRACT(TIMESTAMP ts,
STRING unit). Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
.
DATE_PART(STRING unit, DATE date)
Similar to EXTRACT(), with the argument order reversed. Supports the
same date units as EXTRACT().
The implementation mirrors Impala's DATE_PART(STRING unit,
TIMESTAMP ts) function.
Change-Id: I843358a45eb5faa2c134994600546fc1d0a797c8
Reviewed-on: http://gerrit.cloudera.org:8080/13363
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
95a1da2d32 |
IMPALA-8578: part 2: move metrics code to .cc files
This moves a lot of metric function definitions into .cc files, to reduce the size of compilation units and to reduce the frequency of recompilation required when making changes to metrics. This moves most of the large, non-perf-critical metric functions into .cc files. For template classes, this requires explicitly instantiating all combinations of template parameters that are used in impala, including in tests. Disable weak-template-vtables warning because of spurious warnings on template instantiations. See https://bugs.llvm.org/show_bug.cgi?id=18733 Change-Id: I78ad045ded6e6a7b7524711be9302c26115b97b9 Reviewed-on: http://gerrit.cloudera.org:8080/13501 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
d4648e87b4 |
IMPALA-4356,IMPALA-7331: codegen all ScalarExprs
Based on initial draft patch by Pooja Nilangekar.
Codegen'd expressions can be executed in two ways - either by
being called directly from a fully codegend function, or from
interpreted code via a function pointer (previously
ScalarFnCall::scalar_fn_wrapper_).
This change moves the function pointer from ScalarFnCall to its
base class ScalarExpr, so the full expr tree can be codegen'd, not
just the ScalarFnCall subtrees. The key refactoring and improvements
are:
* ScalarExpr::Get*Val() switches between interpreted and the codegen'd
function pointer code paths in an inline function, avoiding a
virtual function call to ScalarFnCal::Get*Val().
* Boilerplate logic is moved to ScalarExpr::GetCodegendComputeFn(),
which calls a virtual function GetCodegenComputeFnImpl().
* ScalarFnCall's logic for deciding whether to interpret or codegen is
better abstracted and exposed to ScalarExpr as IsInterpretable()
and ShouldCodegen() methods.
* The ScalarExpr::codegend_compute_fn_ function pointer is only
populated for expressions that are "codegen entry points". These
include the roots of expr trees and non-root expressions
where the parent expression calls Get*Val() from the
pseudo-codegend GetCodegendComputeFnWrapper().
* ScalarFnCall is always initialised for interpreted execution.
Otherwise the function pointer is needed for non-root expressions,
e.g. to support ScalarExprEvaluator::GetConstantVal().
* Latent bugs/gaps for codegen of CollectionVal are fixed. CollectionVal
is modified to use the StringVal memory layout to allow code sharing
with StringVal. These fixes allowed simplification of
IsNotEmptyPredicate codegen (from IMPALA-7657).
I chose to tackle two problems in one change - adding support for
generating codegen'd function pointers for all ScalarExprs, and adding
the "entry point" concept - to avoid a blow-up in the number of
codegen'd entry points that could lead to longer codegen times and/or
worse code because of inlining changes.
IMPALA-7331 (CHAR codegen support functions) is also fixed because
it was simpler to enable CHAR codegen within ScalarExpr than to carry
forward the exiting CHAR workarounds from ScalarFnCall. The
CHAR-specific codegen support required in the scalar expr subsystem is
very limited. StringVal intermediates are used everywhere. Only
SlotRef actually operates on the different tuple layout, and the
required codegen support for SlotRef already exists for UDA
intermediates anyway.
Testing:
* Ran exhaustive tests.
Perf:
* Ran a basic insert benchmark, which went from 10.1s to 7.6s
create table foo stored as parquet as
select case when l_orderkey % 2 = 0 then 'aaa' else 'bbb' end
from tpch30_parquet.lineitem;
* Ran a basic CHAR expr test:
set num_nodes=1;
set mt_dop=1;
select count(*) from lineitem
where cast(l_linestatus as CHAR(2)) = 'O ' and
cast(l_returnflag as CHAR(2)) = 'N '
The time spent in the scan went from 520ms to 220ms.
* Added perf regression test to tpcds-insert, similar to the manual
benchmark.
* Ran single-node TPC-H with large and small scale factors, to estimate
impact on execution perf and query startup time, respectively.
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 6.84 | -0.18% | 4.49 | -0.31% |
+----------+-----------------------+---------+------------+------------+----------------+
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCH(30) | TPCH-Q20 | parquet / none / none | 2.58 | 2.47 | +4.18% | 1.29% | 0.88% | 5 | +4.12% | 2.31 | 5.81 |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 4.81 | 4.61 | +4.33% | 2.18% | 2.15% | 5 | +3.91% | 1.73 | 3.09 |
| TPCH(30) | TPCH-Q21 | parquet / none / none | 26.45 | 26.16 | +1.09% | 0.37% | 0.50% | 5 | +1.36% | 2.02 | 3.94 |
| TPCH(30) | TPCH-Q9 | parquet / none / none | 15.92 | 15.75 | +1.09% | 2.87% | 1.65% | 5 | +0.88% | 0.29 | 0.73 |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.38 | 2.35 | +1.12% | 1.64% | 1.11% | 5 | +0.80% | 1.15 | 1.26 |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 2.94 | 2.91 | +1.13% | 7.68% | 5.37% | 5 | -0.34% | -0.29 | 0.27 |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 18.10 | 18.02 | +0.42% | 2.70% | 0.56% | 5 | +0.28% | 0.29 | 0.34 |
| TPCH(30) | TPCH-Q8 | parquet / none / none | 4.72 | 4.72 | -0.04% | 1.20% | 1.65% | 5 | +0.05% | 0.00 | -0.04 |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 3.92 | 3.93 | -0.26% | 1.08% | 2.36% | 5 | +0.20% | 0.58 | -0.23 |
| TPCH(30) | TPCH-Q6 | parquet / none / none | 1.27 | 1.27 | -0.28% | 0.22% | 0.88% | 5 | +0.09% | 0.29 | -0.68 |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.64 | 2.65 | -0.45% | 1.65% | 0.65% | 5 | -0.24% | -0.58 | -0.57 |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 3.10 | 3.13 | -0.76% | 1.47% | 1.12% | 5 | -0.21% | -0.29 | -0.93 |
| TPCH(30) | TPCH-Q2 | parquet / none / none | 1.20 | 1.21 | -0.80% | 2.26% | 2.47% | 5 | -0.82% | -1.15 | -0.53 |
| TPCH(30) | TPCH-Q4 | parquet / none / none | 1.97 | 1.99 | -1.37% | 1.84% | 3.21% | 5 | -0.47% | -0.58 | -0.83 |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 11.53 | 11.63 | -0.91% | 0.46% | 0.49% | 5 | -0.95% | -2.02 | -3.08 |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 5.13 | 5.21 | -1.51% | 2.24% | 4.05% | 5 | -0.94% | -0.58 | -0.73 |
| TPCH(30) | TPCH-Q5 | parquet / none / none | 3.61 | 3.66 | -1.40% | 0.66% | 0.79% | 5 | -1.33% | -1.73 | -3.05 |
| TPCH(30) | TPCH-Q7 | parquet / none / none | 19.42 | 19.71 | -1.52% | 1.34% | 1.39% | 5 | -1.22% | -1.44 | -1.76 |
| TPCH(30) | TPCH-Q3 | parquet / none / none | 5.08 | 5.15 | -1.49% | 1.34% | 0.73% | 5 | -1.35% | -1.44 | -2.20 |
| TPCH(30) | TPCH-Q15 | parquet / none / none | 3.42 | 3.49 | -1.92% | 0.93% | 1.47% | 5 | -1.53% | -1.15 | -2.49 |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.15 | 1.19 | -3.17% | 2.27% | 1.95% | 5 | -4.21% | -1.15 | -2.41 |
| TPCH(30) | TPCH-Q1 | parquet / none / none | 9.26 | 9.63 | -3.85% | 0.62% | 0.59% | 5 | -3.78% | -2.31 | -10.25 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
Cluster Name: UNKNOWN
Lab Run Info: UNKNOWN
Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE ()
Baseline Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE (2019-03-19)
+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(2) | parquet / none / none | 0.90 | -0.08% | 0.80 | -0.05% |
+----------+-----------------------+---------+------------+------------+----------------+
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| TPCH(2) | TPCH-Q18 | parquet / none / none | 1.22 | 1.19 | +1.93% | 3.81% | 4.46% | 20 | +3.34% | 1.62 | 1.46 |
| TPCH(2) | TPCH-Q10 | parquet / none / none | 0.74 | 0.73 | +1.97% | 3.36% | 2.94% | 20 | +0.97% | 1.88 | 1.95 |
| TPCH(2) | TPCH-Q11 | parquet / none / none | 0.49 | 0.48 | +1.91% | 6.19% | 4.64% | 20 | +0.25% | 0.95 | 1.09 |
| TPCH(2) | TPCH-Q4 | parquet / none / none | 0.43 | 0.43 | +1.99% | 6.26% | 5.86% | 20 | +0.15% | 0.92 | 1.03 |
| TPCH(2) | TPCH-Q15 | parquet / none / none | 0.50 | 0.49 | +1.82% | 7.32% | 6.35% | 20 | +0.26% | 1.01 | 0.83 |
| TPCH(2) | TPCH-Q1 | parquet / none / none | 0.98 | 0.97 | +0.79% | 4.64% | 2.73% | 20 | +0.36% | 0.77 | 0.65 |
| TPCH(2) | TPCH-Q19 | parquet / none / none | 0.83 | 0.83 | +0.65% | 3.33% | 2.80% | 20 | +0.44% | 2.18 | 0.67 |
| TPCH(2) | TPCH-Q14 | parquet / none / none | 0.62 | 0.62 | +0.97% | 2.86% | 1.00% | 20 | +0.04% | 0.13 | 1.42 |
| TPCH(2) | TPCH-Q3 | parquet / none / none | 0.88 | 0.87 | +0.57% | 2.17% | 1.74% | 20 | +0.29% | 1.15 | 0.92 |
| TPCH(2) | TPCH-Q12 | parquet / none / none | 0.53 | 0.53 | +0.27% | 4.58% | 5.78% | 20 | +0.46% | 1.47 | 0.16 |
| TPCH(2) | TPCH-Q17 | parquet / none / none | 0.72 | 0.72 | +0.15% | 3.64% | 5.55% | 20 | +0.21% | 0.86 | 0.10 |
| TPCH(2) | TPCH-Q21 | parquet / none / none | 2.05 | 2.05 | +0.21% | 1.99% | 2.37% | 20 | +0.01% | 0.25 | 0.30 |
| TPCH(2) | TPCH-Q5 | parquet / none / none | 1.28 | 1.27 | +0.24% | 1.61% | 1.80% | 20 | -0.02% | -0.57 | 0.44 |
| TPCH(2) | TPCH-Q13 | parquet / none / none | 1.27 | 1.27 | -0.34% | 1.69% | 1.83% | 20 | -0.20% | -1.65 | -0.61 |
| TPCH(2) | TPCH-Q7 | parquet / none / none | 1.72 | 1.73 | -0.55% | 2.40% | 1.69% | 20 | -0.03% | -0.42 | -0.83 |
| TPCH(2) | TPCH-Q8 | parquet / none / none | 1.27 | 1.28 | -0.68% | 3.10% | 3.89% | 20 | -0.06% | -0.54 | -0.62 |
| TPCH(2) | TPCH-Q6 | parquet / none / none | 0.36 | 0.36 | -0.84% | 0.79% | 3.51% | 20 | -0.07% | -0.36 | -1.04 |
| TPCH(2) | TPCH-Q2 | parquet / none / none | 0.65 | 0.65 | -1.17% | 4.76% | 5.99% | 20 | -0.05% | -0.25 | -0.69 |
| TPCH(2) | TPCH-Q9 | parquet / none / none | 1.59 | 1.62 | -2.01% | 1.45% | 5.12% | 20 | -0.16% | -1.24 | -1.69 |
| TPCH(2) | TPCH-Q20 | parquet / none / none | 0.68 | 0.69 | -1.73% | 4.35% | 4.43% | 20 | -0.49% | -1.74 | -1.25 |
| TPCH(2) | TPCH-Q22 | parquet / none / none | 0.38 | 0.40 | -2.89% | 7.42% | 6.39% | 20 | -0.21% | -0.66 | -1.34 |
| TPCH(2) | TPCH-Q16 | parquet / none / none | 0.59 | 0.62 | -4.01% | 6.33% | 5.83% | 20 | -4.72% | -1.39 | -2.13 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
Change-Id: I839d7a3a2f5e1309c33a1f66013ef11628c5dc11
Reviewed-on: http://gerrit.cloudera.org:8080/12797
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
e2ead7f857 |
expr-test: use gtest parameterization
Instead of running the tests three times with different flags from main(), this uses gtest's parameterization feature to accomplish the same. The advantage here is that we end up with different test names for each of the runs. Additionally, this moves the setup code into a proper setup method so that executing expr-test --gtest_list_tests doesn't waste time starting a cluster. This is prep work towards adding multi-threaded test execution for long-running tests. expr-test seems to currently be one of the worst offenders. Change-Id: Idc9fb24ad62b4aa2e120a99d74ae04bb221c034b Reviewed-on: http://gerrit.cloudera.org:8080/13289 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
d423979866 |
IMPALA-5843: Use page index in Parquet files to skip pages
This commit implements page filtering based on the Parquet page index. The read and evaluation of the page index is done by the HdfsParquetScanner. At first, we determine the row ranges we are interested in, and based on the row ranges we determine the candidate pages for each column that we are reading. We still issue one ScanRange per column chunk, but we specify sub-ranges that store the candidate pages, i.e. we don't read the whole column chunk, but only fractions of it. Pages are not aligned across column chunks, i.e. page #2 of column A might store completely different rows than page #2 of column B. It means we need to implement some kind of row-skipping logic when we read the data pages. This logic is implemented in BaseScalarColumnReader and ScalarColumnReader. Collection column readers know nothing about page filtering. Page filtering can be turned off by setting the query option 'read_parquet_page_index' to false. Testing: * added some unit tests for the row range and page selection logic * generated various Parquet files with Parquet-MR * enabled Page index writing and wrote selective queries against tables written by Impala. Current tests are likely to use page filtering transparently. Performance: * Measured locally, observed 3x to 20x speedup for selective queries. The speedup was proportional to the IO operations need to be done. * The TPCH benchmark didn't show a significant performance change. It is not a suprise since the data is not being sorted in any useful way. So the main goal was to not introduce perf regression. TODO: * measure performance for remote reads Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Reviewed-on: http://gerrit.cloudera.org:8080/12065 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
1e49b6a6b4 |
IMPALA-2029. Implement our own getJNIEnv equivalent
The libhdfs getJNIEnv function was made non-exported in Hadoop 2. For a while in CDH we were hacking around this with a vendor-specific patch that re-exported it. However, that was always a bit annoying to maintain our own patch each time we rebased to new versions, etc. Earlier attempts to solve this issue turned up strange bugs around coordinating whether we or libhdfs were responsible for attaching and detaching to the JVM/JNI environment. So, this patch takes a new approach: rather than directly creating/attaching to the JVM, we just look for an existing attached environment. If there isn't one, we call some simple libhdfs function which forces it to attach the current thread, and then try again. Performance is maintained (or maybe improved) by adding a thread-local cache of the attached JVM, with an inlined fast-path. I tested this with a CDP build of Hadoop which doesn't have the getJNIEnv workaround. Prior to this fix, I wasn't able to run Java tests against that build because it would fail to link getJNIEnv() at runtime. Now, they pass. Change-Id: I766bcfd70addb00e9fd8a860e89c2a1c5d4c71d5 Reviewed-on: http://gerrit.cloudera.org:8080/13275 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
b5805de3e6 |
IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.
This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.
The changes are as follows:
- Support for DATE literal syntax.
- Explicit casting between DATE and other types (note that invalid
casts will fail with an error just like invalid DECIMAL_V2 casts,
while failed casts to other types do no lead to warning or error):
- from STRING to DATE. The string value must be formatted as
yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
the time component is optional. If the time component is
present, it will be truncated silently.
- from DATE to STRING. The resulting string value is formatted as
yyyy-MM-dd.
- from TIMESTAMP to DATE. The source timestamp's time of day
component is ignored.
- from DATE to TIMESTAMP. The target timestamp's time of day
component is set to 00:00:00.
- Implicit casting between DATE and other types:
- from STRING to DATE if the source string value is used in a
context where a DATE value is expected.
- from DATE to TIMESTAMP if the source date value is used in a
context where a TIMESTAMP value is expected.
- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
implicit conversions are now all possible, the existing function
overload resolution logic is not adequate anymore.
For example, it resolves the
if(false, '2011-01-01', DATE '1499-02-02') function call to the
if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
function, instead of the if(BOOLEAN, DATE, DATE) version.
This is clearly wrong, so the function overload resolution logic had
to be changed to resolve function calls to the best-fit overloaded
function definition if there are multiple applicable candidates.
An overloaded function definition is an applicable candidate for a
function call if each actual parameter in the function call either
matches the corresponding formal parameter's type (without casting)
or is implicitly castable to that type.
When looking for the best-fit applicable candidate, a parameter
match score (i.e. the number of actual parameters in the function
call that match their corresponding formal parameter's type without
casting) is calculated and the applicable candidate with the highest
parameter match score is chosen.
There's one more issue that the new resolution logic has to address:
if two applicable candidates have the same parameter match score and
the only difference between the two is that the first one requires a
STRING -> TIMESTAMP implicit cast for some of its parameters while
the second one requires a STRING -> DATE implicit cast for the same
parameters then the first candidate has to be chosen not to break
backward compatibility.
E.g: year('2019-02-15') function call must resolve to
year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
implemented yet, so this is not an issue at the moment but it will
be in the future.
When the resolution algorithm considers overloaded function
definitions, first it orders them lexicographically by the types in
their parameter lists. To ensure the backward compatible behavior
Primitivetype.DATE enum value has to come after
PrimitiveType.TIMESTAMP.
- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.
These items are tightly coupled and it makes sense to implement them
in one change-set.
Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
corresponding HBASE table 'functional_hbase.date_tbl') was
introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
- since DATE type is supported for TEXT and HBASE fileformats
only, most DATE tests were implemented separately in
tests/query_test/test_date_queries.py.
Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.
Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
209a350aae |
Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
This re-lands commit |
||
|
|
d3428a58d8 |
Revert "IMPALA-5393. Use THREAD_LOCAL state for regexp"
This depends on a change which switches to a toolchain version that does not have packages for Ubuntu 18.04. Reverting both now to unblock everyone.
This reverts commit
|
||
|
|
6e8c330f40 |
IMPALA-5393. Use THREAD_LOCAL state for regexp
This changes the built-in regexp-related UDFs to use THREAD_LOCAL re2::RE instances instead of FRAGMENT_LOCAL. Although re2::RE is thread-safe, it achieves that thread safety through a certain amount of locking. Using thread-local regexps improves performance substantially. I ran a simple test query: select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0 on a table with three underlying parquet files (thus getting 3 scanner threads). Prior to this change, the query took ~60 seconds and burned 2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For a query with more scanner threads, the improvement should be even more dramatic. The only potential downside of this change is slightly increased memory consumption by having one RE instance per thread, but the REs themselves should be small relative to all of the other per-scanner-thread memory. Change-Id: Ibc331151a302e755701cb08adb3e6f289d54c3a6 Reviewed-on: http://gerrit.cloudera.org:8080/12772 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Todd Lipcon <todd@apache.org> |
||
|
|
214f61a180 |
IMPALA-8250: Clean up JNI warnings.
Using LIBHDFS_OPTS+="-Xcheck:jni" revealed a handful of warnings related to
(a) checking for exceptions and (b) leaking local references.
Checking for exceptions required sprinkling RETURN_ERROR_IF_EXC
left and right. I chose not to expand the JniCall infrastructure
to handle this more generally at the moment.
The leaky local references are a bit harder. In the logs, they show up
as "WARNING: JNI local refs: 2597, exceeds capacity: 35" or similar. A
few of these errors seem to be not in our code. The ones that I've
found in our code stemmed from HBaseTableScanner::GetRowKey(): this
method uses local references and wasn't returning them. Using a
JniLocalFrame seems to have taken care of the warnings.
I have added code to skip test_large_strings when JNI checking is
enabled. This test takes forever (presumably because JNI is checking
bounds on strings very aggressively), and times out. The time out also
causes some metric-related checks to fail (since a query is still in
flight).
Debugging this required customizing my JDK to give stack traces
when these warnings occurred. The following diff facilitated
this.
diff -r 76a9c9cf14f1 src/share/vm/prims/jniCheck.cpp
--- a/src/share/vm/prims/jniCheck.cpp Tue Jan 15 10:43:31 2019 +0000
+++ b/src/share/vm/prims/jniCheck.cpp Wed Feb 27 11:57:13 2019 -0800
@@ -143,11 +143,30 @@
static const char * fatal_instance_field_mismatch = "Field type (instance) mismatch in JNI get/set field operations";
static const char * fatal_non_string = "JNI string operation received a non-string";
+// thisone: whether to print every time, or maybe, depending on future
+// how many future stacks we want printed (totally racy); helps catch
+// missing exception handling if there's a way to tickle that code
+// reliably.
+static inline void dump_native_stack(JavaThread* thr, bool thisone, int future) {
+ static int fut_stacks = 0; // racy!
+ if (fut_stacks > 0) {
+ thisone = true;
+ fut_stacks--;
+ }
+ if (future > 0) fut_stacks = future;
+ if (thisone) {
+ frame fr = os::current_frame();
+ char buf[6000];
+ tty->print_cr("Thread: %s %d", thr->get_thread_name(), thr->osthread()->thread_id());
+ print_native_stack(tty, fr, thr, buf, sizeof(buf));
+ }
+}
// When in VM state:
static void ReportJNIWarning(JavaThread* thr, const char *msg) {
tty->print_cr("WARNING in native method: %s", msg);
thr->print_stack();
+ dump_native_stack(thr, true, 0);
}
// When in NATIVE state:
@@ -199,11 +218,14 @@
tty->print_cr("WARNING in native method: JNI call made without checking exceptions when required to from %s",
thr->get_pending_jni_exception_check());
thr->print_stack();
+ dump_native_stack(thr, true, 10);
)
thr->clear_pending_jni_exception_check(); // Just complain once
}
}
+
+
/**
* Add to the planned number of handles. I.e. plus current live & warning threshold
*/
@@ -254,9 +276,12 @@
tty->print_cr("WARNING: JNI local refs: %zu, exceeds capacity: %zu",
live_handles, planned_capacity);
thr->print_stack();
+ dump_native_stack(thr, true, 0);
)
// Complain just the once, reset to current + warn threshold
add_planned_handle_capacity(handles, 0);
+ } else {
+ dump_native_stack(thr, false, 0);
}
}
Change-Id: Idd1709f749a764c1d947704bc64306493863b45f
Reviewed-on: http://gerrit.cloudera.org:8080/12660
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
0b7c964545 |
Adding hostname to Disk I/O errors.
I recently ran into some queries that failed like so: WARNINGS: Disk I/O error: Could not open file: /data/...: Error(5): Input/output error These warnings were in the profile, but I had to cross-reference impalad logs to figure out which machine had the broken disk. In this commit, I've sprinkled GetBackendString() to include it. Change-Id: Ib977d2c0983ef81ab1338de090239ed57f3efde2 Reviewed-on: http://gerrit.cloudera.org:8080/12402 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
7707eb0417 |
IMPALA-7657: Codegen IsNotEmptyPredicate and ValidTupleIdExpr.
These two classes evaluate scalar expressions. Previously codegen was done by calling ScalarExpr::GetCodegendComputeFnWrapper which generates a static method that calls the scalar expression evaluation method. Make this more efficient by generating code which is customized using information available at codegen time. Add new cross-compiled files null-literal-ir.cc slot-ref-ir.cc IsNotEmptyPredicate works by getting a CollectionVal object from the single child Expr node, and counting its tuples. At codegen time we know the type and value of the child node. Generate a call to a node-specific non-virtual cross-compiled method to get the CollectionVal object from the child. Then generate a code that examines the CollectionVal and returns an IntVal. A ValidTupleIdExpr node contains a vector of tuple ids. It works by probing each row for the tuple ids in the vector to find a non-null tuple. At codegen time we know the vector of tuple ids. We unroll the loop through the tuple ids, generating code that evaluates if the tuple is non-null, and returns the tuple id if/when a non-null tuple is found. IMPALA-7657 also requests replacing GetCodegendComputeFnWrapper() in TupleIsNullPredicate. In the current Impala code this method is never called. This is because TupleIsNullPredicate is always wrapped in an IfExpr. This is always codegen'd by IfExpr's GetCodegendComputeFnWrapper() method. There is a separate Jira IMPALA-7655 to improve codegen of IfExpr. Minor corrections: Correct the link to llvm tutorial in LlvmCodegen. PERFORMANCE: I tested performance on a local mini-cluster. I wrote some pathological queries to test the new code. The new codegen'd code is very similar in performance. Both ValidTupleIdExpr and IsNotEmptyPredicate seem very slightly faster than the old code. Overall these changes are not purely for performance but to move away from GetCodegendComputeFnWrapper. TESTING: The changed scalar expressions are well exercised by current tests. Ran exhaustive end-to-end tests. Change-Id: Ifb87b9e3b879c278ce8638d97bcb320a7555a6b3 Reviewed-on: http://gerrit.cloudera.org:8080/12068 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
ae96a9fb19 |
IMPALA-8151: Use sizeof() in HiveUdfCall to specify non-primitive type's size
Previously, data type sizes were hardcoded in HiveUdfCall::Evaluate(). Since IMPALA-7367 removed the padding from STRING and VARCHAR types, it could read past the end of the actual value and cause a crash. This change replaces the hardcoded values with sizeof() calls to determine the size of non-primitive types (STRING, VARCHAR and TIMESTAMP) to avoid similar issues in the future. Testing: Ran test_udfs.py on an ASAN build. Added logs to manually verify the size of bytes copied. Change-Id: I919c330546fa86b474ab66245b20ceb1f5525b41 Reviewed-on: http://gerrit.cloudera.org:8080/12355 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
3338bae608 |
IMPALA-8043: Fix BE test failures related to SystemV timezones.
This is a fix for the following issue: 1. Some BE tests (e.g. ExprTest.TimestampFunctions) use the system's local timezone but run against a test timezone db (instead of the system's timezone db). 2. On some Linux installations /usr/share/zoneinfo contains symlinks to files in the /usr/share/zoneifo/SystemV directory (e.g /usr/share/zoneinfo/America/Los_Angeles is a symlink to ../SystemV/PST8PDT). 3. The 'SystemV' directory is not part of the test timezone db, since it is obsolete and excluded by default. Consequently, if the system's local timezone is set to America/Los_Angeles, BE tests won't find the corresponding timezone file in the test timezone db. BE tests will default to UTC, which will break some of them. This change sets local timezone explicitly for failing BE tests, so they don't depend on the system's local timezone. It also adds 'SystemV' directory to the test timezone db to avoid similar issues in the future. Change-Id: I9288cd24c8af0c059e55d47c86bd92eaf0075681 Reviewed-on: http://gerrit.cloudera.org:8080/12199 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
928c5c261b |
Fix some warnings on GCC7
I tried compiling with GCC7 to see what warnings popped up. Fix some ambiguous else warnings resulting from gtest macros. See https://github.com/google/googletest/issues/1119. Add a missing include that broke compilation on the release build. Fix some warnings that detect missing returns when there is a DCHECK (these warnings already occurred in release builds, but they now happen in gcc7 debug builds). Change-Id: I39a12bc5ed6957c147b7f0dba85c7687cc989439 Reviewed-on: http://gerrit.cloudera.org:8080/12132 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
27577dd652 |
IMPALA-7902: NumericLiteral fixes, refactoring
The work to clean up the rewriter logic must start with a stable AST,
which must start with sprucing up some issues with the leaf nodes. This
CR tackles the NumericLiteral used to hold numbers.
IMPALA-7896: Literals should not need explicit analyze step
Partial fix: removes the need to analyze a numeric literal: analyze() is
a no-op. This eliminates the need to do a "fake" analysis with a null
analyzer: numeric literals are now created analyzed. This is useful
because the catalog module creates numeric literals outside of a query
(and outside of an analyzer.)
A literal is immutable except for type. Modified the constructor to set
the type and cost, then mark the node as analyzed. A later call to
analyze() has nothing to do.
Code that created and dummy-analyzed numeric literals changed to use
static create() methods resulting in simpler literal creation, and
eliminates the special "analyzer == null" checks in analyze().
IMPALA-7886: NumericLiteral constructor fails to round values to
Decimal type
IMPALA-7887: NumericLiteral fails to detect numeric overflow
IMPALA-7888: Incorrect NumericLiteral overflow checks for FLOAT,
DOUBLE
IMPALA-7891: Analyzer does not detect numeric overflow in CAST
IMPALA-7894: Parser does not catch double overflow
These are all caused by the somewhat cluttered state of the numeric
range check code after years of incremental changes. This patch
centralizes all checks into a series of constants and methods for
uniformity. All values are set in the constructor which now checks
that the value is legal for the type. Cast operations verify that the
cast is valid. Multiple semi-parallel versions of the same logic is
replaced by calls to a single implementation.
The numeric checks now follow the SQL standard which says that
implementations should fail if a cast would trucate the most significant
digits, but round when truncating the least significant.
IMPALA-7865: Repeated type widening of arithmetic expressions
Partial fix. Replaces the "is explicit cast" flag in the numeric literal
with the explicit type. This allows reseting an implicit type back to
the explciit type if an arithmetic expression is analyzed multiple
times. A later patch will feed this type information into the type
inference mechanism to complete the fix.
Finally, adds a set of new exceptions that begin to unify error
reporting. These handle casts (SqlCastException), value validation
(InvalidValueException) and unsupported features
(UnsupportedFeatureException.) These all derive from AnalysisException
for backward compatibility. Tests use the new exceptions to check for
expected errors rather than parsing text strings (which tend to
change.)
Testing:
* Added unit tests just for numeric literals. Refactored code to
simplify the tests.
* Added a test case for the obscure case in Decimal V1 of an implicit
cast overflow.
* The depth-check tests needed one extra level of nesting to trigger
failure.
* Ran all FE tests.
Change-Id: I484600747b2871d3a6fe9153751973af9a8534f2
Reviewed-on: http://gerrit.cloudera.org:8080/12001
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
ba9b78c103 |
IMPALA-7759: Add Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate Levenshtein edit distance. Implemented as levenshtein() to match PostgreSQL in both functionality and name and also added le_dst() alias for Netezza, compatibility, but note that levenshtein() differs in functionality in that if either value is NULL or both values are NULL, levenshtein() returns NULL, where Netezza's le_dst() returns the length of the not NULL value or 0 if both values are NULL. Testing: - Added unit tests to expr-test.cc - Manual test on 966289 string pairs and results match PostgreSQL - Added changes to qgen tests for PostgreSQL comparison Change-Id: I549d33ab7cebfa10db2934461c8ec91e2cc1cdcb Reviewed-on: http://gerrit.cloudera.org:8080/11793 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
2a4835cfba |
IMPALA-7367: Pack StringValue and CollectionValue slots
This change packs StringValue and CollectionValue slots to ensure they now occupy 12 bytes instead of 16 bytes. This reduces the memory requirements and improves the performance. Since Kudu tuples are populated using a memcopy, 4 bytes of padding was added to StringSlots in Kudu tables. Testing: Ran core tests. Added static asserts to ensure the value sizes are as expected. Performance tests on TPCH-40 produced 3.96% improvement. Change-Id: I32f3b06622c087e4aa288e8db1bf4581b10d386a Reviewed-on: http://gerrit.cloudera.org:8080/11599 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> |
||
|
|
067657aa7d |
IMPALA-5031: prevent signed overflow in decimal
This removes two signed integer overflows when using the 'conv'
builtin. Signed integer overflow is undefined behavior according to
the C++ standard. The interesting parts of the backtraces are:
exprs/math-functions-ir.cc:405:13: runtime error: signed integer overflow: 4738381338321616896 * 36 cannot be represented in type 'long'
exprs/math-functions-ir.cc:404:24: runtime error: signed integer overflow: 2 * 4738381338321616896 cannot be represented in type 'long'
#0 MathFunctions::DecimalInBaseToDecimal(long, signed char, long*) exprs/math-functions-ir.cc:404:24
#1 MathFunctions::ConvInt(impala_udf::FunctionContext*, impala_udf::BigIntVal const&, impala_udf::TinyIntVal const&, impala_udf::TinyIntVal const&) exprs/math-functions-ir.cc:327:10
#2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:485:580
#3 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
#8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
#9 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
#10 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
These were triggered in the backend test
ExprTest.MathConversionFunctions.
Change-Id: I0d97dfcf42072750c16e41175765cd9a468a3c39
Reviewed-on: http://gerrit.cloudera.org:8080/11876
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
60095a4c6b |
IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet
Changes: - parquet.thrift is updated to a newer version which contains the timestamp logical type. - INT64 columns with converted types TIMESTAMP_MILLIS and TIMESTAMP_MICROS can be read as TIMESTAMP. - If the logical type is timestamp, then the type will contain the information whether the UTC->local conversion is necessary. This feature is only supported for the new timestamp types, so INT96 timestamps must still use flag convert_legacy_hive_parquet_utc_timestamps. - Min/max stat filtering is enabled again for columns that need UTC->local conversion. This was disabled in IMPALA-7559 because it could incorrectly drop column chunks. - CREATE TABLE LIKE PARQUET converts these columns to TIMESTAMP - before the change, an error was returned instead. - Bulk of the Parquet column stat logic was moved to a new class called "ColumnStatsReader". Testing: - Added unit tests for timezone conversion (this needed a new public function in timezone_db.h and adding CET to tzdb_tiny). - Added parquet files (created with parquet-mr) with int64 timestamp columns. Change-Id: I4c7c01fffa31b3d2ca3480adf6ff851137dadac3 Reviewed-on: http://gerrit.cloudera.org:8080/11057 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
250d85e94e |
IMPALA-7822: handle overflows in repeat() builtin
We need to carefully check that the intermediate value fits in an int64_t and the final size fits in an int. If they don't we raise an error and fail the query. Testing: Added a couple of backend tests to exercise the overflow check code paths. Change-Id: I872ce77bc2cb29116881c27ca2a5216f722cdb2a Reviewed-on: http://gerrit.cloudera.org:8080/11889 Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
78b6f1db69 |
IMPALA-5031: Make UBSAN-friendly arithmetic generic
ArithmeticUtil::AsUnsigned() makes it possible to do arithmetic on signed integers in a way that does not invoke undefined behavior, but it only works on integers. This patch adds ArithmeticUtil::Compute(), which dispatches (at compile time) to the normal arithmetic evaluation method if the type of the values is a floating point type, but uses AsUnsigned() if the type of the values is an integral type. Change-Id: I73bec71e59c5a921003d0ebca52a1d4e49bbef66 Reviewed-on: http://gerrit.cloudera.org:8080/11810 Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |
||
|
|
8fc702be0c |
IMPALA-5031: fix signed overflows in decimal
The standard says that overflow for signed arithmetic operations is
undefined behavior; see [expr]:
If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable values
for its type, the behavior is undefined.
and [basic.fundamental]:
Unsigned integers shall obey the laws of arithmetic modulo 2^n
where n is the number of bits in the value representation of that
particular size of integer. This implies that unsigned arithmetic
does not overflow because a result that cannot be represented by
the resulting unsigned integer type is reduced modulo the number
that is one greater than the largest value that can be represented
by the resulting unsigned integer type.
All of the overflows fixed in this patch were tested with expr-test's
DecimalArithmeticTest.
Change-Id: Ibf882428931e4f4264be2fc8cd9d6b1fc89b8ace
Reviewed-on: http://gerrit.cloudera.org:8080/11604
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
d301600a85 |
Revert "IMPALA-7595: Revert "IMPALA-7521: Speed up sub-second unix time->TimestampValue conversions""
IMPALA-7595 added proper handling for invalid time-of-day values
in Parquet, so the DCHECK mentioned in IMPALA-7595 will no longer
be hit. This means that IMPALA-7521 can be committed again without
causing problems.
This reverts commit
|
||
|
|
20bde289eb |
IMPALA-5031: null ptr errors in C calls in BE tests
This patch fixes all remaining UBSAN "null pointer passed as argument"
errors in the backend tests. These are undefined behavior according to
"7.1.4 Use of library functions" in the C99 standard (which is
included in C++14 in section [intro.refs]):
If an argument to a function has an invalid value (such as a value
outside the domain of the function, or a pointer outside the
address space of the program, or a null pointer, or a pointer to
non-modifiable storage when the corresponding parameter is not
const-qualified) or a type (after promotion) not expected by a
function with variable number of arguments, the behavior is
undefined.
The interesting parts of the backtraces for the errors fixed in this
patch are below:
exprs/string-functions-ir.cc:311:17: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 StringFunctions::Replace(impala_udf::FunctionContext*, impala_udf::StringVal const&, impala_udf::StringVal const&, impala_udf::StringVal const&) exprs/string-functions-ir.cc:311:5
#1 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:485:580
#2 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
#3 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
#4 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
#5 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
#6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
#7 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
#8 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
#9 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
#10 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
#11 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
#12 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
#13 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
#20 thread_proxy (exprs/expr-test+0x55ca939)
exprs/string-functions-ir.cc:868:15: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 StringFunctions::ConcatWs(impala_udf::FunctionContext*, impala_udf::StringVal const&, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:868:3
#1 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:510:270
#2 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
#3 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
#4 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
#5 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
#6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
#7 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
#8 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
#9 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
#10 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
#11 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
#12 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
#13 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
#20 thread_proxy (exprs/expr-test+0x55ca939)
exprs/string-functions-ir.cc:871:17: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 StringFunctions::ConcatWs(impala_udf::FunctionContext*, impala_udf::StringVal const&, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:871:5
#1 StringFunctions::Concat(impala_udf::FunctionContext*, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:843:10
#2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:510:95
#3 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
#4 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
#5 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
#6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
#7 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
#8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
#9 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
#10 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
#11 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
#12 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
#13 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
#14 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
#21 thread_proxy (exprs/expr-test+0x55ca939)
exprs/string-functions-ir.cc:873:17: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 StringFunctions::ConcatWs(impala_udf::FunctionContext*, impala_udf::StringVal const&, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:873:5
#1 StringFunctions::Concat(impala_udf::FunctionContext*, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:843:10
#2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:510:95
#3 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
#4 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
#5 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
#6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
#7 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
#8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
#9 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
#10 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
#11 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
#12 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
#13 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
#14 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
#21 thread_proxy (exprs/expr-test+0x55ca939)
runtime/raw-value.cc:159:27: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 RawValue::Write(void const*, void*, ColumnType const&, MemPool*) runtime/raw-value.cc:159:9
#1 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:225:7
#2 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
#3 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
#4 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
#5 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
#6 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
#7 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
#8 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
#9 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
#16 thread_proxy (exprs/expr-test+0x55ca939)
udf/udf.cc:521:24: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 impala_udf::StringVal::CopyFrom(impala_udf::FunctionContext*, unsigned char const*, unsigned long) udf/udf.cc:521:5
#1 AnyValUtil::FromBuffer(impala_udf::FunctionContext*, char const*, int) exprs/anyval-util.h:241:12
#2 StringFunctions::RegexpExtract(impala_udf::FunctionContext*, impala_udf::StringVal const&, impala_udf::StringVal const&, impala_udf::BigIntVal const&) exprs/string-functions-ir.cc:726:10
#3 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:485:580
#4 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
#5 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
#6 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
#7 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
#8 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
#9 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
#10 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
#11 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
#12 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
#13 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
#14 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
#15 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
#22 thread_proxy (exprs/expr-test+0x55ca939)
util/coding-util-test.cc:45:10: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 TestUrl(string const&, string const&, bool) util/coding-util-test.cc:45:3
#1 UrlCodingTest_BlankString_Test::TestBody() util/coding-util-test.cc:88:3
#2 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/coding-util-test+0x6630f42)
#8 main util/coding-util-test.cc:123:192
util/decompress-test.cc:126:261: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
#0 DecompressorTest::CompressAndDecompress(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:126:254
#1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:84:9
#2 DecompressorTest_Default_Test::TestBody() util/decompress-test.cc:373:3
#3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
#9 main util/decompress-test.cc:479:47
util/decompress-test.cc:148:261: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
#0 DecompressorTest::CompressAndDecompress(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:148:254
#1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:84:9
#2 DecompressorTest_Default_Test::TestBody() util/decompress-test.cc:373:3
#3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
#9 main util/decompress-test.cc:479:47
util/decompress-test.cc:269:261: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
#0 DecompressorTest::CompressAndDecompressNoOutputAllocated(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:269:254
#1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:71:7
#2 DecompressorTest_LZ4_Test::TestBody() util/decompress-test.cc:381:3
#3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
#9 main util/decompress-test.cc:479:47
util/decompress-test.cc:221:329: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
#0 DecompressorTest::StreamingDecompress(Codec*, long, unsigned char*, long, unsigned char*, bool, long*) util/decompress-test.cc:221:322
#1 DecompressorTest::CompressAndStreamingDecompress(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:245:35
#2 DecompressorTest::RunTestStreaming(THdfsCompression::type) util/decompress-test.cc:104:5
#3 DecompressorTest_Gzip_Test::TestBody() util/decompress-test.cc:386:3
#4 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
#10 main util/decompress-test.cc:479:47
util/streaming-sampler.h:55:22: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
#0 StreamingSampler<long, 64>::StreamingSampler(int, vector<long> const&) util/streaming-sampler.h:55:5
#1 RuntimeProfile::TimeSeriesCounter::TimeSeriesCounter(string const&, TUnit::type, int, vector<long> const&) util/runtime-profile-counters.h:401:53
#2 RuntimeProfile::Update(vector<TRuntimeProfileNode> const&, int*) util/runtime-profile.cc:310:28
#3 RuntimeProfile::Update(TRuntimeProfileTree const&) util/runtime-profile.cc:245:3
#4 Coordinator::BackendState::InstanceStats::Update(TFragmentInstanceExecStatus const&, Coordinator::ExecSummary*, ProgressUpdater*) runtime/coordinator-backend-state.cc:473:13
#5 Coordinator::BackendState::ApplyExecStatusReport(TReportExecStatusParams const&, Coordinator::ExecSummary*, ProgressUpdater*) runtime/coordinator-backend-state.cc:286:21
#6 Coordinator::UpdateBackendExecStatus(TReportExecStatusParams const&) runtime/coordinator.cc:678:22
#7 ClientRequestState::UpdateBackendExecStatus(TReportExecStatusParams const&) service/client-request-state.cc:1253:18
#8 ImpalaServer::ReportExecStatus(TReportExecStatusResult&, TReportExecStatusParams const&) service/impala-server.cc:1343:18
#9 ImpalaInternalService::ReportExecStatus(TReportExecStatusResult&, TReportExecStatusParams const&) service/impala-internal-service.cc:87:19
#24 thread_proxy (exprs/expr-test+0x55ca939)
Change-Id: I317ccc99549744a26d65f3e07242079faad0355a
Reviewed-on: http://gerrit.cloudera.org:8080/11545
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
ddef2cb9b1 |
IMPALA-376: add built-in functions for parsing JSON
This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.
get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.
More examples can be found in expr-test.cc.
Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory
Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Reviewed-on: http://gerrit.cloudera.org:8080/10950
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
|
||
|
|
cb49371613 |
IMPALA-7492: Add support for DATE text parser/formatter
This change is the first step in implementing support for DATE type (IMPALA-6169). The DATE parser/formatter is implemented by the new DateParser class. - The parser supports parsing both default and custom formatted DATE values. CCTZ is used to validate the parsed dates. - The formatter supports default and custom formatting of DATE values. In the future, DateParser will be used in the text scanner/writer and in the DATE <-> STRING cast functions. The DateParser class reuses some of the functionality already implemented in the TimestampParser class to minimize redundancy. To make code reuse easier, a new namespace (datetime_parse_util) was created and the common functionality was moved there. This change also adds a new class (DateValue) to represent a DATE value in-memory. The DateParser and DateValue classes are used only in tests at the moment, therefore this patch doesn't change user facing behavior. Testing: - Added BE-tests for DateParser and DateValue classes. - Re-run parse-timestamp-benchmark to make sure that parser performance hasn't degraded. Change-Id: I1eec00f22502c4c67c6807c4b51384419ea8b831 Reviewed-on: http://gerrit.cloudera.org:8080/11450 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> |