Commit Graph

724 Commits

Author SHA1 Message Date
Tim Armstrong
da5b498c18 IMPALA-9373: more tactical IWYU fixes
This is a grab-bag of fixes that I did with a mix of manual
inspection. The techniques used were:
* Getting preprocessor output for a few files by modifying
  command lines from compiler_commands.json to include -E.
  This is revealing because you see all the random unrelated
  cruft that gets pulled in. A useful one liner to extract
  an (approximate) list of headers from preprocessor output is:
  grep '^#.*h' be/src/util/CMakeFiles/Util.dir/os-info.cc.i | \
      grep -o '".*"' | sort -u
* Looking at the IWYU recommendations for guidance on what
  headers can be removed (and what need to be added).
* Grepping for includes of headers, especially in other headers
  where they become viral. An example one-liner to find these:
  git grep -l 'include.*<iostream>' | grep '\.h$'

Non-exhaustive list of changes made:
-----------------------------------
Unnest classes from TmpFileMgr so we can forward-declare them.
This lets us remove tmp-file-mgr.h from buffer-pool.h and
query-state.h, which are both widely included headers in the
codebase.

Also remove webserver.h from other headers, since it
pulls in openssl-util.h and consequently a lot of
openssl headers.

Avoid including runtime/multi-precision.h in other headers.
It pulls in a lot of boost multiprecision headers that
are only needed for internal implementations of math
and decimal operations. This required replacing some
references to int128_t with __int128_t, which I don't
think significantly hurts code readability.

Also remove references to decimal-util.h where they're
not needed, since it transitively pulls in
multi-precision.h

Reduce includes of boost/date_time modules, which are
transitively many places via timestamp-value.h.

Remove transitive dependencies of timestamp-value.h
to avoid pulling in remaining boost date_time headers
where not needed. Dependent headers are:
scalar-expr-evaluator.h, expr-value.h

Remove references to debug-util.h in other headers,
because it pulls in a lot of thread headers.

Remove references to llvm-codegen.h where possible,
because it pulls in many llvm headers.

Other opportunities:
--------------------
* boost/algorithm/string.hpp includes many string algorithms
  and pulls in a lot of headers.
* util/string-parser.h is a giant header with many dependencies.
* There's lots of redundancy between boost and standard c++
  headers. Both pull in vast numbers of utility headers for
  C++ metaprogramming and similar things. If we reduced virality
  of boost headers this would help a lot, and also if we switch
  to equivalent standard headers where possible (e.g. unordered_map,
  unordered_set, function, bind, etc).

Compile time with clang/ASAN:
-----------------------------
Before:
real    9m6.311s
user    62m25.006s
sys     2m44.798s

After:
real    8m17.073s
user    55m38.425s
sys     2m25.808s

Change-Id: I8de71866bdf3211e53560d9bfe930e7657c4d7f1
Reviewed-on: http://gerrit.cloudera.org:8080/15248
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-25 03:37:32 +00:00
Sahil Takiar
984f675e05 IMPALA-5904: (part 2) Fix more TSAN bugs
Fixes the following data races reported by TSAN:

data race be/src/runtime/krpc-data-stream-sender.cc:581:3 in
KrpcDataStreamSender::Channel::SerializeAndSendBatch(impala::RowBatch*)
* Race condition when reading 'rpc_in_flight_batch_' outside of the
  'lock_'
* Since this race condition is only triggered inside a DCHECK, added a
  suppresion

data race be/src/util/stopwatch.h:183:9 in
MonotonicStopWatch::RunningTime() const
* Race condition on BlockingJoinNode::built_probe_overlap_stop_watch_;
  changed from a MonotonicStopWatch to a ConcurrentStopWatch

data race be/src/exec/kudu-scan-node.cc:211:13 in
KuduScanNode::ProcessScanToken(impala::KuduScanner*, std::string const&)
* Some reads on KuduScanNode::done_ are racey, so I made 'done_' an
  AtomicBool; this has the added benefit that failed scans will be
  aborted as soon as 'done_' is set to false

data race be/src/service/client-request-state.h:220:29 in
ClientRequestState::eos() const
* Race condition when reading / updating ClientRequestState::eos_; made
  'eos_' an AtomicBool

data race be/src/exec/parquet/parquet-column-readers.cc:497:9 in
bool ScalarColumnReader<...>::ReadValueBatch<false>(...)
* Race condition in SHOULD_TRIGGER_COL_READER_DEBUG_ACTION /
  parquet_column_reader_debug_count

data race be/src/service/impala-server.cc:817:20 in
ImpalaServer::ArchiveQuery(impala::ClientRequestState const&)
* Race condition on some ClientRequestState fields when creating a
  QueryStateRecord

Fixes IMPALA-9313: 'TSAN data race in TmpFileMgr::File::Blacklist' and
adds a suppresion for IMPALA-9404:
'Instantiations/ExprTest.MathConversionFunctions fails in TSAN builds'.

Testing:
* Ran core tests
* Re-ran TSAN tests and confirmed the data races have been fixed

Change-Id: I01feb40417dc5ea64ccb0c1044cfc3eed8508476
Reviewed-on: http://gerrit.cloudera.org:8080/15244
Reviewed-by: Sahil Takiar <stakiar@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-03-02 09:01:14 +00:00
wzhou-code
3224839876 IMPALA-8759: Use double precision for HLL finalize function
Current HLL finalize function use single precision of data type
float32 to calculate estimate. It's not accurate for the larger
cardinalities beyond 1,000,000 since float32 only has 6~7 decimal
digit precision.
This patch change single precision data type to double precision
type for HLL finalize function.

Testing:
 - Passed all exhaustive tests.
 - Did benchmark for queries with NDV functions. The performance
   impact is negligible.
   See following spreadsheet for the menchmark:
   https://docs.google.com/spreadsheets/d/1DIVOEs5C4MJL1b7O4MA_jkaM3Y-JSMFREjXCUHJ3eHc/edit#gid=0

Change-Id: I0c5a5229b682070b0bc14da287db5231159dbb3d
Reviewed-on: http://gerrit.cloudera.org:8080/15167
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-26 11:06:12 +00:00
Csaba Ringhofer
2c54dbe225 IMPALA-9385: Unix time conversion cleanup + ORC fix
ORC scanner uses TimestampValue::FromUnixTimeNanos() to convert
sec + nano representation to Impala's TimestampValue (day + nano).
FromUnixTimeNanos was affected by flag
use_local_tz_for_unix_timestamp_conversions, while that global option
should not affect ORC. By default there was no conversion, but if the
flag is 1, then timestamps were interpreted as UTC and converted to
local time.

This could be solved by creating a UTC version of FromUnixTimeNanos,
but I decided to change the interface in the hope of making To/From
timestamp functions less confusing.

Changes:
- Fixed the bug by passing UTC as timezone in the ORC scanner.
- Changed the interface of these TimestampValue functions to expect
  a timezone pointer, interpret null as UTC and skip conversion. It
  would be also possible to pass the actual UTC timezone and check
  for this in the functions, but I guess it is easier to optimize
  the inlined functions this way.
- Moved the checking of use_local_tz_for_unix_timestamp_conversions to
  RuntimeState and added property time_zone_for_unix_time_conversions()
  to return the timezone to use in Unix time conversions. This made
  TimestampValue's interface clearer and makes it easy to replace the
  flag with a query option if we want to.
- Changed RuntimeState and the Parquet scanner to skip timezone
  conversion if convert_legacy_hive_parquet_utc_timestamps=1 but the
  timezone is UTC. This allows users to avoid the performance penalty
  of this flag by setting query option timezone to UTC in their
  session (IMPALA-7557). CCTZ is not good at this, actually
  conversions are slower with fixed offset timezones (including UTC)
  than with timezones that have DST/historical rule changes.

Postponed changes:
- Didn't remove the UTC versions of the functions yet, as that would
  require changing (and possibly rethinking) several BE tests and
  benchmarks (IMPALA-9409).

Tests:
- Added regression test for Orc and other file formats to
  check that they are not affected by this flag.
- Extended test_hive_parquet_timestamp_conversion.py to cover the case
  when convert_legacy_hive_parquet_utc_timestamps=1 and timezone=UTC.
  Also did some cleanup there to use query option timezone instead of
  env var TZ.

Change-Id: I14e2a7e512ccd013d5d9fe480a5467ed4c46b76e
Reviewed-on: http://gerrit.cloudera.org:8080/15222
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-22 02:02:56 +00:00
Tim Armstrong
04fd9ae268 IMPALA-9373: Trial run of include-what-you-use
Implemented recommendations from IWYU in a subset of
files, mostly in util. Did a few cleanups related to
systematic problems that I noticed as a result.

I noticed that uid-util.h was pulling in boost UUID headers
to a lot of compilation units, so refactored that a little
bit, including pulling out the hash functions into
unique-id-hash.h and moving some inline functions into
client-request-state-map.cc.

Systematically replaced the general boost mutex header with the
internal pthread-based one. This is equivalent for us, since
we assume that boost::mutex is implemented by pthread_mutex_t,
e.g. for the implementation of ConditionVariable.

Switch include guards to pragma once just as general cleanup.

Prefix string with std:: consistently in headers so that they
don't depend on "using" declarations pulled in from random
headers.

Look at includes of C++ stream headers, including iostream and
stringstream, and replaced them with iosfwd or removed them
if possible.

Compile time:
Measured a full ASAN build of the impalad binary on an 8 core
machine with cccache enabled, but cleared. It used very slightly
less CPU, probably because we are still pulling in most of the
same system headers.

Before:
real    9m27.502s
user    64m39.775s
sys     2m49.002s

After:
real    9m26.561s
user    64m28.948s
sys     2m48.252s

So for the moment, the only significant wins are on incremental
builds, where touching header files should not require as many
recompilations. Compile times should start to drop meaningfully
once we thin out more unnecessary includes - currently it seems
like most compile units end up with large chunks of boost/std
code included via transitive header dependencies.

Change-Id: I3450e0ffcb8b183e18ac59c8b33b9ecbd3f60e20
Reviewed-on: http://gerrit.cloudera.org:8080/15202
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-02-19 05:57:11 +00:00
Bikramjeet Vig
ba00551581 IMPALA-4080 [part 1]: Move codegen code from aggregation exec nodes to
their plan nodes

Refactored code to move codegen code from aggregation exec nodes to
their plan nodes. Added some TODOs that will be fixed in the next few
patch.

Testing:
- Ran queries and confirmed manually that the codegened code works.
- Ran all e2e tests for agg nodes and partition joins.

Change-Id: I58f52a262ac7d0af259d5bcda72ada93a851d3b2
Reviewed-on: http://gerrit.cloudera.org:8080/15053
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-31 03:06:24 +00:00
stiga-huang
0936384271 IMPALA-9010: Add builtin mask functions
There're 6 builtin GenericUDFs for column masking in Hive:
  mask_show_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_show_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_first_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_last_n(value, charCount, upperChar, lowerChar, digitChar,
      otherChar, numberChar)
  mask_hash(value)
  mask(value, upperChar, lowerChar, digitChar, otherChar, numberChar,
      dayValue, monthValue, yearValue)

Description of the parameters:
   value      - value to mask. Supported types: TINYINT, SMALLINT, INT,
                BIGINT, STRING, VARCHAR, CHAR, DATE(only for mask()).
   charCount  - number of characters. Default value: 4
   upperChar  - character to replace upper-case characters with. Specify
                -1 to retain original character. Default value: 'X'
   lowerChar  - character to replace lower-case characters with. Specify
                -1 to retain original character. Default value: 'x'
   digitChar  - character to replace digit characters with. Specify -1
                to retain original character. Default value: 'n'
   otherChar  - character to replace all other characters with. Specify
                -1 to retain original character. Default value: -1
   numberChar - character to replace digits in a number with. Valid
                values: 0-9. Default value: '1'
   dayValue   - value to replace day field in a date with.
                Specify -1 to retain original value. Valid values: 1-31.
                Default value: 1
   monthValue - value to replace month field in a date with. Specify -1
                to retain original value. Valid values: 0-11. Default
                value: 0
   yearValue  - value to replace year field in a date with. Specify -1
                to retain original value. Default value: 1

In Hive, these functions accept variable length of arguments in
non-restricted types:
   mask_show_first_n(val)
   mask_show_first_n(val, 8)
   mask_show_first_n(val, 8, 'X', 'x', 'n')
   mask_show_first_n(val, 8, 'x', 'x', 'x', 'x', 2)
   mask_show_first_n(val, 8, 'x', -1, 'x', 'x', '9')
The arguments of upperChar, lowerChar, digitChar, otherChar and
numberChar can be in string or numeric types.

Impala doesn't support Hive GenericUDFs, so we are lack of these mask
functions to support Ranger column masking policies. On the other hand,
we want the masking functions to be evaluated in the C++ builtin logic
rather than calling out to java UDFs for performance. This patch
introduces our builtin implementation of them.

We currently don't have a corresponding framework for GenericUDF
(IMPALA-9271), so we implement these by overloads. However, it may
requires hundreds of overloads to cover all possible combinations. We
just implement some important overloads, including
 - those used by Ranger default masking policies,
 - those with simple arguments and may be useful for users,
 - an overload with all arguments in int type for full functionality.
   Char argument need to be converted to their ASCII value.

Tests:
 - Add BE tests in expr-test

Change-Id: Ica779a1bf63a085d51f3b533f654cbaac102a664
Reviewed-on: http://gerrit.cloudera.org:8080/14963
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-01-17 15:34:34 +00:00
jchen
4c04e67738 IMPALA-8891: Fix non-standard null handling in concat_ws()
This patch fixes the non-standard null handling logic for
function 'concat_ws', while maintaining the original null
handling for function 'concat'

Existing statuses:
For function concat_ws, any null string element in array
argument 'strs' will result in null result, just like below:
------------------------------------------------
select concat_ws('-','foo',null,'bar') as expr1;
+-------+
| expr1 |
+-------+
| NULL  |
+-------+

New Statuses:
In this implementation, the function conforms to hive standard:
1.will join all the non-null string objects as the result
2.if all string objects are null, return empty string
3.if separator is null, return null
below is a example:
-------------------------------------------------
select concat_ws('-','foo',null,'bar') as expr1;
+----------+
|  expr1   |
+----------+
| foo-bar  |
+----------+
------------------------------------------------

Key changes:
* Reimplement function StringFunctions::ConcatWs by filtering the
  null value and only process the valid string values, based on
  original code structure.
* StringFunctions::Concat was also reimplemented, as it used to
  call ConcatWs but should keep the original NULL handling.

Testing:
* Ran exaustive tests.

Change-Id: I64cd3bfbb952e431a0cf52a5835ac05d2513d29b
Reviewed-on: http://gerrit.cloudera.org:8080/14885
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-12-17 07:11:20 +00:00
Attila Jeges
590da59a3c IMPALA-8706: ISO:SQL:2016 datetime patterns - Milestone 4
This patch adds ISO 8601 week-based date format tokens on top
of what was introduced in IMPALA-8703, IMPALA-8704 and
IMPALA-8705.

The ISO 8601 week-based date tokens may be used for both datetime
to string and string to datetime conversion.

The ISO 8601 week-based date tokens are as follows:
  - IYYY: 4-digit ISO 8601 week-numbering year.
          Week-numbering year is the year relating to the ISO
          8601 week number (IW), which is the full week (Monday
          to Sunday) which contains January 4 of the Gregorian
          year.
          Behaves similarly to YYYY in that for datetime to
          string conversion, prefix digits for 1, 2, and 3-digit
          inputs are obtained from current ISO 8601
          week-numbering year.

  - IYY:  Last 3 digits of ISO 8601 week-numbering year.
          Behaves similarly to YYY in that for datetime to string
          conversion, prefix digit is obtained from current ISO
          8601 week-numbering year and can accept 1 or 2-digit
          input.

  - IY:   Last 2 digits of ISO 8601 week-numbering year.
          Behaves similarly to YY in that for datetime to string
          conversion, prefix digits are obtained from current ISO
          8601 week-numbering year and can accept 1-digit input.

  - I:    Last digit of ISO 8601 week-numbering year.
          Behaves similarly to Y in that for datetime to string
          conversion, prefix digits are obtained from current ISO
          8601 week-numbering year.

  - IW:   ISO 8601 week of year (1-53).
          Begins on the Monday closest to January 1 of the year.
          For string to datetime conversion, if the input ISO
          8601 week does not exist in the input year, an error
          will be thrown.

          Note that IW is different from the other week-related
          tokens WW and W (implemented in IMPALA-8705). With WW
          and W weeks start with the first day of the
          year/month. ISO 8601 weeks on the other hand always
          start with Monday.

  - ID:   ISO 8601 day of week (1-7). 1 means Monday and 7 means
          Sunday.

When doing string to datetime conversion, the ISO 8601 week-based
tokens are meant to be used together and not mixed with other ISO
SQL date tokens. E.g. 'YYYY-IW-ID' is an invalid format string.

The only exceptions are the day name tokens (DAY and DY) which
may be used instead of ID with the rest of the ISO 8601
week-based date tokens. E.g. 'IYYY-IW-DAY' is a valid format
string.

Change-Id: I89a8c1b98742391cb7b331840d216558dbca362b
Reviewed-on: http://gerrit.cloudera.org:8080/14852
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Gabor Kaszab <gaborkaszab@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-12-13 20:02:16 +00:00
Gabor Kaszab
30c7a6a18c IMPALA-8705: ISO:SQL:2016 datetime patterns - Milestone 3
This patch adds additional datetime format tokens on top of
Milestone 1 (IMPALA-8703) and Milestone 2 (IMPALA-8704).

The tokens introduced:
- Full month name (MONTH, Month, month): In a string to datetime
  conversion this token can parse textual month name into a datetime
  type. In a datetime to string conversion this token gives the
  textual representation of a month.
- Short month name (MON, Mon, mon): Similar to the full month name
  token but this works for 3-character month names like 'JAN'.
- Full day name (DAY, Day, day): In a datetime to string conversion
  this token gives the textual representation of a day like
  'Tuesday.' Not suppported in a string to datetime conversion.
- Short day name (DY, Dy, dy): Similar to full day name token but
  this works for 3-character day names like 'TUE'. Not suppported in
  a string to datetime conversion.
- Day of week (D): In a datetime to string conversion this gives a
  number in [1-7] where 1 represents Sunday. Not supported in a
  string to datetime conversion.
- Quarter of year (Q): In a datetime to string conversion this gives
  a number in [1-4] representing a quarter of the year. Not supported
  in a string to datetime conversion.
- Week of year (WW): In a datetime to string conversion this gives a
  number in [1-53] to represent the week of year where the first week
  starts from 1st of January. Not supported in a string to datetime
  conversion.
- Week of month (W): In a datetime to string conversion this gives a
  number in [1-5] to represent the week of month where the first week
  starts from the first day of the month. Not supported in a string
  to datetime conversion.

Change-Id: Ic797f19a1311b54e5d00d01d0a7afe1f0f21fb8f
Reviewed-on: http://gerrit.cloudera.org:8080/14714
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-12-05 14:19:41 +00:00
norbert.luksa
a862282811 IMPALA-8709: Add Damerau-Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate restricted
Damerau-Levenshtein edit distance (optimal string alignment).
Implmented as dle_dst() and damerau_levenshtein(). If either value is
NULL or both values are NULL returns NULL which differs from Netezza's
dle_dst() which returns the length of the not NULL value or 0 if both
values are NULL. The NULL behavior matches the existing levenshtein()
function.

Also cleans up levenshtein tests.

Testing:
- Added unit tests to expr-test.cc
- Manual testing on over 1400 string pairs from
  http://marvin.cs.uidaho.edu/misspell.html and results match Netezza

Change-Id: Ib759817ec15e7075bf49d51e494e45c8af4db94d
Reviewed-on: http://gerrit.cloudera.org:8080/13794
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-11-22 21:39:21 +00:00
Attila Jeges
684a54a89e IMPALA-7368: Change supported year range for DATE values to 1..9999
Before this patch the supported year range for DATE type started with
year 0. This contradicts the ANSI SQL standard that defines the valid
DATE value range to be 0001-01-01 to 9999-12-31.

Change-Id: Iefdf1c036834763f52d44d0c39a25a1f04e41e07
Reviewed-on: http://gerrit.cloudera.org:8080/14349
Reviewed-by: Attila Jeges <attilaj@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-10-04 18:36:22 +00:00
Tim Armstrong
983e3a66de IMPALA-2138: part 1: initial cleanup
This is a mixed bag of simplifications, debugging improvements
and test fixes that came up in the projection work.

I had to update some planner tests because some expressions
now include their arguments. Various things in the planner
tests were stale, so there are spurious changes in the
expected output that are ignored by the plan verification.

Change-Id: I75d2c8cab79988300c1a9c6c23d6ccea53da7d23
Reviewed-on: http://gerrit.cloudera.org:8080/14265
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-09-20 22:15:41 +00:00
Gabor Kaszab
bca1b43efb IMPALA-8703: ISO:SQL:2016 datetime patterns - Milestone 1
This enhancement introduces FORMAT clause for CAST() operator that is
applicable for casts between string types and timestamp types. Instead
of accepting SimpleDateFormat patterns the FORMAT clause supports
datetime patterns following the ISO:SQL:2016 standard.
Note, the CAST() operator without the FORMAT clause still uses
Impala's implementation of SimpleDateFormat handling. Similarly, the
existing conversion functions such as to_timestamp(), from_timestamp()
etc. remain unchanged and use SimpleDateFormat. Contrary to how these
functions work the FORMAT clause must specify a string literal and
cannot be used with any other kind of a string expression.

Milestone 1 contains all the format tokens covered by the SQL
standard. Further milestones will add more functionality on top of
this list to cover functionality provided by other RDBMS systems.

List of tokens implemented by this change:
- YYYY, YYY, YY, Y: Year tokens
- RRRR, RR: Round year tokens
- MM: Month (1-12)
- DD: Day (1-31)
- DDD: Day of year (1-366)
- HH, HH12: Hour of day (1-12)
- HH24: Hour of day (0-23)
- MI: Minute (0-59)
- SS: Second (0-59)
- SSSSS: Second of day (0-86399)
- FF, FF1, ..., FF9: Fractional second
- AM, PM, A.M., P.M.: Meridiem indicators
- TZH: Timezone hour (-99-+99)
- TZM: Timezone minute (0-99)
- Separators: - . / , ' ; : space
- ISO8601 date indicators (T, Z)

Some notes about the matching algorithm:
- The parsing algorithm uses these tokens in a case insensitive
  manner.
- The separators are interchangeable with each other. For example a
  '-' separator in the format will match with a '.' character in the
  input.
- The length of the separator sequences is handled flexibly meaning
  that a single separator character in the format for instance would
  match with a multi-separator sequence in the input.
- In a string type to timestamp conversion the timezone offset tokens
  are parsed, expected to match with the input but they don't adjust
  the result as the input is already expected to be in UTC format.

Usage example:
SELECT CAST('01-02-2019' AS TIMESTAMP FORMAT 'MM-DD-YYYY');
SELECT CAST('2019.10.10 13:30:40.123456 +01:30' AS TIMESTAMP
    FORMAT 'YYYY-MM-DD HH24:MI:SS.FF9 TZH:TZM');
SELECT CAST(timestamp_column as STRING
    FORMAT "YYYY MM HH12 YY") from some_table;

Change-Id: I19d8d097a45ae6f103b6cd1b2d81aad38dfd9e23
Reviewed-on: http://gerrit.cloudera.org:8080/13722
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-09-19 18:46:19 +00:00
Sahil Takiar
151835116a IMPALA-7312: Non-blocking mode for Fetch() RPC
Adds the query option FETCH_ROWS_TIMEOUT_MS to control the client
timeout when fetching rows. Set to 10 seconds by default to avoid
unnecessary fetch requests. Timeout applies when result spooling is
enabled or disabled.

When result spooling is disabled, the timeout controls how long the
client thread will wait for a single RowBatch to be produced by the
coordinator fragment. When result spooling is enabled, a client can
fetch multiple RowBatches at a time, so the timeout controls the total
time spent waiting for RowBatches to be produced.

The timeout applies to both waiting for rows to be sent by the fragment
instance thread, and waiting for rows to be materialized (e.g. the time
measured by RowMaterializationTimer).

Testing:
* Added new tests to test_fetch.py
* Ran core tests

Change-Id: I331acaba23a65dab43cca48e9dc0dc957b9c632d
Reviewed-on: http://gerrit.cloudera.org:8080/14157
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-09-10 05:56:57 +00:00
norbertluksa
1f904719e4 IMPALA-7770: SPLIT_PART to support negative indexes
Third parameter of SPLIT_PART (nth field) accepts now
negative values, and searches the string backwards.

Testing:
 * Added unit tests to expr-test.cc

Change-Id: I2db762989a90bd95661a59eb9c11a29eb2edfafb
Reviewed-on: http://gerrit.cloudera.org:8080/13880
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-15 11:06:49 +00:00
luksan47
8db7f27ddd IMPALA-8752: Added Jaro-Winkler edit distance and similarity built-in function
The added functions return the Jaro/Jaro-Winkler similarity/distance
of two strings. The algorithm calcuates the Jaro-Similarity of the
strings, then adds more weight to the result if there are
common prefixes. (Jaro-Winkler)
For more detail, see:
https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Extended the algorithm with another optional parameter: boost threshold
The prefix weight will only be applied if the Jaro-similarity
exceeds the given threshold. By default, its value is 0.7.

The new built-in functions are:
 * jaro_distance, jaro_dst
 * jaro_similarity, jaro_sim
 * jaro_winkler_distance, jw_dst
 * jaro_winkler_similarity, jw_sim

Testing:
 * Added unit tests to expr-test.cc
 * Manual testing over 1400 word pairs from
   http://marvin.cs.uidaho.edu/misspell.html
   Results match Apache commons

Change-Id: I64d7f461516c5e66cc27d62612bc8cc0e8f0178c
Reviewed-on: http://gerrit.cloudera.org:8080/13870
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-08-13 18:25:32 +00:00
Joe McDonnell
4cc3ff9c67 IMPALA-8176: Convert simple backend tests to the unified executable
This converts tests with trivial main() functions to the unified
executable. This means that the code change is strictly removing
main() functions and updating the CMakeLists.txt files. Any test
that requires a change larger than that will be addressed
separately. The only exceptions are:
 - exec/incr-stats-util-test.cc requires naming changes to avoid
   conflicts with util/rle-test.cc
 - runtime/decimal-test.cc simplified the naming to make the
   CMakeLists.txt arguments easier.

The new test libraries are marked STATIC, because they are linked
into a single binary (unifiedbetests) and googletest has problems
with tests in shared libraries.

Converting this set of tests saves about 18GB of disk
space for a debug build and saves a minute or two of link time.

For any CMakeLists.txt that has unified tests, this adds a comment
for each test that is not unified.

Testing:
 - Ran backend tests in DEBUG and ASAN modes on Centos7
 - Ran backend tests in DEBUG mode on Centos6

Change-Id: I840d0f9b70edb3a7195a2a33b21fd2874d4c52bd
Reviewed-on: http://gerrit.cloudera.org:8080/13515
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-07-19 22:01:00 +00:00
Tim Armstrong
c353cf7a64 IMPALA-8713: fix stack overflow in unhex()
Write the results into the output heap buffer
instead of into a temporary stack buffer.

No additional memory is used because
AnyValUtil::FromBuffer() allocated a temporary
buffer anyway.

Testing:
Added a targeted test to expr-test that caused
a crash before this fix.

Change-Id: Ie0c1760511a04c0823fc465cf6e529e9681b2488
Reviewed-on: http://gerrit.cloudera.org:8080/13743
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-28 03:02:01 +00:00
Jiawei Wang
dbba52c77c IMPALA-8665:Include extra info in error message when date cast fails
This change extends the error message Impala yields when casting STRING
to DATE (explicitly or implicitly) fails. The new error message includes
the violating string value.

Testing:
changes -> date-partitioning.test & date.test
query_test/test_date_queries.py test passed

Example:
select cast('20' as date);
ERROR: UDF ERROR: String to Date parse failed. Invalid string val: "20"

Change-Id: If800b7696515cd61afee27220c55ff2440a86f04
Reviewed-on: http://gerrit.cloudera.org:8080/13680
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-27 14:39:46 +00:00
Attila Jeges
f40935a30e IMPALA-7369: part 2: Add INTERVAL expr support and built-in functions for DATE
This change implements INTERVAL expression support for DATE type and
adds several DATE related built-in functions. The efficiency of the
DateValue::ToYearMonthDay() function used in many of the built-in
functions below was also improved.

The following functions are supported in Hive:

  INT YEAR(DATE d)
  Extracts year of the 'd' date, returns it as an int in 0-9999 range.

  INT MONTH(DATE d)
  Extracts month of the 'd' date and returns it as an int in 1-12
  range.

  INT DAY(DATE d), INT DAYOFMONTH(DATE d)
  Extracts day-of-month of the 'd' date and returns it as an int in
  1-31 range.

  INT QUARTER(DATE d)
  Extracts quarter of the 'd' date and returns it as an int in 1-4
  range.

  INT DAYOFWEEK(DATE d)
  Extracts day-of-week of the 'd' date and returns it as an int in
  1-7 range. 1 is Sunday and 7 is Saturday.

  INT DAYOFYEAR(DATE d)
  Extracts day-of-year of the 'd' date and returns it as an int in
  1-366 range.

  INT WEEKOFYEAR(DATE d)
  Extracts week-of-year of the 'd' date and returns it as an int in
  1-53 range.

  STRING DAYNAME(DATE d)
  Returns the day field from a 'd' date, converted to the string
  corresponding to that day name. The range of return values is
  "Sunday" to "Saturday".

  STRING MONTHNAME(DATE d)
  Returns the month field from a 'd' date, converted to the string
  corresponding to that month name. The range of return values is
  "January" to "December".

  DATE NEXT_DAY(DATE d, STRING weekday)
  Returns the first date which is later than 'd' and named as
  'weekday'. 'weekday' is 3 letters or full name of the day of the
  week.

  DATE LAST_DAY(DATE d)
  Returns the last day of the month which the 'd' date belongs to.

  INT DATEDIFF(DATE d1, DATE d2)
  Returns the number of days from 'd1' date to 'd2' date.

  DATE CURRENT_DATE()
  Returns the current date (in the local time zone).

  INT INT_MONTHS_BETWEEN(DATE d1, DATE d2)
  Returns the number of months between 'd1' and 'd2' dates, as an int
  representing only the full months that passed.
  If 'd1' represents an earlier date than 'd2', the result is
  negative.

  DOUBLE MONTHS_BETWEEN(DATE d1, DATE d2)
  Returns the number of months between 'd1' and 'd2' dates. Can
  include a fractional part representing extra days in addition to the
  full months between the dates. The fractional component is computed
  by dividing the difference in days by 31 (regardless of the month).
  If 'd1' represents an earlier date than 'd2', the result is
  negative.

  DATE ADD_YEARS(DATE d, INT/BIGINT num_years),
  DATE SUB_YEARS(DATE d, INT/BIGINT num_years)
  Adds/subtracts a specified number of years to a 'd' date value.

  DATE ADD_MONTHS(DATE d, INT/BIGINT num_months),
  DATE SUB_MONTHS(DATE d, INT/BIGINT num_months)
  Adds/subtracts a specified number of months to a date value.
  If 'd' is the last day of a month, the returned date will fall on
  the last day of the target month too.

  DATE ADD_DAYS(DATE d, INT/BIGINT num_days),
  DATE SUB_DAYS(DATE d, INT/BIGINT num_days)
  Adds/subtracts a specified number of days to a date value.

  DATE ADD_WEEKS(DATE d, INT/BIGINT num_weeks),
  DATE SUB_WEEKS(DATE d, INT/BIGINT num_weeks)
  Adds/subtracts a specified number of weeks to a date value.

The following function doesn't exist in Hive but supported by Amazon
Redshift

  INT DATE_CMP(DATE d1, DATE d2)
  Compares 'd1' and 'd2' dates. Returns:
  1. NULL, if either 'd1' or 'd2' is NULL
  2. -1 if d1 < d2
  3. 1 if d1 > d2
  4. 0 if d1 == d2
  (https://docs.aws.amazon.com/redshift/latest/dg/r_DATE_CMP.html)

Change-Id: If404bffdaf055c769e79ffa8f193bac415cfdd1a
Reviewed-on: http://gerrit.cloudera.org:8080/13648
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-25 23:06:25 +00:00
Jim Apple
45c6c46bf6 IMPALA-5031: signed overflow is undefined behavior
Fix remaining signed overflow undefined behaviors in end-to-end
tests. The interesting part of the backtraces:

    exprs/aggregate-functions-ir.cc:464:25: runtime error: signed
       integer overflow: 0x5a4728ca063b522c0b728f8000000000 +
       0x3c2f7086aed236c807a1b50000000000 cannot be represented in
       type '__int128'
    #0 AggregateFunctions::DecimalAvgMerge(
       impala_udf::FunctionContext*, impala_udf::StringVal const&,
       impala_udf::StringVal*) exprs/aggregate-functions-ir.cc:464:25
    #1 AggFnEvaluator::Update(TupleRow const*, Tuple*, void*)
       exprs/agg-fn-evaluator.cc:327:7
    #2 AggFnEvaluator::Add(TupleRow const*, Tuple*)
       exprs/agg-fn-evaluator.h:257:3
    #3 Aggregator::UpdateTuple(AggFnEvaluator**, Tuple*, TupleRow*, bool)
       exec/aggregator.cc:167:24
    #4 NonGroupingAggregator::AddBatchImpl(RowBatch*)
       exec/non-grouping-aggregator-ir.cc:27:5
    #5 NonGroupingAggregator::AddBatch(RuntimeState*, RowBatch*)
       exec/non-grouping-aggregator.cc:124:45
    #6 AggregationNode::Open(RuntimeState*)
       exec/aggregation-node.cc:70:57

    exprs/aggregate-functions-ir.cc:513:12: runtime error: signed
       integer overflow: -8282081183197145958 + -4473782455107795527
       cannot be represented in type 'long'
    #0 void AggregateFunctions::SumUpdate<impala_udf::BigIntVal,
       impala_udf::BigIntVal>(impala_udf::FunctionContext*,
       impala_udf::BigIntVal const&, impala_udf::BigIntVal*)
       exprs/aggregate-functions-ir.cc:513:12
    #1 AggFnEvaluator::Update(TupleRow const*, Tuple*, void*)
       exprs/agg-fn-evaluator.cc:327:7
    #2 AggFnEvaluator::Add(TupleRow const*, Tuple*)
       exprs/agg-fn-evaluator.h:257:3
    #3 Aggregator::UpdateTuple(AggFnEvaluator**, Tuple*, TupleRow*,
       bool) exec/aggregator.cc:167:24
    #4 NonGroupingAggregator::AddBatchImpl(RowBatch*)
       exec/non-grouping-aggregator-ir.cc:27:5
    #5 NonGroupingAggregator::AddBatch(RuntimeState*, RowBatch*)
       exec/non-grouping-aggregator.cc:124:45
    #6 AggregationNode::Open(RuntimeState*)
       exec/aggregation-node.cc:70:57

    exprs/aggregate-functions-ir.cc:585:14: runtime error: signed
       integer overflow: 0x5a4728ca063b522c0b728f8000000000 +
       0x3c2f7086aed236c807a1b50000000000 cannot be represented in
       type '__int128'
    #0 AggregateFunctions::SumDecimalMerge(
       impala_udf::FunctionContext*, impala_udf::DecimalVal const&,
       impala_udf::DecimalVal*) exprs/aggregate-functions-ir.cc:585:14
    #1 AggFnEvaluator::Update(TupleRow const*, Tuple*, void*)
       exprs/agg-fn-evaluator.cc:327:7
    #2 AggFnEvaluator::Add(TupleRow const*, Tuple*)
       exprs/agg-fn-evaluator.h:257:3
    #3 Aggregator::UpdateTuple(AggFnEvaluator**, Tuple*, TupleRow*, bool)
       exec/aggregator.cc:167:24
    #4 NonGroupingAggregator::AddBatchImpl(RowBatch*)
       exec/non-grouping-aggregator-ir.cc:27:5
    #5 NonGroupingAggregator::AddBatch(RuntimeState*, RowBatch*)
       exec/non-grouping-aggregator.cc:124:45
    #6 AggregationNode::Open(RuntimeState*)
       exec/aggregation-node.cc:70:57

    runtime/decimal-value.inline.h:145:12: runtime error: signed
       integer overflow: 18 * 0x0785ee10d5da46d900f436a000000000 cannot
       be represented in type '__int128'
    #0 DecimalValue<__int128>::ScaleTo(int, int, int, bool*) const
       runtime/decimal-value.inline.h:145:12
    #1 DecimalOperators::ScaleDecimalValue(
      impala_udf::FunctionContext*, DecimalValue<int> const&, int,
      int, int) exprs/decimal-operators-ir.cc:132:41
    #2 DecimalOperators::RoundDecimal(impala_udf::FunctionContext*,
       impala_udf::DecimalVal const&, int, int, int, int,
       DecimalOperators::DecimalRoundOp const&)
       exprs/decimal-operators-ir.cc:465:16
    #3 DecimalOperators::RoundDecimal(impala_udf::FunctionContext*,
       impala_udf::DecimalVal const&, DecimalOperators::DecimalRoundOp
       const&) exprs/decimal-operators-ir.cc:519:10
    #4 DecimalOperators::CastToDecimalVal(
       impala_udf::FunctionContext*, impala_udf::DecimalVal const&)
       exprs/decimal-operators-ir.cc:529:10
    #5 impala_udf::DecimalVal ScalarFnCall::InterpretEval
       <impala_udf::DecimalVal>(ScalarExprEvaluator*, TupleRow const*)
       const exprs/scalar-fn-call.cc:485:208
    #6 ScalarFnCall::GetDecimalVal(ScalarExprEvaluator*, TupleRow
       const*) const exprs/scalar-fn-call.cc:618:44
    #7 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow
       const*) exprs/scalar-expr-evaluator.cc:321:27
    #8 ScalarExprEvaluator::GetValue(TupleRow const*)
       exprs/scalar-expr-evaluator.cc:251:10
    #9 Java_org_apache_impala_service_FeSupport_NativeEvalExprsWithoutRow
       service/fe-support.cc:246:26
    #10 (<unknown module>)

    runtime/multi-precision.h:116:21: runtime error: negation of
       0x80000000000000000000000000000000 cannot be represented in
       type 'int128_t' (aka '__int128'); cast to an unsigned type to
       negate this value to itself
    #0 ConvertToInt128(boost::multiprecision::number
       <boost::multiprecision::backends::cpp_int_backend<256u, 256u,
       (boost::multiprecision::cpp_integer_type)1,
       (boost::multiprecision::cpp_int_check_type)0, void>,
       (boost::multiprecision::expression_template_option)0>,
       __int128, bool*) runtime/multi-precision.h:116:21
    #1 DecimalValue<__int128>
       DecimalValue<__int128>::Multiply<__int128>(int,
       DecimalValue<__int128> const&, int, int, int, bool, bool*) const
       runtime/decimal-value.inline.h:438:16
    #2 DecimalOperators::Multiply_DecimalVal_DecimalVal(
       impala_udf::FunctionContext*, impala_udf::DecimalVal const&,
       impala_udf::DecimalVal const&)
       exprs/decimal-operators-ir.cc:859:3336
    #3 impala_udf::DecimalVal ScalarFnCall::InterpretEval
       <impala_udf::DecimalVal>(ScalarExprEvaluator*, TupleRow const*)
       const exprs/scalar-fn-call.cc:485:376
    #4 ScalarFnCall::GetDecimalVal(ScalarExprEvaluator*, TupleRow
       const*) const exprs/scalar-fn-call.cc:618:44
    #5 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow
       const*) exprs/scalar-expr-evaluator.cc:321:27
    #6 ScalarExprEvaluator::GetValue(TupleRow const*)
       exprs/scalar-expr-evaluator.cc:251:10
    #7 Java_org_apache_impala_service_FeSupport_NativeEvalExprsWithoutRow
       service/fe-support.cc:246:26
    #8 (<unknown module>)

    util/runtime-profile-counters.h:194:24: runtime error: signed
       integer overflow: -1263418397011577524 + -9223370798768111350
       cannot be represented in type 'long'
    #0 RuntimeProfile::AveragedCounter::UpdateCounter
       (RuntimeProfile::Counter*)
       util/runtime-profile-counters.h:194:24
    #1 RuntimeProfile::UpdateAverage(RuntimeProfile*)
       util/runtime-profile.cc:199:20
    #2 RuntimeProfile::UpdateAverage(RuntimeProfile*)
       util/runtime-profile.cc:245:14
    #3 Coordinator::BackendState::UpdateExecStats
       (vector<Coordinator::FragmentStats*,
       allocator<Coordinator::FragmentStats*> > const&)
       runtime/coordinator-backend-state.cc:429:22
    #4 Coordinator::ComputeQuerySummary()
       runtime/coordinator.cc:775:20
    #5 Coordinator::HandleExecStateTransition(Coordinator::ExecState,
       Coordinator::ExecState) runtime/coordinator.cc:567:3
    #6 Coordinator::SetNonErrorTerminalState(Coordinator::ExecState)
       runtime/coordinator.cc:484:3
    #7 Coordinator::GetNext(QueryResultSet*, int, bool*)
       runtime/coordinator.cc:657:53
    #8 ClientRequestState::FetchRowsInternal(int, QueryResultSet*)
       service/client-request-state.cc:943:34
    #9 ClientRequestState::FetchRows(int, QueryResultSet*)
       service/client-request-state.cc:835:36
    #10 ImpalaServer::FetchInternal(TUniqueId const&, bool, int,
        beeswax::Results*) service/impala-beeswax-server.cc:545:40
    #11 ImpalaServer::fetch(beeswax::Results&, beeswax::QueryHandle
        const&, bool, int) service/impala-beeswax-server.cc:178:19
    #12 beeswax::BeeswaxServiceProcessor::process_fetch(int,
        apache::thrift::protocol::TProtocol*,
        apache::thrift::protocol::TProtocol*, void*)
        generated-sources/gen-cpp/BeeswaxService.cpp:3398:13
    #13 beeswax::BeeswaxServiceProcessor::dispatchCall
        (apache::thrift::protocol::TProtocol*,
        apache::thrift::protocol::TProtocol*, string const&, int,
        void*) generated-sources/gen-cpp/BeeswaxService.cpp:3200:3
    #14 ImpalaServiceProcessor::dispatchCall
        (apache::thrift::protocol::TProtocol*,
        apache::thrift::protocol::TProtocol*, string const&, int,
        void*) generated-sources/gen-cpp/ImpalaService.cpp:1824:48
    #15 apache::thrift::TDispatchProcessor::process
        (boost::shared_ptr<apache::thrift::protocol::TProtocol>,
        boost::shared_ptr<apache::thrift::protocol::TProtocol>, void*)
        toolchain/thrift-0.9.3-p5/include/thrift/TDispatchProcessor.h:121:12

Change-Id: I73dd6802ec1023275d09a99a2950f3558313fc8e
Reviewed-on: http://gerrit.cloudera.org:8080/13437
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-10 08:34:27 +00:00
Attila Jeges
f0678b06e6 IMPALA-7369: part 1: Implement TRUNC, DATE_TRUNC, EXTRACT, DATE_PART functions for DATE
These functions are somewhat similar in that each of them takes a DATE
argument and a time unit to work with.

They work identically to the corresponding TIMESTAMP functions. The
only difference is that the DATE functions don't accept time-of-day
units.

TRUNC(DATE d, STRING unit)
Truncates a DATE value to the specified time unit. The 'unit' argument
is case insensitive. This argument string can be one of:
  SYYYY, YYYY, YEAR, SYEAR, YYY, YY, Y: Year.
  Q: Quarter.
  MONTH, MON, MM, RM: Month.
  DDD, DD, J: Day.
  DAY, DY, D: Starting day (Monday) of the week.
  WW: Truncates to the most recent date, no later than 'd', which is
      on the same day of the week as the first day of year.
  W: Truncates to the most recent date, no later than 'd', which is on
     the same day of the week as the first day of month.

The impelementation mirrors Impala's TRUNC(TIMESTAMP ts, STRING unit)
function. Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions201.htm
.

DATE_TRUNC(STRING unit, DATE d)
Truncates a DATE value to the specified precision. The 'unit' argument
is case insensitive. This argument string can be one of: DAY, WEEK,
MONTH, YEAR, DECADE, CENTURY, MILLENNIUM.

The implementation mirrors Impala's DATE_TRUNC(STRING unit,
TIMESTAMP ts) function. Vertica has a similar function too.
Reference:
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/
    SQLReferenceManual/Functions/Date-Time/DATE_TRUNC.htm
.

EXTRACT(DATE d, STRING unit), EXTRACT(unit FROM DATE d)
Returns one of the numeric date fields from a DATE value. The 'unit'
string can be one of YEAR, QUARTER, MONTH, DAY. This argument value is
case-insensitive.

The implementation mirrors that Impala's EXTRACT(TIMESTAMP ts,
STRING unit). Hive and Oracle SQL have a similar function too.
Reference:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
.

DATE_PART(STRING unit, DATE date)
Similar to EXTRACT(), with the argument order reversed. Supports the
same date units as EXTRACT().

The implementation mirrors Impala's DATE_PART(STRING unit,
TIMESTAMP ts) function.

Change-Id: I843358a45eb5faa2c134994600546fc1d0a797c8
Reviewed-on: http://gerrit.cloudera.org:8080/13363
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-05 14:23:23 +00:00
Tim Armstrong
95a1da2d32 IMPALA-8578: part 2: move metrics code to .cc files
This moves a lot of metric function definitions into .cc files,
to reduce the size of compilation units and to reduce the
frequency of recompilation required when making changes
to metrics.

This moves most of the large, non-perf-critical metric
functions into .cc files. For template classes, this
requires explicitly instantiating all combinations of
template parameters that are used in impala, including
in tests.

Disable weak-template-vtables warning because of
spurious warnings on template instantiations. See
https://bugs.llvm.org/show_bug.cgi?id=18733

Change-Id: I78ad045ded6e6a7b7524711be9302c26115b97b9
Reviewed-on: http://gerrit.cloudera.org:8080/13501
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-06-05 03:29:47 +00:00
Tim Armstrong
d4648e87b4 IMPALA-4356,IMPALA-7331: codegen all ScalarExprs
Based on initial draft patch by Pooja Nilangekar.

Codegen'd expressions can be executed in two ways - either by
being called directly from a fully codegend function, or from
interpreted code via a function pointer (previously
ScalarFnCall::scalar_fn_wrapper_).

This change moves the function pointer from ScalarFnCall to its
base class ScalarExpr, so the full expr tree can be codegen'd, not
just the ScalarFnCall subtrees. The key refactoring and improvements
are:
* ScalarExpr::Get*Val() switches between interpreted and the codegen'd
  function pointer code paths in an inline function, avoiding a
  virtual function call to ScalarFnCal::Get*Val().
* Boilerplate logic is moved to ScalarExpr::GetCodegendComputeFn(),
  which calls a virtual function GetCodegenComputeFnImpl().
* ScalarFnCall's logic for deciding whether to interpret or codegen is
  better abstracted and exposed to ScalarExpr as IsInterpretable()
  and ShouldCodegen() methods.
* The ScalarExpr::codegend_compute_fn_ function pointer is only
  populated for expressions that are "codegen entry points". These
  include the roots of expr trees and non-root expressions
  where the parent expression calls Get*Val() from the
  pseudo-codegend GetCodegendComputeFnWrapper().
* ScalarFnCall is always initialised for interpreted execution.
  Otherwise the function pointer is needed for non-root expressions,
  e.g. to support ScalarExprEvaluator::GetConstantVal().
* Latent bugs/gaps for codegen of CollectionVal are fixed. CollectionVal
  is modified to use the StringVal memory layout to allow code sharing
  with StringVal. These fixes allowed simplification of
  IsNotEmptyPredicate codegen (from IMPALA-7657).

I chose to tackle two problems in one change - adding support for
generating codegen'd function pointers for all ScalarExprs, and adding
the "entry point" concept - to avoid a blow-up in the number of
codegen'd entry points that could lead to longer codegen times and/or
worse code because of inlining changes.

IMPALA-7331 (CHAR codegen support functions) is also fixed because
it was simpler to enable CHAR codegen within ScalarExpr than to carry
forward the exiting CHAR workarounds from ScalarFnCall. The
CHAR-specific codegen support required in the scalar expr subsystem is
very limited.  StringVal intermediates are used everywhere. Only
SlotRef actually operates on the different tuple layout, and the
required codegen support for SlotRef already exists for UDA
intermediates anyway.

Testing:
* Ran exhaustive tests.

Perf:
* Ran a basic insert benchmark, which went from 10.1s to 7.6s
  create table foo stored as parquet as
  select case when l_orderkey % 2 = 0 then 'aaa' else 'bbb' end
  from tpch30_parquet.lineitem;
* Ran a basic CHAR expr test:
  set num_nodes=1;
  set mt_dop=1;
  select count(*) from lineitem
  where cast(l_linestatus as CHAR(2)) = 'O ' and
        cast(l_returnflag as CHAR(2)) = 'N '
  The time spent in the scan went from 520ms to 220ms.
* Added perf regression test to tpcds-insert, similar to the manual
  benchmark.
* Ran single-node TPC-H with large and small scale factors, to estimate
  impact on execution perf and query startup time, respectively.

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 6.84    | -0.18%     | 4.49       | -0.31%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCH(30) | TPCH-Q20 | parquet / none / none | 2.58   | 2.47        |   +4.18%   |   1.29%   |   0.88%        | 5     |   +4.12%       | 2.31    | 5.81   |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 4.81   | 4.61        |   +4.33%   |   2.18%   |   2.15%        | 5     |   +3.91%       | 1.73    | 3.09   |
| TPCH(30) | TPCH-Q21 | parquet / none / none | 26.45  | 26.16       |   +1.09%   |   0.37%   |   0.50%        | 5     |   +1.36%       | 2.02    | 3.94   |
| TPCH(30) | TPCH-Q9  | parquet / none / none | 15.92  | 15.75       |   +1.09%   |   2.87%   |   1.65%        | 5     |   +0.88%       | 0.29    | 0.73   |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.38   | 2.35        |   +1.12%   |   1.64%   |   1.11%        | 5     |   +0.80%       | 1.15    | 1.26   |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 2.94   | 2.91        |   +1.13%   |   7.68%   |   5.37%        | 5     |   -0.34%       | -0.29   | 0.27   |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 18.10  | 18.02       |   +0.42%   |   2.70%   |   0.56%        | 5     |   +0.28%       | 0.29    | 0.34   |
| TPCH(30) | TPCH-Q8  | parquet / none / none | 4.72   | 4.72        |   -0.04%   |   1.20%   |   1.65%        | 5     |   +0.05%       | 0.00    | -0.04  |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 3.92   | 3.93        |   -0.26%   |   1.08%   |   2.36%        | 5     |   +0.20%       | 0.58    | -0.23  |
| TPCH(30) | TPCH-Q6  | parquet / none / none | 1.27   | 1.27        |   -0.28%   |   0.22%   |   0.88%        | 5     |   +0.09%       | 0.29    | -0.68  |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.64   | 2.65        |   -0.45%   |   1.65%   |   0.65%        | 5     |   -0.24%       | -0.58   | -0.57  |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 3.10   | 3.13        |   -0.76%   |   1.47%   |   1.12%        | 5     |   -0.21%       | -0.29   | -0.93  |
| TPCH(30) | TPCH-Q2  | parquet / none / none | 1.20   | 1.21        |   -0.80%   |   2.26%   |   2.47%        | 5     |   -0.82%       | -1.15   | -0.53  |
| TPCH(30) | TPCH-Q4  | parquet / none / none | 1.97   | 1.99        |   -1.37%   |   1.84%   |   3.21%        | 5     |   -0.47%       | -0.58   | -0.83  |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 11.53  | 11.63       |   -0.91%   |   0.46%   |   0.49%        | 5     |   -0.95%       | -2.02   | -3.08  |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 5.13   | 5.21        |   -1.51%   |   2.24%   |   4.05%        | 5     |   -0.94%       | -0.58   | -0.73  |
| TPCH(30) | TPCH-Q5  | parquet / none / none | 3.61   | 3.66        |   -1.40%   |   0.66%   |   0.79%        | 5     |   -1.33%       | -1.73   | -3.05  |
| TPCH(30) | TPCH-Q7  | parquet / none / none | 19.42  | 19.71       |   -1.52%   |   1.34%   |   1.39%        | 5     |   -1.22%       | -1.44   | -1.76  |
| TPCH(30) | TPCH-Q3  | parquet / none / none | 5.08   | 5.15        |   -1.49%   |   1.34%   |   0.73%        | 5     |   -1.35%       | -1.44   | -2.20  |
| TPCH(30) | TPCH-Q15 | parquet / none / none | 3.42   | 3.49        |   -1.92%   |   0.93%   |   1.47%        | 5     |   -1.53%       | -1.15   | -2.49  |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.15   | 1.19        |   -3.17%   |   2.27%   |   1.95%        | 5     |   -4.21%       | -1.15   | -2.41  |
| TPCH(30) | TPCH-Q1  | parquet / none / none | 9.26   | 9.63        |   -3.85%   |   0.62%   |   0.59%        | 5     |   -3.78%       | -2.31   | -10.25 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

Cluster Name: UNKNOWN
Lab Run Info: UNKNOWN
Impala Version:          impalad version 3.2.0-SNAPSHOT RELEASE ()
Baseline Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE (2019-03-19)

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(2)  | parquet / none / none | 0.90    | -0.08%     | 0.80       | -0.05%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| TPCH(2)  | TPCH-Q18 | parquet / none / none | 1.22   | 1.19        |   +1.93%   |   3.81%   |   4.46%        | 20    |   +3.34%       | 1.62    | 1.46  |
| TPCH(2)  | TPCH-Q10 | parquet / none / none | 0.74   | 0.73        |   +1.97%   |   3.36%   |   2.94%        | 20    |   +0.97%       | 1.88    | 1.95  |
| TPCH(2)  | TPCH-Q11 | parquet / none / none | 0.49   | 0.48        |   +1.91%   |   6.19%   |   4.64%        | 20    |   +0.25%       | 0.95    | 1.09  |
| TPCH(2)  | TPCH-Q4  | parquet / none / none | 0.43   | 0.43        |   +1.99%   |   6.26%   |   5.86%        | 20    |   +0.15%       | 0.92    | 1.03  |
| TPCH(2)  | TPCH-Q15 | parquet / none / none | 0.50   | 0.49        |   +1.82%   |   7.32%   |   6.35%        | 20    |   +0.26%       | 1.01    | 0.83  |
| TPCH(2)  | TPCH-Q1  | parquet / none / none | 0.98   | 0.97        |   +0.79%   |   4.64%   |   2.73%        | 20    |   +0.36%       | 0.77    | 0.65  |
| TPCH(2)  | TPCH-Q19 | parquet / none / none | 0.83   | 0.83        |   +0.65%   |   3.33%   |   2.80%        | 20    |   +0.44%       | 2.18    | 0.67  |
| TPCH(2)  | TPCH-Q14 | parquet / none / none | 0.62   | 0.62        |   +0.97%   |   2.86%   |   1.00%        | 20    |   +0.04%       | 0.13    | 1.42  |
| TPCH(2)  | TPCH-Q3  | parquet / none / none | 0.88   | 0.87        |   +0.57%   |   2.17%   |   1.74%        | 20    |   +0.29%       | 1.15    | 0.92  |
| TPCH(2)  | TPCH-Q12 | parquet / none / none | 0.53   | 0.53        |   +0.27%   |   4.58%   |   5.78%        | 20    |   +0.46%       | 1.47    | 0.16  |
| TPCH(2)  | TPCH-Q17 | parquet / none / none | 0.72   | 0.72        |   +0.15%   |   3.64%   |   5.55%        | 20    |   +0.21%       | 0.86    | 0.10  |
| TPCH(2)  | TPCH-Q21 | parquet / none / none | 2.05   | 2.05        |   +0.21%   |   1.99%   |   2.37%        | 20    |   +0.01%       | 0.25    | 0.30  |
| TPCH(2)  | TPCH-Q5  | parquet / none / none | 1.28   | 1.27        |   +0.24%   |   1.61%   |   1.80%        | 20    |   -0.02%       | -0.57   | 0.44  |
| TPCH(2)  | TPCH-Q13 | parquet / none / none | 1.27   | 1.27        |   -0.34%   |   1.69%   |   1.83%        | 20    |   -0.20%       | -1.65   | -0.61 |
| TPCH(2)  | TPCH-Q7  | parquet / none / none | 1.72   | 1.73        |   -0.55%   |   2.40%   |   1.69%        | 20    |   -0.03%       | -0.42   | -0.83 |
| TPCH(2)  | TPCH-Q8  | parquet / none / none | 1.27   | 1.28        |   -0.68%   |   3.10%   |   3.89%        | 20    |   -0.06%       | -0.54   | -0.62 |
| TPCH(2)  | TPCH-Q6  | parquet / none / none | 0.36   | 0.36        |   -0.84%   |   0.79%   |   3.51%        | 20    |   -0.07%       | -0.36   | -1.04 |
| TPCH(2)  | TPCH-Q2  | parquet / none / none | 0.65   | 0.65        |   -1.17%   |   4.76%   |   5.99%        | 20    |   -0.05%       | -0.25   | -0.69 |
| TPCH(2)  | TPCH-Q9  | parquet / none / none | 1.59   | 1.62        |   -2.01%   |   1.45%   |   5.12%        | 20    |   -0.16%       | -1.24   | -1.69 |
| TPCH(2)  | TPCH-Q20 | parquet / none / none | 0.68   | 0.69        |   -1.73%   |   4.35%   |   4.43%        | 20    |   -0.49%       | -1.74   | -1.25 |
| TPCH(2)  | TPCH-Q22 | parquet / none / none | 0.38   | 0.40        |   -2.89%   |   7.42%   |   6.39%        | 20    |   -0.21%       | -0.66   | -1.34 |
| TPCH(2)  | TPCH-Q16 | parquet / none / none | 0.59   | 0.62        |   -4.01%   |   6.33%   |   5.83%        | 20    |   -4.72%       | -1.39   | -2.13 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+

Change-Id: I839d7a3a2f5e1309c33a1f66013ef11628c5dc11
Reviewed-on: http://gerrit.cloudera.org:8080/12797
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-15 22:34:28 +00:00
Todd Lipcon
e2ead7f857 expr-test: use gtest parameterization
Instead of running the tests three times with different flags from
main(), this uses gtest's parameterization feature to accomplish the
same.

The advantage here is that we end up with different test names for each
of the runs. Additionally, this moves the setup code into a proper setup
method so that executing expr-test --gtest_list_tests doesn't waste time
starting a cluster.

This is prep work towards adding multi-threaded test execution for
long-running tests. expr-test seems to currently be one of the worst
offenders.

Change-Id: Idc9fb24ad62b4aa2e120a99d74ae04bb221c034b
Reviewed-on: http://gerrit.cloudera.org:8080/13289
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-11 01:28:58 +00:00
Zoltan Borok-Nagy
d423979866 IMPALA-5843: Use page index in Parquet files to skip pages
This commit implements page filtering based on the Parquet page index.

The read and evaluation of the page index is done by the
HdfsParquetScanner. At first, we determine the row ranges we are
interested in, and based on the row ranges we determine the candidate
pages for each column that we are reading.

We still issue one ScanRange per column chunk, but we specify
sub-ranges that store the candidate pages, i.e. we don't read
the whole column chunk, but only fractions of it.

Pages are not aligned across column chunks, i.e. page #2 of column A
might store completely different rows than page #2 of column B.
It means we need to implement some kind of row-skipping logic
when we read the data pages. This logic is implemented in
BaseScalarColumnReader and ScalarColumnReader. Collection column
readers know nothing about page filtering.

Page filtering can be turned off by setting the query option
'read_parquet_page_index' to false.

Testing:
 * added some unit tests for the row range and
   page selection logic
 * generated various Parquet files with Parquet-MR
 * enabled Page index writing and wrote selective queries against
   tables written by Impala. Current tests are likely to use page
   filtering transparently.

Performance:
 * Measured locally, observed 3x to 20x speedup for selective queries.
   The speedup was proportional to the IO operations need to be done.

 * The TPCH benchmark didn't show a significant performance change. It
   is not a suprise since the data is not being sorted in any useful
   way. So the main goal was to not introduce perf regression.

TODO:
   * measure performance for remote reads

Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Reviewed-on: http://gerrit.cloudera.org:8080/12065
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-10 11:46:38 +00:00
Todd Lipcon
1e49b6a6b4 IMPALA-2029. Implement our own getJNIEnv equivalent
The libhdfs getJNIEnv function was made non-exported in Hadoop 2. For a
while in CDH we were hacking around this with a vendor-specific patch
that re-exported it. However, that was always a bit annoying to maintain
our own patch each time we rebased to new versions, etc.

Earlier attempts to solve this issue turned up strange bugs around
coordinating whether we or libhdfs were responsible for attaching and
detaching to the JVM/JNI environment. So, this patch takes a new
approach: rather than directly creating/attaching to the JVM, we just
look for an existing attached environment. If there isn't one, we call
some simple libhdfs function which forces it to attach the current
thread, and then try again.

Performance is maintained (or maybe improved) by adding a thread-local
cache of the attached JVM, with an inlined fast-path.

I tested this with a CDP build of Hadoop which doesn't have the
getJNIEnv workaround. Prior to this fix, I wasn't able to run Java tests
against that build because it would fail to link getJNIEnv() at runtime.
Now, they pass.

Change-Id: I766bcfd70addb00e9fd8a860e89c2a1c5d4c71d5
Reviewed-on: http://gerrit.cloudera.org:8080/13275
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-10 01:50:06 +00:00
Attila Jeges
b5805de3e6 IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.

This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.

The changes are as follows:
- Support for DATE literal syntax.

- Explicit casting between DATE and other types (note that invalid
  casts will fail with an error just like invalid DECIMAL_V2 casts,
  while failed casts to other types do no lead to warning or error):
    - from STRING to DATE. The string value must be formatted as
      yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
      the time component is optional. If the time component is
      present, it will be truncated silently.
    - from DATE to STRING. The resulting string value is formatted as
      yyyy-MM-dd.
    - from TIMESTAMP to DATE. The source timestamp's time of day
      component is ignored.
    - from DATE to TIMESTAMP. The target timestamp's time of day
      component is set to 00:00:00.

- Implicit casting between DATE and other types:
    - from STRING to DATE if the source string value is used in a
      context where a DATE value is expected.
    - from DATE to TIMESTAMP if the source date value is used in a
      context where a TIMESTAMP value is expected.

- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
  implicit conversions are now all possible, the existing function
  overload resolution logic is not adequate anymore.
  For example, it resolves the
  if(false, '2011-01-01', DATE '1499-02-02') function call to the
  if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
  function, instead of the if(BOOLEAN, DATE, DATE) version.

  This is clearly wrong, so the function overload resolution logic had
  to be changed to resolve function calls to the best-fit overloaded
  function definition if there are multiple applicable candidates.

  An overloaded function definition is an applicable candidate for a
  function call if each actual parameter in the function call either
  matches the corresponding formal parameter's type (without casting)
  or is implicitly castable to that type.

  When looking for the best-fit applicable candidate, a parameter
  match score (i.e. the number of actual parameters in the function
  call that match their corresponding formal parameter's type without
  casting) is calculated and the applicable candidate with the highest
  parameter match score is chosen.

  There's one more issue that the new resolution logic has to address:
  if two applicable candidates have the same parameter match score and
  the only difference between the two is that the first one requires a
  STRING -> TIMESTAMP implicit cast for some of its parameters while
  the second one requires a STRING -> DATE implicit cast for the same
  parameters then the first candidate has to be chosen not to break
  backward compatibility.
  E.g: year('2019-02-15') function call must resolve to
  year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
  implemented yet, so this is not an issue at the moment but it will
  be in the future.
  When the resolution algorithm considers overloaded function
  definitions, first it orders them lexicographically by the types in
  their parameter lists. To ensure the backward compatible behavior
  Primitivetype.DATE enum value has to come after
  PrimitiveType.TIMESTAMP.

- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
  math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.

These items are tightly coupled and it makes sense to implement them
in one change-set.

Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
  corresponding HBASE table 'functional_hbase.date_tbl') was
  introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
    - since DATE type is supported for TEXT and HBASE fileformats
      only, most DATE tests were implemented separately in
      tests/query_test/test_date_queries.py.

Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.

Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-23 13:33:57 +00:00
Todd Lipcon
209a350aae Re-land IMPALA-5393. Use THREAD_LOCAL state for regexp
This re-lands commit 6e8c330f40 which was
reverted in d3428a58d8. The revert was due
to an assumption that this commit depended on the new version of re2
(which was correctly reverted due to a toolchain issue). In fact this
commit does not depend on any toolchain changes.

Original commit message follows
--------------------------------

This changes the built-in regexp-related UDFs to use THREAD_LOCAL
re2::RE instances instead of FRAGMENT_LOCAL.

Although re2::RE is thread-safe, it achieves that thread safety through
a certain amount of locking. Using thread-local regexps improves
performance substantially.

I ran a simple test query:

select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0

on a table with three underlying parquet files (thus getting 3 scanner
threads). Prior to this change, the query took ~60 seconds and burned
2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For
a query with more scanner threads, the improvement should be even more
dramatic.

The only potential downside of this change is slightly increased memory
consumption by having one RE instance per thread, but the REs themselves
should be small relative to all of the other per-scanner-thread memory.

Change-Id: I9ae0703efeb2429813b2a712f1accf1b0a4a409e
Reviewed-on: http://gerrit.cloudera.org:8080/12845
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-03-26 04:37:49 +00:00
Lars Volker
d3428a58d8 Revert "IMPALA-5393. Use THREAD_LOCAL state for regexp"
This depends on a change which switches to a toolchain version that does not have packages for Ubuntu 18.04. Reverting both now to unblock everyone.

This reverts commit 6e8c330f40.

Change-Id: Id35a90a58e3f775031a0f147b042ccd46d77e24b
Reviewed-on: http://gerrit.cloudera.org:8080/12791
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Lars Volker <lv@cloudera.com>
2019-03-19 14:52:32 +00:00
Todd Lipcon
6e8c330f40 IMPALA-5393. Use THREAD_LOCAL state for regexp
This changes the built-in regexp-related UDFs to use THREAD_LOCAL
re2::RE instances instead of FRAGMENT_LOCAL.

Although re2::RE is thread-safe, it achieves that thread safety through
a certain amount of locking. Using thread-local regexps improves
performance substantially.

I ran a simple test query:

select sum(l_linenumber) from item_20x where length(regexp_extract(l_shipinstruct, '.*', 0)) > 0

on a table with three underlying parquet files (thus getting 3 scanner
threads). Prior to this change, the query took ~60 seconds and burned
2m16sec CPU time. With this change, it took ~19sec and 43s CPU time. For
a query with more scanner threads, the improvement should be even more
dramatic.

The only potential downside of this change is slightly increased memory
consumption by having one RE instance per thread, but the REs themselves
should be small relative to all of the other per-scanner-thread memory.

Change-Id: Ibc331151a302e755701cb08adb3e6f289d54c3a6
Reviewed-on: http://gerrit.cloudera.org:8080/12772
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Todd Lipcon <todd@apache.org>
2019-03-18 23:55:57 +00:00
Philip Zeyliger
214f61a180 IMPALA-8250: Clean up JNI warnings.
Using LIBHDFS_OPTS+="-Xcheck:jni" revealed a handful of warnings related to
(a) checking for exceptions and (b) leaking local references.

Checking for exceptions required sprinkling RETURN_ERROR_IF_EXC
left and right. I chose not to expand the JniCall infrastructure
to handle this more generally at the moment.

The leaky local references are a bit harder. In the logs, they show up
as "WARNING: JNI local refs: 2597, exceeds capacity: 35" or similar. A
few of these errors seem to be not in our code.  The ones that I've
found in our code stemmed from HBaseTableScanner::GetRowKey(): this
method uses local references and wasn't returning them. Using a
JniLocalFrame seems to have taken care of the warnings.

I have added code to skip test_large_strings when JNI checking is
enabled. This test takes forever (presumably because JNI is checking
bounds on strings very aggressively), and times out. The time out also
causes some metric-related checks to fail (since a query is still in
flight).

Debugging this required customizing my JDK to give stack traces
when these warnings occurred. The following diff facilitated
this.

  diff -r 76a9c9cf14f1 src/share/vm/prims/jniCheck.cpp
  --- a/src/share/vm/prims/jniCheck.cpp	Tue Jan 15 10:43:31 2019 +0000
  +++ b/src/share/vm/prims/jniCheck.cpp	Wed Feb 27 11:57:13 2019 -0800
  @@ -143,11 +143,30 @@
   static const char * fatal_instance_field_mismatch = "Field type (instance) mismatch in JNI get/set field operations";
   static const char * fatal_non_string = "JNI string operation received a non-string";

  +// thisone: whether to print every time, or maybe, depending on future
  +// how many future stacks we want printed (totally racy); helps catch
  +// missing exception handling if there's a way to tickle that code
  +// reliably.
  +static inline void dump_native_stack(JavaThread* thr, bool thisone, int future) {
  +  static int fut_stacks = 0; // racy!
  +  if (fut_stacks > 0) {
  +    thisone = true;
  +    fut_stacks--;
  +  }
  +  if (future > 0) fut_stacks = future;
  +  if (thisone) {
  +    frame fr = os::current_frame();
  +    char buf[6000];
  +    tty->print_cr("Thread: %s %d", thr->get_thread_name(), thr->osthread()->thread_id());
  +    print_native_stack(tty, fr, thr, buf, sizeof(buf));
  +  }
  +}

   // When in VM state:
   static void ReportJNIWarning(JavaThread* thr, const char *msg) {
     tty->print_cr("WARNING in native method: %s", msg);
     thr->print_stack();
  +  dump_native_stack(thr, true, 0);
   }

   // When in NATIVE state:
  @@ -199,11 +218,14 @@
         tty->print_cr("WARNING in native method: JNI call made without checking exceptions when required to from %s",
           thr->get_pending_jni_exception_check());
         thr->print_stack();
  +      dump_native_stack(thr, true, 10);
       )
       thr->clear_pending_jni_exception_check(); // Just complain once
     }
   }

  +
  +
   /**
    * Add to the planned number of handles. I.e. plus current live & warning threshold
    */
  @@ -254,9 +276,12 @@
         tty->print_cr("WARNING: JNI local refs: %zu, exceeds capacity: %zu",
             live_handles, planned_capacity);
         thr->print_stack();
  +      dump_native_stack(thr, true, 0);
       )
       // Complain just the once, reset to current + warn threshold
       add_planned_handle_capacity(handles, 0);
  +  } else {
  +    dump_native_stack(thr, false, 0);
     }
   }

Change-Id: Idd1709f749a764c1d947704bc64306493863b45f
Reviewed-on: http://gerrit.cloudera.org:8080/12660
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-03-08 03:35:09 +00:00
Philip Zeyliger
0b7c964545 Adding hostname to Disk I/O errors.
I recently ran into some queries that failed like so:

  WARNINGS: Disk I/O error: Could not open file: /data/...: Error(5): Input/output error

These warnings were in the profile, but I had to cross-reference impalad
logs to figure out which machine had the broken disk.

In this commit, I've sprinkled GetBackendString() to include it.

Change-Id: Ib977d2c0983ef81ab1338de090239ed57f3efde2
Reviewed-on: http://gerrit.cloudera.org:8080/12402
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-02-20 23:14:36 +00:00
Andrew Sherman
7707eb0417 IMPALA-7657: Codegen IsNotEmptyPredicate and ValidTupleIdExpr.
These two classes evaluate scalar expressions. Previously codegen was
done by calling ScalarExpr::GetCodegendComputeFnWrapper which generates
a static method that calls the scalar expression evaluation method. Make
this more efficient by generating code which is customized using
information available at codegen time.

Add new cross-compiled files null-literal-ir.cc slot-ref-ir.cc

IsNotEmptyPredicate works by getting a CollectionVal object from the
single child Expr node, and counting its tuples. At codegen time we know
the type and value of the child node. Generate a call to a node-specific
non-virtual cross-compiled method to get the CollectionVal object from
the child. Then generate a code that examines the CollectionVal and
returns an IntVal.

A ValidTupleIdExpr node contains a vector of tuple ids. It works by
probing each row for the tuple ids in the vector to find a non-null
tuple. At codegen time we know the vector of tuple ids. We unroll the
loop through the tuple ids, generating code that evaluates if the tuple
is non-null, and returns the tuple id if/when a non-null tuple is found.

IMPALA-7657 also requests replacing GetCodegendComputeFnWrapper() in
TupleIsNullPredicate. In the current Impala code this method is never
called. This is because TupleIsNullPredicate is always wrapped in an
IfExpr. This is always codegen'd by IfExpr's
GetCodegendComputeFnWrapper() method. There is a separate Jira
IMPALA-7655 to improve codegen of IfExpr.

Minor corrections:
  Correct the link to llvm tutorial in LlvmCodegen.

PERFORMANCE:
  I tested performance on a local mini-cluster. I wrote some
  pathological queries to test the new code. The new codegen'd code is
  very similar in performance. Both ValidTupleIdExpr and
  IsNotEmptyPredicate seem very slightly faster than the old code.
  Overall these changes are not purely for performance but to move away
  from GetCodegendComputeFnWrapper.

TESTING:
  The changed scalar expressions are well exercised by current tests.
  Ran exhaustive end-to-end tests.

Change-Id: Ifb87b9e3b879c278ce8638d97bcb320a7555a6b3
Reviewed-on: http://gerrit.cloudera.org:8080/12068
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-02-07 22:14:15 +00:00
poojanilangekar
ae96a9fb19 IMPALA-8151: Use sizeof() in HiveUdfCall to specify non-primitive type's size
Previously, data type sizes were hardcoded in
HiveUdfCall::Evaluate(). Since IMPALA-7367 removed the padding
from STRING and VARCHAR types, it could read past the end of the
actual value and cause a crash. This change replaces the hardcoded
values with  sizeof() calls to determine the size of non-primitive
types (STRING, VARCHAR and TIMESTAMP) to avoid similar issues in
the future.

Testing:
Ran test_udfs.py on an ASAN build.
Added logs to manually verify the size of bytes copied.

Change-Id: I919c330546fa86b474ab66245b20ceb1f5525b41
Reviewed-on: http://gerrit.cloudera.org:8080/12355
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-02-06 04:19:20 +00:00
Attila Jeges
3338bae608 IMPALA-8043: Fix BE test failures related to SystemV timezones.
This is a fix for the following issue:

1. Some BE tests (e.g. ExprTest.TimestampFunctions) use the system's
   local timezone but run against a test timezone db (instead of the
   system's timezone db).
2. On some Linux installations /usr/share/zoneinfo contains symlinks
   to files in the /usr/share/zoneifo/SystemV directory
   (e.g /usr/share/zoneinfo/America/Los_Angeles is a symlink to
   ../SystemV/PST8PDT).
3. The 'SystemV' directory is not part of the test timezone db, since
   it is obsolete and excluded by default.

Consequently, if the system's local timezone is set to
America/Los_Angeles, BE tests won't find the corresponding timezone
file in the test timezone db. BE tests will default to UTC, which will
break some of them.

This change sets local timezone explicitly for failing BE tests, so
they don't depend on the system's local timezone.
It also adds 'SystemV' directory to the test timezone db to avoid
similar issues in the future.

Change-Id: I9288cd24c8af0c059e55d47c86bd92eaf0075681
Reviewed-on: http://gerrit.cloudera.org:8080/12199
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-15 17:04:55 +00:00
Tim Armstrong
928c5c261b Fix some warnings on GCC7
I tried compiling with GCC7 to see what warnings popped up.

Fix some ambiguous else warnings resulting from gtest macros. See
https://github.com/google/googletest/issues/1119.

Add a missing include that broke compilation on the release build.

Fix some warnings that detect missing returns when there is a DCHECK
(these warnings already occurred in release builds, but they now
happen in gcc7 debug builds).

Change-Id: I39a12bc5ed6957c147b7f0dba85c7687cc989439
Reviewed-on: http://gerrit.cloudera.org:8080/12132
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-01-03 22:07:56 +00:00
Paul Rogers
27577dd652 IMPALA-7902: NumericLiteral fixes, refactoring
The work to clean up the rewriter logic must start with a stable AST,
which must start with sprucing up some issues with the leaf nodes. This
CR tackles the NumericLiteral used to hold numbers.

IMPALA-7896: Literals should not need explicit analyze step

Partial fix: removes the need to analyze a numeric literal: analyze() is
a no-op. This eliminates the need to do a "fake" analysis with a null
analyzer: numeric literals are now created analyzed. This is useful
because the catalog module creates numeric literals outside of a query
(and outside of an analyzer.)

A literal is immutable except for type. Modified the constructor to set
the type and cost, then mark the node as analyzed. A later call to
analyze() has nothing to do.

Code that created and dummy-analyzed numeric literals changed to use
static create() methods resulting in simpler literal creation, and
eliminates the special "analyzer == null" checks in analyze().

IMPALA-7886: NumericLiteral constructor fails to round values to
             Decimal type
IMPALA-7887: NumericLiteral fails to detect numeric overflow
IMPALA-7888: Incorrect NumericLiteral overflow checks for FLOAT,
             DOUBLE
IMPALA-7891: Analyzer does not detect numeric overflow in CAST
IMPALA-7894: Parser does not catch double overflow

These are all caused by the somewhat cluttered state of the numeric
range check code after years of incremental changes. This patch
centralizes all checks into a series of constants and methods for
uniformity.  All values are set in the constructor which now checks
that the value is legal for the type. Cast operations verify that the
cast is valid. Multiple semi-parallel versions of the same logic is
replaced by calls to a single implementation.

The numeric checks now follow the SQL standard which says that
implementations should fail if a cast would trucate the most significant
digits, but round when truncating the least significant.

IMPALA-7865: Repeated type widening of arithmetic expressions

Partial fix. Replaces the "is explicit cast" flag in the numeric literal
with the explicit type. This allows reseting an implicit type back to
the explciit type if an arithmetic expression is analyzed multiple
times. A later patch will feed this type information into the type
inference mechanism to complete the fix.

Finally, adds a set of new exceptions that begin to unify error
reporting.  These handle casts (SqlCastException), value validation
(InvalidValueException) and unsupported features
(UnsupportedFeatureException.) These all derive from AnalysisException
for backward compatibility. Tests use the new exceptions to check for
expected errors rather than parsing text strings (which tend to
change.)

Testing:

* Added unit tests just for numeric literals. Refactored code to
  simplify the tests.
* Added a test case for the obscure case in Decimal V1 of an implicit
  cast overflow.
* The depth-check tests needed one extra level of nesting to trigger
  failure.
* Ran all FE tests.

Change-Id: I484600747b2871d3a6fe9153751973af9a8534f2
Reviewed-on: http://gerrit.cloudera.org:8080/12001
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-12-28 23:25:06 +00:00
Greg Rahn
ba9b78c103 IMPALA-7759: Add Levenshtein edit distance built-in function
This patch adds new built-in functions to calculate Levenshtein edit
distance. Implemented as levenshtein() to match PostgreSQL in
both functionality and name and also added le_dst() alias for Netezza,
compatibility, but note that levenshtein() differs in functionality in
that if either value is NULL or both values are NULL, levenshtein()
returns NULL, where Netezza's le_dst() returns the length of the not
NULL value or 0 if both values are NULL.

Testing:
- Added unit tests to expr-test.cc
- Manual test on 966289 string pairs and results match PostgreSQL
- Added changes to qgen tests for PostgreSQL comparison

Change-Id: I549d33ab7cebfa10db2934461c8ec91e2cc1cdcb
Reviewed-on: http://gerrit.cloudera.org:8080/11793
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-12-02 10:39:44 +00:00
poojanilangekar
2a4835cfba IMPALA-7367: Pack StringValue and CollectionValue slots
This change packs StringValue and CollectionValue slots to ensure
they now occupy 12 bytes instead of 16 bytes. This reduces the
memory requirements and improves the performance. Since Kudu
tuples are populated using a memcopy, 4 bytes of padding was
added to StringSlots in Kudu tables.

Testing:
Ran core tests.
Added static asserts to ensure the value sizes are as expected.
Performance tests on TPCH-40  produced 3.96% improvement.

Change-Id: I32f3b06622c087e4aa288e8db1bf4581b10d386a
Reviewed-on: http://gerrit.cloudera.org:8080/11599
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-11-19 17:27:13 +00:00
Jim Apple
067657aa7d IMPALA-5031: prevent signed overflow in decimal
This removes two signed integer overflows when using the 'conv'
builtin. Signed integer overflow is undefined behavior according to
the C++ standard. The interesting parts of the backtraces are:

    exprs/math-functions-ir.cc:405:13: runtime error: signed integer overflow: 4738381338321616896 * 36 cannot be represented in type 'long'
    exprs/math-functions-ir.cc:404:24: runtime error: signed integer overflow: 2 * 4738381338321616896 cannot be represented in type 'long'

    #0 MathFunctions::DecimalInBaseToDecimal(long, signed char, long*) exprs/math-functions-ir.cc:404:24
    #1 MathFunctions::ConvInt(impala_udf::FunctionContext*, impala_udf::BigIntVal const&, impala_udf::TinyIntVal const&, impala_udf::TinyIntVal const&) exprs/math-functions-ir.cc:327:10
    #2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:485:580
    #3 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
    #8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
    #9 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
    #10 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45

These were triggered in the backend test
ExprTest.MathConversionFunctions.

Change-Id: I0d97dfcf42072750c16e41175765cd9a468a3c39
Reviewed-on: http://gerrit.cloudera.org:8080/11876
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-14 22:32:12 +00:00
Csaba Ringhofer
60095a4c6b IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet
Changes:
- parquet.thrift is updated to a newer version which contains the
  timestamp logical type.
- INT64 columns with converted types TIMESTAMP_MILLIS and
  TIMESTAMP_MICROS can be read as TIMESTAMP.
- If the logical type is timestamp, then the type will contain the
  information whether the UTC->local conversion is necessary. This
  feature is only supported for the new timestamp types, so INT96
  timestamps must still use flag
  convert_legacy_hive_parquet_utc_timestamps.
- Min/max stat filtering is enabled again for columns that need
  UTC->local conversion. This was disabled in IMPALA-7559 because
  it could incorrectly drop column chunks.
- CREATE TABLE LIKE PARQUET converts these columns to
  TIMESTAMP - before the change, an error was returned instead.
- Bulk of the Parquet column stat logic was moved to a new class
  called "ColumnStatsReader".

Testing:
- Added unit tests for timezone conversion (this needed a new public
  function in timezone_db.h and adding CET to tzdb_tiny).
- Added parquet files (created with parquet-mr) with int64 timestamp
  columns.

Change-Id: I4c7c01fffa31b3d2ca3480adf6ff851137dadac3
Reviewed-on: http://gerrit.cloudera.org:8080/11057
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-14 20:16:14 +00:00
Tim Armstrong
250d85e94e IMPALA-7822: handle overflows in repeat() builtin
We need to carefully check that the intermediate value fits in an
int64_t and the final size fits in an int. If they don't we
raise an error and fail the query.

Testing:
Added a couple of backend tests to exercise the
overflow check code paths.

Change-Id: I872ce77bc2cb29116881c27ca2a5216f722cdb2a
Reviewed-on: http://gerrit.cloudera.org:8080/11889
Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-07 21:49:15 +00:00
Jim Apple
78b6f1db69 IMPALA-5031: Make UBSAN-friendly arithmetic generic
ArithmeticUtil::AsUnsigned() makes it possible to do arithmetic on
signed integers in a way that does not invoke undefined behavior, but
it only works on integers. This patch adds ArithmeticUtil::Compute(),
which dispatches (at compile time) to the normal arithmetic evaluation
method if the type of the values is a floating point type, but uses
AsUnsigned() if the type of the values is an integral type.

Change-Id: I73bec71e59c5a921003d0ebca52a1d4e49bbef66
Reviewed-on: http://gerrit.cloudera.org:8080/11810
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-05 23:30:52 +00:00
Jim Apple
8fc702be0c IMPALA-5031: fix signed overflows in decimal
The standard says that overflow for signed arithmetic operations is
undefined behavior; see [expr]:

    If during the evaluation of an expression, the result is not
    mathematically defined or not in the range of representable values
    for its type, the behavior is undefined.

and [basic.fundamental]:

    Unsigned integers shall obey the laws of arithmetic modulo 2^n
    where n is the number of bits in the value representation of that
    particular size of integer. This implies that unsigned arithmetic
    does not overflow because a result that cannot be represented by
    the resulting unsigned integer type is reduced modulo the number
    that is one greater than the largest value that can be represented
    by the resulting unsigned integer type.

All of the overflows fixed in this patch were tested with expr-test's
DecimalArithmeticTest.

Change-Id: Ibf882428931e4f4264be2fc8cd9d6b1fc89b8ace
Reviewed-on: http://gerrit.cloudera.org:8080/11604
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-10-13 06:23:17 +00:00
Csaba Ringhofer
d301600a85 Revert "IMPALA-7595: Revert "IMPALA-7521: Speed up sub-second unix time->TimestampValue conversions""
IMPALA-7595 added proper handling for invalid time-of-day values
in Parquet, so the DCHECK mentioned in IMPALA-7595 will no longer
be hit. This means that IMPALA-7521 can be committed again without
causing problems.

This reverts commit f8b472ee64.

Change-Id: Ibab04bc6ad09db331220312ed21d90622fdfc41b
Reviewed-on: http://gerrit.cloudera.org:8080/11573
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-10-04 03:40:44 +00:00
Jim Apple
20bde289eb IMPALA-5031: null ptr errors in C calls in BE tests
This patch fixes all remaining UBSAN "null pointer passed as argument"
errors in the backend tests. These are undefined behavior according to
"7.1.4 Use of library functions" in the C99 standard (which is
included in C++14 in section [intro.refs]):

    If an argument to a function has an invalid value (such as a value
    outside the domain of the function, or a pointer outside the
    address space of the program, or a null pointer, or a pointer to
    non-modifiable storage when the corresponding parameter is not
    const-qualified) or a type (after promotion) not expected by a
    function with variable number of arguments, the behavior is
    undefined.

The interesting parts of the backtraces for the errors fixed in this
patch are below:

exprs/string-functions-ir.cc:311:17: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 StringFunctions::Replace(impala_udf::FunctionContext*, impala_udf::StringVal const&, impala_udf::StringVal const&, impala_udf::StringVal const&) exprs/string-functions-ir.cc:311:5
    #1 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:485:580
    #2 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
    #3 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
    #4 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
    #5 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
    #6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
    #7 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
    #8 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
    #9 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
    #10 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
    #11 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
    #12 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
    #13 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
    #20 thread_proxy (exprs/expr-test+0x55ca939)

exprs/string-functions-ir.cc:868:15: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 StringFunctions::ConcatWs(impala_udf::FunctionContext*, impala_udf::StringVal const&, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:868:3
    #1 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:510:270
    #2 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
    #3 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
    #4 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
    #5 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
    #6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
    #7 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
    #8 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
    #9 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
    #10 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
    #11 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
    #12 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
    #13 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
    #20 thread_proxy (exprs/expr-test+0x55ca939)

exprs/string-functions-ir.cc:871:17: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 StringFunctions::ConcatWs(impala_udf::FunctionContext*, impala_udf::StringVal const&, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:871:5
    #1 StringFunctions::Concat(impala_udf::FunctionContext*, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:843:10
    #2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:510:95
    #3 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
    #4 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
    #5 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
    #6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
    #7 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
    #8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
    #9 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
    #10 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
    #11 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
    #12 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
    #13 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
    #14 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
    #21 thread_proxy (exprs/expr-test+0x55ca939)

exprs/string-functions-ir.cc:873:17: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 StringFunctions::ConcatWs(impala_udf::FunctionContext*, impala_udf::StringVal const&, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:873:5
    #1 StringFunctions::Concat(impala_udf::FunctionContext*, int, impala_udf::StringVal const*) exprs/string-functions-ir.cc:843:10
    #2 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:510:95
    #3 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
    #4 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
    #5 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
    #6 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
    #7 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
    #8 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
    #9 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
    #10 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
    #11 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
    #12 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
    #13 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
    #14 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
    #21 thread_proxy (exprs/expr-test+0x55ca939)

runtime/raw-value.cc:159:27: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 RawValue::Write(void const*, void*, ColumnType const&, MemPool*) runtime/raw-value.cc:159:9
    #1 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:225:7
    #2 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
    #3 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
    #4 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
    #5 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
    #6 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
    #7 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
    #8 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
    #9 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
    #16 thread_proxy (exprs/expr-test+0x55ca939)

udf/udf.cc:521:24: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 impala_udf::StringVal::CopyFrom(impala_udf::FunctionContext*, unsigned char const*, unsigned long) udf/udf.cc:521:5
    #1 AnyValUtil::FromBuffer(impala_udf::FunctionContext*, char const*, int) exprs/anyval-util.h:241:12
    #2 StringFunctions::RegexpExtract(impala_udf::FunctionContext*, impala_udf::StringVal const&, impala_udf::StringVal const&, impala_udf::BigIntVal const&) exprs/string-functions-ir.cc:726:10
    #3 impala_udf::StringVal ScalarFnCall::InterpretEval<impala_udf::StringVal>(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:485:580
    #4 ScalarFnCall::GetStringVal(ScalarExprEvaluator*, TupleRow const*) const exprs/scalar-fn-call.cc:599:44
    #5 ScalarExprEvaluator::GetValue(ScalarExpr const&, TupleRow const*) exprs/scalar-expr-evaluator.cc:299:38
    #6 ScalarExprEvaluator::GetValue(TupleRow const*) exprs/scalar-expr-evaluator.cc:250:10
    #7 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, ScalarExprEvaluator* const*, MemPool*, StringValue**, int*, int*) runtime/tuple.cc:222:27
    #8 void Tuple::MaterializeExprs<false, false>(TupleRow*, TupleDescriptor const&, vector<ScalarExprEvaluator*> const&, MemPool*, vector<StringValue*>*, int*) runtime/tuple.h:174:5
    #9 UnionNode::MaterializeExprs(vector<ScalarExprEvaluator*> const&, TupleRow*, unsigned char*, RowBatch*) exec/union-node-ir.cc:29:14
    #10 UnionNode::GetNextConst(RuntimeState*, RowBatch*) exec/union-node.cc:263:5
    #11 UnionNode::GetNext(RuntimeState*, RowBatch*, bool*) exec/union-node.cc:296:45
    #12 FragmentInstanceState::ExecInternal() runtime/fragment-instance-state.cc:310:59
    #13 FragmentInstanceState::Exec() runtime/fragment-instance-state.cc:95:14
    #14 QueryState::ExecFInstance(FragmentInstanceState*) runtime/query-state.cc:488:24
    #15 QueryState::StartFInstances()::$_0::operator()() const runtime/query-state.cc:416:35
    #22 thread_proxy (exprs/expr-test+0x55ca939)

util/coding-util-test.cc:45:10: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 TestUrl(string const&, string const&, bool) util/coding-util-test.cc:45:3
    #1 UrlCodingTest_BlankString_Test::TestBody() util/coding-util-test.cc:88:3
    #2 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/coding-util-test+0x6630f42)
    #8 main util/coding-util-test.cc:123:192

util/decompress-test.cc:126:261: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
    #0 DecompressorTest::CompressAndDecompress(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:126:254
    #1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:84:9
    #2 DecompressorTest_Default_Test::TestBody() util/decompress-test.cc:373:3
    #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
    #9 main util/decompress-test.cc:479:47

util/decompress-test.cc:148:261: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
    #0 DecompressorTest::CompressAndDecompress(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:148:254
    #1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:84:9
    #2 DecompressorTest_Default_Test::TestBody() util/decompress-test.cc:373:3
    #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
    #9 main util/decompress-test.cc:479:47

util/decompress-test.cc:269:261: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
    #0 DecompressorTest::CompressAndDecompressNoOutputAllocated(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:269:254
    #1 DecompressorTest::RunTest(THdfsCompression::type) util/decompress-test.cc:71:7
    #2 DecompressorTest_LZ4_Test::TestBody() util/decompress-test.cc:381:3
    #3 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
    #9 main util/decompress-test.cc:479:47

util/decompress-test.cc:221:329: runtime error: null pointer passed as argument 1, which is declared to never be null
/usr/include/string.h:66:58: note: nonnull attribute specified here
    #0 DecompressorTest::StreamingDecompress(Codec*, long, unsigned char*, long, unsigned char*, bool, long*) util/decompress-test.cc:221:322
    #1 DecompressorTest::CompressAndStreamingDecompress(Codec*, Codec*, long, unsigned char*) util/decompress-test.cc:245:35
    #2 DecompressorTest::RunTestStreaming(THdfsCompression::type) util/decompress-test.cc:104:5
    #3 DecompressorTest_Gzip_Test::TestBody() util/decompress-test.cc:386:3
    #4 void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (util/decompress-test+0x6642bb2)
    #10 main util/decompress-test.cc:479:47

util/streaming-sampler.h:55:22: runtime error: null pointer passed as argument 2, which is declared to never be null
/usr/include/string.h:43:45: note: nonnull attribute specified here
    #0 StreamingSampler<long, 64>::StreamingSampler(int, vector<long> const&) util/streaming-sampler.h:55:5
    #1 RuntimeProfile::TimeSeriesCounter::TimeSeriesCounter(string const&, TUnit::type, int, vector<long> const&) util/runtime-profile-counters.h:401:53
    #2 RuntimeProfile::Update(vector<TRuntimeProfileNode> const&, int*) util/runtime-profile.cc:310:28
    #3 RuntimeProfile::Update(TRuntimeProfileTree const&) util/runtime-profile.cc:245:3
    #4 Coordinator::BackendState::InstanceStats::Update(TFragmentInstanceExecStatus const&, Coordinator::ExecSummary*, ProgressUpdater*) runtime/coordinator-backend-state.cc:473:13
    #5 Coordinator::BackendState::ApplyExecStatusReport(TReportExecStatusParams const&, Coordinator::ExecSummary*, ProgressUpdater*) runtime/coordinator-backend-state.cc:286:21
    #6 Coordinator::UpdateBackendExecStatus(TReportExecStatusParams const&) runtime/coordinator.cc:678:22
    #7 ClientRequestState::UpdateBackendExecStatus(TReportExecStatusParams const&) service/client-request-state.cc:1253:18
    #8 ImpalaServer::ReportExecStatus(TReportExecStatusResult&, TReportExecStatusParams const&) service/impala-server.cc:1343:18
    #9 ImpalaInternalService::ReportExecStatus(TReportExecStatusResult&, TReportExecStatusParams const&) service/impala-internal-service.cc:87:19
    #24 thread_proxy (exprs/expr-test+0x55ca939)

Change-Id: I317ccc99549744a26d65f3e07242079faad0355a
Reviewed-on: http://gerrit.cloudera.org:8080/11545
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-10-02 20:24:17 +00:00
stiga-huang
ddef2cb9b1 IMPALA-376: add built-in functions for parsing JSON
This patch implements the same function as Hive UDF get_json_object.
We reuse RapidJson to parse the json string. In order to track the
memory used in RapidJson, we wrap FunctionContext into an allocator.

get_json_object accepts two parameters: a json string and a selector
(json path). We parse the json string into a Document tree and then
perform BFS according to the selector. For example, to process
    get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[*].a'),
we first perform '$[*]' to extract all the items in the root array.
Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a'
selector on all values in the queue. The final results is 1,2,3 in the
queue. As there're multiple results, they should be encapsulated into
an array. The output results is a string of '[1,2,3]'.

More examples can be found in expr-test.cc.

Test:
* Add unit tests in expr-test
* Add e2e tests in exprs.test
* Add tests in test_alloc_fail.py to check handling of out of memory

Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f
Reviewed-on: http://gerrit.cloudera.org:8080/10950
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-29 11:59:03 +00:00
Attila Jeges
cb49371613 IMPALA-7492: Add support for DATE text parser/formatter
This change is the first step in implementing support for DATE type
(IMPALA-6169).

The DATE parser/formatter is implemented by the new DateParser class.
- The parser supports parsing both default and custom formatted DATE
values. CCTZ is used to validate the parsed dates.
- The formatter supports default and custom formatting of DATE values.

In the future, DateParser will be used in the text scanner/writer and
in the DATE <-> STRING cast functions.

The DateParser class reuses some of the functionality already
implemented in the TimestampParser class to minimize redundancy. To
make code reuse easier, a new namespace (datetime_parse_util) was
created and the common functionality was moved there.

This change also adds a new class (DateValue) to represent a DATE
value in-memory. The DateParser and DateValue classes are used only in
tests at the moment, therefore this patch doesn't change user facing
behavior.

Testing:
- Added BE-tests for DateParser and DateValue classes.
- Re-run parse-timestamp-benchmark to make sure that parser
  performance hasn't degraded.

Change-Id: I1eec00f22502c4c67c6807c4b51384419ea8b831
Reviewed-on: http://gerrit.cloudera.org:8080/11450
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-09-28 18:14:56 +00:00