Commit Graph

13 Commits

Author SHA1 Message Date
Attila Jeges
b5805de3e6 IMPALA-7368: Add initial support for DATE type
DATE values describe a particular year/month/day in the form
yyyy-MM-dd. For example: DATE '2019-02-15'. DATE values do not have a
time of day component. The range of values supported for the DATE type
is 0000-01-01 to 9999-12-31.

This initial DATE type support covers TEXT and HBASE fileformats only.
'DateValue' is used as the internal type to represent DATE values.

The changes are as follows:
- Support for DATE literal syntax.

- Explicit casting between DATE and other types (note that invalid
  casts will fail with an error just like invalid DECIMAL_V2 casts,
  while failed casts to other types do no lead to warning or error):
    - from STRING to DATE. The string value must be formatted as
      yyyy-MM-dd HH:mm:ss.SSSSSSSSS. The date component is mandatory,
      the time component is optional. If the time component is
      present, it will be truncated silently.
    - from DATE to STRING. The resulting string value is formatted as
      yyyy-MM-dd.
    - from TIMESTAMP to DATE. The source timestamp's time of day
      component is ignored.
    - from DATE to TIMESTAMP. The target timestamp's time of day
      component is set to 00:00:00.

- Implicit casting between DATE and other types:
    - from STRING to DATE if the source string value is used in a
      context where a DATE value is expected.
    - from DATE to TIMESTAMP if the source date value is used in a
      context where a TIMESTAMP value is expected.

- Since STRING -> DATE, STRING -> TIMESTAMP and DATE -> TIMESTAMP
  implicit conversions are now all possible, the existing function
  overload resolution logic is not adequate anymore.
  For example, it resolves the
  if(false, '2011-01-01', DATE '1499-02-02') function call to the
  if(BOOLEAN, TIMESTAMP, TIMESTAMP) version of the overloaded
  function, instead of the if(BOOLEAN, DATE, DATE) version.

  This is clearly wrong, so the function overload resolution logic had
  to be changed to resolve function calls to the best-fit overloaded
  function definition if there are multiple applicable candidates.

  An overloaded function definition is an applicable candidate for a
  function call if each actual parameter in the function call either
  matches the corresponding formal parameter's type (without casting)
  or is implicitly castable to that type.

  When looking for the best-fit applicable candidate, a parameter
  match score (i.e. the number of actual parameters in the function
  call that match their corresponding formal parameter's type without
  casting) is calculated and the applicable candidate with the highest
  parameter match score is chosen.

  There's one more issue that the new resolution logic has to address:
  if two applicable candidates have the same parameter match score and
  the only difference between the two is that the first one requires a
  STRING -> TIMESTAMP implicit cast for some of its parameters while
  the second one requires a STRING -> DATE implicit cast for the same
  parameters then the first candidate has to be chosen not to break
  backward compatibility.
  E.g: year('2019-02-15') function call must resolve to
  year(TIMESTAMP) instead of year(DATE). Note, that year(DATE) is not
  implemented yet, so this is not an issue at the moment but it will
  be in the future.
  When the resolution algorithm considers overloaded function
  definitions, first it orders them lexicographically by the types in
  their parameter lists. To ensure the backward compatible behavior
  Primitivetype.DATE enum value has to come after
  PrimitiveType.TIMESTAMP.

- Codegen infrastructure changes for expression evaluation.
- 'IS [NOT] NULL' and '[NOT] IN' predicates.
- Common comparison operators (including the 'BETWEEN' operator).
- Infrastructure changes for built-in functions.
- Some built-in functions: conditional, aggregate, analytical and
  math functions.
- C++ UDF/UDA support.
- Support partitioning and grouping by DATE.
- Beeswax, HiveServer2 support.

These items are tightly coupled and it makes sense to implement them
in one change-set.

Testing:
- A new partitioned TEXT table 'functional.date_tbl' (and the
  corresponding HBASE table 'functional_hbase.date_tbl') was
  introduced for DATE-related tests.
- BE and FE tests were extended to cover DATE type.
- E2E tests:
    - since DATE type is supported for TEXT and HBASE fileformats
      only, most DATE tests were implemented separately in
      tests/query_test/test_date_queries.py.

Note, that this change-set is not a complete DATE type implementation,
but it lays the foundation for future work:
- Add date support to the random query generator.
- Implement a complete set of built-in functions.
- Add Parquet support.
- Add Kudu support.
- Optionally support Avro and ORC.
For further details, see IMPALA-6169.

Change-Id: Iea8155ef09557e0afa2f8b2d0b2dc9d0896dc30f
Reviewed-on: http://gerrit.cloudera.org:8080/12481
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-23 13:33:57 +00:00
Yongjun Zhang
ba27b03814 IMPALA-6202. mod and % should be equivalent.
Currently in DECIMAL V2 mode, typeof(9.9 % 3) is DECIMAL(2,1) and typeof(mod(9.9, 3)) is
DECIMAL(4,1), while both are expected to be DECIMAL(2,1). This jira fixes V2 mode by
replacing "mod" with "%" at parser stage thus they share the same code path afterwards.

Testing:
Added unit tests and done real cluster testing.

Change-Id: Ib0067da04083859ffbf662a29007154461bea2ba
Reviewed-on: http://gerrit.cloudera.org:8080/11443
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2018-09-27 20:41:29 +00:00
Taras Bobrovytsky
d0f838b66a IMPALA-6340,IMPALA-6518: Check that decimal types are compatible in FE
In this patch we implement strict decimal type checking in the FE in
various situations when DECIMAL_V2 is enabled. What is affected:
- Union. If we union two decimals and it is not possible to come up
  with a decimal that will be able to contain all the digits, an error
  is thrown. For example, the union(decimal(20, 10), decimal(20, 20))
  returns decimal(30, 20). However, for union(decimal(38, 0),
  decimal(38, 38)) the ideal return type would be decimal(76,38), but
  this is too large, so an error is thrown.
- Insert. If we are inserting a decimal value into a column where we are
  not guaranteed that all digits will fit, an error is thrown. For
  example, inserting a decimal(38,0) value into a decimal(38,38) column.
- Functions such as coalesce(). If we are unable to determine the output
  type that guarantees that all digits will fit from all the arguments,
  an error is thrown. For example,
  coalesce(decimal(38,38), decimal(38,0)) will throw an error.
- Hash Join. When joining on two decimals, if a type cannot be
  determined that both columns can be cast to, we throw an error.
  For example, join on decimal(38,0) and decimal(38,38) will result
  in an error.

To avoid these errors, you need to use CAST() on some of the decimals.

In this patch we also change the output decimal calculation of decimal
round, truncate and related functions. If these functions are a no-op,
the resulting decimal type is the same as the input type.

Testing:
- Ran a core build which passed.

Change-Id: Id406f4189e01a909152985fabd5cca7a1527a568
Reviewed-on: http://gerrit.cloudera.org:8080/9930
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-04-28 03:33:02 +00:00
Taras Bobrovytsky
b0027575cb IMPALA-6405: Error when string to decimal cast overflows
Before this patch, when there was an error when converting a string to
a decimal, a NULL was returned. In this patch, we change this behavior
so that an error is returned if decimal_v2 is enabled. We also add a
warning if there is an underflow.

The reasoning is that we want stricter behavior in decimal_v2.

Testing:
- Added some EE tests.
- Ran an exhaustive build, which passed.

Change-Id: Icffccac1c1c2361447ae4b0de9b6c2ec7de071db
Reviewed-on: http://gerrit.cloudera.org:8080/9339
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2018-03-06 03:29:47 +00:00
Zoltan Borok-Nagy
545e60f832 IMPALA-5191: Standardize column alias behavior
We should not perform alias substitution in the
subexpressions of GROUP BY, HAVING, and ORDER BY
to be more standard conformant.

=== Allowed ===
SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x;

SELECT int_col / 2 AS x
FROM functional.alltypes
ORDER BY x;

SELECT NOT bool_col AS nb
FROM functional.alltypes
GROUP BY nb
HAVING nb;

=== Not allowed ===
SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x / 2;

SELECT int_col / 2 AS x
FROM functional.alltypes
ORDER BY -x;

SELECT int_col / 2 AS x
FROM functional.alltypes
GROUP BY x
HAVING x > 3;

Some extra checks were added to AnalyzeExprsTest.java.

I had to update other tests to make them pass
since the new behavior is more restrictive.

I added alias.test to the end-to-end tests.

Cherry-picks: not for 2.x.

Change-Id: I0f82483b486acf6953876cfa672b0d034f3709a8
Reviewed-on: http://gerrit.cloudera.org:8080/8801
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-24 22:47:18 +00:00
Taras Bobrovytsky
c86b0a9736 IMPALA-5014: Part 2: Round when casting decimal to timestamp
When there are too many digits to the right of the dot in a decimal, we
would always truncate when casting to timestamp. In this patch we change
the behavior to round instead of truncating when decimal_v2 is enabled.

Testing:
- Added some EE tests, ran BE tests on my machine.

Change-Id: I8fb3a7d976ab980b8572d7e9524850572bad57da
Reviewed-on: http://gerrit.cloudera.org:8080/8969
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-10 05:47:23 +00:00
Taras Bobrovytsky
575b5a20e6 IMPALA-5017: Error on decimal overflow
Before this patch, decimal operations would either silently overflow (in
the case of sum() and avg()), or produce a warning.

In this patch, the behaviour is changed so that an error is produced in
the case of overflow when DECIMAL_v2 is enabled. Decimal v1 behaviour is
unchanged.

We introduce overflow checks when computing sum() and avg(). This
results in a ~30% performance regression when we are in decimal v2 mode
compared to decimal v1.

Benchmarks:
  Query:
  select sum(dec_38_19) from decimal_tbl
    Decimal v1: 11.57s
    Decimal v2: 16.58s

  Query:
  select avg(dec_38_19) from decimal_tbl
    Decimal v1: 12.08s
    Decimal v2: 17.08s

The performance regression is not as bad if we are computing the sum or
average of decimal column with a lower precision:

  Query:
  select sum(dec_9_5) from decimal_tbl
    Decimal v1: 11.06s
    Decimal v2: 13.08s

  Query:
  select avg(dec_9_5) from decimal_tbl
    Decimal v1: 11.56s
    Decimal v2: 13.57s

Testing:
- Added several end to end tests.
- Updated Expr tests to check for error in case of overflow.

Change-Id: Id98a92c9a9469ec8cf14e518c741a2dab7053019
Reviewed-on: http://gerrit.cloudera.org:8080/8404
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
2017-12-01 23:23:01 +00:00
Zoltan Borok-Nagy
4f11bed407 IMPALA-5936: operator '%' overflows on large decimals
Suppose we have a large decimal number, which is greater
than INT_MAX. We want to calculate the modulo of this
number by 3:
BIG_DECIMAL % 3

The result of this calculation can be 0, 1, or 2.
This can fit into a decimal with precision 1.

The in-memory representation of such small decimals are
stored in int32_t in the backend. Let's call this int32_t
the result type. The backend had the invalid assumption
that it can do the calculation as well using the result type.
This assumption is true for multiplying or adding numbers,
but not for modulo.

Now the backend selects the biggest type of ['return type',
'1st operand type', '2nd operand type'] to do the calculation.

Change-Id: I2b06c8acd5aa490943e84013faf2eaac7c26ceb4
Reviewed-on: http://gerrit.cloudera.org:8080/8574
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-28 21:11:45 +00:00
Taras Bobrovytsky
d1b92c8b52 IMPALA-6183: Fix Decimal to Double conversion
When converting a decimal to a double, we incorrectly used the powf()
function in the backend, which returns a float instead of a double.
This caused us to lose precision.

We fix the problem by replacing the powf() function with a pow()
function, which returns a double.

Testing:
- Added an EE test.

Change-Id: I9bf81d039e5037f22c64a32b328832235aafe9e3
Reviewed-on: http://gerrit.cloudera.org:8080/8547
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-11-15 02:54:53 +00:00
Taras Bobrovytsky
5ebea0ec4d IMPALA-5018: Error on decimal modulo or divide by zero
Before this patch, decimal operations would never produce an error.
Division by and modulo zero would result in a NULL.

In this patch, we change this behavior so that we raise an error
instead of returning a NULL. We also modify the format of the decimal
expr tests format to also include an error field.

Testing:
- Added several expr and end to end tests.

Change-Id: If7a7131e657fcdd293ade78d62f851dac0f1e3eb
Reviewed-on: http://gerrit.cloudera.org:8080/8344
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-10-25 00:44:34 +00:00
Zach Amsden
c87ab35af1 IMPALA-4813: Round on divide and multiply
Address rounding on divide and multiply when results are truncated.

Testing: Manually ran some divides that should overflow, then added
the results to the test.  Made the decimal-test use rounding behavior
by default, and now the error margin of the test has decreased.

Initial perf results:

Multiply is totall uninteresting so far, all implementations
return the same values in the same time:

+-------------------------+-----------------------------------+
| sum(l_quantity * l_tax) | sum(l_extendedprice * l_discount) |
+-------------------------+-----------------------------------+
| 61202493.3700           | 114698450836.4234                 |
+-------------------------+-----------------------------------+
Fetched 1 row(s) in 1.13s

Divide shows no regression from prior with DECIMAL_V2 off:

+-----------------------------+-----------------------------------+
| sum(l_quantity / l_tax)     | sum(l_extendedprice / l_discount) |
+-----------------------------+-----------------------------------+
| 46178777464.523809516381723 | 61076151920731.010714279183910    |
+-----------------------------+-----------------------------------+
before:  Fetched 1 row(s) in 13.08s
after:   Fetched 1 row(s) in 13.06s

And with DECIMAL_V2 on:

+-----------------------------+-----------------------------------+
| sum(l_quantity / l_tax)     | sum(l_extendedprice / l_discount) |
+-----------------------------+-----------------------------------+
| 46178777464.523809523847285 | 61076151920731.010714285714202    |
+-----------------------------+-----------------------------------+
Fetched 1 row(s) in 16.06s

So the performance regression is not as bad as expected.  Still,
divide performance could use some work.

Change-Id: Ie6bfcbe37555b74598d409c6f84f06b0ae5c4312
Reviewed-on: http://gerrit.cloudera.org:8080/6132
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-03-02 20:12:05 +00:00
Michael Ho
637cc3e447 IMPALA-4821: Update AVG() for DECIMAL_V2
This change implements the DECIMAL_V2's behavior for AVG().
The differences with DECIMAL_V1 are:

1. The output type has a minimum scale of 6. This is similar
to MS SQL's behavior which takes the max of 6 and the input
type's scale. We deviate from MS SQL in the output's precision
which is always set to 38. We use the smallest precision which
can store the output. A key insight is that the output of AVG()
is no wider than the inputs. Precision only needs to be adjusted
when the scale is augmented. Using a smaller precision avoids
potential loss of precision in subsequent decimal operations
(e.g. division) if AVG() is a subexpression. Please note that
the output type is different from SUM()/COUNT() as the latter
can have a much larger scale.

2. Due to a minimum of 6 decimal places for the output,
AVG() for decimal values whose whole number part exceeds 32
decimal places (e.g. DECIMAL(38,4), DECIMAL(33,0)) will
always overflow as the scale is augmented to 6. Certain
decimal types which work with AVG() in DECIMAL_V1 no longer
work in DECIMAL_V2.

Change-Id: I28f5ef0370938440eb5b1c6d29b2f24e6f88499f
Reviewed-on: http://gerrit.cloudera.org:8080/6038
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-22 06:31:14 +00:00
Dan Hecht
a53eeb2068 IMPALA-4370: Divide and modulo result types for DECIMAL version V2
Implement the new DECIMAL return type rules for divide and modulo
expressions, active when query option DECIMAL_V2=1. See the comment
in the code for more details. A couple of examples that show why new
return type rules for divide are desirable.

For modulo, the return types are actually equivalent, though the
rules are expressed differently to have consistency with how
precision fixups are handled for each version.

DECIMAL Version 1:

+-------------------------------------------------------+
| cast(1 as decimal(20,0)) / cast(3 as decimal(20,0)) |
+-----------------------------------------------------+
| 0                                                   |
+-------------------------------------------------------+

DECIMAL Version 2:

+-------------------------------------------------------+
| cast(1 as decimal(20,0)) / cast(3 as decimal(20,0)) |
+-----------------------------------------------------+
| 0.333333333333333333                                |
+-------------------------------------------------------+

DECIMAL Version 1:

+-------------------------------------------------------+
| cast(1 as decimal(6,0)) / cast(0.1 as decimal(38,38)) |
+-------------------------------------------------------+
| NULL                                                  |
+-------------------------------------------------------+
WARNINGS: UDF WARNING: Expression overflowed, returning NULL

DECIMAL Version 2:

+-------------------------------------------------------+
| cast(1 as decimal(6,0)) / cast(0.1 as decimal(38,38)) |
+-------------------------------------------------------+
| 10.000000                                             |
+-------------------------------------------------------+

Change-Id: I83e7f7787edfa4b4bddc25945090542a0e90881b
Reviewed-on: http://gerrit.cloudera.org:8080/5952
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-14 18:40:54 +00:00