Before this patch, decimal operations would either silently overflow (in
the case of sum() and avg()), or produce a warning.
In this patch, the behaviour is changed so that an error is produced in
the case of overflow when DECIMAL_v2 is enabled. Decimal v1 behaviour is
unchanged.
We introduce overflow checks when computing sum() and avg(). This
results in a ~30% performance regression when we are in decimal v2 mode
compared to decimal v1.
Benchmarks:
Query:
select sum(dec_38_19) from decimal_tbl
Decimal v1: 11.57s
Decimal v2: 16.58s
Query:
select avg(dec_38_19) from decimal_tbl
Decimal v1: 12.08s
Decimal v2: 17.08s
The performance regression is not as bad if we are computing the sum or
average of decimal column with a lower precision:
Query:
select sum(dec_9_5) from decimal_tbl
Decimal v1: 11.06s
Decimal v2: 13.08s
Query:
select avg(dec_9_5) from decimal_tbl
Decimal v1: 11.56s
Decimal v2: 13.57s
Testing:
- Added several end to end tests.
- Updated Expr tests to check for error in case of overflow.
Change-Id: Id98a92c9a9469ec8cf14e518c741a2dab7053019
Reviewed-on: http://gerrit.cloudera.org:8080/8404
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
Suppose we have a large decimal number, which is greater
than INT_MAX. We want to calculate the modulo of this
number by 3:
BIG_DECIMAL % 3
The result of this calculation can be 0, 1, or 2.
This can fit into a decimal with precision 1.
The in-memory representation of such small decimals are
stored in int32_t in the backend. Let's call this int32_t
the result type. The backend had the invalid assumption
that it can do the calculation as well using the result type.
This assumption is true for multiplying or adding numbers,
but not for modulo.
Now the backend selects the biggest type of ['return type',
'1st operand type', '2nd operand type'] to do the calculation.
Change-Id: I2b06c8acd5aa490943e84013faf2eaac7c26ceb4
Reviewed-on: http://gerrit.cloudera.org:8080/8574
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
When converting a decimal to a double, we incorrectly used the powf()
function in the backend, which returns a float instead of a double.
This caused us to lose precision.
We fix the problem by replacing the powf() function with a pow()
function, which returns a double.
Testing:
- Added an EE test.
Change-Id: I9bf81d039e5037f22c64a32b328832235aafe9e3
Reviewed-on: http://gerrit.cloudera.org:8080/8547
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Before this patch, decimal operations would never produce an error.
Division by and modulo zero would result in a NULL.
In this patch, we change this behavior so that we raise an error
instead of returning a NULL. We also modify the format of the decimal
expr tests format to also include an error field.
Testing:
- Added several expr and end to end tests.
Change-Id: If7a7131e657fcdd293ade78d62f851dac0f1e3eb
Reviewed-on: http://gerrit.cloudera.org:8080/8344
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Address rounding on divide and multiply when results are truncated.
Testing: Manually ran some divides that should overflow, then added
the results to the test. Made the decimal-test use rounding behavior
by default, and now the error margin of the test has decreased.
Initial perf results:
Multiply is totall uninteresting so far, all implementations
return the same values in the same time:
+-------------------------+-----------------------------------+
| sum(l_quantity * l_tax) | sum(l_extendedprice * l_discount) |
+-------------------------+-----------------------------------+
| 61202493.3700 | 114698450836.4234 |
+-------------------------+-----------------------------------+
Fetched 1 row(s) in 1.13s
Divide shows no regression from prior with DECIMAL_V2 off:
+-----------------------------+-----------------------------------+
| sum(l_quantity / l_tax) | sum(l_extendedprice / l_discount) |
+-----------------------------+-----------------------------------+
| 46178777464.523809516381723 | 61076151920731.010714279183910 |
+-----------------------------+-----------------------------------+
before: Fetched 1 row(s) in 13.08s
after: Fetched 1 row(s) in 13.06s
And with DECIMAL_V2 on:
+-----------------------------+-----------------------------------+
| sum(l_quantity / l_tax) | sum(l_extendedprice / l_discount) |
+-----------------------------+-----------------------------------+
| 46178777464.523809523847285 | 61076151920731.010714285714202 |
+-----------------------------+-----------------------------------+
Fetched 1 row(s) in 16.06s
So the performance regression is not as bad as expected. Still,
divide performance could use some work.
Change-Id: Ie6bfcbe37555b74598d409c6f84f06b0ae5c4312
Reviewed-on: http://gerrit.cloudera.org:8080/6132
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This change implements the DECIMAL_V2's behavior for AVG().
The differences with DECIMAL_V1 are:
1. The output type has a minimum scale of 6. This is similar
to MS SQL's behavior which takes the max of 6 and the input
type's scale. We deviate from MS SQL in the output's precision
which is always set to 38. We use the smallest precision which
can store the output. A key insight is that the output of AVG()
is no wider than the inputs. Precision only needs to be adjusted
when the scale is augmented. Using a smaller precision avoids
potential loss of precision in subsequent decimal operations
(e.g. division) if AVG() is a subexpression. Please note that
the output type is different from SUM()/COUNT() as the latter
can have a much larger scale.
2. Due to a minimum of 6 decimal places for the output,
AVG() for decimal values whose whole number part exceeds 32
decimal places (e.g. DECIMAL(38,4), DECIMAL(33,0)) will
always overflow as the scale is augmented to 6. Certain
decimal types which work with AVG() in DECIMAL_V1 no longer
work in DECIMAL_V2.
Change-Id: I28f5ef0370938440eb5b1c6d29b2f24e6f88499f
Reviewed-on: http://gerrit.cloudera.org:8080/6038
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Implement the new DECIMAL return type rules for divide and modulo
expressions, active when query option DECIMAL_V2=1. See the comment
in the code for more details. A couple of examples that show why new
return type rules for divide are desirable.
For modulo, the return types are actually equivalent, though the
rules are expressed differently to have consistency with how
precision fixups are handled for each version.
DECIMAL Version 1:
+-------------------------------------------------------+
| cast(1 as decimal(20,0)) / cast(3 as decimal(20,0)) |
+-----------------------------------------------------+
| 0 |
+-------------------------------------------------------+
DECIMAL Version 2:
+-------------------------------------------------------+
| cast(1 as decimal(20,0)) / cast(3 as decimal(20,0)) |
+-----------------------------------------------------+
| 0.333333333333333333 |
+-------------------------------------------------------+
DECIMAL Version 1:
+-------------------------------------------------------+
| cast(1 as decimal(6,0)) / cast(0.1 as decimal(38,38)) |
+-------------------------------------------------------+
| NULL |
+-------------------------------------------------------+
WARNINGS: UDF WARNING: Expression overflowed, returning NULL
DECIMAL Version 2:
+-------------------------------------------------------+
| cast(1 as decimal(6,0)) / cast(0.1 as decimal(38,38)) |
+-------------------------------------------------------+
| 10.000000 |
+-------------------------------------------------------+
Change-Id: I83e7f7787edfa4b4bddc25945090542a0e90881b
Reviewed-on: http://gerrit.cloudera.org:8080/5952
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins