This introduces KuduScanNodeMt, the single-threaded version
of KuduScanNode that materializes the tuples in GetNext().
KuduScanNodeMt is enabled by the same condition as
HdfsScanNodeMt: mt_dop is greater than or equal to 1.
To share code between the two implementations, KuduScanNode
and KuduScanNodeMt are now subclasses of KuduScanNodeBase,
which implements the shared code. The KuduScanner is
minimally impacted, as it already had the required GetNext
interface.
Since the KuduClient is a heavy-weight object, it is now
shared at the QueryState level. We try to share the
KuduClient as much as possible, but there are times when
the KuduClient cannot be shared. Each Kudu table has
master addresses stored in the Hive Metastore. We only
share KuduClients for tables that have an identical value
for the master addresses. In the ideal case, every Kudu
table will have the same value, but there is no explicit
guarantee of this.
The testing for this is a modified version of
kudu-scan-node.test run with various mt_dop values.
Change-Id: I6e4593300e376bc508b78acaea64ffdd2c73a67a
Reviewed-on: http://gerrit.cloudera.org:8080/6312
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
The tests for dictionary filtering look at how many row groups are
processed and how many are filtered by matching text in the profile.
However, the number of row groups processed and filtered by any
individual fragment depends on how the work is split and how many
impalads are running. This causes variability in the test output.
To fix this, the test needs a way to aggregate the results across
fragments. This fix introduces the following syntax for specifying
these aggregates:
aggregate(function_name, field_name): expected_value
This searches the runtime profile for lines that contain
'field_name: number'. It skips the averaged fragment, as this is
derived from all the other fragments.
Currently, only SUM is implemented, and the expected_value is
required to be an integer. It should be easy to implement other
interesting functions like COUNT and MIN/MAX. It would also be
possible to extend it to floats.
Switching the dictionary filtering tests over to this new syntax
eliminates the variability in the tests.
Change-Id: I6b7b84d973b3ac678a24e82900f2637d569158bb
Reviewed-on: http://gerrit.cloudera.org:8080/6301
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Summary of changes:
Introduces a new query option PARQUET_ARRAY_RESOLUTION to
control the path-resolution behavior for Parquet files
with nested arrays. The values are:
- THREE_LEVEL
Assumes arrays are encoded with the 3-level representation.
Also resolves arrays encoded with a single level.
Does not attempt a 2-level resolution.
- TWO_LEVEL
Assumes arrays are encoded with the 2-level representation.
Also resolves arrays encoded with a single level.
Does not attempt a 3-level resolution.
- TWO_LEVEL_THEN_THREE_LEVEL
First tries to resolve assuming the 2-level representation,
and if unsuccessful, tries the 3-level representation.
Also resolves arrays encoded with a single level.
This is the current Impala behavior and is used as the
default value for compatibility.
Note that 'failure' to resolve a schema path with a given
array-resolution policy does not necessarily mean a warning or
error is returned by the query. A mismatch might be treated
like a missing field which is necessary to support schema
evolution. There is no way to reliably distinguish the
'bad resolution' and 'legitimately missing field' cases.
The new query option is independent of and can be combined
with the existing PARQUET_FALLBACK_SCHEMA_RESOLUTION.
Background:
Arrays can be represented in several ways in Parquet:
- Three Level Encoding (standard)
- Two Level Encoding (legacy)
- One Level Encoding (legacy)
More details are in the "Lists" section of the spec:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
Unfortunately, there is no reliable metadata within Parquet files
to indicate which encoding was used. There is even the possibility
of having mixed encodings within the same file if there are multiple
arrays.
As a result, Impala currently tries to auto-detect the file encoding
when resolving a schema path in a Parquet file using the
TWO_LEVEL_THEN_THREE_LEVEL policy.
However, regardless of whether a Parquet data file uses the 2-level
or 3-level encoding, the index-based resolution may return incorrect
results if the representation in the Parquet file does not
exactly match the attempted array-resoution policy. Intuitively,
when attempting a 2-level resolution on a 3-level file, the matched
schema node may not be deep enough in the schema tree, but could still
be a scalar node with expected type. Similarly, when attempting a
3-level resolution on a 2-level file a level may be incorrectly
skipped.
The name-based policy generally does not have this problem because it
avoids traversing incorrect schema paths. However, the index-based
resoution allows a different set of schema-evolution operations,
so just using name-based resolution is not an acceptable workaround
in all cases.
Testing:
- Added new Parquet data files that show how incorrect results
can be returned with a mismatched file encoding and resolution
policy. Added both 2-level and 3-level versions of the data.
- Added a new test in test_nested_types.py that shows the behavior
with the new PARQUET_ARRAY_RESOLUTION query option.
- Locally ran test_scanners.py and test_nested_types.py on core.
Change-Id: I4f32e19ec542d4d485154c9d65d0f5e3f9f0a907
Reviewed-on: http://gerrit.cloudera.org:8080/6250
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Here is a basic summary of the changes:
Frontend looks for conjuncts that operate on a single slot and pass a
map from slot id to the conjunct index through thrift to the backend.
The conjunct indices are the indices into the normal PlanNode conjuncts list.
The conjuncts need to satisfy certain conditions:
1. They are bound on a single slot
2. They are deterministic (no random functions)
3. They evaluate to FALSE on a NULL input. This is because the dictionary
does not include NULLs, so any condition that evaluates to TRUE on NULL
cannot be evaluated by looking only at the dictionary.
The backend converts the indices into ExprContexts. These are cloned in
the scanner threads.
The dictionary read codepath has been removed from ReadDataPage into its
own function, InitDictionary. This has also been turned into its own step
in row group initialization. ReadDataPage will not see any dictionary
pages unless the parquet file is invalid.
For dictionary filtering, we initialize dictionaries only as needed to evaluate
the conjuncts. The Parquet scanner evaluates the dictionary filter conjuncts on the
dictionary to see if any dictionary entry passes. If no entry passes, the row
group is eliminated. If the row group passes the dictionary filtering, then we
initialize all remaining dictionaries.
Dictionary filtering is controlled by a new query option,
parquet_dictionary_filtering, which is on by default.
Since column chunks can have a mixture of encodings, dictionary filtering
uses three tests to determine whether this is purely dictionary encoded:
1. If the encoding_stats is in the parquet file, then use it to determine if
there are only dictionary encoded pages (i.e. there are no data pages with
an encoding other than PLAIN_DICTIONARY).
-OR-
2. If the encoding stats are not present, then look at the encodings. The column
is purely dictionary encoded if:
a) PLAIN_DICTIONARY is present
AND
b) Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed
-OR-
3. If this file was written by an older version of Impala, then we know that
dictionary failover happens when the dictionary reaches 40,000 values.
Dictionary filtering can proceed as long as the dictionary is smaller than
that.
parquet-mr writes the encoding list correctly in the current version in our
environment (1.5.0). This means that check #2 works on some existing files
(potentially most existing parquet-mr files).
parquet-mr writes the encoding stats starting in 1.9.0. This is the version
where check #1 will start working.
Impala's parquet writer now implements both, so either check above will work.
Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Reviewed-on: http://gerrit.cloudera.org:8080/5904
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
The query option PARQUET_FALLBACK_SCHEMA_RESOLUTION
allows matching of Parquet fields by name instead of by
index (the default).
Parquet column names are case sensitive, but Impala treats
db/table/column/field names as case-insensitive. Today,
there is no way today to select Parquet columns with mixed
casing via SQL using the name-based field resolution policy.
This patch changes the matching of Parquet fields to be
case-insensitive.
Testing:
- Modified the data files backing complextypestbl
to contain fields with mixed casing.
- Several existing tests run against this table,
including the test for name-based resolution.
- I confirmed that without this fix, the existing
name-based resolution tests fail on the modified
data files.
- I locally ran test_scanners.py and test_nested_types.py
on exhaustive with this fix.
Change-Id: I87395f84ba29b4c3d8e41be1ea4e89e500b8a9f4
Reviewed-on: http://gerrit.cloudera.org:8080/5891
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
- test_parquet_stats.py was missing and the tests weren't run during
GVO.
- The tests in parquet_stats.test assume that the queries were executed
in a single fragment, so they now run with 'num_nodes = 1'.
- Parquet columns are now resolved correctly.
- Parquet files with missing columns are now handled correctly.
- Predicates with implicit casts can now be evaluated against
parquet::Statistics.
- This change also cleans up some old friend declarations I came across.
Change-Id: I54c205fad7afc4a0b0a7d0f654859de76db29a02
Reviewed-on: http://gerrit.cloudera.org:8080/6147
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
The string parsing code already errors if the decimal column either
overflows or underflows (i.e. loses scale). Let's just add a test
case.
Change-Id: Idd66c0fb5a4d201919d39f73dea08b87339d6469
Reviewed-on: http://gerrit.cloudera.org:8080/6150
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Adds support for missing Kudu column options in ALTER TABLE
(it was there in CREATE TABLE already):
* encoding
* compression
* block_size
Also adds support for adding nullable columns with default
values.
All of the above was not originally implemented due to
limitations in the Kudu client, but have since been fixed:
KUDU-1746, KUDU-1747
Testing: Updates and adds relevant test cases.
Change-Id: I96a0fce7f6bc0c086b259d3119daa72c94b0af7b
Reviewed-on: http://gerrit.cloudera.org:8080/6220
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
Address rounding on divide and multiply when results are truncated.
Testing: Manually ran some divides that should overflow, then added
the results to the test. Made the decimal-test use rounding behavior
by default, and now the error margin of the test has decreased.
Initial perf results:
Multiply is totall uninteresting so far, all implementations
return the same values in the same time:
+-------------------------+-----------------------------------+
| sum(l_quantity * l_tax) | sum(l_extendedprice * l_discount) |
+-------------------------+-----------------------------------+
| 61202493.3700 | 114698450836.4234 |
+-------------------------+-----------------------------------+
Fetched 1 row(s) in 1.13s
Divide shows no regression from prior with DECIMAL_V2 off:
+-----------------------------+-----------------------------------+
| sum(l_quantity / l_tax) | sum(l_extendedprice / l_discount) |
+-----------------------------+-----------------------------------+
| 46178777464.523809516381723 | 61076151920731.010714279183910 |
+-----------------------------+-----------------------------------+
before: Fetched 1 row(s) in 13.08s
after: Fetched 1 row(s) in 13.06s
And with DECIMAL_V2 on:
+-----------------------------+-----------------------------------+
| sum(l_quantity / l_tax) | sum(l_extendedprice / l_discount) |
+-----------------------------+-----------------------------------+
| 46178777464.523809523847285 | 61076151920731.010714285714202 |
+-----------------------------+-----------------------------------+
Fetched 1 row(s) in 16.06s
So the performance regression is not as bad as expected. Still,
divide performance could use some work.
Change-Id: Ie6bfcbe37555b74598d409c6f84f06b0ae5c4312
Reviewed-on: http://gerrit.cloudera.org:8080/6132
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
IMPALA-2328 added support for skipping row groups based on
parquet::Statistics. This change adds a test for root-level
scalar columns of parquet files with nested types.
Change-Id: If81c8a1ecea937794885d4e5e7bf765bd238f5fb
Reviewed-on: http://gerrit.cloudera.org:8080/6130
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
This change adds support for skipping row groups based on Parquet row
group statistics. With this change we only support reading statistics
from Parquet files for numerical types (bool, integer, floating point)
and for simple predicates of the forms <slot> <op> <constant> or
<constant> <op> <slot>, where <op> is LT, LE, GE, GT, and EQ.
Change-Id: I39b836165756fcf929c801048d91c50c8fdcdae4
Reviewed-on: http://gerrit.cloudera.org:8080/6032
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Impala Public Jenkins
Impala incorrectly returned NULLs in the "Max Size" column of the SHOW
COLUMN STATS result when executed through the HS2 interface. The issue
was that the column was specified to be type INT in the result schema,
but the actual type of the contents that we inserted into it was
"long". The reason why this is not an issue in Impala shell is because
we stringify the contents without inspecting the metadata for beeswax
results.
The issue was fixed by changing the type from INT to BIGINT.
Change-Id: I419657744635dfdc2e1562fe60a597617fff446e
Reviewed-on: http://gerrit.cloudera.org:8080/6109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This change implements the DECIMAL_V2's behavior for AVG().
The differences with DECIMAL_V1 are:
1. The output type has a minimum scale of 6. This is similar
to MS SQL's behavior which takes the max of 6 and the input
type's scale. We deviate from MS SQL in the output's precision
which is always set to 38. We use the smallest precision which
can store the output. A key insight is that the output of AVG()
is no wider than the inputs. Precision only needs to be adjusted
when the scale is augmented. Using a smaller precision avoids
potential loss of precision in subsequent decimal operations
(e.g. division) if AVG() is a subexpression. Please note that
the output type is different from SUM()/COUNT() as the latter
can have a much larger scale.
2. Due to a minimum of 6 decimal places for the output,
AVG() for decimal values whose whole number part exceeds 32
decimal places (e.g. DECIMAL(38,4), DECIMAL(33,0)) will
always overflow as the scale is augmented to 6. Certain
decimal types which work with AVG() in DECIMAL_V1 no longer
work in DECIMAL_V2.
Change-Id: I28f5ef0370938440eb5b1c6d29b2f24e6f88499f
Reviewed-on: http://gerrit.cloudera.org:8080/6038
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The bug: Compute incremental stats used to always do a
full stats recomputation for tables with complex types.
The logic for detecting schema changes (e.g. an added
column) did not take into consideration that columns
with complex types are ignored in the stats computation,
and should therefore not be recognized as a new column
that does not yet have stats.
Testing:
- Added a new regression test
- Locally ran test_compute_stats.py and the FE tests
Change-Id: I6e0335048d688ee25ff55c6628d0f6f8ecc1dd8a
Reviewed-on: http://gerrit.cloudera.org:8080/6033
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
This turned out to be slightly non-trivial as REPLACE is already a
keyword, and thus the parser needs to be tweaked to allow this,
since function names act as bare identifiers.
It was difficult to get this to match performance of regexp_replace.
For expanding patterns, the fact that regexp_replace copies the
expansion inline means that it may in fact win on large strings
with sparse matches that are > dcache size apart. Let's leave
optimizing that for later.
Testing: Added a full test for maximum size strings and got most
of the boundary conditions I could identify. Manually ran queries
on TPC-H dataset in impala to verify both performance and correctness.
Added large string and exprs.test test clauses and ran the tests to
verify they work as expected.
Change-Id: I1780a7d8fee6d0db9dad148217fb6eb10f773329
Reviewed-on: http://gerrit.cloudera.org:8080/5776
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
Implement the new DECIMAL return type rules for divide and modulo
expressions, active when query option DECIMAL_V2=1. See the comment
in the code for more details. A couple of examples that show why new
return type rules for divide are desirable.
For modulo, the return types are actually equivalent, though the
rules are expressed differently to have consistency with how
precision fixups are handled for each version.
DECIMAL Version 1:
+-------------------------------------------------------+
| cast(1 as decimal(20,0)) / cast(3 as decimal(20,0)) |
+-----------------------------------------------------+
| 0 |
+-------------------------------------------------------+
DECIMAL Version 2:
+-------------------------------------------------------+
| cast(1 as decimal(20,0)) / cast(3 as decimal(20,0)) |
+-----------------------------------------------------+
| 0.333333333333333333 |
+-------------------------------------------------------+
DECIMAL Version 1:
+-------------------------------------------------------+
| cast(1 as decimal(6,0)) / cast(0.1 as decimal(38,38)) |
+-------------------------------------------------------+
| NULL |
+-------------------------------------------------------+
WARNINGS: UDF WARNING: Expression overflowed, returning NULL
DECIMAL Version 2:
+-------------------------------------------------------+
| cast(1 as decimal(6,0)) / cast(0.1 as decimal(38,38)) |
+-------------------------------------------------------+
| 10.000000 |
+-------------------------------------------------------+
Change-Id: I83e7f7787edfa4b4bddc25945090542a0e90881b
Reviewed-on: http://gerrit.cloudera.org:8080/5952
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
A few tests were added with IMPALA-1670 that made use of HDFS caching.
This patch moves these tests to a new file and only executes them
when the default filesystem is HDFS.
There was also a bug where the tests used absolute locations instead
of locations relative to the table they were in which could easily
collide with locations of other tables if they raced. That has been
fixed too.
Also added a testcase for testing alter table ADD multiple PARTITIONS
for non-HDFS filesystems.
Change-Id: Iefe61556bc28ae320f3f41fdc930d37b258d970a
Reviewed-on: http://gerrit.cloudera.org:8080/5972
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Impala Public Jenkins
Currently, codegen supports converting type attributes (e.g.
decimal type's precision and scale, type's size) obtained via
calls to FunctionContextImpl::GetFnAttr() (previously
Expr::GetConstantInt()) in cross-compiled code to runtime constants.
This change extends this support for the query option DECIMAL_V2.
To test this change, this change also handles a subset of IMPALA-2020:
casting between decimal values is updated to support rounding (instead
of truncation) when decimal_v2 is true.
This change also cleans up the existing code by moving the codegen
logic Expr::InlineConstant() to the codegen module and the type
related logic in Expr::GetConstantInt() to FunctionContextImpl.
Change-Id: I2434d240f65b81389b8a8ba027f980a0e1d1f981
Reviewed-on: http://gerrit.cloudera.org:8080/5950
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
This uses the existing infrastructure for codegening builtin UDAs and
for codegening calls to UDFs. GetUdf() is refactored to support both
UDFs and UDAs.
IR UDAs are still not allowed by the frontend. It's unclear if we want
to enable them going forward because of the difficulties in testing and
supporting IR UDFs/UDAs.
This also fixes some bugs with the Get*Type() methods of
FunctionContext. GetArgType() and related methods now always return the
logical input types of the aggregate function. Getting the tests to pass
required fixing IMPALA-4878 because they called GetIntermediateType().
Testing:
test_udfs.py tests UDAs with codegen enabled and disabled.
Added some asserts to test UDAs to check that the correct types are
passed in via the FunctionContext.
Change-Id: Id1708eaa96eb76fb9bec5eeabf209f81c88eec2f
Reviewed-on: http://gerrit.cloudera.org:8080/5161
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
In SelectList.reset(), we call reset() on each select list item's
expr. reset() is supposed to remove implicit casts, by returning
the reset expr with implicit cast exprs removed from the tree.
Previously SelectList.reset() ignored the return value of the calls
to Expr.reset(), meaning that if the top-most expr of the select list
item is an implicit cast, it won't actually get removed, which causes
problems with analysis since implicit casts are always treated as
pre-analyzed.
The solution is to set the select list item's exprs to the return
value of reset().
Testing:
- Added a regression test to exprs.test
Change-Id: I16ff88716b185e1d72d2bc603a42bd06c60ec18e
Reviewed-on: http://gerrit.cloudera.org:8080/5917
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
In calculating the STDDEV_SAMP/VARIANCE of N rows a divion by
N-1 rows is involved. Hence STDDEV_SAMP/VARIANCE for a single
row involves a division by 0. This change returns a NULL instead
of a 0 when calculating STDDEV_SAMP/VARIANCE for a single row.
STDDEV_POP/VARIANCE_POP for single row will still return a 0 since
this does not involve a division by 0. This matches the postgres
behavior.
Change-Id: Ide8af752cd8a2e554a2cd5a1ec948967a80de1fe
Reviewed-on: http://gerrit.cloudera.org:8080/5800
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
In the partitioned hash join node, if a spilled partition has no probe
rows, building the hash table is unnecessary.
For some build types (right outer, right anti, and full outer), we still
need to process the build side to output unmatched rows (in this case, all
rows since there were no probe rows to match).
Testing: Added some cases to spilling.test. Manually tested these cases
for performance, and they all show around a 6% improvement.
Change-Id: I175b32dd9031e51218b38c37693ac3e31dfab47b
Reviewed-on: http://gerrit.cloudera.org:8080/5389
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
ADD PARTITION
Just like Hive, Implala should support multiple partitions in ALTER
TABLE ADD PARTITION statements. The syntax is as follows:
ALTER TABLE table_name ADD [IF NOT EXISTS]
PARTITION partition_spec1 [location_spec1] [cache_spec1]
PARTITION partition_spec2 [location_spec2] [cache_spec2]
...
Grammar was modified to handle the new syntax. Introduced PartitionDef
class to capture the repeatable part of the statement. TPartitionDef
is the name of the corresponding thrift class.
AlterTableAddPartitionStmt and CatalogOpExecutor classes were also
modified to work with a list of partitions. Duplicate partition specs
are rejected in AlterTableAddPartitionStmt.analyze().
Added FE, E2E and integration tests.
Change-Id: Iddbc951f2931f488f7048c9780260f6b49100750
Reviewed-on: http://gerrit.cloudera.org:8080/4144
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Impala Public Jenkins
This change fixes expr-test.cc to work with codegen as it's
originally intended. Fixing it uncovers a couple of bugs fixed
in this patch:
IMPALA-4705: When an IR function is materialized, its
function body is parsed to find all its callee functions
to be materialized too. However, the old code doesn't
detect callee fnctions referenced indirectly (e.g. a
callee function passed as argument to another function).
This change fixes the problem above inspecting the use
lists of llvm::Function objects. When parsing the bitcode
module into memory, LLVM already establishes a use list
for each llvm::Value object which llvm::Function is a
subclass of. A use list contains all the locations in
the module in which the Value is referenced. For a
llvm::Function object, that would be its call sites and
constant expressions referencing the functions. By using
the use lists of llvm::Function in the module, a global
map is established at Impala initialization time to map
functions to their corresponding callee functions. This
map is then used when materializing a function to ensure
all its callee functions are also materialized recursively.
IMPALA-4779: conditional function isfalse(), istrue(),
isnotfalse(), isnotrue() aren't cross-compiled so they
will lead to unexpected query failure when codegen is enabled.
This change will cross-compile these functions.
IMPALA-4780: next_day() always returns NULL when codegen
is enabled. The bound checks for next_day() use some class
static variables initialized in the global constructors
(@llvm.global_ctors). However, we never execute the global
constructors before calling the JIT compiled functions.
This causes these variables to remain as zero, causing all
executions of next_day() to fail the bound checks. The reason
why these class static variables aren't compiled as global
constants in LLVM IR is that TimestampFunctions::MIN_YEAR is
not a compile time constant. This change fixes the problem
above by setting TimestampFunctions::MIN_YEAR to a known constant
value. A DCHECK is added to verify that it matches the value
defined in the boost library.
Change-Id: I40fdb035a565ae2f9c9fbf4db48a548653ef7608
Reviewed-on: http://gerrit.cloudera.org:8080/5732
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
When there are conditionals with constant values of TRUE or
FALSE we can simplify them during analysis using the ExprRewriter.
This patch introduces the SimplifyConditionalsRule with covers IF,
OR, AND, CASE, and DECODE.
It also introduces NormalizeExprsRule which normalizes AND and OR
such that if either child is a BoolLiteral, then the left child is a
BoolLiteral.
Testing:
- Added unit tests to ExprRewriteRulesTest.
- Added functional tests to expr.test
- Ran FE planner tests and BE expr-test.
Change-Id: Id70aaf9fd99f64bd98175b7e2dbba28f350e7d3b
Reviewed-on: http://gerrit.cloudera.org:8080/5585
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
HIVE-15653 is a Hive Metastore bug that results in ALTER TABLE
commands wiping the table stats of unpartitioned tables.
Until the Hive bug is fixed, this patch adds a workaround
to Impala that forces the Metastore to preserve the table stats.
Testing: Private core/hdfs run passed.
Change-Id: Ic191c765f73624bc716badadd7215c8dca9d6b1f
Reviewed-on: http://gerrit.cloudera.org:8080/5731
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Previously Impala was inconsistent about whether the year 10000 was
supported, as a result of inconsistency in boost, which reported the
maximum year as 9999 but sometimes allowed 10000. This meant that
Impala sometimes accepted the year 10000 and sometimes not.
Use the patched boost version and update tests accordingly.
Testing:
Ran an exhaustive build.
Change-Id: Iaf23b40833017789d879e5da7bb10384129e2d10
Reviewed-on: http://gerrit.cloudera.org:8080/5665
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The DECODE constructor in CaseExpr uses the same decodeExpr object when
building the BinaryPredicates that compare the decodeExpr to each 'when'
of the DECODE. This causes problems when different BinaryPredicates try
to cast the same decodeExpr object to different types during analysis,
in this case leading to a Precondition check failure.
The solution is to clone the decodeExpr in the DECODE constructor in
CaseExpr for each generated BinaryPredicate.
Testing:
- Added a regression test to exprs.test
Change-Id: I4de9ed7118c8d18ec3f02ff74c9cca211c716e51
Reviewed-on: http://gerrit.cloudera.org:8080/5631
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
For a table that has both a table comment and a partition specified,
"show create table" incorrectly outputs the comment before the partition.
This is not the correct order, and it results in an invalid SQL.
This transaction fixes the ordering (partition comes before comment) and
adds tests for this case.
Change-Id: I29a33cfd142b473997fdc3acfe3f0966bc7ed784
Reviewed-on: http://gerrit.cloudera.org:8080/5648
Tested-by: Impala Public Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
The bug was that expr rewrite rules such as ExtractCommonConjunctRule
analyzed their own output, which doesn't work for syntactic elements
that allow column aliases, such as the HAVING clause.
The fix was to remove the analysis step (the re-analysis happens anyway
in AnalysisCtx).
Change-Id: Ife74c61f549f620c42f74928f6474e8a5a7b7f00
Reviewed-on: http://gerrit.cloudera.org:8080/5662
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Impala Public Jenkins
Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into
TmpFileMgr, in anticipation of it being shared with BufferPool.
TmpFileMgr now handles:
* Scratch space allocation and recycling
* Read and write I/O
The interface is also greatly changed so that it is built around Write()
and Read() calls, abstracting away the details of temporary file
allocation from clients. This means the TmpFileMgr::File class can
be hidden from clients.
Write error recovery:
Also implement write error recovery in TmpFileMgr.
If an error occurs while writing to scratch and we have multiple
scratch directories, we will try one of the other directories
before cancelling the query. File-level blacklisting is used to
prevent excessive repeated attempts to resize a scratch file during
a single query. Device-level blacklisting is not implemented because
it is problematic to permanently take a scratch directory out of use.
To reduce the number of error paths, all I/O errors are now handled
asynchronously. Previously errors creating or extending the file were
returned synchronously from WriteUnpinnedBlock(). This required
modifying DiskIoMgr to create the file if not present when opened.
Also set the default max_errors value in the thrift definition file,
so that it is in effect for backend tests.
Future Work:
* Support for recycling variable-length scratch file ranges. I omitted
this to avoid making the patch even large.
Testing:
Updated BufferedBlockMgr unit test to reflect changes in behaviour:
* Scratch space is no longer permanently associated with a block, and
is remapped every time a new block is written to disk .
* Files are now blacklisted - updated existing tests and enable the
disable blacklisting test.
Added some basic testing of recycling of scratch file ranges in
the TmpFileMgr unit test.
I also manually tested the code in two ways. First by removing permissions
for /tmp/impala-scratch and ensuring that a spilling query fails cleanly.
Second, by creating a tiny ramdisk (16M) and running with two scratch
directories: one on /tmp and one on the tiny ramdisk. When spilling, an
out of space error is encountered for the tiny ramdisk and impala spills
the remaining data (72M) to /tmp.
Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17
Reviewed-on: http://gerrit.cloudera.org:8080/5141
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This commit makes ADD PARTITION operations treat string partition-key
values as case sensitive in consistent with other related partition DDL
operations.
Change-Id: I6fbe67d99df8a50a16a18456fde85d03d622c7a1
Reviewed-on: http://gerrit.cloudera.org:8080/5535
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change introduces the sortby() query plan hint for insert
statements. When specified, sortby(a, b) will add an additional sort
step to the plan to order data by columns a, b before inserting it into
the target table.
Change-Id: I37a3ffab99aaa5d5a4fd1ac674b3e8b394a3c4c0
Reviewed-on: http://gerrit.cloudera.org:8080/5051
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
This patch ensures that setting the query option
enable_expr_rewrites=false will disable both constant folding in the
frontend (which it did already) and constant caching in the backend
(which is enabled in this patch). This gives a way for users to revert
to the old behaviour of non-deterministic UDFs before these
optimisations were added in Impala 2.8.
Before this patch, the backend would cache values based on IsConstant().
This meant that there was no way to override caching of values of
non-deterministic UDFs, e.g. with enable_expr_rewrites.
After this patch, we only cache literal values in the backend. This
offers the same performance as before in the common case where the
frontend will constant fold the expressions anyway.
Also rename some functions to more cleanly separate the backend concepts
of "constant" expressions and expressions that can be evaluated without
a TupleRow. In a future change (IMPALA-4617) we should remove the
IsConstant() analysis logic from the backend entirely and pass the
information from the frontend. We should also fix isConstant() in the
frontend so that it only returns true when it is safe to constant-fold
the expression (IMPALA-4606). Once that is done, we could revert back
to using IsConstant() instead of IsLiteral().
Testing:
Added targeted test to test constant folding of UDFs: we expect
different results depending on whether constant folding is enabled.
Also run TestUdfs with expr rewrites enabled and disabled, since this
can exercise different code paths. Refactored test_udfs somewhat to
avoid running uninteresting combinations of query options for
targeted tests and removed some 'drop * if not exists' statements
that aren't necessary when using unique_database.
This change revealed flakiness in test_mem_limit, which seems
to have only worked by coincidence. Updated TrackAllocation() to
actually set the query status when a memory limit is exceeded.
Looped this test for a while to make sure it isn't flaky any
more.
Also fix other test bugs where the vector argument is modified
in-place, which can leak out to other tests.
Change-Id: I0c76e3c8a8d92749256c312080ecd7aac5d99ce7
Reviewed-on: http://gerrit.cloudera.org:8080/5391
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This commit fixes an issue where an error is thrown if the default value
for a Kudu column is set to NULL.
Change-Id: Ida27ce56f1dd7603485a69c680db3bcea6702aff
Reviewed-on: http://gerrit.cloudera.org:8080/5405
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
In a recent change (IMPALA-4363) we introduced a change where all file
paths in .test files should be replaced with '__HDFS_FILENAME__'. This
caused problems for tests on non-HDFS file systems and we also lost some
test coverage. This patch fixes the problem by allowing the $DATABASE
template in the catch section of the .test file.
Change-Id: If0f6ae8dea7ac4cdaf0c61ebd8f0c589c353a96e
Reviewed-on: http://gerrit.cloudera.org:8080/5372
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
For LLVM IR UDF, Impalad will link an external LLVM module
in which the IR UDF is defined with the main module. If it
happens that a symbol is defined in both modules, LLVM may
choose to discard the one defined in the external module.
The discarded function and its callee will not be present
in the linked module.
In IMPALA-4595, udf-sample.cc was compiled without any
optimization. Duplicated definition such as StringVal::null()
may have different inlining level between the external module
and the main module. When the duplicated definition in
the external module is discarded, some of its callee
functions (which are not inlined) may not be defined in the
main module so they can no longer be located in the linked
module. This trips up some code in the LlvmCodegen::LinkModule().
In particular, when parsing for functions in external module
which are materialized during linking, certain functions may
not be present due to the reason above. Impalad will hit
a DCHECK in debug build or crash due to null pointer access
in release build.
This change fixes the problem above by taking into account
that certain functions may not be defined anymore after linking.
This change also fixes two incorrect status propagation in
fe-support.cc.
Change-Id: Iaa056a0c888bfcc95b412e1bc1063bb607b58ab7
Reviewed-on: http://gerrit.cloudera.org:8080/5384
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Impala Public Jenkins
Bug: Commit 6f31c7 fixed a crash when setting Avro schemas for
tables with storage altered to Avro file format. However the
fix was incomplete for partitioned/multi file format tables since
'hasAvroData_' is not set for all code paths that load the
partitioned tables (For example: HdfsTable#loadAllPartitions()).
Fix: Moved the code for setting 'hasAvroData_' to addPartition()
which is the common logic for all code paths adding new partitions.
Also fixed the test coverage gap by adding a new test for partitioned
tables altered to Avro format.
Change-Id: I7854ff002b2277ec4a5388216218a1d5ad142de8
Reviewed-on: http://gerrit.cloudera.org:8080/5388
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This commit also removes the now unused `DISTRIBUTE`, `SPLIT`, and
`BUCKETS` keywords that were going to be newly released in Impala 2.6,
but are now unused. Additionally, a few remaining uses of the
`DISTRIBUTE BY` syntax has been switched to `PARTITION BY`.
Change-Id: I32fdd5ef26c532f7a30220db52bdfbf228165922
Reviewed-on: http://gerrit.cloudera.org:8080/5382
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
Bumps the toolchain version to get a newer Kudu build.
Also fixes test failures resulting from changes in Kudu.
Notably error strings have changed (IMPALA-4590) and the
number of replicas must be odd (IMPALA-4589).
Note: The toolchain binaries starting with this build are
now using the toolchain binutils rather than the system
binutils.
Testing: private exhaustive build.
Change-Id: If1912f058c240fbe82b06f77e31add7755289be1
Reviewed-on: http://gerrit.cloudera.org:8080/5369
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
A recent change (IMPALA-1788) lead UUID() to be constant folded,
and therefore, produce the same value for every invocation across
rows. Similar issues might also occur due to the BE optimizing
UUID() during codegen of scalar-fn-call.h/cc.
The fix is to not treat UUID() like a constant expr in both
the FE and BE.
Discussion:
The fix in this patch is rather blunt, but minimally invasive to
reduce the risk of adding new bugs. Ideally, the constness of an
Expr should be determined in one place and the FE and BE should agree
on which Exprs are constant. I considered the following alternatives
but concluded they were too risky:
1. Pass a flag from FE to BE for ever Expr indicating its constness.
This simple solution would populate a thrift field with the
result of Expr.isConstant() for every Expr in an Expr tree.
There are several issues. Calling isConstant() for every Expr
in an Expr tree is rather expensive due to repeated traversals
of the tree. That could be mitigated by populating an isConstant
flag during Expr.analyze() to avoid re-computing the constness
repeatedly. This requires changes to analyze(), clone(), reset(),
and possibly other places for many Exprs. There is potential
for missing a place and adding a new bug.
2. The above solution could be limited to only FunctionCallExpr.
However, the BE expr type FUNCTION_CALL which maps to
scalar-fn-call.h/cc is created from various FE Exprs, not just
FunctionCallExpr. So adding a flag only to scalar-fn-call.h/cc
would be confusing because it would only sometimes be set
in a meaningful way. This seems more confusing than the current
straightforward solution.
Testing: Added FE and EE tests.
Change-Id: If2499f5f6ecdcb098623202c8e6dc2d02727194a
Reviewed-on: http://gerrit.cloudera.org:8080/5324
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This commit changes the behavior of alter table operations on Kudu
tables from asynchronous to synchronous. With this change, alter table
operations return when either the operations complete successfully or
a timeout is reached.
Change-Id: I385bce66691ae9040e72f97557e1bba31009e36b
Reviewed-on: http://gerrit.cloudera.org:8080/5364
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
This commit also fixes an issue where an error is thrown if a default
value is set for a boolean column on a Kudu table.
Change-Id: I25b66275d29d1cf21df14e78ab58f625a83b0725
Reviewed-on: http://gerrit.cloudera.org:8080/5337
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Impala Public Jenkins
The bugs was that the functions did not check whether the conversion
pushed the value out of range. The fix is to use boost's validation
immediately to check the validity of the timestamp and catch any
exceptions thrown.
It would be preferable to avoid the exceptions, but Boost does not
provide a straightforward way to disable the exceptions or extract
potentially-invalid values from a date object.
Testing:
Added expression tests that exercise out-of-range cases. Also
added additional tests to confirm that date addition and subtraction
weren't affected by similar bugs.
Change-Id: Idc427b06ac33ec874a05cb98d01c00e970d3dde6
Reviewed-on: http://gerrit.cloudera.org:8080/5251
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This commit fixes an issue where a SHOW CREATE VIEW statement throws an
analysis error if the view contains a subquery.
Change-Id: I4a89e46a022f0ccec198b6e3e2b30230103831ce
Reviewed-on: http://gerrit.cloudera.org:8080/5333
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Before this patch, we would simply read the INT96 Parquet timestamp
representation and assume that it's valid. However, not all bit
permutations represent a valid timestamp. One of the boost functions
raised an exception (that we didn't catch) when passed an invalid
boost date object, which resulted in a crash. This patch fixes
problem by validating that the date falls into 1400..9999 year
range as we are scanning Parquet.
Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac
Reviewed-on: http://gerrit.cloudera.org:8080/5343
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This commit reverts the behavior introduced by IMPALA-3719 which used
the Kudu default behavior for column nullability if none was specified
in the CREATE TABLE statement. With this commit, non-key columns of Kudu
tables that are created from Impala are by default nullable unless
specified otherwise.
Change-Id: I950d9a9c64e3851e11a641573617790b340ece94
Reviewed-on: http://gerrit.cloudera.org:8080/5259
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins