The bugs was that the functions did not check whether the conversion
pushed the value out of range. The fix is to use boost's validation
immediately to check the validity of the timestamp and catch any
exceptions thrown.
It would be preferable to avoid the exceptions, but Boost does not
provide a straightforward way to disable the exceptions or extract
potentially-invalid values from a date object.
Testing:
Added expression tests that exercise out-of-range cases. Also
added additional tests to confirm that date addition and subtraction
weren't affected by similar bugs.
Change-Id: Idc427b06ac33ec874a05cb98d01c00e970d3dde6
Reviewed-on: http://gerrit.cloudera.org:8080/5251
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
This commit fixes an issue where a SHOW CREATE VIEW statement throws an
analysis error if the view contains a subquery.
Change-Id: I4a89e46a022f0ccec198b6e3e2b30230103831ce
Reviewed-on: http://gerrit.cloudera.org:8080/5333
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Before this patch, we would simply read the INT96 Parquet timestamp
representation and assume that it's valid. However, not all bit
permutations represent a valid timestamp. One of the boost functions
raised an exception (that we didn't catch) when passed an invalid
boost date object, which resulted in a crash. This patch fixes
problem by validating that the date falls into 1400..9999 year
range as we are scanning Parquet.
Change-Id: Ieaab5d33e6f0df831d0e67e1d318e5416ffb90ac
Reviewed-on: http://gerrit.cloudera.org:8080/5343
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This commit reverts the behavior introduced by IMPALA-3719 which used
the Kudu default behavior for column nullability if none was specified
in the CREATE TABLE statement. With this commit, non-key columns of Kudu
tables that are created from Impala are by default nullable unless
specified otherwise.
Change-Id: I950d9a9c64e3851e11a641573617790b340ece94
Reviewed-on: http://gerrit.cloudera.org:8080/5259
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Fix a test bug where we need to skip nested types tests for the old aggs
and joins.
Fix a product bug where *eos is not initialised by the MT scan node.
This causes incorrect results when the calling ExecNode does not
initialise the eos variable, e.g. the sort node and the old agg and join
nodes.
Testing:
Added a test that reproduces the incorrect results with the sort node
when run under ASAN
Tested the mt_dop tests locally with old aggs and joins to ensure they
pass.
Change-Id: I48c50c8aa0c23710eb099fba252bc3c0cb74b313
Reviewed-on: http://gerrit.cloudera.org:8080/5302
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The bug was that HdfsScanNodeMt::Close() did not properly
clean up all in-flight resources when called through the
query cancellation path.
The main change is to clean up all resources when passing
a NULL batch into HdfsparquetScanner::Close() which also
needed similar changes in the scanner context.
Testing: Ran test_cancellation.py, test_scanners.py and
test_nested_types.py with MT_DOP=3. Added a test query
with a limit that was failing before.
A regular private hdfs/core test run succeeded.
Change-Id: Ib32f87b3289ed9e8fc2db0885675845e11207438
Reviewed-on: http://gerrit.cloudera.org:8080/5274
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
During slot substitution, the type of the child of a CastExpr can
change. If the previous child type matched the CastExpr, then the cast
was flagged as noOp_. During substitution and subsequent re-analysis
the noOp_ flag was not revisited so that no cast was performed, even
after it had become necessary.
The fix is to always set noOp_ to the correct value in
CastExpr.analyze().
Change-Id: I7f29cdc359558fad6df455b8eec0e0eaed00e996
Reviewed-on: http://gerrit.cloudera.org:8080/5267
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
With this commit, we add support for additional ALTER TABLE statements
against Kudu tables. The new supported ALTER TABLE operations for Kudu are:
- ADD/DROP range partitions. Syntax:
ALTER TABLE <tbl_name> ADD [IF NOT EXISTS] RANGE <kudu_partition_spec>
ALTER TABLE <tbl_name> DROP [IF EXISTS] RANGE <kudu_partition_spec>
- ADD/DROP/RENAME column. Syntax:
ALTER TABLE <tbl_name> ADD COLUMNS (col_spec, [col_spec, ...])
ALTER TABLE <tbl_name> DROP COLUMN <col_name>
ALTER TABLE <tbl_name> CHANGE COLUMN <old> <new_name> <type>
- Rename Kudu table using the 'kudu.table_name' table property. Example:
ALTER TABLE <tbl_name> SET TBLPROPERTY ('kudu.tbl_name'='<new_name>'),
will change the underlying Kudu table name to <new_name>.
- Renaming the HMS/Catalog table entry of a Kudu table is supported using the
existing ALTER TABLE <tbl_name> RENAME TO <new_tbl_name> syntax.
Not supported:
- ALTER TABLE <tbl_name> REPLACE COLUMNS
Change-Id: I04bc87e04e05da5cc03edec79d13cedfd2012896
Reviewed-on: http://gerrit.cloudera.org:8080/5136
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
The error path in OptimizeLlvmModule() has not worked correctly for a
long time because various places in the code assume that codegen'd
function pointers will be filled in (e.g. ScalarFnCall) . Since the
recent change "IMPALA-4397,IMPALA-3259: reduce codegen time and memory"
it is more likely to go down this path.
The cases when errors occur on this path: memory limit exceeded, internal
codegen bugs, and corrupt IR UDFs, are all cases when it is not correct
or safe to continue executing the query, so we should just fail the
query.
Testing:
Add a test where codegen reliably fails with memory limit exceeded.
Change-Id: Ib38d0a44b54c47617cad1b971244f477d344d505
Reviewed-on: http://gerrit.cloudera.org:8080/5211
Reviewed-by: Michael Ho <kwho@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Previously, CopyStringVal() mistakenly copies a null
StringVal as an empty string (i.e. a non-null string
with zero length). This change fixes the problem by
distinguishing between these two cases in CopyStringVal()
and handles them properly. Also added a test case for it.
This problem only started showing up recently due to
commit 51268c053f which
calls CopyStringVal() in OffsetFnInit(). All other
pre-existing callers of CopyStringVal() before that
commit checks if 'src' is null before calling it so
the problem never showed up. In that sense, this is
a latent bug exposed by the aforementioned commit.
Change-Id: I3a5b9349dd08556eba5cfedc8c0063cc59f5be03
Reviewed-on: http://gerrit.cloudera.org:8080/5198
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
Adds a new ExprRewriteRule for replacing constant expressions
with their literal equivalent via BE evaluation. Applies the
new rule together with the existing ones on the parse tree,
after analysis.
Limitations
- Constant folding is applied on the unresolved expressions.
As a result, it only works for expressions that are constant
within a single query block, as opposed to expressions that
may become constant after fully substituting inline-view exprs.
- Exprs are not normalized, so some opportunities for constant
folding are missed for certain expr-tree shapes.
This patch includes the following interesting changes:
- Introduces a timestamp literal that can only be produced
by constant folding (not expressible directly via SQL).
- To make sure that rewrites have no user-visible effect,
the original result types and column labels of the top-level
statement are restored after the rewrites are performed.
- Does not fold exprs if their evaluation resulted in a
warning or error, or if the resulting value is not
representable by corresponding FE LiteralExpr.
- Fixes an existing issue with converting strings between
the FE/BE. String produced in the BE that have characters
with a value > 127 are not correctly deserialized into a
Java String via thrift. We detect this case during constant
folding and abandon folding of such exprs.
- Fixes several issues with detecting/reporting errors in
NativeEvalConstExprs().
- Cleans up ExprContext::GetValue() into
ExprContext::GetConstantValue() which clarifies its only use
of evaluating exprs from the FE.
Testing:
- Modifies expr-test.cc to run all tests through the constant
folding path.
- Adds basic planner and rewrite rule tests.
- Exhaustive test run passed
Change-Id: If672b703db1ba0bfc26e5b9130161798b40a69e9
Reviewed-on: http://gerrit.cloudera.org:8080/5109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
A handful of fixes to codegen memory usage:
* Delete the IR module when we're done with it (it can be fairly large)
* Track the compiled code size (typically not that large, but it can add
up if there are many fragments).
* Estimate optimisation memory requirements and track it in the memory
tracker. This is very crude but much better than not tracking it.
A handful of fixes to improve codegen time/cost, particularly targeted
at compute stats workloads:
* Avoid over-inlining when there are many aggregate functions,
conjuncts, etc by adding "NoInline" attributes.
* Don't codegen non-grouping merge aggregations. They will only process
one row per Impala daemon, so codegen is not worth it.
* Make the Hll algorithm more efficient by specialising the hash function
based on decimal width.
Limitations:
* This doesn't tackle over-inlining of large expr trees, but a similar
approach will be used there in a follow-on patch.
Perf:
Compute stats on functional_parquet.widetable_1000_cols goes from 1min+
of codegen to ~ 5s codegen on my machine. Local perf runs of tpc-h
and targeted perf showed no regressions and some moderate improvements
(1-2%).
Also did an experiment to understand the perf consequences of disabling
inlining. I manually set CODEGEN_INLINE_EXPRS_THRESHOLD to 0, and ran:
drop stats tpch_20_parquet.lineitem
compute stats tpch_20_parquet.lineitem;
There was no difference in time spent in the agg node: 30.7s with
inlining, 30.5s without.
Change-Id: Id10015b49da182cb181a653ac8464b4a18b71091
Reviewed-on: http://gerrit.cloudera.org:8080/4956
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.
This change also adds/modifies tests in several ways:
- clustered insert tests switch from selecting all rows from
alltypessmall to alltypes. Together with varying settings for
batch_size, this results in a larger number of row batches being
written.
- clustered insert tests select from alltypes instead of
functional.alltypes to make sure we also select from various input
formats.
- clustered insert tests have been added to select from alltypestiny to
create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
16, 0 (meaning default, 1024). This is limited to uncompressed parquet
files, to maintain a reasonable runtime. On my machine execution of
test.insert took 1778 seconds, compared to 1002 seconds with the just
default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
that insertion over several row batches only creates one file per
partition.
- It renames the test_insert method to make it unique in the file and
allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.
Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Reviewed-on: http://gerrit.cloudera.org:8080/4863
Reviewed-by: Lars Volker <lv@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This is because that test uses 'set cached' and 'set uncached' which
are not supported on non-HDFS filesystems. This patch creates a
separate test file for non-HDFS filesystems with only supported
queries and invokes the right file based on the filesystem.
Change-Id: I8606aa427cb6e50be3395cdde246abb53db5172c
Reviewed-on: http://gerrit.cloudera.org:8080/5164
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
This commit adds support for Kudu-specific column options in CREATE
TABLE statements. The syntax is:
CREATE TABLE tbl_name ([col_name type [PRIMARY KEY] [option [...]]] [, ....])
where option is:
| NULL
| NOT NULL
| ENCODING encoding_val
| COMPRESSION compression_algorithm
| DEFAULT expr
| BLOCK_SIZE num
The output of the SHOW CREATE TABLE statement was altered to include all the specified
column options for Kudu tables.
Change-Id: I727b9ae1b7b2387db752b58081398dd3f3449c02
Reviewed-on: http://gerrit.cloudera.org:8080/5026
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Before this patch an unclear error message was returned if DATE or
DATETIME appeared in the select list after a star expansion. This was
because DATE and DATETIME PrimitiveType was serialized as INVALID_TYPE.
This is fixed by serializing correctly.
Change-Id: I9019b4bfd219f94e554c795befd3ff5e39706ea9
Reviewed-on: http://gerrit.cloudera.org:8080/4859
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The results in the test files were verified by hand.
This patch also introduces a new test section 'DML_RESULTS', which
takes the name of a table as a comment and the contents of the
table as its body and then verifies that the body matches the
actual contents of the table. This makes it easy to check that a
DML operation has the desired effect on the contents of a table,
rather than always having to add another test case that runs a
select on the table. For now, this section cannot be used in a
test along with the RESULTS or ERRORS sections.
TODO: Refactor the DML test case handling (IMPALA-4471)
Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6
Reviewed-on: http://gerrit.cloudera.org:8080/4953
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
A recent change caused 'describe formatted' to display the types
in all upper case, but we want 'describe formatted' to match Hive's
'describe' output, which displays the types in lower case.
This patch also fixes several problems with test_describe_formatted,
which was encountering an error but reporting success.
Change-Id: I274b97d4d1247244247fb38a5ca7f4c10bba8d22
Reviewed-on: http://gerrit.cloudera.org:8080/4861
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
This commit handles partition related DDL in a more general way. We can
now use compound predicates to specify a list of partitions in
statements like ALTER TABLE DROP PARTITION and COMPUTE INCREMENTAL
STATS, etc. It will also make sure some statements only accept one
partition at a time, such as PARTITION SET LOCATION and LOAD DATA. ALTER
TABLE ADD PARTITION remains using the old PartitionKeyValue's logic.
The changed partition related DDLs are as follows,
Table: p (i int) partitioned by (j int, k string)
Partitions:
+-------+---+-------+--------+------+--------------+-------------------+
| j | k | #Rows | #Files | Size | Bytes Cached | Cache Replication |
+-------+---+-------+--------+------+--------------+-------------------+
| 1 | a | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| 1 | b | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| 1 | c | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| 2 | d | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| 2 | e | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| 2 | f | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| Total | | -1 | 0 | 0B | 0B | |
+-------+---+-------+--------+------+--------------+-------------------+
1. show files in p partition (j<2, k='a');
2. alter table p partition (j<2, k in ("b","c") set cached in 'testPool';
// j can appear more than once,
3.1. alter table p partition (j<2, j>0, k<>"d") set uncached;
// it is the same as
3.2. alter table p partition (j<2 and j>0, not k="e") set uncached;
// we can also do 'or'
3.3. alter table p partition (j<2 or j>0, k like "%") set uncached;
// missing 'k' matches all values of k
4. alter table p partition (j<2) set fileformat textfile;
5. alter table p partition (k rlike ".*") set serdeproperties ("k"="v");
6. alter table p partition (j is not null) set tblproperties ("k"="v");
7. alter table p drop partition (j<2);
8. compute incremental stats p partition(j<2);
The remaining old partition related DDLs are as follows,
1. load data inpath '/path/from' into table p partition (j=2, k="d");
2. alter table p add partition (j=2, k="g");
3. alter table p partition (j=2, k="g") set location '/path/to';
4. insert into p partition (j=2, k="g") values (1), (2), (3);
General partition expressions or partially specified partition specs
allows partition predicates to return empty partition set no matter
'IF EXISTS' is specified.
Examples:
[localhost.localdomain:21000] >
alter table p drop partition (j=2, k="f");
Query: alter table p drop partition (j=2, k="f")
+-------------------------+
| summary |
+-------------------------+
| Dropped 1 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.78s
[localhost.localdomain:21000] >
alter table p drop partition (j=2, k<"f");
Query: alter table p drop partition (j=2, k<"f")
+-------------------------+
| summary |
+-------------------------+
| Dropped 2 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.41s
[localhost.localdomain:21000] >
alter table p drop partition (k="a");
Query: alter table p drop partition (k="a")
+-------------------------+
| summary |
+-------------------------+
| Dropped 1 partition(s). |
+-------------------------+
Fetched 1 row(s) in 0.25s
[localhost.localdomain:21000] > show partitions p;
Query: show partitions p
+-------+---+-------+--------+------+--------------+-------------------+
| j | k | #Rows | #Files | Size | Bytes Cached | Cache Replication |
+-------+---+-------+--------+------+--------------+-------------------+
| 1 | b | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| 1 | c | -1 | 0 | 0B | NOT CACHED | NOT CACHED |
| Total | | -1 | 0 | 0B | 0B | |
+-------+---+-------+--------+------+--------------+-------------------+
Fetched 3 row(s) in 0.01s
Change-Id: I2c9162fcf9d227b8daf4c2e761d57bab4e26408f
Reviewed-on: http://gerrit.cloudera.org:8080/3942
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Second part of IMPALA-3710, which removed the IGNORE DML
option and changed the following errors on Kudu DML
operations to be ignored:
1) INSERT where the PK already exists
2) UPDATE/DELETE where the PK doesn't exist
This changes other data-related errors to be ignored as
well:
3) NULLs in non-nullable columns, i.e. null constraint
violoations.
4) Rows with PKs that are in an 'uncovered range'.
It became clear that we can't differentiate between (3) and
(4) because both return a Kudu 'NotFound' error code. The
Impala error codes have been simplified as well: we just
report a generic KUDU_NOT_FOUND error in these cases.
This also adds some metadata to the thrift report sent to
the coordinator from sinks so the total number of rows with
errors can be added to the profile. Note that this does not
include a breakdown of error counts by type/code because we
cannot differentiate between all of these cases yet.
An upcoming change will add this new info to the beeswax
interface and show it in the shell output (IMPALA-3713).
Testing: Updated kudu_crud tests to check the number of rows
with errors.
Change-Id: I4eb1ad91dc355ea51de261c3a14df0f9d28c879c
Reviewed-on: http://gerrit.cloudera.org:8080/4985
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
This change enables codegen for all builtin aggregate functions,
e.g. timestamp functions and group_concat.
There are several parts to the change:
* Adding support for generic UDAs. Previous the codegen code did not
handle multiple input arguments or NULL return values.
* Defaulting to using the UDA interface when there is not a special
codegen path (we have implementations of all builtin aggregate
functions for the interpreted path).
* Remove all the logic to disable codegen for the special cases that now
are supported.
Also fix the generation of code to get/set NULL bits since I needed
to add functionality there anyway.
Testing:
Add tests that check that codegen was enabled for builtin aggregate
functions. Also fix some gaps in the preexisting tests.
Also add tests for UDAs that check input/output nulls are handled
correctly, in anticipation of enabling codegen for arbitrary UDAs.
The tests are run with both codegen enabled and disabled. To avoid
flaky tests, we switch the UDF tests to use "unique_database".
Perf:
Ran local TPC-H and targeted perf. Spent a lot of time on TPC-H Q1,
since my original approach regressed it ~5%. In the end the problem was
to do with the ordering of loads/stores to the slot and null bit in the
generated code: the previous version of the code exploited some
properties of the particular aggregate function. I ended up replicating
this behaviour to avoid regressing perf.
Change-Id: Id9dc21d1d676505d3617e1e4f37557397c4fb260
Reviewed-on: http://gerrit.cloudera.org:8080/4655
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Removes the non-standard IGNORE syntax that was allowed for
DML into Kudu tables to indicate that certain errors should
be ignored, i.e. not fail the query and continue. However,
because there is no way to 'roll back' mutations that
occurred before an error occurs, tables are left in an
inconsistent state and it's difficult to know what rows were
successfully modified vs which rows were not. Instead, this
change makes it so that we always 'ignore' these conflicts,
i.e. a 'best effort'. In the future, when Kudu will provide
the mechanisms Impala needs to provide a notion of isolation
levels, then Impala will be able to provide options for more
traditional semantics.
After this change, the following errors are ignored:
* INSERT where the PK already exists
* UPDATE/DELETE where the PK doesn't exist
Another follow-up patch will change other violations to be
handled in this way as well, e.g. nulls inserted in
non-nullable cols.
Reporting:
The number of rows inserted is reported to the coordinator,
which makes the aggregate available to the shell and via the
profile.
TODO: Return rows modified for INSERT via HS2 (IMPALA-1789).
TODO: Return rows modified for other CRUD (beeswax+hs2) (IMPALA-3713).
TODO: Return error counts for specific warnings (IMPALA-4416).
Testing:
Updated tests. Ran all functional tests. More tests will be
needed when other conflicts are handled in the same way.
Change-Id: I83b5beaa982d006da4997a2af061ef7c22cad3f1
Reviewed-on: http://gerrit.cloudera.org:8080/4911
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
The memory management of string results was wrong: strings returned from
Exprs must live until the next time FreeLocalAllocations() is called.
Otherwise the buffer holding the string is freed or reused by the next
UDF call. The fix is to copy string values into a buffer with the
right lifetime.
Testing:
Added a regression test based on Bharath's example that reproduced the
bug reliably.
Change-Id: I705d271814cb1143f67d8a12f4fd87bab7a8e161
Reviewed-on: http://gerrit.cloudera.org:8080/4941
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This patch fixes two issues around handling of constant expr args.
The patches are combined because they touch some of the same code
and depend on some of the same memory management cleanup.
First, it fixes IMPALA-2379, where constant expr args were not visible
to UDAFs. The issue is that the input exprs need to be opened before
calling the UDAF Init() function.
Second, it avoids overhead from repeated evaluation of constant
arguments for ScalarFnCall expressions on both the codegen'd and
interpreted paths. A common example is an IN predicate with a
long list of constant values.
The interpreted path was inefficient because it always evaluated all
children expressions. Instead in this patch constant args are
evaluated once and cached. The memory management of the AnyVal*
objects was somewhat nebulous - adjusted it so that they're allocated
from ExprContext::mem_pool_, which has the correct lifetime.
The codegen'd path was inefficient only with varargs - with fixed
arguments the LLVM optimiser is able to infer after inlining that
the expressions are constant and remove all evaluation. However,
for varargs it stores the vararg values into a heap-allocated buffer.
The LLVM optimiser is unable to remove these stores because they
have a side-effect that is visible to code outside the function.
The codegen'd path is improved by evaluating varargs into an automatic
buffer that can be optimised out. We also make a small related change
to bake the string constants into the codegen'd code.
Testing:
Ran exhaustive build.
Added regression test for IMPALA-2379 and MemPool test for aligned
allocation. Added a test for in predicates with constant strings.
Perf:
Added a targeted query that demonstrates the improvement. Also manually
validated the non-codegend perf. Also ran TPC-H and targeted perf
queries locally - didn't see any significant changes.
+--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| Workload | Query | File Format | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Num Clients | Iters |
+--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
| TARGETED-PERF(_20) | primitive_filter_in_predicate | parquet / none / none | 1.19 | 9.82 | I -87.85% | 3.82% | 0.71% | 1 | 10 |
+--------------------+-------------------------------+-----------------------+--------+-------------+------------+-----------+----------------+-------------+-------+
(I) Improvement: TARGETED-PERF(_20) primitive_filter_in_predicate [parquet / none / none] (9.82s -> 1.19s [-87.85%])
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+
| Operator | % of Query | Avg | Base Avg | Delta(Avg) | StdDev(%) | Max | Base Max | Delta(Max) | #Hosts | #Rows | Est #Rows |
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+
| 01:AGGREGATE | 14.39% | 155.88ms | 214.61ms | -27.37% | 2.68% | 163.38ms | 227.53ms | -28.19% | 1 | 1 | 1 |
| 00:SCAN HDFS | 85.60% | 927.46ms | 9.43s | -90.16% | 4.49% | 1.01s | 9.50s | -89.42% | 1 | 13.77K | 14.05K |
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+--------+-----------+
Change-Id: I45c3ed8c9d7a61e94a9b9d6c316e8a53d9ff6c24
Reviewed-on: http://gerrit.cloudera.org:8080/4838
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Hive expects types for column stats to be specified as all lower
case. For some reason, it doesn't check this when the stats are
first written, but it does check when performing an 'alter table'.
This causes it to drop stats that Impala wrote because we specify
type names in upper case.
This patch converts the types that Impala sends to Hive for the
column stats to all lower case and adds a regression test.
I also filed HIVE-15061 to track the issue from the Hive end.
Change-Id: Ia373ec917efa7ab9f2a59b8a870b7ebc30175dda
Reviewed-on: http://gerrit.cloudera.org:8080/4845
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
The Kudu client timeout was too low for Impala usage. This
sets the default timeout to 3 minutes and exposes it as a
gflag.
New timeout tests were added.
Change-Id: Iad95e8e38aad4f76d21bac6879db6c02b3c3e045
Reviewed-on: http://gerrit.cloudera.org:8080/4849
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This patch introduces a new query statement, UPSERT, for Kudu
tables which operates like an INSERT and uses all of the analysis,
planning, and execution machinery as INSERT, except that if
there's a primary key collision instead of returning an error an
update is performed.
New syntax:
[with_clause] UPSERT INTO [TABLE] table_name [(column list)]
{
query_stmt
| VALUES (value [, value...]) [, (value [, (value...)]) ...]
}
where column list must contain all of the key columns in
table_name, if specified, and table_name must be a Kudu table.
This patch also improves the behavior of INSERTing into Kudu
tables without specifying all of the key columns - this now
results in an analysis exception, rather than attempting the
INSERT and receiving an error back from Kudu.
Change-Id: I8df5cea36b642e267f85ff6b163f3dd96b8386e9
Reviewed-on: http://gerrit.cloudera.org:8080/4047
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This commit adds support for non-covering range partitions in Kudu
tables. The SPLIT ROWS clause is now deprecated and no longer supported.
The following new syntax provides more flexibility in creating range
partitions and it supports bounded and unbounded ranges as well as single value
partitions; multi-column range partitions are supported as well.
The new syntax is:
DISTRIBUTE BY RANGE (col_list)
(
PARTITION lower_1 <[=] VALUES <[=] upper_1,
PARTITION lower_2 <[=] VALUES <[=] upper_2,
....
PARTITION lower_n <[=] VALUES <[=] upper_n,
PARTITION VALUE = val_1,
....
PARTITION VALUE = val_n
)
Multi-column range partitions are specified as follows:
DISTRIBUTE BY RANGE (col1, col2,..., coln)
(
PARTITION VALUE = (col1_val, col2_val, ..., coln_val),
....
PARTITION VALUE = (col1_val, col2_val, ..., coln_val)
)
Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06
Reviewed-on: http://gerrit.cloudera.org:8080/4856
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
The first fix for IMPALA-4379 went in before all comments
were addressed. First commit: 9b507b6.
This addresses some follow-up comments about how to handling
ALTER TABLE setting the storage_handler table property,
which doesn't really make sense to ever allow.
Change-Id: I93d04a04483af598b392c28874363e3b0202e1f3
Reviewed-on: http://gerrit.cloudera.org:8080/4894
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
This change adds test coverage for the fixes committed for
IMPALA-2399 in commit 9ed3b685a1.
It uses the table nulltable in the workload functional-query
to verify the materialization and counting of NULL and empty-
valued columns. The test can be run on any supported storage
and compression combination.
Change-Id: I23923f95f43d67977ee1520a1fc09ce297548b3f
Reviewed-on: http://gerrit.cloudera.org:8080/4755
Tested-by: Internal Jenkins
Reviewed-by: Jim Apple <jbapple@cloudera.com>
The bug was that we cast the result exprs of operands before
unnesting them. If we unnested an operands, casts were missing
on those unnested operands' result exprs.
The fix is to first unnest operands and then cast result exprs.
Also clarifies the use of resultExprs vs. baseTblResultExprs.
Change-Id: I5e3ab7349df7d67d0d9c2baf4a56342d3f04e76d
Reviewed-on: http://gerrit.cloudera.org:8080/4815
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Kudu does not allocate null bytes if all projected columns are
non-nullable. Otherwise, Kudu allocates a null bit for all columns,
even the non-nullable ones. The bug was that Impala's memory layout
did not match the first requirement.
Change-Id: I762ad9d5cc4198922ea4b5218c504fde355c49a5
Reviewed-on: http://gerrit.cloudera.org:8080/4892
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Creating Kudu tables shouldn't allow types not supported by
Kudu (e.g. VARCHAR/CHAR, DECIMAL, TIMESTAMP, collection types).
The behavior is inconsistent: for some types it throws in
the catalog, for VARCHAR/CHAR these become strings. This changes
behavior so that all fail during analysis. Analysis tests
were added.
Similarly, external tables cannot contain Kudu types that
Impala doesn't support (e.g. UNIXTIME_MICROS, BINARY). Tests
were added to validate this behavior. Note that this
required upgrading the python Kudu client.
This also fixes a small corner case with ALTER TABLE:
ALTER TABLE shouldn't allow Kudu tables to change the
storage descriptor tblproperty, otherwise the table metadata
gets in an inconsistent state.
Tests were added for all of the above.
Change-Id: I475273cbbf4110db8d0f78ddf9a56abfc6221e3e
Reviewed-on: http://gerrit.cloudera.org:8080/4857
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
When HdfsParquetScanner::Open() failed we used to hit a DCHECK
when trying to access HdfsParquetScanner::batch() which is
only valid to call for non-MT scan nodes.
Change-Id: Ifbfdde505dbbd2742e7ab79a2415ff317a9bfa2f
Reviewed-on: http://gerrit.cloudera.org:8080/4851
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
This change introduces a clustered/noclustered hint for insert
statements. Specifying this hint adds an additional sort node to the
plan, just before the table sink. This has the effect that data will be
clustered by its partition prior to writing partitions, which therefore
can be written sequentially.
Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14
Reviewed-on: http://gerrit.cloudera.org:8080/4745
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change implements support for TYPE_TIMESTAMP for
HashTableCtx::CodegenAssignNullValue(). TimestampValue itself
is 16 bytes in size. To match RawValue::Write() in the
interpreted path, CodegenAssignNullValue() emits code to assign
HashUtil::FNV_SEED to both the upper and lower 64-bit of the
destination value. This change also fixes the handling of 128-bit
Decimal16Value in CodegenAssignNullValue() so the emitted code
matches the behavior of the interpreted path.
Change-Id: I0211d38cbef46331e0006fa5ed0680e6e0867bc8
Reviewed-on: http://gerrit.cloudera.org:8080/4794
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Michael Ho <kwho@cloudera.com>
1.) IMPALA-4134: Use Kudu AUTO FLUSH
Improves performance of writes to Kudu up to 4.2x in
bulk data loading tests (load 200 million rows from
lineitem).
2.) IMPALA-3704: Improve errors on PK conflicts
The Kudu client reports an error for every PK conflict,
and all errors were being returned in the error status.
As a result, inserts/updates/deletes could return errors
with thousands errors reported. This changes the error
handling to log all reported errors as warnings and
return only the first error in the query error status.
3.) Improve the DataSink reporting of the insert stats.
The per-partition stats returned by the data sink weren't
useful for Kudu sinks. Firstly, the number of appended rows
was not being displayed in the profile. Secondly, the
'stats' field isn't populated for Kudu tables and thus was
confusing in the profile, so it is no longer printed if it
is not set in the thrift struct.
Testing: Ran local tests, including new tests to verify
the query profile insert stats. Manual cluster testing was
conducted of the AUTO FLUSH functionality, and that testing
informed the default mutation buffer value of 100MB which
was found to provide good results.
Change-Id: I5542b9a061b01c543a139e8722560b1365f06595
Reviewed-on: http://gerrit.cloudera.org:8080/4728
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
IMPALA-4258: The problem was that there was a reference to
HdfsScanner::batch_ hidden inside WriteEmptyTuples(). The batch_
reference is NULL when the scanner is run with MT_DOP > 1.
IMPALA-4286: When there are no scan ranges HdfsScanNodeBase::Open()
exits early without initializing the reader context. This lead to
a DCHECK in IoMgr::GetNextRange() called from HdfsScanNodeMt.
The fix is to remove that unnecessary short-circuit Open().
I combined these two bugfixes because the new basic test covers
both cases.
Testing: Added a new test_mt_dop.py test. A private code/hdfs
run passed.
Change-Id: I79c0f6fd2aeb4bc6fa5f87219a485194fef2db1b
Reviewed-on: http://gerrit.cloudera.org:8080/4767
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change fixes a memory management problem with LEAD()/LAG()
analytic functions which led to incorrect result. In particular,
the update functions specified for these analytic functions only
make a shallow copy of StringVal (i.e. copying only the pointer
and the length of the string) without copying the string itself.
This may lead to problem if the string is created from some UDFs
which do local allocations whose buffer may be freed and reused
before the result tuple is copied out. This change fixes the problem
above by allocating a buffer at the Init() functions of these
analytic functions to track the intermediate value. In addition,
when the value is copied out in GetValue(), it will be copied into
the MemPool belonging to the AnalyticEvalNode and attached to the
outgoing row batches. This change also fixes a missing free of
local allocations in QueryMaintenance().
Change-Id: I85bb1745232d8dd383a6047c86019c6378ab571f
Reviewed-on: http://gerrit.cloudera.org:8080/4740
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.
Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU
Changes:
1) Remove the requirement to specify table properties such as key
columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
time of creation in Impala.
4) Disallow table properties that could conflict with an existing
table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
addresses. The flag is used as the default value for the table
property kudu_master_addresses but it can still be overriden
using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
wasn't implemented for Kudu tables and silently ignored. The Kudu
tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
Kudu) the existence of the other delegate and the use of delegates in
general has led to confusion. The Kudu delegate only exists to provide
functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
standard. When used at the column level, only one column can be
marked as a key. When used at the table level, multiple columns can
be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
The old "kudu.key_columns" table property is no longer accepted
though it is still used internally. "PRIMARY" is now a keyword.
The ident style declaration is used for "KEY" because it is also used
for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
The table property "kudu.table_name" is optional for managed tables
and is required for external tables. If for a managed table a Kudu
table name is not provided, a table name will be generated based
on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
of HMS when a table is loaded or refreshed. Table/column metadata
are cached in the catalog and are stored in HMS in order to be
able to use table and column statistics.
Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
If the table format is changed by the Alter Table statement, the
default partition in partitioned tables used to not get updated. This
caused a problem because Insert picks up the file format for new
partitions from the default partition. This patch fixes the problem by
calling addDefaultPartition().
Also removed "drop table if not exists" in tests in alter-table.test
because we already have the unique_database fixture.
Change-Id: I59bf21caa5c5e7867d07d87cda0c0a5b4b994859
Reviewed-on: http://gerrit.cloudera.org:8080/4750
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This patch is mostly mechanical move of codegen related logic
from each exec node's Prepare() to its Codegen() function.
After this change, code generation will no longer happen in
Prepare(). Instead, it will happen after Prepare() completes in
PlanFragmentExecutor. This is an intermediate step towards the
final goal of sharing compiled code among fragment instances in
multi-threading.
As part of the clean up, this change also removes the logic for
lazy codegen object creation. In other words, if codegen is enabled,
the codegen object will always be created. This simplifies some
of the logic in ScalarFnCall::Prepare() and various Codegen()
functions by reducing error checking needed. This change also
removes the logic added for tackling IMPALA-1755 as it's not
needed anymore after the clean up.
The clean up also rectifies a not so well documented situation.
Previously, even if a user explicitly sets DISABLE_CODEGEN to true,
we may still codegen a UDF if it was written in LLVM IR or if it
has more than 8 arguments. This patch enforces the query option
by failing the query in both cases. To run the query, the user
must enable codegen. This change also extends the number of
arguments supported in the interpretation path of ScalarFn to 20.
Change-Id: I207566bc9f4c6a159271ecdbc4bbdba3d78c6651
Reviewed-on: http://gerrit.cloudera.org:8080/4651
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
Adds code comments and issues a warning for Parquet files
with num_rows=0 but at least one non-empty row group.
Change-Id: I72ccf00191afddb8583ac961f1eaf11e5eb28791
Reviewed-on: http://gerrit.cloudera.org:8080/4696
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
This change removes some of the occurrences of the strings 'CDH'/'cdh'
from the Impala repository. References to Cloudera-internal Jiras have
been replaced with upstream Jira issues on issues.cloudera.org.
For several categories of occurrences (e.g. pom.xml files,
DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to
remove the occurrences left after this change.
Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8
Reviewed-on: http://gerrit.cloudera.org:8080/4187
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
For Parquet files with no row groups but with num_rows=0 in the
file footer the Parquet scanner returns an error indicating
that the file is invalid. This behavior is a regression from
previous Impala versions which used to accept such files.
This patch restores the previous behavior and adds tests.
Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Reviewed-on: http://gerrit.cloudera.org:8080/4693
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
There was a missing break in a switch statement leading to bad fallthrough.
An existing test already expected incorrect results. The bug is covered by
expecting correct results.
Change-Id: I5340c2eda813afc032ba72203bd59eb3f2c4f482
Reviewed-on: http://gerrit.cloudera.org:8080/4585
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins