This affects java UDFs. Previously it was possible that the length of
the string returned from a java udf didn't match the actual data. Per the
Text.getBytes() documentation "... only data up to getLength() is
valid.". Impala just needs to use copyBytes() which is a convenience
function for this situation. The same should be done for BytesWritable.
Before:
Query: select length(echo('12345678901234567890'))
+-------------------------------------------+
| length(java.echo('12345678901234567890')) |
+-------------------------------------------+
| 22 |
+-------------------------------------------+
After:
Query: select length(echo('12345678901234567890'))
+-------------------------------------------------+
| length(functional.echo('12345678901234567890')) |
+-------------------------------------------------+
| 20 |
+-------------------------------------------------+
Change-Id: If9671278df8abf7529d3bc470c5f9d037ac3da1b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4897
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
This commit fixes the issue where querying a view produces incorrect
results if the view definition statement references the same column
multiple times in the select list. In that case, a predicate referencing
the same slot is generated when equivalences among view slots are
enforced, thereby causing null values to be rejected.
Change-Id: I3d13656141fb41d232ddd38562cbde277f2a1264
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5031
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
The sequence writer test had an issue with zlib on certain cluster machines, making
this a flaky test. This has passed several times locally and in private builds. This
re-enables the test because the failures could not be produced in private builds.
Change-Id: I0aeea3a2d000e711e5a84427a7b40592e1eef75b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5077
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
This enables the existing subquery rewrite rules to rewrite UNION
statements. UNION rewriting is easily done by simply calling the
rewriter for each operand in the UNION. At least one TPC-DS query
requires this functionality (IMPALA-1365).
The more difficult case of a UNION within a subquery is still not
supported.
Change-Id: I7f83eed0eb8ae81565e629f09f6918a4ba86ee13
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4859
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: jenkins
This patch improves the performance of the planning phase of a query
querying HBase tables. It removes an unnecessary second call to compute
stats and adds a new version for estimating the row count in a table.
This patch adds an incremental version to estimate the number of rows
for a set of regions. This incremental version will start querying up to
five regions to calculate the average row size and use this value to
estimate the row count based on the size of the regions on disk. Only if
the standard deviation from the average is larger than 15% query an
additional region, it will query additional regions to calculate an
average with more confidence.
If the data is balanced it will not be necessary to retrieve data from
all regions but only from a subset. In the worst case, all regions are
queried.
Change-Id: Idcb3bea81b11cb08da6d9329ba66c86aca23e170
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5258
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Impala qualifies all paths stored in the metastore except for the
DataSource jar path. Use a qualified path here as well, which will
allow datasources to live on the non-default FS.
In CreateDataSrcStmt, use the post-analyzed qualified path rather than
the user passed string. Then, fix CreateTableDataSrcStmt so that it
doesn't strip out the scheme://authority portion of the URI, but instead
uses the qualified path string directly.
Note that the metastore may still contain unqualified paths in
DataSource tables' properties that were generated by previous versions.
That's okay though since the backend won't assume all paths are
qualified in case other components generate (or have in the past)
metadata with unqualified paths.
Change-Id: I905d8f6a7bf1793cfccf720b6ab5dc845d7dd5fa
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5201
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 86c75be01d0f5654291acdbc1c68f5a76915028c)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5239
This commit fixes the issue where additional predicates that alter the
meaning of a query are generated when an inline view contains an outer
join. Such predicates are derived from slot equivalences withouth taking
into account the directionality of the value transfer graph.
Change-Id: I0a3390d39a4f2039a8b114a7659980aa444d35c0
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5109
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5186
- Add number of files in table to query plan
- Add number of remote scan ranges to runtime profile
- Clean up logging in ClientCache
Change-Id: I0580fe435ac0a52548aedb4e0ffa875ce9b9dede
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5166
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
This commit fixes the issue where a CTAS statement inserts a wrong
number of rows if the associated select statement contains an inline
view with a limit clause. The limit clause of the nested query was not
taken into consideration during planning, resulting in the generation
of a wrong distributed plan.
Change-Id: Ib3ad50199d95d2d6b9ad0aa3b2031a002cbcca44
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5057
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5063
The analysis of a with clause should have its own global state so the
local view(s) can be analyzed without polluting the global state of the
parent QueryStmt. This might not always matter, but in a complex query
involving a with clause that contained a subquery, re-analysis of the
WithClause after the subquery rewrite resulted in an invalid Exists
conjunct being registered in the parent analyzer's global state. The
Exists conjunct was assigned to a scan node which then failed a
pre-condition check.
Change-Id: Ib020787b2e1ff202d96fe1b92bd9740897ab32a0
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4825
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 629a8652c5a290054a8e582cc5cb5768a3ee67a8)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5038
This patch modifies the abs() built-in function so that it
retains the type of the input argument for the return type
in the same way as Postgres does.
Change-Id: I1750237b85bedbc3ce9d52330ac4d458b0aada3a
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4980
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 424b359ab0a4f621f2865844c3293f2c80e0867f)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4996
Removing the test case for IMPALA-1312 to unblock exhaustive runs. This query was
previously hitting a DCHECK failure in the BufferedTupleStream where the number of
pinned blocks wasn't being updated properly. With codegen enabled, this query took
~70sec. Without codegen, it took so long that the exhaustive runs would fail- I
found it took ~35min on my local machine.
IMPALA-1414 tracks investigating why this query is so slow.
Change-Id: I2bf8a8c51fc7ded0026e334636f9b2cc859ffdb2
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4931
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f8b7320e035549da4e4a6a99b87da97bc18be0ad)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4941
columns
This commit fixes the issue where Impala hangs when querying an HBase
table with large (>500) number of columns. The issue was triggered by a
large memory allocation of a tuple buffer during the first GetNext call
of the HBase scanner that was causing an infinite loop where each iteration
was allocating a significant amount of memory. The fix is to dynamically
set the mem limit of a row batch based on the corresponding row size and to
dynamically set the maximum size of the tuple buffer so that it does not
exceed that limit.
Change-Id: Ia64f98b229772b50658af952fc641bf00f54f450
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4871
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4933
In case of certain queries order by with ordinals would not work
properly. This is the case for all "select * " type of queries. Until
now, the ordinal substitution was based on the values from the select
list. However, these expression are not expanded in case of "*",
rather the list of result expressions and column lables is filled.
This patch simply changes the lookup of the expression from the select
list to the result list because only ordinals from the result can be
used as a sorting field.
Change-Id: I21d3c3da837307cae04f8a4be02ca31bdcfbcbdb
(cherry picked from commit 1b62c08552c19f1b0c2220d1568804e2eba7efac)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4920
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
This commit fixes an issue where an error is thrown during planning
when an insert-select statement contains an analytic function. The issue
was caused by a missing mapping step of logical to physical tuples for
the case of insert statement.
Change-Id: I68d856b1fda4dd0a7345648459e466d90d95201f
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4911
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4915
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
This patch adds partitions filters to tpcds-q89 to account for the lack of dynamic
partition pruning. Additionally, it also re-enables running tpcds-q47, which was blocked
by IMPALA-1238
Change-Id: Ied05d80565ebb29cd06b3c38d76bd31f0285028e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4453
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Correct elimination of redundant join predicates relies on slot
equivalences being enforced at the lowest possible plan node
possibly by generating new predicates. Previously, we only
enforced such equivalences at scan and aggregation nodes which
is insufficient because join materialize a new tuple combination
which may also require construction of new predicates to establish
known slot equivalences.
This patch generalies the existing helper function for constructing
the minimum spanning tree to cover known slot equivalences for each
equivalence class. The function is intended to be called during
bottom-up plan generation at nodes that change the tuple composition
(scans, joins, aggs, etc.)
Change-Id: I73880310553c63296486b2f77a51618738005167
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4781
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4794
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
This fixes the incorrect pushing of predicates into Unions for which at least one operand contains
an analytic expr.
It also adds a TupleDescriptor.debugName_ member variable that makes it easier to read the output
of DescriptorTable.debugString().
Change-Id: Icd50220e711851b8174fdfb53c6b2cd03ca3dcde
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4586
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Since we only make one NULL-aware stream per NAAJ (as opposed to one per partition),
we do not care about the memory footprint on this tuple stream. For simplicity,
this will always use io-sized buffers.
Also, improving error handling in PHJ::ProcessProbeBatch(), as status_ was not being
set properly.
Disabling the regression test for this bug, as it takes too long to run. Need to find
a simpler query.
Change-Id: I7572f607199f38b1bc30ae208ece2832522342a1
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4770
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Conflicts:
be/src/exec/partitioned-hash-join-node.cc
Correct elimination of redundant join predicates relies on slot
equivalences being enforced at the lowest possible plan node
possibly by generating new predicates. Previously, we only
enforced such equivalences at scan and aggregation nodes which
is insufficient, explained as follows. Equivalences between
slots of an inline view may not be correctly enforced in the
scans/aggs of the inline-view plan if those inline-view slots
point to complex expressions in the underlying view stmt.
We currently cannot reason about equivalences of complex
expressions. As a result, it is possible that inline-view slots
are known to be equivalent but the underlying expressions are
thought to be non-equivalent.
This patch adds enforcement of equivalent inline-view slots
by generating new predicates that are migrated into the
inline-view plan. This way, our existing expression substitution
logic can be used to indirectly reason about the equivalences
of complex expressions that are part of an inline view.
Change-Id: Id38115c90e2c47d65463380a6f8cb1d0f21134b7
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4755
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Conflicts:
fe/src/main/java/com/cloudera/impala/planner/Planner.java
Fixes two issues that can occur when generating the plan for a
stmt with an empty result set (e.g. due to limit 0 or constant
predicates that evaluate to false):
1) Unions with an inline view that produces an empty result set
does not create the EmptySetNode for the correct stmt.
2) An EmptySetNode may contain non-materialized tuples which
will fail a precondition check when generating the thrift
plan.
Change-Id: I1511c755be3a59fdb8934624fd08250323266d27
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4744
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Union statements were sometimes losing necessary casts during
expression substitution, causing the backend union node to receive
slot refs that did not have the same types as the result tuple. Add a
flag to Expr.Substitute() to preserve the root expr types, which adds
back the casts after substitution.
Currently only the union node sets this flag to true, but there may be
other places that are incorrect.
Change-Id: I1b4d9846860ef9694ff0c089f79654b1746d687d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4777
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
Because we add 'total' to the last row in SHOW PARTITIONS, we set the
partition key columns to be string. At least, that's what the comment
said, but we didn't do that in fact.
This patch also corrects the column type for max width, which should be INT.
Change-Id: I787ab17be27f45107340119017e528c58a3daad3
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4678
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
The fix is to only register aggregates for string, not for CHAR or VARCHAR. The CHAR and
and VARCHAR types are implicitly cast to STRING for aggregation.
Also, fixed aggregate fn builtins that should not ignore distinct.
Change-Id: If4c1a2c6127360c2c8127a5c02949df74fafc85a
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4717
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Also modified the text of the analysis exception for lengths that are too long or
short because John said they were unclear.
Change-Id: I9427d5c39298aa8207672e50e10fe527c5076599
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4698
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
The bug was that changes to table refs were done in-place for join inversion,
and not reverted when a particular join re-ordering attempt was unsuccessful.
Subsequent join re-ordering attempts with a different left-most table ref
should use the original unmodified table refs.
To achieve the above, this patch reverts the state changes made to table refs
for unsuccessful join ordering attempts.
TODO: The cleaner fix is to clone all table refs for each new join re-ordering
attempt. However, implementing a state-preserving clone() for table refs is a
more involved change.
Change-Id: Ife0121f0e15441a5c0a23f75054c683c05b1ecac
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4715
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Problem: hash table assumed all raw values were at most 16 bytes. This maximum was
increased to to support up to 128 bytes for CHARs.
Change-Id: I107c58b9a013d5db46ff5586bcdceee3961346e9
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4701
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
When generating the plan for a partitioned hash join we place the join into a
partition-compatible input fragment, if possible. During this optimization one
must ensure that the hash exchange sending to the new join (which was placed
into a compatible fragment) is compatible with the hash exprs used for sending
to the fragment containing the join node.
In particular, we had two bugs:
1. The number of hash exprs could be different, possibly because of redundant
exprs on one or both sides
2. The order of the hash exprs could be different, causing two rows with the
same hash-expr values to be sent to different nodes
The fix is to enfore the above two properties, and revert to exchanging
both sides if no compatible hash exprs can be constructed.
Change-Id: Id155fb8094ed1694f7bc038ed2f9685f4d645fbe
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4639
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Small buffers introduced an issue that is exacerbated by the large fanout. A stream can
only be appended to forever once it has grabbed the initial io sized buffer. With small
buffers, we don't grab that at the beginning anymore and, before this patch, it is
grabbed when the stream first needs it. This means when one stream needs it, another
stream could have already grabbed it (meaning this stream is pinned with multiple
buffers).
This patch has all the streams grab an IO buffer as soon as the first stream needs an
io buffer. This guarantees that all streams get 1 before any get 2.
Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d
The issue is that the aggregation node needed to use IsVarLen; previously
it assumed TYPE_STRING was the only variable length type.
Change-Id: I9545e8d405937a47b25c9042f97854851a448c6e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4690
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
There is an issue related to IMPALA-1322. The expression list when laying out memory
was being improperly index.
Change-Id: I2eef84a812b451d87ecb8afd304e765aff1f5a6b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4675
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
with complex exprs
This commit fixes the issue where, for the case of scalar subqueries,
complex exprs in a correlated predicate may result in a wrong subquery
rewrite.
Change-Id: Ib6f14a37ca7a74e25daf3b31f86766ff9032d7fd
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4674
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Update PA/PHJ to use small (< io sized buffers) initially. Without this we would
not be able to run at the QPS that we need just due to the buffering requirements
of these operators.
Change-Id: Ic8a777d147893567c9590fbab17f561eadb6ee19
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4623
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>