Commit Graph

560 Commits

Author SHA1 Message Date
Vlad Berindei
ece7fed421 IMPALA-2316: Add RESTRICT to DROP DATABASE
Change-Id: Iffad73175b49160ae049911bd33c110a830f932b
Reviewed-on: http://gerrit.cloudera.org:8080/796
Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com>
Tested-by: Internal Jenkins
2015-09-11 20:37:27 +00:00
Alex Behm
e9e43488cf IMPALA-2297: Handle collection types in ExprContext::GetValue().
Change-Id: I6af780791e392c0431efdf5a513e4b1cb60d14cf
Reviewed-on: http://gerrit.cloudera.org:8080/749
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-10 17:46:21 +00:00
Alex Behm
deb9c6f8e6 Nested Types: Poor man's projection for collection-typed slots.
Collection-typed slots are expensive to copy, e.g., during data
exchanges or when writing into a buffered-tuple-stream. Even worse,
such slots could be duplicated many times after unnesting in a
subplan. To alleviate this problem, this patch implements a
poor man's projection where collection-typed slots are set to NULL
inside the SubplanNode that flattens them.

The FE guarantees that the contents of an array-typed slot are never
referenced outside of the single UnnestNode that access them, so when
returning eos in UnnestNode::GetNext() we also set the unnested array
slot to NULL to avoid those expensive copies in downstream exec nodes.

The FE provides that guarantee by creating a new slot in the parent
scan for every relative CollectionTableRef. For example, for a table
't' with a collection-typed column 'c' the following query would have
two separate slots in the tuple of 't', one for 'c1' and one for 'c2':

select * from t, t.c c1, t.c c2

Change-Id: I90e5b86463019c9ed810c299945c831c744ff563
Reviewed-on: http://gerrit.cloudera.org:8080/763
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-10 05:44:55 +00:00
Alex Behm
361da01152 Fail queries that require a SubplanNode when using legacy joins and aggs.
We will not provide full nested types support if any of these options
are set:

--enable_partitioned_aggregation=false
--enable_partitioned_hash_join=false

Change-Id: I0f8607914faf9691d5f7b1a4327609fefba22e56
Reviewed-on: http://gerrit.cloudera.org:8080/792
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-09-10 04:50:31 +00:00
Henry Robinson
8809567e82 IMPALA-2290: Fix btrim() thread-safety.
By not using THREAD_LOCAL for its state, btrim() invocations in
multi-threaded contexts (i.e. pushed to the scanner) would have threads
trampling over each other's bitset used to check for trimmed characters.

Testing:

See new test in expr.test:

select count(*) from functional.alltpyes where btrim(string_col, string_col) != ""

.. should give 0 results, but would give > 0 with this bug.

Change-Id: I595e25b1d4fb7c76b846fce837b4ec140f47d43c
Reviewed-on: http://gerrit.cloudera.org:8080/748
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
2015-09-09 04:15:30 +00:00
Tim Armstrong
5ac55f24cc IMPALA-2296: missing DeepCopy array support
Implement Tuple-to-Tuple DeepCopy for collections. Add query test
that uses the TOP-N node, which deep copies tuples in this way.
Confirmed that the query test failed before this fix.

Change-Id: I3fea860d8251038d7b5eb85c77973939abe9dbf8
Reviewed-on: http://gerrit.cloudera.org:8080/757
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-08 23:40:53 +00:00
Tim Armstrong
d73683b320 Fix nested types tpch test formatting
Invalid test file format caused tpch tests to fail.

Change-Id: Ibf523d071bb14db72689e39645fd1724897543c7
Reviewed-on: http://gerrit.cloudera.org:8080/766
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-08 21:57:52 +00:00
Alex Behm
662bc24c79 IMPALA-2100: Exclude explain header from expected results of test_partitioning.py.
HDFS acknowledges writes when the first replica is written.
As a result, the estimated memory requirements for an Impala
query may vary depending on how many replicas existed at the
time of table loading. This racey behavior caused a few tests
to sometimes fail due to different actual and expected memory
requirements.

The fix is to exclude the explain header from the expected results.

Change-Id: Ifb13de937a104a48960d35745df521de66596837
Reviewed-on: http://gerrit.cloudera.org:8080/762
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-08 19:57:55 +00:00
aacalfa
5e733e8d62 IMPALA-2190: Complete conversion functions between timestamp, unixtime, and string dates
Change-Id: I48a446f19c7634477f175d0defa8779dd70a392f
Reviewed-on: http://gerrit.cloudera.org:8080/654
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-09-07 07:07:20 +00:00
Dimitris Tsirogiannis
f647b36e58 IMPALA-2289: Properly set eos_ in the BlockingJoinNode when the probe side
is exhausted

This commit fixes an issue where BlockingJoinNode will incorrectly set
eos_ flag to true when the probe side is exhausted without considering
the join mode that is executed. This would cause the NestedLoopJoinNode to
sometimes return wrong results when a right-outer, right-anti or
full-outer join mode is used. This issue appeared in nested TPC-H Q22.

Change-Id: I01f2118d4db3d8739201d5c3f475f5b7e328555a
Reviewed-on: http://gerrit.cloudera.org:8080/753
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-06 05:29:22 +00:00
Alex Behm
d48ec4b8b3 IMPALA-2289: Properly handle AtCapacity() in SubplaNode.
After this patch we get correct results for nested TPCH Q13.

The bug: Since we were not properly handling AtCapacity() of the output
batch in SubplanNode, we sometimes passed a row batch that was already
at capacity into GetNext() on the second child of the SubplanNode.
In this particular case, that batch was passed into the NestedLoopJoinNode
which may return incomplete results if the output batch is already
at capacity (e.g., ProcessUnmatchedBuildRows() was not called).

The fix is to return from SuplanNode::GetNext() if the output batch
is at capacity due to resources being tranferred to it from the input
batch used to fetch from the first child.

Change-Id: Ib97821e8457867dc0d00fd37149a3f0a75872297
Reviewed-on: http://gerrit.cloudera.org:8080/742
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-04 20:26:52 +00:00
Vlad Berindei
cfc3952a83 IMPALA-898: Support explicit column names in WITH-clause views.
Example:
WITH t(c1, c2) AS (SELECT int_col, bool_col FROM functional.alltypes)
SELECT * FROM t

This will create a local view with the 'int_col' and 'bool_col' columns labeled as 'c1'
and 'c2'. If the number of labels is less than the number of columns, then the remaining
columns in the local view will be labeled as the corresponding columns in the query
statement. Therefore, this is also a valid query (only 'int_col' will be labeled as
'c1'):

WITH t(c1) AS (SELECT int_col, bool_col FROM functional.alltypes)
SELECT * FROM t

Change-Id: Ie3a559ca9eaf95c6980c5695a49f02010c42899b
Reviewed-on: http://gerrit.cloudera.org:8080/717
Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com>
Tested-by: Internal Jenkins
2015-09-03 01:19:43 +00:00
Skye Wanderman-Milne
bcc73a36da Nested types: read and materialize nested types in Parquet scanner
This patch modifies the Parquet scanner to resolve nested schemas, and
read and materialize collection types. The high-level modification is
to create a CollectionColumnReader that recursively materializes map-
and array-type slots.

This patch also adds many tests, most of which query a new table
called complextypestbl. This table contains hand-generated data that
is meant to expose edge cases in the scanner. The tests mostly test
the scanner, with a few tests of other functionality (e.g. array
serialization).

I ran a local benchmark comparing this scanner code to the original
scanner code on an expanded version of tpch_parquet.lineitem with
48009720 rows. My benchmark involved selecting different numbers of
columns with a single scanner thread, and I looked at the HDFS scan
node time in the query profiles. This code introduces a 10%-20%
regression in single-threaded scan time.

Change-Id: Id27fb728934e8346444f61752c9278d8010e5f3a
Reviewed-on: http://gerrit.cloudera.org:8080/576
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-09-02 19:23:54 +00:00
Dimitris Tsirogiannis
f6985772dc IMPALA-2275: S3: authorization.test_grant_revoke failure due to stale
grant_revoke_no_insert.test

This commit updates the test file of grant/revoke statements running
against S3 to include column-level privileges.

Change-Id: Ia21595740fd37c88040d9a692444c6009591a188
Reviewed-on: http://gerrit.cloudera.org:8080/735
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-09-02 04:29:41 +00:00
Juan Yu
c66785be4a IMPALA-2227: S3:query_test.test_queries.TestQueries.test_exprs failure
Use select query instead of insert query to verify constant expression
on partition column.

Change-Id: I442111225e8df29bcc5fe89500d023559bb1c1fb
Reviewed-on: http://gerrit.cloudera.org:8080/707
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-08-29 00:40:41 +00:00
Dimitris Tsirogiannis
fdb90ed753 CDH-23206: Impala support for column-level authorization (part 1)
This commit adds partial support for column-level authorization in
Impala using the Sentry Service. The following changes are included:
* Added support for parsing and analyzing GRANT/REVOKE statements with column-level
  privileges. The supporting syntax is:
  - GRANT SELECT (<col_names>) ON TABLE <table_name>
    TO [ROLE] <role_name> [WITH GRANT OPTION]
  - REVOKE [GRANT OPTION FROM] SELECT (<col_names>) ON
    TABLE <table_name> FROM [ROLE] <role_name>
* Added support for storing column-level privileges in the Catalog Service and updating
  the Sentry Service when GRANT/REVOKE statements are executed.
* Modified the SHOW GRANT ROLE statement to include information about
  column-level privileges.

Subsequent patches will add support for enforcing column-level
privileges in SQL queries and other statements.

Change-Id: I0fd9daa92cc5147cb6f4b25eb9651aab8bf3049f
Reviewed-on: http://gerrit.cloudera.org:8080/607
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-08-28 23:58:36 +00:00
Juan Yu
d42ecb310a IMPALA-1756: Add test case for partition insert query
Change-Id: I4879d8fe7221b551898fa9fa94076bb9b0804f06
Reviewed-on: http://gerrit.cloudera.org:8080/696
Reviewed-by: Juan Yu <jyu@cloudera.com>
Tested-by: Internal Jenkins
2015-08-27 18:50:58 +00:00
Martin Grund
60c5140ea7 IMPALA-1983: Warn if table stats are potentially corrupt.
When the `numRows` parameter stored in the table properties is
errornously set to 0 and a number of non-empty files are present
the table statistics are considered to be corrupt.

To hint that there might be a problem, the explain statement will emit
an additional warning if it detects potentially corrupt table stats like
in the following example:

  Estimated Per-Host Requirements: Memory=42.00MB VCores=1
  WARNING: The following tables have potentially corrupt table and/or
  column statistics.
  compute_stats_db.corrupted

  03:AGGREGATE [FINALIZE]
  |  output: count:merge(*)
  |
  02:EXCHANGE [UNPARTITIONED]
  |
  01:AGGREGATE
  |  output: count(*)
  |
  00:SCAN HDFS [compute_stats_db.corrupted]
     partitions=1/2 files=1 size=24B

In addition, the small query optimization is disabled for such queries.

Change-Id: I0fa911f5132aa62195b854248663a94dcd8b14de
Reviewed-on: http://gerrit.cloudera.org:8080/689
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
2015-08-26 22:19:33 +00:00
Sailesh Mukil
1a9fc47295 IMPALA-2227: S3: query_test.test_queries.TestQueries.test_exprs failure
The test file testdata/workloads/functional-query/queries/QueryTest/exprs.test had INSERT
statements in it, which are not supported on S3. This commit gets rid of those statements
and rewrites them with SELECT [...] FROM VALUES(...) so that the tests are compatible on
S3.

Change-Id: I25faacf9fae3780f627afee86dc8c1ede7f6e2a2
Reviewed-on: http://gerrit.cloudera.org:8080/670
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-08-26 00:36:51 +00:00
Vlad Berindei
452ebee59d IMPALA-1906: PARQUET_FILE_SIZE query option overflows for values >= 2GB.
The value of PARQUET_FILE_SIZE overflows when RoundUp() is called because this function
returns an int32. Even with this change, this value will still overflow when calling the
HDFS API since it is passed to hdfsOpenFile() as blocksize, which is an int32 parameter
(see HDFS-8949).

Changes:
- Return an error if PARQUET_FILE_SIZE is set to a value greater than or equal to 2GB.
  - If PARQUET_FILE_SIZE is set in an Impala session to a value greater than or equal to
    2GB, then every query will fail with an error message.
  - If PARQUET_FILE_SIZE is changed to a value greater than or equal to 2GB as an impalad
    argument, impalad will not start and log an error.
- Ceil(), RoundUp(), RoundDown() return int64.

Change-Id: Ie4f2551b72954e2a57db5594e4789e3f7434d578
Reviewed-on: http://gerrit.cloudera.org:8080/678
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com>
Tested-by: Internal Jenkins
2015-08-25 23:28:13 +00:00
Alex Behm
6f0b255c5a Address several shortcomings with respect to the usability of Avro tables.
Addressed JIRAs: IMPALA-1947 and IMPALA-1813

New Feature:
Adds support for creating an Avro table without an explicit
Avro schema with the following syntax.

CREATE TABLE <table_name> column_defs STORED AS AVRO

Fixes and Improvements:
This patch fixes and unifies the logic for reconciling differences between
an Avro table's Avro Schema and its column definitions. This reconciliation
logic is executed during Impala's CREATE TABLE and when loading a table's
metadata. Impala generally performs the schema reconciliation during table
creation, but Hive does not. In many cases, Hive's CREATE TABLE stores the
original column definitions in the HMS (in the StorageDescriptor) instead
of the reconciled column definitions.

The reconciliation logic considers the field/column names and follows this
conflict resolution policy which is similar to Hive's:

Mismatched number of columns -> Prefer Avro columns.
Mismatched name/type -> Prefer Avro column, except:
  A CHAR/VARCHAR column definition maps to an Avro STRING, and is preserved
  as a CHAR/VARCHAR in the reconciled schema.

Behavior for TIMESTAMP:
A TIMESTAMP column definition maps to an Avro STRING and is presented as a STRING
in the reconciled schema, because Avro has no binary TIMESTAMP representation.
As a result, no Avro table may have a TIMESTAMP column (existing behavior).

Change-Id: I8457354568b6049b2dd2794b65fadc06e619d648
Reviewed-on: http://gerrit.cloudera.org:8080/550
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-25 09:52:18 +00:00
Taras Bobrovytsky
75691156be IMPALA-2239: update misc.test to match the new .test file format
Change-Id: Ia5b9925628b415c306f320ef186246179e38f73b
Reviewed-on: http://gerrit.cloudera.org:8080/684
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-25 00:12:52 +00:00
Alex Behm
ae9fd52c51 IMPALA-2089: Retain eq predicates bound by grouping slots with complex grouping exprs.
The bug: When enforcing slot equivalences at an aggregation node, we used to
incorrectly assume that equivalences among grouping slots must have already been
enforced below the aggregation (e.g., in a scan). This assumption is correct if the
grouping slots are produced by simple SlotRef grouping exprs, because then there is
certainly a value transfer between the grouping slot and another slot below the
aggregation. However, for grouping slots with complex grouping exprs this assumption
is not correct, and as a result, we would incorrectly remove eq predicates bound by
gropuing slots with complex grouping exprs because we assumed they were redundant.

Ths fix is to enforce slot equivalences among grouping slots with complex grouping
exprs as usual, and not assume that they have already been enforced below the agg.

Change-Id: Idcd44acccb9326a35c9121025dc88c2c70c7c7c7
Reviewed-on: http://gerrit.cloudera.org:8080/656
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-23 04:43:37 +00:00
Alex Behm
14a8cadcf6 Nested Types: Pretty print complex types in DESCRIBE.
The current DESCRIBE prints the column type as a single string without
whitespace. As a result, the DESCRIBE output for tables with complex types
is basically unreadable/unusable, e.g., from the Impala shell.

This patch adds a prettyPrint() function to the FE Type and uses that
for generating a nicely formatted DESCRIBE output.

The output of DESCRIBE FORMATTED is intentionally not modified because
exact Hive-compatibility has been and presumably continues to be very
important to our users.

Change-Id: Ida810facdffd970948b837b83a60f9ddcd95f44d
Reviewed-on: http://gerrit.cloudera.org:8080/633
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-22 09:26:35 +00:00
Taras Bobrovytsky
b8b7930377 Add nested types support to Create Table Like File
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.

Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-22 01:46:26 +00:00
Vlad Berindei
e4c42fa8bf IMPALA-595: Add CASCADE to DROP DATABASE and use it in cleanup_db
Change-Id: Idfa5b6943bc797e10d542487c31b8f1b527d8c97
Reviewed-on: http://gerrit.cloudera.org:8080/635
Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com>
Tested-by: Internal Jenkins
2015-08-20 03:34:31 +00:00
Skye Wanderman-Milne
7906ed44ac IMPALA-2015: Add support for nested loop join
Implement nested-loop join in Impala with support for multiple join
modes, including inner, outer, semi and anti joins. Null-aware left
anti-join is not currently supported.

Summary of changes:
Introduced the NestedLoopJoinNode class in the FE that represents the nested
loop join. Common functionality between NestedLoopJoinNode and HashJoinNode
(e.g. cardinality estimation) was moved to the JoinNode class.
In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop
join execution strategy.

Change-Id: I238ec7dc0080f661847e5e1b84e30d61c3b0bb5c
Reviewed-on: http://gerrit.cloudera.org:8080/652
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-08-19 08:40:14 +00:00
Tim Armstrong
5350d49f8c IMPALA-1829: UDAs with different intermediate type
Previously the frontend rejected UDAs with different intermediate and
result type. The backend supports these, so this change enables support
in the frontend and adds tests.

This patch adds a test UDA function with different intermediate type and
a simple end-to-end test that exercises it. It modifies an existing
unused test UDA that used a currently unsupported intermediate type -
BufferVal.

Change-Id: I5675ec7f275ea698c24ea8e92de7f469a950df83
Reviewed-on: http://gerrit.cloudera.org:8080/655
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2015-08-19 04:37:39 +00:00
Sailesh Mukil
1c46cab5c6 IMPALA-2084: SPLIT_PART and REGEXP_LIKE functions for Tableau pushdown
Added the SPLIT_PART and the REGEXP_LIKE builtin functions and tests for both.
The REGEXP_LIKE has an optional third parameter which if used, uses a different
'prepare' function (RegexpLikePrepare in like-predicate.cc) so that the appropriate
options can be set in the RE2 library.

Added a patch for the RE2 library so that the 'dot matches all' option is exposed
via the RE2 class.

Fixed a bug in the case when the function to be evaluated for the WHERE clause
operates on constants, proper cleanup isn't guaranteed on certain edge cases.

Change-Id: Ia2a8de9eeb2854100a2d949f612cfaba317c5a7b
Reviewed-on: http://gerrit.cloudera.org:8080/501
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2015-08-18 09:07:34 +00:00
Alex Behm
f9d26fb896 IMPALA-2203: Set an InsertStmt's result exprs from the source statement's result exprs.
This patch fixes an issue where incorrect results are produced by a CTAS or IAS
that is fed from a QueryStmt that has outer-joined inline views with constants or
conditionals in the select list. The regression was introduced in this commit:
b8f642710ea9d311a7aca32611eaa7cac6cd86df

Now that the final expression substitution with TupleIsNullPredicate() wrapping
is performed in planning, the InsertStmt's result expressions should be taken
from the feeding QueryStmt's result expressions, and not the QueryStmt's
(already substituted) base table result expressions.

Change-Id: Iae29683638df01f140d0f74976cca8ca9ba0852d
Reviewed-on: http://gerrit.cloudera.org:8080/637
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-18 01:44:45 +00:00
Casey Ching
cf60967b7e IMPALA-1675: Avoid overflow when adding large intervals to TIMESTAMPs
It turns out there is a variety of cases where boost incorrectly adds
intervals if the interval is at (or beyond) an edge case value. This
change defines a max interval and returns NULL if the user supplies
an interval beyond the max.

Change-Id: I4fb6869be22ab06089b66eeffaea04b0c0880080
Reviewed-on: http://gerrit.cloudera.org:8080/492
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-08-16 12:09:24 +00:00
Christopher Channing
9ea5caf0ef IMPALA-2199: Row count not set for empty partition when spec is used with compute incremental stats
This patch resolves an issue where row count is not set to 0 when a partition spec is
used with 'compute incremental stats' on a partition that contains no data. The fix is
to populate the partition 'expected list' in the frontend with the partition spec, the
backend keeps track of which partitions had statistics generated. In the scenario where
no statistics are generated for a partition, the backend will fall back to the
'expected list' to zero out the statistics.

Change-Id: If4aac131dbe44e14a0477afa58e980da9e235d6b
Reviewed-on: http://gerrit.cloudera.org:8080/627
Reviewed-by: Christopher Channing <cchanning@cloudera.com>
Tested-by: Internal Jenkins
2015-08-13 09:38:30 +00:00
Dimitris Tsirogiannis
47c5ae405a Revert "IMPALA-2015: Add support for nested loop join"
This reverts commit 6837cdec7f6a7e1c7e8157e323f3ab68277689aa.

Change-Id: I2fd6424c553a701fcbfd425b4486af7280820b23
Reviewed-on: http://gerrit.cloudera.org:8080/636
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-13 02:20:07 +00:00
Sailesh Mukil
8f11fbdd5c IMPALA-2081: Add PERCENT_RANK, NTILE, CUME_DIST analytic window functions
These functions are implemented as rewrites in the analysis stage. They are rewritten as
different arithmetic expressions and make use of the existing analytic functions such as
'rank', 'count' and 'row_number' to compute the final results.

TODO: IMPALA-2171: NTILE() currently takes only constant expressions. We need to modify
it to take non-constant expressions as well in a future patch.

Change-Id: I8773df8ceefff27ab66a41169dc4ac0927465191
Reviewed-on: http://gerrit.cloudera.org:8080/584
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2015-08-07 04:57:37 +00:00
Skye Wanderman-Milne
f000758ca8 IMPALA-2015: Add support for nested loop join
Implement nested-loop join in Impala with support for multiple join
modes, including inner, outer, semi and anti joins. Null-aware left
anti-join is not currently supported.

Summary of changes:
Introduced the NestedLoopJoinNode class in the FE that represents the nested
loop join. Common functionality between NestedLoopJoinNode and HashJoinNode
(e.g. cardinality estimation) was moved to the JoinNode class.
In the BE, introduced the NestedLoopJoinNode class that implements the nested-loop
join execution strategy.

Change-Id: Id65a1aae84335bba53f06339bdfa64a1b0be079e
Reviewed-on: http://gerrit.cloudera.org:8080/457
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2015-08-07 02:47:32 +00:00
Alex Behm
480a56e3a0 IMPALA-1737: Substitute an InsertStmt's partition key exprs with the root node's smap.
The bug was that we were not substituting the partition key exprs of an InsertStmt
with the root plan node's output smap during single-node planning.

Change-Id: I16eff4bab0b1d95c7f30fd89b14af2628d6f865f
Reviewed-on: http://gerrit.cloudera.org:8080/580
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-03 19:31:51 +00:00
Alex Behm
c908ba1b7e IMPALA-1136: Support loading Avro tables without an explicit Avro schema
Hive allows creating Avro tables without an explicit Avro schema since 0.14.0.
For such tables, the Avro schema is inferred from the column definitions,
and not stored in the metadata at all (no Avro schema literal or Avro schema file).

This patch adds support for loading the metadata of such tables, although Impala
currently cannot create such tables (expect a follow-on patch).

Change-Id: I9e66921ffbeff7ce6db9619bcfb30278b571cd95
Reviewed-on: http://gerrit.cloudera.org:8080/538
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-07-31 12:13:37 +00:00
Tim Armstrong
e151ebaa71 IMPALA-1001: Bit and byte manipulation functions
Bit and byte functions for compatibility with Teradata: bitand, bitor, bitxor, bitnot,
countset, getbit, setbit, shiftleft, shiftright, rotateleft, rotateright.
Interfaces and behavior follow Teradata documentation.

All bit* functions are compatible with DB2.  bitand only is compatible with Oracle.

Change-Id: Idba3fb7beb029de493b602e6279aa68e32688df3
2015-07-28 08:11:01 -07:00
Sailesh Mukil
8a01527bad IMPALA-2141: UnionNode::GetNext() doesn't check for query errors
When a UDF with constant parameters in the select list calls SetError(), it does not fail
the query. This is because UnionNode::GetNext() does not check for errors after
UnionNode::EvalAndMaterializeExprs() evaluates the expression, which itself does not
report the error.

Change-Id: I8850cf1a603e320bb23f4a9a4d47600d14590f3a
2015-07-27 22:09:19 -07:00
Alex Behm
3ac341287c IMPALA-2088: Fix planning of empty union operands with analytics.
The check for ignoring empty union operands was simply misplaced.
This misplacement resulted in empty union operands not being
dropped if the containing UnionStmt had analytic functions.

Change-Id: I3dad546c0c31a495e5f30d97c3e49465fcc2ebb3
Reviewed-on: http://gerrit.cloudera.org:8080/554
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-07-27 15:46:41 -07:00
Tim Armstrong
822cb8f5e2 IMPALA-1660: Netezza compatibility - factorial
Implements suffix n! operator for factorial and factorial function.

Slightly refactor operators in fe to share code between unary operators.

Based partially on work by Arthur Peng <arthur.peng@intel.com>.

Change-Id: I71b6c824c59fc5305f16b8c4457805126a1da93b
Reviewed-on: http://gerrit.cloudera.org:8080/531
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-07-27 19:03:48 +00:00
Sailesh Mukil
c21c080a46 IMPALA-1756: Constant expressions not checked for errors, no state cleanup on exception.
Changed the way the function context error message is returned. Also, changed the
exception thrown in SingleNodePlanner from IllegalStateException to AnalysisException
in case of an exception in registerConjuncts().

This commit follows from:

d497ba6cef

This is a new commit since the previous one was closed before making these changes.

Change-Id: Ifa9b7c0884d76b6d7911d8cd80355a8ba13c4c18
Reviewed-on: http://gerrit.cloudera.org:8080/560
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-07-24 19:04:38 +00:00
Tim Armstrong
5990b43fe2 IMPALA-1898: Explicit aliases + ordinals analysis bug
Analysis errors occurred with select queries that combined ordinals
in the group by/order by clauses with select list aliases that
had the same name as a column in one of the underlying tables.

The root cause was a double substitution: e.g. the ordinal 1 in
a GROUP BY clause was replaced with the corresponding select list expression,
then a reference to column 'x' in an underlying table was replaced erroneously
with the select list expression with alias 'x'

Change-Id: I0f298290c58f18239e1ff83f0388d037c311f5fb
Reviewed-on: http://gerrit.cloudera.org:8080/542
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2015-07-22 21:23:36 +00:00
Sailesh Mukil
6d7bb76e87 IMPALA-1756: Constant filter expressions are not checked for errors and state cleanup is
not done before throwing exception.

When a builtin has an error (in the constant case), it is checked for but the state
cleanup isn't taken care of which results in a DCHECK. When a UDF has an error (in the
constant case), the error does not propagate back up the stack due to a lack of error
checking in ScalarFnCall::Open() after it calls GetConstVal().

Change-Id: Ib500c84a41df574690369f124044991ed8c82cc1
Reviewed-on: http://gerrit.cloudera.org:8080/537
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-07-21 04:01:39 +00:00
Casey Ching
a6d534682b IMPALA-2086, IMPALA-2090: Avoid boost year/month interval logic
Boost handles a couple of edge cases differently than other databases
such as Postgres and MySQL when adding year/month intervals to
timestamps. This change makes Impala consistent for the other databases.
The performance difference was not noticeable (<5% if any).

Change-Id: Icb02a06281b53753938cab88e0d28f20709fee06
Reviewed-on: http://gerrit.cloudera.org:8080/489
Reviewed-by: Casey Ching <casey@cloudera.com>
Tested-by: Internal Jenkins
2015-07-20 10:16:54 +00:00
Shant Hovsepian
6d87fe090c Improve Hll estimate for small cardinalities.
Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.

Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2015-07-16 19:38:17 +00:00
Ippokratis Pandis
7e9f8478e1 Removing duplicate query test
Change-Id: Ia8b33ca2a2eadae288acea4bd2111a1a974bc484
Reviewed-on: http://gerrit.cloudera.org:8080/526
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-07-15 03:28:36 +00:00
Ippokratis Pandis
e99c68fe52 IMPALA-2130: Wrong verification of Parquet file version
This patch corrects a mistake in the Parquet magic file number verification
and adds a test about it. Note that with this patch Impala may fail to read
Parquet files with wrong magic number that it used to read before.

Change-Id: Iff31accda1e1d541946ef1f750e38886ce4cb8d5
Reviewed-on: http://gerrit.cloudera.org:8080/515
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-07-14 02:52:02 +00:00
Martin Grund
51aa077448 IMPALA-2133: Properly unescape string value for HBase filters
This patch fixes the problem, that the Frontend would simply pass the
escaped value to the backend as an HBase filter and not the unescaped
one. Now queries including an escaped character will work as well.

Change-Id: I96e544973b523f3ef1abdec86ea1ec5596d9bee9
Reviewed-on: http://gerrit.cloudera.org:8080/520
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Internal Jenkins
2015-07-13 18:38:39 +00:00
Ippokratis Pandis
4951f895e7 Nested Types: Reset() for partitioned hash join node
TODO: Need to modify Reset()'s functionality in case of NAAJs.

Change-Id: I7d0ea0dabd0b3404957e228bbaa51781c5fc34c0
Reviewed-on: http://gerrit.cloudera.org:8080/490
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: Internal Jenkins
2015-07-08 01:51:09 +00:00