Commit Graph

805 Commits

Author SHA1 Message Date
Dimitris Tsirogiannis
d802f321b2 IMPALA-3724: Support Kudu non-covering range partitions
This commit adds support for non-covering range partitions in Kudu
tables. The SPLIT ROWS clause is now deprecated and no longer supported.
The following new syntax provides more flexibility in creating range
partitions and it supports bounded and unbounded ranges as well as single value
partitions; multi-column range partitions are supported as well.

The new syntax is:
DISTRIBUTE BY RANGE (col_list)
(
 PARTITION lower_1 <[=] VALUES <[=] upper_1,
 PARTITION lower_2 <[=] VALUES <[=] upper_2,
             ....
 PARTITION lower_n <[=] VALUES <[=] upper_n,
 PARTITION VALUE = val_1,
             ....
 PARTITION VALUE = val_n
)

Multi-column range partitions are specified as follows:
DISTRIBUTE BY RANGE (col1, col2,..., coln)
(
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val),
                     ....
 PARTITION VALUE = (col1_val, col2_val, ..., coln_val)
)

Change-Id: I6799c01a37003f0f4c068d911a13e3f060110a06
Reviewed-on: http://gerrit.cloudera.org:8080/4856
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
2016-11-04 22:02:22 +00:00
Matthew Jacobs
32294220c4 IMPALA-4379: Fix and test Kudu table type checking, follow up
The first fix for IMPALA-4379 went in before all comments
were addressed. First commit: 9b507b6.

This addresses some follow-up comments about how to handling
ALTER TABLE setting the storage_handler table property,
which doesn't really make sense to ever allow.

Change-Id: I93d04a04483af598b392c28874363e3b0202e1f3
Reviewed-on: http://gerrit.cloudera.org:8080/4894
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-11-04 06:54:18 +00:00
Laszlo Gaal
9af59bfe2b IMPALA-4153: Fix count(*) on all blank('') columns - test
This change adds test coverage for the fixes committed for
IMPALA-2399 in commit 9ed3b685a1.
It uses the table nulltable in the workload functional-query
to verify the materialization and counting of NULL and empty-
valued columns. The test can be run on any supported storage
and compression combination.

Change-Id: I23923f95f43d67977ee1520a1fc09ce297548b3f
Reviewed-on: http://gerrit.cloudera.org:8080/4755
Tested-by: Internal Jenkins
Reviewed-by: Jim Apple <jbapple@cloudera.com>
2016-11-03 23:08:56 +00:00
Alex Behm
6a16e44a2a Add functional tests for compute stats with mt_dop > 0.
Change-Id: Icd4e7e44f9f23f66e59ad1fb298e13da76ad817a
Reviewed-on: http://gerrit.cloudera.org:8080/4879
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-03 11:59:07 +00:00
Alex Behm
795c085fa3 IMPALA-4336: Cast exprs after unnesting union operands.
The bug was that we cast the result exprs of operands before
unnesting them. If we unnested an operands, casts were missing
on those unnested operands' result exprs.

The fix is to first unnest operands and then cast result exprs.

Also clarifies the use of resultExprs vs. baseTblResultExprs.

Change-Id: I5e3ab7349df7d67d0d9c2baf4a56342d3f04e76d
Reviewed-on: http://gerrit.cloudera.org:8080/4815
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-11-03 08:59:45 +00:00
Alex Behm
4918b20ac0 IMPALA-4408: Omit null bytes for Kudu scans with no nullable slots.
Kudu does not allocate null bytes if all projected columns are
non-nullable. Otherwise, Kudu allocates a null bit for all columns,
even the non-nullable ones. The bug was that Impala's memory layout
did not match the first requirement.

Change-Id: I762ad9d5cc4198922ea4b5218c504fde355c49a5
Reviewed-on: http://gerrit.cloudera.org:8080/4892
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-11-01 01:47:30 +00:00
Matthew Jacobs
9b507b6ed6 IMPALA-4379: Fix and test Kudu table type checking
Creating Kudu tables shouldn't allow types not supported by
Kudu (e.g. VARCHAR/CHAR, DECIMAL, TIMESTAMP, collection types).
The behavior is inconsistent: for some types it throws in
the catalog, for VARCHAR/CHAR these become strings. This changes
behavior so that all fail during analysis. Analysis tests
were added.

Similarly, external tables cannot contain Kudu types that
Impala doesn't support (e.g. UNIXTIME_MICROS, BINARY). Tests
were added to validate this behavior. Note that this
required upgrading the python Kudu client.

This also fixes a small corner case with ALTER TABLE:
ALTER TABLE shouldn't allow Kudu tables to change the
storage descriptor tblproperty, otherwise the table metadata
gets in an inconsistent state.

Tests were added for all of the above.

Change-Id: I475273cbbf4110db8d0f78ddf9a56abfc6221e3e
Reviewed-on: http://gerrit.cloudera.org:8080/4857
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-10-31 16:03:54 +00:00
Alex Behm
f7d71950e3 IMPALA-4369: Avoid DCHECK in Parquet scanner with MT_DOP > 0.
When HdfsParquetScanner::Open() failed we used to hit a DCHECK
when trying to access HdfsParquetScanner::batch() which is
only valid to call for non-MT scan nodes.

Change-Id: Ifbfdde505dbbd2742e7ab79a2415ff317a9bfa2f
Reviewed-on: http://gerrit.cloudera.org:8080/4851
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-10-26 22:21:19 +00:00
Lars Volker
c24e9da914 IMPALA-2521: Add clustered hint to insert statements
This change introduces a clustered/noclustered hint for insert
statements. Specifying this hint adds an additional sort node to the
plan, just before the table sink. This has the effect that data will be
clustered by its partition prior to writing partitions, which therefore
can be written sequentially.

Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14
Reviewed-on: http://gerrit.cloudera.org:8080/4745
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-26 04:56:14 +00:00
Michael Ho
13455b5a24 IMPALA-3884: Support TYPE_TIMESTAMP for HashTableCtx::CodegenAssignNullValue()
This change implements support for TYPE_TIMESTAMP for
HashTableCtx::CodegenAssignNullValue(). TimestampValue itself
is 16 bytes in size. To match RawValue::Write() in the
interpreted path, CodegenAssignNullValue() emits code to assign
HashUtil::FNV_SEED to both the upper and lower 64-bit of the
destination value. This change also fixes the handling of 128-bit
Decimal16Value in CodegenAssignNullValue() so the emitted code
matches the behavior of the interpreted path.

Change-Id: I0211d38cbef46331e0006fa5ed0680e6e0867bc8
Reviewed-on: http://gerrit.cloudera.org:8080/4794
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Michael Ho <kwho@cloudera.com>
2016-10-25 05:52:33 +00:00
Matthew Jacobs
99ed6dc67a IMPALA-4134,IMPALA-3704: Kudu INSERT improvements
1.) IMPALA-4134: Use Kudu AUTO FLUSH
Improves performance of writes to Kudu up to 4.2x in
bulk data loading tests (load 200 million rows from
lineitem).

2.) IMPALA-3704: Improve errors on PK conflicts
The Kudu client reports an error for every PK conflict,
and all errors were being returned in the error status.
As a result, inserts/updates/deletes could return errors
with thousands errors reported. This changes the error
handling to log all reported errors as warnings and
return only the first error in the query error status.

3.) Improve the DataSink reporting of the insert stats.
The per-partition stats returned by the data sink weren't
useful for Kudu sinks. Firstly, the number of appended rows
was not being displayed in the profile. Secondly, the
'stats' field isn't populated for Kudu tables and thus was
confusing in the profile, so it is no longer printed if it
is not set in the thrift struct.

Testing: Ran local tests, including new tests to verify
the query profile insert stats. Manual cluster testing was
conducted of the AUTO FLUSH functionality, and that testing
informed the default mutation buffer value of 100MB which
was found to provide good results.

Change-Id: I5542b9a061b01c543a139e8722560b1365f06595
Reviewed-on: http://gerrit.cloudera.org:8080/4728
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-10-25 02:06:10 +00:00
Alex Behm
ff6b450ad3 IMPALA-4285/IMPALA-4286: Fixes for Parquet scanner with MT_DOP > 0.
IMPALA-4258: The problem was that there was a reference to
HdfsScanner::batch_ hidden inside WriteEmptyTuples(). The batch_
reference is NULL when the scanner is run with MT_DOP > 1.

IMPALA-4286: When there are no scan ranges HdfsScanNodeBase::Open()
exits early without initializing the reader context. This lead to
a DCHECK in IoMgr::GetNextRange() called from HdfsScanNodeMt.
The fix is to remove that unnecessary short-circuit Open().

I combined these two bugfixes because the new basic test covers
both cases.

Testing: Added a new test_mt_dop.py test. A private code/hdfs
run passed.

Change-Id: I79c0f6fd2aeb4bc6fa5f87219a485194fef2db1b
Reviewed-on: http://gerrit.cloudera.org:8080/4767
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-22 10:24:24 +00:00
Michael Ho
51268c053f IMPALA-4120: Incorrect results with LEAD() analytic function
This change fixes a memory management problem with LEAD()/LAG()
analytic functions which led to incorrect result. In particular,
the update functions specified for these analytic functions only
make a shallow copy of StringVal (i.e. copying only the pointer
and the length of the string) without copying the string itself.
This may lead to problem if the string is created from some UDFs
which do local allocations whose buffer may be freed and reused
before the result tuple is copied out. This change fixes the problem
above by allocating a buffer at the Init() functions of these
analytic functions to track the intermediate value. In addition,
when the value is copied out in GetValue(), it will be copied into
the MemPool belonging to the AnalyticEvalNode and attached to the
outgoing row batches. This change also fixes a missing free of
local allocations in QueryMaintenance().

Change-Id: I85bb1745232d8dd383a6047c86019c6378ab571f
Reviewed-on: http://gerrit.cloudera.org:8080/4740
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-10-22 07:39:37 +00:00
Dimitris Tsirogiannis
041fa6d946 IMPALA-3719: Simplify CREATE TABLE statements with Kudu tables
With this commit we simplify the syntax and handling of CREATE TABLE
statements for both managed and external Kudu tables.

Syntax example:
CREATE TABLE foo(a INT, b STRING, PRIMARY KEY (a, b))
DISTRIBUTE BY HASH (a) INTO 3 BUCKETS,
RANGE (b) SPLIT ROWS (('abc', 'def'))
STORED AS KUDU

Changes:
1) Remove the requirement to specify table properties such as key
   columns in tblproperties.
2) Read table schema (column definitions, primary keys, and distribution
   schemes) from Kudu instead of the HMS.
3) For external tables, the Kudu table is now required to exist at the
   time of creation in Impala.
4) Disallow table properties that could conflict with an existing
   table. Ex: key_columns cannot be specified.
5) Add KUDU as a file format.
6) Add a startup flag to impalad to specify the default Kudu master
   addresses. The flag is used as the default value for the table
   property kudu_master_addresses but it can still be overriden
   using TBLPROPERTIES.
7) Fix a post merge issue (IMPALA-3178) where DROP DATABASE CASCADE
   wasn't implemented for Kudu tables and silently ignored. The Kudu
   tables wouldn't be removed in Kudu.
8) Remove DDL delegates. There was only one functional delegate (for
   Kudu) the existence of the other delegate and the use of delegates in
   general has led to confusion. The Kudu delegate only exists to provide
   functionality missing from Hive.
9) Add PRIMARY KEY at the column and table level. This syntax is fairly
   standard. When used at the column level, only one column can be
   marked as a key. When used at the table level, multiple columns can
   be used as a key. Only Kudu tables are allowed to use PRIMARY KEY.
   The old "kudu.key_columns" table property is no longer accepted
   though it is still used internally. "PRIMARY" is now a keyword.
   The ident style declaration is used for "KEY" because it is also used
   for nested map types.
10) For managed tables, infer a Kudu table name if none was given.
   The table property "kudu.table_name" is optional for managed tables
   and is required for external tables. If for a managed table a Kudu
   table name is not provided, a table name will be generated based
   on the HMS database and table name.
11) Use Kudu master as the source of truth for table metadata instead
   of HMS when a table is loaded or refreshed. Table/column metadata
   are cached in the catalog and are stored in HMS in order to be
   able to use table and column statistics.

Change-Id: I7b9d51b2720ab57649abdb7d5c710ea04ff50dc1
Reviewed-on: http://gerrit.cloudera.org:8080/4414
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-21 10:52:25 +00:00
Taras Bobrovytsky
bf1d9677fc IMPALA-4155: Update default partition when table is altered
If the table format is changed by the Alter Table statement, the
default partition in partitioned tables used to not get updated. This
caused a problem because Insert picks up the file format for new
partitions from the default partition. This patch fixes the problem by
calling addDefaultPartition().

Also removed "drop table if not exists" in tests in alter-table.test
because we already have the unique_database fixture.

Change-Id: I59bf21caa5c5e7867d07d87cda0c0a5b4b994859
Reviewed-on: http://gerrit.cloudera.org:8080/4750
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-20 23:47:52 +00:00
Michael Ho
b15d992abe IMPALA-4080, IMPALA-3638: Introduce ExecNode::Codegen()
This patch is mostly mechanical move of codegen related logic
from each exec node's Prepare() to its Codegen() function.
After this change, code generation will no longer happen in
Prepare(). Instead, it will happen after Prepare() completes in
PlanFragmentExecutor. This is an intermediate step towards the
final goal of sharing compiled code among fragment instances in
multi-threading.

As part of the clean up, this change also removes the logic for
lazy codegen object creation. In other words, if codegen is enabled,
the codegen object will always be created. This simplifies some
of the logic in ScalarFnCall::Prepare() and various Codegen()
functions by reducing error checking needed. This change also
removes the logic added for tackling IMPALA-1755 as it's not
needed anymore after the clean up.

The clean up also rectifies a not so well documented situation.
Previously, even if a user explicitly sets DISABLE_CODEGEN to true,
we may still codegen a UDF if it was written in LLVM IR or if it
has more than 8 arguments. This patch enforces the query option
by failing the query in both cases. To run the query, the user
must enable codegen. This change also extends the number of
arguments supported in the interpretation path of ScalarFn to 20.

Change-Id: I207566bc9f4c6a159271ecdbc4bbdba3d78c6651
Reviewed-on: http://gerrit.cloudera.org:8080/4651
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-10-19 08:18:37 +00:00
Alex Behm
2a04b0e21a IMPALA-3943: Address post-merge comments.
Adds code comments and issues a warning for Parquet files
with num_rows=0 but at least one non-empty row group.

Change-Id: I72ccf00191afddb8583ac961f1eaf11e5eb28791
Reviewed-on: http://gerrit.cloudera.org:8080/4696
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-14 05:41:22 +00:00
Lars Volker
ef4c9958d0 IMPALA-4047: Remove occurrences of 'CDH'/'cdh' from repo
This change removes some of the occurrences of the strings 'CDH'/'cdh'
from the Impala repository. References to Cloudera-internal Jiras have
been replaced with upstream Jira issues on issues.cloudera.org.

For several categories of occurrences (e.g. pom.xml files,
DOWNLOAD_CDH_COMPONENTS) I also created a list of follow-up Jiras to
remove the occurrences left after this change.

Change-Id: Icb37e2ef0cd9fa0e581d359c5dd3db7812b7b2c8
Reviewed-on: http://gerrit.cloudera.org:8080/4187
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-13 00:40:41 +00:00
Alex Behm
0449b5beab IMPALA-3943: Do not throw scan errors for empty Parquet files.
For Parquet files with no row groups but with num_rows=0 in the
file footer the Parquet scanner returns an error indicating
that the file is invalid. This behavior is a regression from
previous Impala versions which used to accept such files.

This patch restores the previous behavior and adds tests.

Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Reviewed-on: http://gerrit.cloudera.org:8080/4693
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-12 09:22:57 +00:00
Alex Behm
d9dc9090ea IMPALA-4237: Fix materialization of 4 byte decimals in data source scan node.
There was a missing break in a switch statement leading to bad fallthrough.

An existing test already expected incorrect results. The bug is covered by
expecting correct results.

Change-Id: I5340c2eda813afc032ba72203bd59eb3f2c4f482
Reviewed-on: http://gerrit.cloudera.org:8080/4585
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-10-07 03:36:43 +00:00
Bharath Vissapragada
64c394827a IMPALA-4196: Cross compile bit-byte-functions
Change-Id: I5a1291bfd202b500405a884e4a62f0ca2447244a
Reviewed-on: http://gerrit.cloudera.org:8080/4557
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-10-01 01:42:21 +00:00
Michael Ho
2a31fbdbfa IMPALA-4180: Synchronize accesses to RuntimeState::reader_contexts_
HdfsScanNodeBase::Close() may add its outstanding DiskIO context to
RuntimeState::reader_contexts_ to be unregistered later when the
fragment is closed. In a plan fragment with multiple HDFS scan nodes,
it's possible for HdfsScanNodeBase::Close() to be called concurrently.
To allow safe concurrent accesses, this change adds a SpinLock to
synchronize accesses to 'reader_contexts_' in RuntimeState.

Change-Id: I911fda526a99514b12f88a3e9fb5952ea4fe1973
Reviewed-on: http://gerrit.cloudera.org:8080/4558
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-09-30 01:21:05 +00:00
Thomas Tauber-Marshall
b2c2fe7813 IMPALA-3786: Replace "cloudera" with "apache" (part 2)
As part of the ASF transition, we need to replace references to
Cloudera in Impala with references to Apache. This primarily means
changing Java package names from com.cloudera.impala.* to
org.apache.impala.*

A prior patch renamed all the files as necessary, and this patch
performs the actual code changes. Most of the changes in this patch
were generated with some commands of the form:

find . | grep "\.java\|\.py\|\.h\|\.cc" | \
  xargs sed -i s/'com\(.\)cloudera\(\.\)impala/org\1apache\2impala/g

along with some manual fixes.

After this patch, the remaining references to Cloudera in the repo
mostly fall into the categories:
- External components that have cloudera in their own package names,
  eg. com.cloudera.kudu/llama
- URLs, eg. https://repository.cloudera.com/

Change-Id: I0d35fa6602a7fc0c212b2ef5e2b3322b77dde7e2
Reviewed-on: http://gerrit.cloudera.org:8080/3937
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2016-09-29 21:14:13 +00:00
Alex Behm
3aa4351625 IMPALA-4170: Fix identifier quoting in COMPUTE INCREMENTAL STATS.
The SQL statements generated from COMPUTE INCREMENTAL STATS
did not properly quote identifiers when incrementally updating
the stats for newly added partitions.

Our existing tests did not catch this case because the code paths
for doing the initial stats computation and the incremental stats
computation are different, in particular, the code for generating
the SQL statements.

Change-Id: I63adcc45dc964ce769107bf4139fc4566937bb96
Reviewed-on: http://gerrit.cloudera.org:8080/4479
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-09-21 01:24:53 +00:00
Matthew Jacobs
c7fa03286b IMPALA-3718: Support subset of functional-query for Kudu
Adds initial support for the functional-query test workload
for Kudu tables.

There are a few issues that make loading the functional
schema difficult on Kudu:
 1) Kudu tables must have one or more columns that together
    constitute a unique primary key.
   a) Primary key columns must currently be the first columns
      in the table definition (KUDU-1271).
   b) Primary key columns cannot be nullable (KUDU-1570).
 2) Kudu tables must be specified with distribution
    parameters.

(1) limits the tables that can be loaded without ugly
workarounds. This patch only includes important tables that
are used for relevant tests, most notably the alltypes*
family. In particular, alltypesagg is important but it does
not have a set of columns that are non-nullable and form a unique
primary key. As a result, that table is created in Kudu with
a different name and an additional BIGINT column for a PK
that is a unique index and is generated at data loading time
using the ROW_NUMBER analytic function. A view is then
wrapped around the underlying table that matches the
alltypesagg schema exactly. When KUDU-1570 is resolved, this
can be simplified.

(2) requires some additional considerations and custom
syntax. As a result, the DDL to create the tables is
explicitly specified in CREATE_KUDU sections in the
functional_schema_constraints.csv, and an additional
DEPENDENT_LOAD_KUDU section was added to specify custom data
loading DML that differs from the existing DEPENDENT_LOAD.

TODO: IMPALA-4005: generate_schema_statements.py needs refactoring

Tests that are not relevant or not yet supported have been
marked with xfail and a skip where appropriate.

TODO: Support remaining functional tables/tests when possible.

Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc
Reviewed-on: http://gerrit.cloudera.org:8080/4175
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-09-14 22:11:04 +00:00
Alex Behm
d379875806 IMPALA-3491: Use unique db in test_scanners.py and test_aggregation.py
Testing: Ran the tests locally in a loop on exhaustive.
Did a private debug/exhaustive run.

Change-Id: Ided0848c138bdc1d43694a12222010c48e23ee1c
Reviewed-on: http://gerrit.cloudera.org:8080/4339
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-13 21:57:36 +00:00
Alex Behm
8c37bf3543 IMPALA-3491: Use unique database fixture in test_partitioning.py
Testing: Ran the test locally in a loop on exhaustive. Did a private
debug/exhaustive/hdfs test run.

Change-Id: Ib1b33d9977a98894288662a711805e9a54329ec8
Reviewed-on: http://gerrit.cloudera.org:8080/4316
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-08 04:31:27 +00:00
Alex Behm
c8f3d40efc IMPALA-3491: Use unique database fixture in test_nested_types.py
Testing: Ran the tests locally in a loop. Did a core/debug/hdfs
private build.

Change-Id: I0c56df0c6a5f771222dedb69353f8bebe01d5a90
Reviewed-on: http://gerrit.cloudera.org:8080/4302
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-03 00:39:07 +00:00
Yuanhao Luo
052d3cc8dd IMPALA-4056: Fix toSql() of DistributeParam
This commit fixes two issues in toSql() of DistributeParam:
1. string literals were not quoted
2. range partition split rows were not printed.
Besides, this commit fixes a small issue in run-hive-server.sh

Change-Id: I984a63a24f02670347b0e1efceb864d265d1f931
Reviewed-on: http://gerrit.cloudera.org:8080/4195
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 20:11:27 +00:00
Alex Behm
ab9e54bc42 IMPALA-3491: Use unique database fixture in test_ddl.py.
Adds new parametrization to the unique database fixture:
- num_dbs: allows creating multiple unique databases at once;
  the 2nd, 3rd, etc. datbase name is generated by appending
  "2", "3", etc., to the first database name
- sync_ddl: allows creating the dabatases(s) with sync_ddl
  which is needed by most tests in test_ddl.py

Testing: I ran debug/core and debug/exhaustive on HDFS and
core/debug on S3. Also ran the test locally in a loop on
exhaustive.

Change-Id: Idf667dd5e960768879c019e2037cf48ad4e4241b
Reviewed-on: http://gerrit.cloudera.org:8080/4155
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 02:47:02 +00:00
Alex Behm
16f1c8d8de IMPALA-4054: Remove serial test workarounds for IMPALA-2479.
The underlying issue IMPALA-2479 has been fixed, so it
should be safe to execute these tests in parallel again:
- test_runtime_filters.py (all tests)
- test_scanners.py::TestParquet::test_multiple_blocks
- test_scanners.py::testParquet::test_multiple_blocks_one_row_group

Testing: Ran the tests locally in a loop. Did a private core/hdfs run.

Change-Id: I8f046e67eb1de1c6ff87980f906870ec9816f551
Reviewed-on: http://gerrit.cloudera.org:8080/4291
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 02:19:52 +00:00
Jim Apple
20ef3b016e IMPALA-4058: ByteSwap256 assumed memory was 16-byte aligned.
This changes the code to use the lddqu and movdqu instructions (via
Intel intrinsics) to allow unaligned memory access.

Change-Id: I39b2b47bb717d5ac9727512a24fcf8a8a6a8dcc6
Reviewed-on: http://gerrit.cloudera.org:8080/4205
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-09-02 01:47:08 +00:00
Henry Robinson
24869d40fd IMPALA-3610: Account for memory used by filters in the coordinator
Before this patch, Impala would not account for the memory used to
aggregate runtime filters together in the coordinator. Impala's memory
could therefore be silently overcommitted.

This patch accounts for aggregated filter memory in a new filter
memtracker that is attached to the coordinator's query_mem_tracker(). If
the query memory limit is exceeded when a filter update arrives, that
update is discarded. If the filter is from a partitioned join, the
entire filter can therefore be discarded immediately (to alleviate
memory pressure) and a dummy 'always true' filter is sent to backends to
unblock them.

If the filter is from a broadcast join, no aggregation is done, so there
is no tracking. The Thrift input and output filter data structures are
not tracked (as we generally don't track RPC objects, but plan to in the
future). The filter payload is moved from the input request structure to
the output broadcast structure without copying.

Memory that is added to a memtracker must always be released. To do
this, we need to signal to the coordinator that it is finished, and that
there is no point trying to process any future updates that might arrive
concurrently. This patch adds Coordinator::Done() which is called from
QueryExecState::Done(), and which releases memory from all in-process
runtime filters.

Finally, this patch increases the upper limit for runtime filters to
512MB. This allows testing on very large datasets. The default maximum
is still 16MB, per RUNTIME_FILTER_MAX_SIZE.

Testing: Added a new test that triggers the OOM condition on the
coordinator. All existing runtime filter tests pass.

Change-Id: I3c52c8a1c2e79ef370c77bf264885fc859678d1b
Reviewed-on: http://gerrit.cloudera.org:8080/4066
Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
Tested-by: Internal Jenkins
2016-09-01 02:35:41 +00:00
Tim Armstrong
1350c34763 IMPALA-4049: fix empty batch handling NLJ build side
Memory from the build side of a nested loop join is
referenced by its output batches, so accumulated memory
build side resources must be transferred to the caller.
Special-cased handling of empty batches did not transfer
the memory. The fix is to accumulate empty batches and
transfer their resources in the same way as non-empty
batches. The iterator required changes to handle empty
batches in the list.

Testing:
Added a unit test that exercises the bug RowBatchList.
Add a query test that causes a crash in the ASAN build
and incorrect results in the debug build.

Change-Id: I3cb19e536b87bbb4d4ae82d1636ba1463a422789
Reviewed-on: http://gerrit.cloudera.org:8080/4182
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-31 21:20:29 +00:00
Alex Behm
df830901de IMPALA-3491: Use unique database fixture in test_join_queries.py.
Testing: Ran the core/exhaustive on hdfs.

Change-Id: Ib639ff8a37dbf64840606f88badff8f2590587b6
Reviewed-on: http://gerrit.cloudera.org:8080/4169
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-31 03:12:30 +00:00
Alex Behm
12496c7fbf IMPALA-1657: Rework detection and reporting of corrupt table stats.
1. Minor fixes for cardinality estimation of unpartitioned tables.
2. Reworks handling of corrupt table stats as follows:
   The stats of a table or partition are reported as corrupt if the
   numRows < -1, or if numRows == 0 but the table size is positive.
3. Removes the Preconditions check reported in IMPALA-1657 in favor
   or issuing a corrupt table stats warning.
4. Fixes a few tests to set numRows together with
   STATS_GENERATED_VIA_STATS_TASK so that the numRows is definitely
   set in the HMS.

Change-Id: I1d3305791d96e1c23a901af7b7c109af9352bb44
Reviewed-on: http://gerrit.cloudera.org:8080/4166
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-08-31 00:58:03 +00:00
Thomas Tauber-Marshall
d72353d0c9 IMPALA-2932: Extend DistributedPlanner to account for hash table build cost
When deciding between a broadcast or repartition join, Impala calculates
the cost of each join as the total amount of data that is sent over the
network. This ignores some relevant costs, and can lead to bad plans.

One such relevant cost is the work to create the hash table used in the
join. This patch accounts for this by adding the amount of data inserted
into the hash table (the size of the right side of the join) to the
previous cost.

This generally increases the estimated cost of broadcast joins relative
to repartitioning joins, as the broadcast join must build the hash table
on each node the data was broadcast to, so its effect will be to make
repartitioning joins more likely to be chosen, especially in large
clusters.

This patch has not yet been performance tested.

Change-Id: I03a0f56f69c8deae68d48dfdb9dc95b71aec11f1
Reviewed-on: http://gerrit.cloudera.org:8080/4098
Tested-by: Internal Jenkins
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
2016-08-29 16:44:22 +00:00
Lars Volker
0e886618e2 IMPALA-3776: fix 'describe formatted' for Avro tables
For Avro tables the column information in the underlying database of the
Hive metastore can be different from what is specified in the avro
schema. HIVE-6308 aimed to improve upon this, but for older tables the
two don't necessarily align.

There are two possible cases:

1) Hive's underlying database contains a column which is not present in
the Avro schema file. In this case we encounter a NullPointerException
in DescribeResultFactory.java#L189 when trying to look up the column in
the internal table object.

2) The Avro schema contains a column, which is not present in the
underlying database. In this case the column will not be displayed in
describe formatted.

In addition to the automatic tests I verified this manually by creating
an Avro table with an external schema file in Hive. This populated the
underlying database with the column information. I then either removed
a column from the Avro schema file (case 1) or cleared the column
information from the "COLUMNS_V2" table in the underlying database
(case 2) and verified that the change fixed both cases.

Change-Id: Ieb69d3678e662465d40aee80ba23132ea13871a0
Reviewed-on: http://gerrit.cloudera.org:8080/4126
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
Reviewed-by: Jim Apple <jbapple@cloudera.com>
2016-08-26 17:20:10 +00:00
Henry Robinson
34b5f1c416 IMPALA-(3895,3859): Don't log file data on parse errors
Logging file or table data is a bad idea, and doing it by default is
particularly bad. This patch changes HdfsScanNode::LogRowParseError() to
log a file and offset only.

Testing: See rewritten tests.

To support testing this change, we also fix IMPALA-3895, by introducing
a canonical string __HDFS_FILENAME__ that all Hadoop filenames in the ERROR
output are replaced with before comparing with the expected
results. This fixes a number of issues with the old way of matching
filenames which purported to be a regex, but really wasn't. In
particular, we can now match the rest of an ERROR line after the
filename, which was not possible before.

In some cases, we don't want to substitute filenames because the ERROR
output is looking for a very specific output. In that case we can write:

$NAMENODE/<filename>

and this patch will not perform _any_ filename substitutions on ERROR
sections that contain the $NAMENODE string.

Finally, this patch fixes a bug where a test that had an ERRORS section
but no RESULTS section would silently pass without testing anything.

Change-Id: I5a604f8784a9ff7b4bf878f82ee7f56697df3272
Reviewed-on: http://gerrit.cloudera.org:8080/4020
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2016-08-25 10:20:36 +00:00
Attila Jeges
211f60d831 IMPALA-1731,IMPALA-3868: Float values are not parsed correctly
Fixed StringToFloatInternal() not to parse strings like "1.23inf"
and "infinite" with leading/trailing garbage as Infinity. These
strings are now rejected with PARSE_FAILURE.
Only "inf" and "infinity" are accepted, parsing is case-insensitive.

"NaN" values are handled similarly: strings with leading/trailing
garbage like "nana" are rejected, parsing is case-insensitive.

Other changes:
- StringToFloatInternal() was cleaned up a bit. Parsing inf and NaN
strings was moved out of the main loop.
- Use std::numeric_limits<T>::infinity() instead of INFINITY macro
and std::numeric_limits<T>::quiet_NaN() instead of NAN macro.
- Fixed another minor bug: multiple dots are allowed when parsing
float values (e.g. "2.1..6" is interpreted as 2.16).
- New BE and E2E tests were added.

Change-Id: I9e17d0f051b300a22a520ce34e276c2d4460d35e
Reviewed-on: http://gerrit.cloudera.org:8080/3791
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-08-24 03:34:01 +00:00
Tim Armstrong
f613dcd02d Add functional and targeted perf tests for joins with empty builds
I wrote these tests for my IMPALA-3987 patch, but other issues block
that optimisations.  These tests exercise an interesting corner case
so I split them out into a separate patch.

The functional tests exercise every join mode for nested loop join and
hash join with an empty build side. The perf test exercises hash join
with an empty build side.

Testing:
Made sure the tests passed with both partitioned and non-partitioned
hash join implementations. Ran the targeted perf query through the
single node perf run script to make sure it worked.

Change-Id: I0a68cafec32011a47c569b254979601237e7f2a5
Reviewed-on: http://gerrit.cloudera.org:8080/4051
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-08-19 06:04:18 +00:00
Matthew Jacobs
d113205cee IMPALA-3650: DISTRIBUTE BY required for managed Kudu tables
As of Kudu 0.9, DISTRIBUTE BY is now required when creating
a new Kudu table. Create table analysis, data loading, and
tests are updated to reflect this.

This also bumps the Kudu version to 0.10.0.

Change-Id: Ieb15110b10b28ef6dd8ec136c2522b5f44dca43e
Reviewed-on: http://gerrit.cloudera.org:8080/3987
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-08-19 02:14:39 +00:00
Tim Armstrong
5afd9f7df7 IMPALA-3764,3914: fuzz test HDFS scanners and fix parquet bugs found
This adds a test that performs some simple fuzz testing of HDFS
scanners. It creates a copy of a given HDFS table, with each
file in the table corrupted in a random way: either a single
byte is set to a random value, or the file is truncated to a
random length. It then runs a query that scans the whole table
with several different batch_size settings. I made some effort
to make the failures reproducible by explicitly seeding the
random number generator, and providing a mechanism to override
the seed.

The fuzzer has found crashes resulting from corrupted or truncated
input files for RCFile, SequenceFile, Parquet, and Text LZO so far.
Avro only had a small buffer read overrun detected by ASAN.

Includes fixes for Parquet crashes found by the fuzzer, a small
buffer overrun in Avro, and a DCHECK in MemPool.

Initially it is only enabled for Avro, Parquet, and uncompressed
text. As follow-up work we should fix the bugs in the other scanners
and enable the test for them.

We also don't implement abort_on_error=0 correctly in Parquet:
for some file formats, corrupt headers result in the query being
aborted, so an exception will xfail the test.

Testing:
Ran the test with exploration_strategy=exhaustive in a loop locally
with both DEBUG and ASAN builds for a couple of days over a weekend.
Also ran exhaustive private build.

Change-Id: I50cf43195a7c582caa02c85ae400ea2256fa3a3b
Reviewed-on: http://gerrit.cloudera.org:8080/3833
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-08-11 08:42:41 +00:00
Michael Ho
276376acac IMPALA-3674: Lazy materialization of LLVM module bitcode.
Previously, each fragment using dynamic code generation will
parse the bitcode module and populate the LLVM data structures
for all the functions and their bodies in the bitcode module.
This is wasteful as we may only use a few functions out of all
the functions parsed. We rely on dead code elimination to
delete most of the unused functions so we won't waste time
compiling them.

This change implements lazy materialization of the functions'
bodies. On the initial parse of the bitcode module, we just
create the Function objects for each function in the module.
The functions' bodies will be materialized on demand from the
bitcode module when they are actually referenced in the query.
This ensures that the prepare time during codegen is proportional
to the number of IR functions referenced by the query instead
of being proportional to the total number of IR functions in
the module.

This change also stops cross-compiling BufferedTupleStream::GetTupleRow()
as there isn't much benefit for doing it. In addition, move the ctors
and dtors of LikePredicate to the header file to avoid an unnecessary
alias in the IR module.

For TPCH-Q2, a fragment which only codegen 9 functions used to spend
146ms in codegen. It now goes down to 35ms, a 76% reduction.

      CodeGen:(Total: 146.041ms, non-child: 146.041ms, % non-child: 100.00%)
         - CodegenTime: 0.000ns
         - CompileTime: 2.003ms
         - LoadTime: 0.000ns
         - ModuleBitcodeSize: 2.12 MB (2225304)
         - NumFunctions: 9 (9)
         - NumInstructions: 129 (129)
         - OptimizationTime: 29.019ms
         - PrepareTime: 114.651ms

      CodeGen:(Total: 35.288ms, non-child: 35.288ms, % non-child: 100.00%)
         - CodegenTime: 0.000ns
         - CompileTime: 1.880ms
         - LoadTime: 0.000ns
         - ModuleBitcodeSize: 2.12 MB (2221276)
         - NumFunctions: 9 (9)
         - NumInstructions: 129 (129)
         - OptimizationTime: 5.101ms
         - PrepareTime: 28.044ms

Change-Id: I6ed7862fc5e86005ecea83fa2ceb489e737d66b2
Reviewed-on: http://gerrit.cloudera.org:8080/3220
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-07-20 18:30:25 -07:00
Tim Armstrong
bc8c55afcd IMPALA-3729: batch_size=1 coverage for avro scanner
Also fix a stale comment in the avro scanner header.

The main work here is to fix the handling of empty result sets in the
test result verifier. This is a problem because we wanted to verify
that the results in the test file were a superset of the rows
returned, and this was thrown off by superflous '' rows in the expected
and actual result sets.

The basic problem is that the way test file sections
was parsed conflated an empty result section with non-empty result
section that had a single empty string. I.e.:

---- RESULTS
====

vs
---- RESULTS

====

both got resolved to [''].

Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/3413
Tested-by: Internal Jenkins
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2016-07-19 23:30:02 -07:00
Thomas Tauber-Marshall
343bdad866 IMPALA-3210: last/first_value() support for IGNORE NULLS
Added support for the 'ignore nulls' keyword to the last_value and
first_value analytic functions, eg. 'last_value(col ignore nulls)',
which would return the last value from the window that is not null,
or null if all of the values in the window are null.

We handle 'ignore nulls' in the FE in the same way that we handle
'distinct' - by adding isIgnoreNulls as a field in FunctionParams.

To avoid affecting performance when 'ignore nulls' is not used, and
to avoid having to special case 'ignore nulls' on the backend, this
patch adds 'last_value_ignore_nulls' and 'first_value_ignore_nulls'
builtin analytic functions that wrap 'last_value' and 'first_value'
respectively.

Change-Id: Ic27525e2237fb54318549d2674f1610884208e9b
Reviewed-on: http://gerrit.cloudera.org:8080/3328
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Internal Jenkins
2016-07-18 08:28:09 -07:00
Michael Ho
f129dfd202 IMPALA-3018: Don't return NULL on zero length allocations.
FunctionContext::Allocate() and FunctionContext::AllocateLocal()
used to return NULL for zero length allocations. This makes
it hard to distinguish between allocation failures and zero
length allocations. Such confusion may lead to DCHECK failure
in the macro RETURN_IF_NULL() in debug builds or access to NULL
pointers in non-debug builds.

This change fixes the problem above by returning NULL only if
there is allocation failure. Zero-length allocations will always
return a dummy non-NULL pointer.

Change-Id: Id8c3211f4d9417f44b8018ccc58ae182682693da
Reviewed-on: http://gerrit.cloudera.org:8080/3601
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-07-14 19:04:45 +00:00
Michael Ho
df2fc08d22 IMPALA-3206: Enable codegen for AVRO_DECIMAL
This change adds the missing switch statement in
CodegenReadScalar() for AVRO_DECIMAL so that we will
also codegen if an avro table contains AVRO_DECIMAL.
With this change, the following query improves by 37.5%,
going from 8s to 5s:

select count(distinct l_linenumber), avg(l_extendedprice), max(l_discount), min(l_tax) from tpch15_avro.lineitem;

This change also un-inlines BitUtil::ByteSwap() as the
third argument 'len' is not compilation constant for
all call sites.

Change-Id: I51adf0c1ba76e055f31ccb0034a0d23ea2afb30e
Reviewed-on: http://gerrit.cloudera.org:8080/3489
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-07-14 19:04:44 +00:00
Huaisi Xu
c6ce32b3b6 IMPALA-3687: Prefer Avro field name during schema reconciliation
Since it is possible to create an Avro table with both column
definitions and an Avro schema, Impala attempts to reconcile
inconsistencies in the two schema definitions, generally preferring the
Avro schema. The only exception to this rule was with
CHAR/VARCHAR/STRING columns, where the column definition was preferred
in order to support tables with CHAR/VARCHAR columns although Avro only
supports STRING. This exception is confusing because the name for such a
column will be taken from the column definition (and not from the Avro
schema).

This patch prefers name, comment from Avro schema definition and
uses column type from column definition for CHAR/VARCHAR/STRING
columns.

Change-Id: Ia3e43b2885853c2b4f207a45a873c9d7f31379cd
Reviewed-on: http://gerrit.cloudera.org:8080/3331
Reviewed-by: Huaisi Xu <hxu@cloudera.com>
Tested-by: Internal Jenkins
2016-07-14 19:04:43 +00:00
Huaisi Xu
c1da1409ba IMPALA-3711: Remove unnecessary privilege checks in getDbsMetadata()
Previously all code paths using getDbsMetadata() sufferred
unnecessary privilege checks:
1. Impala checked privilege of all databases, tables before
applying user provided JDBC pattern filters.
2. Impala passed a null pattern to getDbsMetadata() when
user did not provide one. However, null pattern is treated
as "%", which matches everything thereby causing unnecessary
privilege checks for catalog objects that are not in the
result set.

This patch creates PatternMatcher early so that user specified
null pattern is respected when calling getDbsMetadata().

Change-Id: I17d8c5b9fb12483e4b01b819fba48b6849311a14
Reviewed-on: http://gerrit.cloudera.org:8080/3371
Reviewed-by: Huaisi Xu <hxu@cloudera.com>
Tested-by: Huaisi Xu <hxu@cloudera.com>
2016-07-07 10:41:29 -07:00