Commit Graph

40 Commits

Author SHA1 Message Date
Lars Volker
b5570da405 IMPALA-1740: Add support for skip.header.line.count.
HIVE-5795 introduced a parameter skip.header.line.count to skip header
lines from input files. This change introduces the capability to skip
an arbitrary number of header lines from csv input files on hdfs. The
size of the total file header must be smaller than
max_scan_range_length, otherwise an error will be reported. This is
necessary because scan ranges are not read in disk order, so there is
no way of identifying header lines except by counting from the start
of the first scan range.

[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='1');
Query: alter table t1 set tblproperties('skip.header.line.count'='1')
[localhost:21000] > select * from t1;
Query: select * from t1
+----+----+
| c1 | c2 |
+----+----+
| 1  | 1  |
| 2  | 2  |
| 3  | 3  |
+----+----+
Fetched 3 row(s) in 0.32s
[localhost:21000] > alter table t1 set
tblproperties('skip.header.line.count'='0');
Query: alter table t1 set tblproperties('skip.header.line.count'='0')
[localhost:21000] > select * from t1;
Query: select * from t1
+------+------+
| c1   | c2   |
+------+------+
| NULL | NULL |
| 1    | 1    |
| 2    | 2    |
| 3    | 3    |
+------+------+
WARNINGS: Error converting column: 0 TO INT (Data is: num1)
Error converting column: 1 TO DOUBLE (Data is: num2)
file: hdfs://localhost:20500/test-warehouse/t1/test.txt
record: num1,num2

Fetched 4 row(s) in 0.41s

Change-Id: I595f01a165d41499ca1956fe748ba3840a6eb543
Reviewed-on: http://gerrit.cloudera.org:8080/2110
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-05-12 14:17:46 -07:00
Tim Armstrong
52362d4079 IMPALA-3047: separate create table test with nested types
We need to skip queries that select from tables wiht nested types is
running with the old aggs and joins. To achieve this, move the failing
test to a separate test and use the skip decorator.

Change-Id: Iaf1351c711b524be66a99084657926909425cbff
Reviewed-on: http://gerrit.cloudera.org:8080/2272
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2016-02-24 13:31:00 -08:00
Alex Behm
8b32cbb904 IMPALA-2820: Support unquoted keywords as struct-field names.
After this patch structs can be parsed/created with field names
that are regular identifiers or keywords, even if unquoted.
This fix is needed for parsing type strings stored in the
Hive Metastore which could contain unquoted identifiers that
correspond to Impala keywords.

The parser changes required an upgrade of Cup and its Maven plugin.
In the old version, the generated parser would not compile because
of a giant method that exceeded the JVM maximum allowed size for a
single method.

Change-Id: Ic989c7afd034216f6db4c8f9f3901c025cceb524
Reviewed-on: http://gerrit.cloudera.org:8080/2249
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-02-22 20:16:24 -08:00
Bharath Vissapragada
ef0dac661c IMPALA-2843: Persist hive udfs across catalog restarts
This commit adds a new feature to persist hive/java udfs across
catalog restarts. IMPALA-1748 already added this for non-java
udfs by storing them in parameters map of the Db object and
reading them back at catalog startup. However we follow a
different approach for hive udfs by converting them to Hive's
function format and adding them as hive functions to the metastore.
This makes it possible to share udfs between hive and Impala as the
udfs added from one service are accessible to other. This commit
takes care of format conversions between hive and impala and user
can just add function once in either of the services.

Background: Hive and impala treat udfs differently. Hive resolves the
evaluate function in the udf class at runtime depending on the data
types of the input arguments. So user can add one function by name and
can pass any arguments to it as long as there is a compatible evaluate
function in the udf class. However Impala takes the input types of the
udf as a part of function definition (that maps to only one evaluate
function) and loads the function only for those set of input argument
types. If we have multiple 'evaluate' methods, we need to add multiple
functions one for each of them.

This commit adds new variants of CREATE | DROP FUNCTIONS  to Impala which
lets the user to create and drop hive/java udfs without input argument
types or return types. Catalog takes care of loading/dropping the udf
signatures corresponding to each "evaluate" method in the udf symbol
class. The syntax is as follows,

CREATE FUNCTION [IF NOT EXISTS] <function name> <function_opts>
DROP FUNCTION [IF EXISTS] <function name>

Examples:

CREATE FUNCTION IF NOT EXISTS foo location '/path/to/jar' SYMBOL='TestUdf';
CREATE FUNCTION bar location '/path/to/jar' SYMBOL='TestUdf2';
DROP FUNCTION foo;
DROP FUNCTION IF EXISTS bar;

The older way of creating hive/java udfs with specific signature is still supported,
however they are *not* persisted across restarts. So a restart of catalog can
wipe them out. Additionally this commit also loads all the compatible java udfs
added outside of Impala and they needn't be separately loaded. One thing
to note here is that the functions added using the new CREATE FUNCTION
can only be dropped using the new DROP FUNCTION syntax (without
signature). The same rule applies for the java udfs added using the old
CREATE FUNCTION syntax (with signature).

Change-Id: If31ed3d5ac4192e3bc2d57610a9a0bbe1f62b42d
Reviewed-on: http://gerrit.cloudera.org:8080/2250
Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com>
Tested-by: Internal Jenkins
2016-02-19 23:04:03 -08:00
Dimitris Tsirogiannis
6b3c9cac45 IMPALA-2870: Fix failing metadata.test_ddl.TestDdlStatements.test_create_table test
This commit fixes a DDL test that was failing because some newly added test cases were
using a database that had been dropped by another test case. The temporary fix is to use
fully qualified names for the specified tables.

Change-Id: I3bb022e2497283faeb84c85f922cda95beca2a32
Reviewed-on: http://gerrit.cloudera.org:8080/1909
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2016-01-27 01:51:42 +00:00
Michael Ho
f3e7274342 IMPALA-2711: Fix memory leak in Rand().
MathFunctions::RandPrepare() allocates a 4-bytes seed and
stores it in the FunctionContext's thread local state.
However, it was never freed. This change fixes the problem
by adding a close function for Rand() so it has a chance to
free the seed. A new test is also added to verify the fix.

Change-Id: Ibcc2e1ca0d052b86defe80aad471f9fdaac5a453
Reviewed-on: http://gerrit.cloudera.org:8080/1855
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
2016-01-26 11:53:38 +00:00
Lars Volker
82a1aef91b IMPALA-1687: Expand CTAS to allow partition clauses
This changes implements support for PARTITIONED BY clauses in CTAS
statements. The syntax and semantics follow the PARTITION feature of
insert from select statements: inside the PARTITIONED BY (...) column
list the user must specify names of the columns to partition by. These
column names must appear in that particular order at the end of the
select statement. A remapping between columns of the source and
destination tables is not possible, because the destination table does
not yet exist. Specifying static values for the partition columns is
also not possible, as their type needs to be deduced from columns in the
select statement. Example:

CREATE TABLE t (a DOUBLE, b INT);
INSERT INTO t VALUES (1.5, 3);
CREATE TABLE p PARTITIONED BY (b) AS SELECT a, b FROM t;

This change also contains a fix for setting the PYTHONPATH environment
variable correctly, so you can run single python tests from the command
line.

Change-Id: I5f61854d36d1ee30cfcd1c6b2b3eb971f6cf4b2f
Reviewed-on: http://gerrit.cloudera.org:8080/1740
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
2016-01-18 16:55:45 +00:00
Tim Armstrong
a280b93a37 IMPALA-2070: include the database comment when showing databases
As part of change, refactor catalog and frontend functions to return
TDatabase/Db objects instead of just the string names of databases -
this required a lot of method/variable renamings.

Add test for creating database with comment. Modify existing tests
that assumed only a single column in SHOW DATABASES results.

Change-Id: I400e99b0aa60df24e7f051040074e2ab184163bf
Reviewed-on: http://gerrit.cloudera.org:8080/620
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-12-04 01:40:41 +00:00
Jim Apple
19b6bf0201 IMPALA-2226: Throw AnalysisError if table properties are too large (for the Hive metastore)
This only enforced the defaults in Hive. Users how manually choose to
change the schema in Hive may trigger these new analysis exceptions in
this commit unnecessarily. The Hive issue tracking the length
restrictions is

https://issues.apache.org/jira/browse/HIVE-9815

Change-Id: Ia30f286193fe63e51a10f0c19f12b848c4b02f34
Reviewed-on: http://gerrit.cloudera.org:8080/721
Reviewed-by: Jim Apple <jbapple@cloudera.com>
Tested-by: Internal Jenkins
2015-10-29 19:39:25 +00:00
Alex Behm
85c02cb144 Nested Types: Use a fixed-size comment for CREATE TABLE LIKE FILE.
Our CREATE TABLE LIKE FILE used to put the Parquet schema equivalent
into the comment of each column. Since the Parquet data model is very
verbose for deeply nested types compared to the Impala/Hive type, it
was easy to hit comment length limit in the Hive Metastore.

After this patch we populate the comment with a fixed-size message.
We are still limited by the Hive Metastore constraint on the length
of a type string. This patch adds checks and tests for exceeding
the type and comment length during a CREATE TABLE.

Change-Id: I83518e167cdd56c9e3f99e146269c37dc7b8e22a
Reviewed-on: http://gerrit.cloudera.org:8080/889
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
2015-09-22 10:58:33 -07:00
Vlad Berindei
ece7fed421 IMPALA-2316: Add RESTRICT to DROP DATABASE
Change-Id: Iffad73175b49160ae049911bd33c110a830f932b
Reviewed-on: http://gerrit.cloudera.org:8080/796
Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com>
Tested-by: Internal Jenkins
2015-09-11 20:37:27 +00:00
Alex Behm
6f0b255c5a Address several shortcomings with respect to the usability of Avro tables.
Addressed JIRAs: IMPALA-1947 and IMPALA-1813

New Feature:
Adds support for creating an Avro table without an explicit
Avro schema with the following syntax.

CREATE TABLE <table_name> column_defs STORED AS AVRO

Fixes and Improvements:
This patch fixes and unifies the logic for reconciling differences between
an Avro table's Avro Schema and its column definitions. This reconciliation
logic is executed during Impala's CREATE TABLE and when loading a table's
metadata. Impala generally performs the schema reconciliation during table
creation, but Hive does not. In many cases, Hive's CREATE TABLE stores the
original column definitions in the HMS (in the StorageDescriptor) instead
of the reconciled column definitions.

The reconciliation logic considers the field/column names and follows this
conflict resolution policy which is similar to Hive's:

Mismatched number of columns -> Prefer Avro columns.
Mismatched name/type -> Prefer Avro column, except:
  A CHAR/VARCHAR column definition maps to an Avro STRING, and is preserved
  as a CHAR/VARCHAR in the reconciled schema.

Behavior for TIMESTAMP:
A TIMESTAMP column definition maps to an Avro STRING and is presented as a STRING
in the reconciled schema, because Avro has no binary TIMESTAMP representation.
As a result, no Avro table may have a TIMESTAMP column (existing behavior).

Change-Id: I8457354568b6049b2dd2794b65fadc06e619d648
Reviewed-on: http://gerrit.cloudera.org:8080/550
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-25 09:52:18 +00:00
Alex Behm
14a8cadcf6 Nested Types: Pretty print complex types in DESCRIBE.
The current DESCRIBE prints the column type as a single string without
whitespace. As a result, the DESCRIBE output for tables with complex types
is basically unreadable/unusable, e.g., from the Impala shell.

This patch adds a prettyPrint() function to the FE Type and uses that
for generating a nicely formatted DESCRIBE output.

The output of DESCRIBE FORMATTED is intentionally not modified because
exact Hive-compatibility has been and presumably continues to be very
important to our users.

Change-Id: Ida810facdffd970948b837b83a60f9ddcd95f44d
Reviewed-on: http://gerrit.cloudera.org:8080/633
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-22 09:26:35 +00:00
Taras Bobrovytsky
b8b7930377 Add nested types support to Create Table Like File
Add support for creating a table based on a parquet file which contains arrays,
structs and/or maps.

Change-Id: I56259d53a3d9b82f318228e864c783b48a03f9ae
Reviewed-on: http://gerrit.cloudera.org:8080/582
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-22 01:46:26 +00:00
Vlad Berindei
e4c42fa8bf IMPALA-595: Add CASCADE to DROP DATABASE and use it in cleanup_db
Change-Id: Idfa5b6943bc797e10d542487c31b8f1b527d8c97
Reviewed-on: http://gerrit.cloudera.org:8080/635
Reviewed-by: Vlad Berindei <vlad.berindei@cloudera.com>
Tested-by: Internal Jenkins
2015-08-20 03:34:31 +00:00
Alex Behm
f9d26fb896 IMPALA-2203: Set an InsertStmt's result exprs from the source statement's result exprs.
This patch fixes an issue where incorrect results are produced by a CTAS or IAS
that is fed from a QueryStmt that has outer-joined inline views with constants or
conditionals in the select list. The regression was introduced in this commit:
b8f642710ea9d311a7aca32611eaa7cac6cd86df

Now that the final expression substitution with TupleIsNullPredicate() wrapping
is performed in planning, the InsertStmt's result expressions should be taken
from the feeding QueryStmt's result expressions, and not the QueryStmt's
(already substituted) base table result expressions.

Change-Id: Iae29683638df01f140d0f74976cca8ca9ba0852d
Reviewed-on: http://gerrit.cloudera.org:8080/637
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
2015-08-18 01:44:45 +00:00
Henry Robinson
f22b8659fd IMPALA-1595: Add 'location' to SHOW [TABLE STATS|PARTITIONS] for HDFS tables
This patch adds a 'location' column to the output of SHOW TABLE STATS /
SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE
SET LOCATION commands, particularly for partitions, and is easier to
identify than the output of DESCRIBE FORMATTED.

Some existing tests in alter-table.test have been updated to include
checking the location output before and after a SET LOCATION
command. The tests in show.test have also been updated to check for the
location; all other tests that use SHOW [TABLE STATS|PARTITIONS] use a
generic regex to avoid overly verbose tests.

Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb
Reviewed-on: http://gerrit.cloudera.org:8080/316
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2015-04-21 19:27:50 +00:00
Dan Hecht
c8fb10f50a S3: Some more work toward enabling additional S3 test coverage
Add skip markers for S3 that can be used to categorize the tests that
are skipped against S3 to help see what coverage is missing.  Soon
we'll be reworking some tests and/or adding new tests to get back the
important gaps.

Also, add a mechanism to parameterize paths in the .test files, and
start using these new variables.  This is a step toward enabling some
more tests against S3.

Finally, a fix for buildall.sh to stop the minicluster before applying
the metastore snapshot. Otherwise, this fails since the ms db is in
use.

Change-Id: I142434ed67bed407e61d7b2c90f825734fc0dce0
Reviewed-on: http://gerrit.cloudera.org:8080/127
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-03-03 08:29:13 +00:00
Martin Grund
cee1e84c1e IMPALA-1587: Extending caching directives for multiple replicas
This patch adds the possibility to specify the number of replicas that
should be cached in main memory. This can be useful in high QPS
scenarios as the majority of the load is no longer the single cached
replica, but a set of cached replicas. While the cache replication
factor can be larger than the block replication factor on disk, the
difference will be ignored by HDFS until more replicas become
available.

This extends the current syntax for specifying the cache pool in the
following way:

   cached in 'poolName'

is extended with the optional replication factor

   cached in 'poolName' with replication = XX

By default, the cache replication factor is set to 1. As this value is
not yet configurable in HDFS it's defined as a constant in the JniCatalog
thrift specification. If a partitioned table is cached, all its child
partitions inherit this cache replication factor. If child partitions
have a custom cache replication factor, changing the cache replication
factor on the partitioned table afterwards will overwrite this custom
value. If a new partition is added to the table, it will again inherit
the cache replication factor of the parent independent of the cache pool
that is used to cache the partition.

To review changes and status of the replication factor for tables and
partitions the replication factor is part of output of the "show
partitions" command.

Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-01-26 20:30:59 -08:00
Henry Robinson
44f57e5fb6 IMPALA-1122: Compute stats with partition granularity
This patch adds the ability to compute and drop column and table
statistics at partition granularity.

The following commands are added. Detail about the implementation
follows.

COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>]

This variant of COMPUTE STATS will, ultimately, do the same thing as the
traditional COMPUTE STATS statement, but does so by caching the
intermediate state of the computation for each partition in the Hive
MetaStore. If the PARTITION clause is added, the computation is
performed for only that partition. If the PARTITION clause is omitted,
incremental stats are updated only for those partitions with missing
incremental stats (e.g. one column does not have stats, or incremental
stats was never computed for this partition). In this patch, incremental
stats are only invalidated when a DROP STATS variant is executed. Future
patches can automatically invalidate the statistics after REFRESH or
INSERT queries, etc.

DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec>

This variant of DROP stats removes the incremental statistics for the
given table. It does *not* recalculate the statistics for the whole
table, so this should be used only to invalidate the intermediate state
for a partition which will shortly be subject to COMPUTE INCREMENTAL
STATS. The point of this variant is to allow users to notify Impala when
they believe a partition has changed significantly enough to warrant
recomputation of its statistics. It is not necessary for new partitions;
Impala will detect that they do not have any valid statistics.

--------

This is achieved by adapting the existing HLL UDA via swapping its
finalize method for a new one which returns the intermediate HLL
buckets, rather than aggregating and then disposing of them. This
intermediate state is then returned to Impala's catalog-op-executor.cc,
which then passes the intermediate state back to the frontend to be
ultimately stored in the HMS.

This intermediate state is computed on a per-partition basis by grouping
the input to the UDA by partition. Thus, the incremental computation
produces one row for each partition selected (the set of which might be
quite small, if there are few partitions without valid incremental
stats: this is the point of the new commands).

At the same time, the query coordinator aggregates the output of the UDA
to produce table-level statistics. This computation incorporates any
existing (and not re-computed) intermediate partition state which is
passed to the coordinator by the frontend. The resulting statistics are
saved to the table as normal.

Intermediate statistics are serialised to the HMS by writing a Thrift
structure's serialised form to the partition's 'parameters' map. There
is a schema-imposed limit of 4000 characters to the serialised string,
which is exacerbated by the fact that the Thrift representation must
first be base-64 encoded to avoid type errors in the HMS. The current
patch breaks the encoded structure into 4k chunks, and then recombines
them on read. The alltypes table (11 columns) takes about three of these
chunks. This may mean that incremental stats are not suitable for
particularly wide tables: these structures could be zipped before
encoding for some space savings. In the meantime, the NDV estimates are
run-length encoded (since they are generally sparse); this can result in
substantial space savings.

Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408
2014-11-25 09:13:37 -08:00
Alex Behm
3c650c3ba5 IMPALA-1104: Fixes for compute stats on Avro tables. Create Avro tables without col defs.
This patch addresses the following issues:
1. Allow creating Avro tables without col defs in Impala. Compute stats works on them.
2. Handle table creation with inconsistent col defs and Avro schema as follows:
   The table creation will succeed and ignore the col defs in favor of the Avro schema.
   A warning is issued that the col defs and the Avro schema are inconsistent.
   Compute stats works on such tables.

This patch does not address the issue of compute stats after Avro schema evolution.

Change-Id: Iea6b737d238d81491dc2097012ebc149a89d03ba
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4182
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4250
Tested-by: jenkins
2014-09-11 20:50:05 -07:00
Alex Behm
19bab59854 Create/alter/describe tables with complex types.
This patch adds parsing of complex types and tests for using complex
types in various exprs and create/alter/describe stmts.

Change-Id: Ibc211a560c889f5ccfb616813700b923c89d8245
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3577
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3594
2014-07-23 17:26:14 -07:00
Victor Bittorf
2d7f2e19b2 IMPALA 938: Infer schema from Parquet file
Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'".
Supports all options that CREATE TABLE does. Currently only PARQUET is supported.
Run testdata/bin/create-load-data.sh after pulling this patch.

Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158
2014-06-20 17:38:01 -07:00
anusha
ffc334a735 IMPALA-834: Fix for Create Table like Views
Change-Id: Ied1f706c48a1106e1d6fc2aa73e57746f52ea333
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2939
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3014
Reviewed-by: Anusha Dasarakothapalli <anusha.dasarakothapalli@cloudera.com>
2014-06-12 22:13:30 -07:00
Lenni Kuff
e97a1b52e0 Remove flaky verification in ALTER/CREATE table tests
This fixes the flaky ALTER/CREATE tests by removing a verification step that
didn't add value and was non-deterministic. The verficiation step that was
removed verified that CREATE/ALTER set the appropriate file format by
changing the format to something that didn't match the underlying data files,
then attempting to read the data. This is already covered by the positive
test case where the file format is changed to match the underlying data.

Change-Id: I66f485405234f472f3b83f3e776bf7f2c10de874
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1379
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1382
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-28 16:03:02 -08:00
Alex Behm
dd0409e9d6 IMPALA-509: Minimal type promotion for arithmetic exprs.
Change-Id: I576fe9baf3bae7d46ee08e29ececc4adda97e9df
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1078
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:54:30 -08:00
Lenni Kuff
76fa3b2ded Update DDL to support 'STORED AS PARQUET' and 'STORED AS AVRO' syntax
This change updates our DDL syntax support to allow for using 'STORED AS PARQUET'
as well as 'STORED AS PARQUETFILE'. Moving forward we should prefer the new syntax,
but continue to support the old.  I made the same change for 'AVROFILE', but since
we have not yet documented the 'AVROFILE' syntax I left out support for the old syntax.

Change-Id: I10c73a71a94ee488c9ae205485777b58ab8957c9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1053
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:18 -08:00
Lenni Kuff
39f77b8b8f Add support for cluster-synchronized catalog operations
This change adds support for cluster-synchronized catalog operations. This provides the
guaranteethat after a catalog op completes, all other subscribers to the catalog topic have
also processed that update. This is useful when load balancing, because a common workflow
is to target a different impalad for each statement executed.
For example if each of the following were executed sequentially, but targeting
a different node:
1) CREATE TABLE Foo
2) INSERT INTO Foo
3) SELECT * FROM Foo
4) INSERT INTO Foo ....

Since both the INSERT and the CREATE update the catalog, it would not work as expected
without this patch. The user might either get a "table not found" error or would be
missing partition information from the INSERT.

The downside is that this approach to DDL takes a bit longer because we need to wait
until all subscribers have processed an update. If all nodes are healthy, this overhead
should not be significantly longer than the current DDL time. However, a single bad node
might slow down or completely block the completion of all DDL operations. By default
this feature is disabled, but it can be enabled using a new query option: SYNCED_DDL=1

To test this, the base test suite was updated to support selecting a random impalad
to execute each query section in a query test file. This is currently only enabled
for the insert and DDL tests, but could be leveraged by more tests in the future.

TODO: Add additional failure tests around this functionality.
TODO: Add an explicit "sync" statement so users do not need to run all their DDL
in this mode (since it is slower).

Change-Id: I45e757a931bf2a4740cc0cdd1e76ce49a1e22b83
Reviewed-on: http://gerrit.ent.cloudera.com:8080/899
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:58 -08:00
Nong Li
c031cd4e96 Update RLE encoding to pad literal groups to 8.
Change-Id: I77cb2b80b888b569ff715c583f16aea4e39fe680
Reviewed-on: http://gerrit.ent.cloudera.com:8080/644
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-01-08 10:53:17 -08:00
Alex Behm
39f9a067fa IMPALA-444: Fixed accuracy of string to double conversion. Falling back to strod for scientific notation.
Change-Id: I9a5d948620907d34601ef041e58b1c9bb2172f71
Reviewed-on: http://gerrit.ent.cloudera.com:8080/507
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:56 -08:00
Nong Li
af90c8a133 Fix memory usage tracking.
Changes MemLimit to MemTracker:
- the limit is optional
- it also records a label and an optional parent
- Consume() and Release() also update the ancestors and there's also a new
  AnyLimitExceeded(), which also checks the ancestors
- the consumption counter is a HighwaterMarkCounter and can optionally be created
  as part of a profile

Each fragment instance now has a MemTracker that is part of a 3-level
hierarchy: process, query, fragment instance.

Change-Id: I5f580f4956fdf07d70bd9a6531032439aaf0fd07
Reviewed-on: http://gerrit.ent.cloudera.com:8080/339
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:36 -08:00
Lenni Kuff
d66d3bfce3 IMPALA-161: Add Impala support for CREATE TABLE AS SELECT
This adds support for CREATE TABLE AS SELECT to Impala. It supports all functionality a
regular CREATE TABLE statement includes, except it does not allow for for specifying
partition columns. Hive also has this limitation and it wouldn't be too hard to support
in the future.

Change-Id: I4ca3c3b8f1576441b8bb5ed9dc521d7dfa96ab74
Reviewed-on: http://gerrit.ent.cloudera.com:8080/157
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:17 -08:00
ishaan
e9e23bff5d Fix build because of a change in parquetfile.
This changes QueryTest/create.test to unblock the builds.

Change-Id: If91ac43e349c2f81034ba7504c27890781f33260
Reviewed-on: http://gerrit.ent.cloudera.com:8080/255
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:16 -08:00
Nong Li
fd53edbbe4 Fix parquet writer bug with not setting dictionary metadata.
Change-Id: Ia5c0886497678d31b82cb5052e06df437bb201be
Reviewed-on: http://gerrit.ent.cloudera.com:8080/114
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:02 -08:00
Skye Wanderman-Milne
e8344bb0d0 Dictionary encoding/decoding 2014-01-08 10:51:15 -08:00
Nong Li
68e4c14527 Fix parquet incompatibilities. 2014-01-08 10:50:22 -08:00
Alex Behm
5db3f2cdf5 IMPALA-227: SELECT * on partitioned table returns columns in different order than Hive. 2014-01-08 10:49:48 -08:00
Lenni Kuff
e218721386 IMPALA-198: Support setting file format, table comment in CREATE TABLE LIKE statements 2014-01-08 10:49:31 -08:00
Lenni Kuff
15f0313283 Add analysis checks for length of RowFormat strings, fix escaping of row format values 2014-01-08 10:49:21 -08:00
Lenni Kuff
ca0d23a844 IMPALA-157: Support CREATE TABLE LIKE DDL 2014-01-08 10:48:55 -08:00