Commit Graph

132 Commits

Author SHA1 Message Date
Nong Li
a0bf45a0b4 Add udf type.
Change-Id: Ic5f52c127750cc9c847a3e34d3fdcfc78bee5a8a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/454
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:48 -08:00
Alex Behm
33000b8c15 Fixed codegen of floating-point modulo.
Change-Id: Idd28c6a71a659471aa632a6e26d970557daeb3bf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/385
Tested-by: jenkins
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:46 -08:00
Nong Li
308650f208 Fix create function ddl test setup issue.
Change-Id: I30c9a4342efbdb17bd53fb14bdcee172506cdadb
Reviewed-on: http://gerrit.ent.cloudera.com:8080/447
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:44 -08:00
Nong Li
8eb727b585 UDF ddl cleanup
Change-Id: I381fed277b5809727d2d8bf430258c01d2d0ae1f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/436
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:43 -08:00
Nong Li
b22d1f41a7 Change all "Status Close()" to "void Close()"
Doing it this way makes sure we don't bail early on the Close path
which is rarely the right thing to do. This found a few places where
we were not doing proper cleanup because of this.

Change-Id: Ie663c68398c14589b5cbc1bd980644b0b10fd865
Reviewed-on: http://gerrit.ent.cloudera.com:8080/373
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:38 -08:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Nong Li
af90c8a133 Fix memory usage tracking.
Changes MemLimit to MemTracker:
- the limit is optional
- it also records a label and an optional parent
- Consume() and Release() also update the ancestors and there's also a new
  AnyLimitExceeded(), which also checks the ancestors
- the consumption counter is a HighwaterMarkCounter and can optionally be created
  as part of a profile

Each fragment instance now has a MemTracker that is part of a 3-level
hierarchy: process, query, fragment instance.

Change-Id: I5f580f4956fdf07d70bd9a6531032439aaf0fd07
Reviewed-on: http://gerrit.ent.cloudera.com:8080/339
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:36 -08:00
Nong Li
2394ae2e66 UDF parsing and analysis.
Change-Id: If8058c1cb66bf5e9c7049d4b78f5882b46c03fc1
Reviewed-on: http://gerrit.ent.cloudera.com:8080/318
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:32 -08:00
Aaron Davidson
cafb7b72f8 External sorting
This is an experimental implementation of external sorting. This patch includes the following additions:
(1) creation and implementation of the Sorter interface, which can sort Impala Tuples.
(2) normalization of Tuples to allow memcmp-able sorting.
(3) a testing framework for the Sorter,
(4) a benchmark to compare the current state of the Sorter with other sorts,
(5) an implementation of a Vector which can store data whose size is only known at runtime,
(6) a sorting algorithm (basically a dumbed down STL sort) which can operate over such a vector,
(7) implementation of a simple in-memory Merger, and
(8) logic to stream blocks of memory in and out of memory for the actual external merging.

I have a local branch for experimental optimizations and benchmarking -- this should be considered
a "basic", working sort.

The following optimizations have been implemented:
(i)   Optionally extracting keys instead of writing them in place.
(ii)  Optionally opportunistically parallelize run building (sorting & prepare for output).
(iii) Maximize disk IO and minimize buffer recycling by writing buffers out, but also keeping
      them in memory until right when they're needed.
(iv)  Prepare auxililary data backwards so the buffers can be released as we go, and still
      go out in an order which preserves the first buffers of the run.
(v)   Always merge maximum number of runs at a time, taking from the next merge level if
      available.

Change-Id: I1d7304d54d73152da929b1efffc1e851e5fb8fd4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/126
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Aaron Davidson <aaron.davidson@cloudera.com>
2014-01-08 10:52:27 -08:00
Aaron Davidson
00275ce3a9 (IMPALA-422) Add string concatenation function
Implements a group_concat() function which concatenates all the values in a group together.

The format is group_concat(str_col, [separator]). The default separator is ', '. NULLs
are ignored.

Change-Id: If152df6f528401117dba81d66ef691bfb548cc7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/117
Reviewed-by: Aaron Davidson <aaron.davidson@cloudera.com>
Tested-by: Aaron Davidson <aaron.davidson@cloudera.com>
2014-01-08 10:52:21 -08:00
Lenni Kuff
d66d3bfce3 IMPALA-161: Add Impala support for CREATE TABLE AS SELECT
This adds support for CREATE TABLE AS SELECT to Impala. It supports all functionality a
regular CREATE TABLE statement includes, except it does not allow for for specifying
partition columns. Hive also has this limitation and it wouldn't be too hard to support
in the future.

Change-Id: I4ca3c3b8f1576441b8bb5ed9dc521d7dfa96ab74
Reviewed-on: http://gerrit.ent.cloudera.com:8080/157
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:17 -08:00
ishaan
e9e23bff5d Fix build because of a change in parquetfile.
This changes QueryTest/create.test to unblock the builds.

Change-Id: If91ac43e349c2f81034ba7504c27890781f33260
Reviewed-on: http://gerrit.ent.cloudera.com:8080/255
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:16 -08:00
Nong Li
a3bc1ce133 Some parquet encoder/decoder refactoring. Added dictionary to other types.
Split out the encoder/type for parquet reader/writer. I think this puts us
in a better place to support future encodings.

On the tpch lineitem table, the results are:
Before:
  BytesWritten: 236.45 MB
  Per Column Sizes:
    l_comment: 75.71 MB
    l_commitdate: 8.64 MB
    l_discount: 11.19 MB
    l_extendedprice: 33.02 MB
    l_linenumber: 4.56 MB
    l_linestatus: 869.98 KB
    l_orderkey: 8.99 MB
    l_partkey: 27.02 MB
    l_quantity: 11.58 MB
    l_receiptdate: 8.65 MB
    l_returnflag: 1.40 MB
    l_shipdate: 8.65 MB
    l_shipinstruct: 1.45 MB
    l_shipmode: 2.17 MB
    l_suppkey: 21.91 MB
    l_tax: 10.68 MB
After:
 BytesWritten: 198.63 MB            (84%)
  Per Column Sizes:
    l_comment: 75.71 MB             (100%)
    l_commitdate: 8.64 MB           (100%)
    l_discount: 2.89 MB             (25.8%)
    l_extendedprice: 33.13 MB       (100.33%)
    l_linenumber: 1.50 MB           (32.89%)
    l_linestatus: 870.26 KB         (100.032%)
    l_orderkey: 9.18 MB             (102.11%)
    l_partkey: 27.10 MB             (100.29%)
    l_quantity: 4.32 MB             (37.31%)
    l_receiptdate: 8.65 MB          (100%)
    l_returnflag: 1.40 MB           (100%)
    l_shipdate: 8.65 MB             (100%)
    l_shipinstruct: 1.45 MB         (100%)
    l_shipmode: 2.17 MB             (100%)
    l_suppkey: 10.11 MB             (46.14%)
    l_tax: 2.89 MB                  (27.06%)

The table is overall 84% as big (i.e. 16% smaller). A few columns got marginally
bigger. If the file filled  the 1 GB, I'd expect the overhead to decrease even
more.

The restructuring to use a virtual call doesn't seem to change things much and
will go away when we codegen the scanner.

Here's what they look like with this patch (note this is on the before data files,
so only string cols are dictionary encoded).

Before query times:
  Insert Time: 8.5 sec
  select *: 2.3 sec
  select avg(l_orderkey): .33 sec

After query times:
  Insert Time: 9.5 sec                  <-- Longer due to doing dictionary encoding
  select *: 2.4 sec                     <-- kind of noisy, possibly a slight slow down
  select avg(l_orderkey): .33 sec

Change-Id: I213fdca1bb972cc200dc0cd9fb14b77a8d36d9e6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/238
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:16 -08:00
Skye Wanderman-Milne
b9ea32e9b7 Fix IMPALA-129, IMPALA-534, and other scanner bugs.
Change-Id: Idbd29af3fcc35b9e1173d08ac55b5780751c5938
Reviewed-on: http://gerrit.ent.cloudera.com:8080/196
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:14 -08:00
Alex Behm
9a201645cd IMPALA-496: Fix escaping of field delimiter and escape character in inserts
Change-Id: I49c36ae9823b35dcb9e92d1a13bef270657e36f2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/163
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:09 -08:00
Alex Behm
f0e2d539fc IMPALA-495: Views Sometimes Not Utilizing Partition Pruning.
Change-Id: I65daebbe8c4b72b956a409fe28edd3773fda7cb7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/128
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:04 -08:00
Alex Behm
c9965e5a5c Fix build break due to views defined by a constant select.
Change-Id: I5deeeb03469494f5ba6ed7a911354bbdd6c98195
Reviewed-on: http://gerrit.ent.cloudera.com:8080/149
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
2014-01-08 10:52:04 -08:00
Alex Behm
2b427208e5 IMPALA-507: Creating a VIEW that does not reference a table fails with IllegalStateException.
Change-Id: I11470ba919bbfced76730adae2a46647c4ef110b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/146
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:04 -08:00
Alex Behm
52c9d26d16 IMPALA-475: Impala should avoid the use of c_# style autogenerated column aliases unless necessary.
Change-Id: I959e35bcee1698ebc35534dc4f390c5c2c7dc919
Reviewed-on: http://gerrit.ent.cloudera.com:8080/141
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:03 -08:00
Alex Behm
9754f5bf52 IMPALA-504: Right and full outer joins do not return row with NULL value for rhs table.
Change-Id: Ia3f8d474fb30189b36fb587b2920d7b9b224ea71
Reviewed-on: http://gerrit.ent.cloudera.com:8080/129
Tested-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:03 -08:00
Skye Wanderman-Milne
6e7406df8b IMPALA-502: Impala does not return NULL for case where table has extra string column and data does not (it returns an empty string)
Change-Id: I0cfe5ce5fc279d46610a3cc191a501ccbc335296
Reviewed-on: http://gerrit.ent.cloudera.com:8080/127
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: Skye Wanderman-Milne <skye@cloudera.com>
2014-01-08 10:52:02 -08:00
Nong Li
fd53edbbe4 Fix parquet writer bug with not setting dictionary metadata.
Change-Id: Ia5c0886497678d31b82cb5052e06df437bb201be
Reviewed-on: http://gerrit.ent.cloudera.com:8080/114
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:02 -08:00
Lenni Kuff
faeb7f5fa3 Add scanner test case for scenario where data and table schema do not match
Change-Id: I16f007ad1cb2caac47506914512c5665fc3d5f56
Reviewed-on: http://gerrit.ent.cloudera.com:8080/98
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:52:01 -08:00
Skye Wanderman-Milne
3fecdeb793 IMPALA-441: support default values for Avro tables 2014-01-08 10:51:39 -08:00
Alex Behm
8ad15fabcf IMPALA-372: Added CREATE/DROP/ALTER VIEW. 2014-01-08 10:51:35 -08:00
Alex Behm
3bba336bbf IMPALA-359: Return proper tuple id of inline view with distinct aggregation. 2014-01-08 10:51:26 -08:00
Alan Choi
254ee6ef89 IMPALA-434 Support binary hbase encoding 2014-01-08 10:51:18 -08:00
Skye Wanderman-Milne
e8344bb0d0 Dictionary encoding/decoding 2014-01-08 10:51:15 -08:00
Lenni Kuff
c2cfc7e2a3 IMPALA-373: Add support for 'LOAD DATA' statements
This change adds Impala support for LOAD DATA statements. This allows the user
to load one or more files into a table or partition from a given HDFS location. The
load operation only moves files, it does not convert data to match the target
table/partition's file format.
2014-01-08 10:51:02 -08:00
Alex Behm
045038e479 IMPALA-374: Added WITH clause without recursion. 2014-01-08 10:51:00 -08:00
Henry Robinson
79b36a5eb3 IMPALA-375: Add column permutation clause to INSERT statement 2014-01-08 10:50:59 -08:00
Nong Li
ce092065be Fix bug with how exec sets if the conjuncts are thread safe. 2014-01-08 10:50:53 -08:00
Alan Choi
b1de018298 IMPALA-31 Support EXPLAIN <query>
Hue is moving to HiveServer2 but HiveServer2 does not have an "explain" RPC
call. To support "explain", I added it to the language.

An "explain" statement will return a result set: one row per explain line.
2014-01-08 10:50:32 -08:00
Alex Behm
937a44f9f8 IMPALA-68: Support Values() statement. 2014-01-08 10:50:31 -08:00
Alex Behm
c7819f4db7 IMPALA-87: Support INSERT from SELECT without FROM. 2014-01-08 10:50:30 -08:00
Alex Behm
9ff09cd3f4 IMPALA-70: Respect tbl properties to allow empty strings to be treated as NULL 2014-01-08 10:50:28 -08:00
Lenni Kuff
627e74a068 Fix insert test failure by cleaning up table before executing query 2014-01-08 10:50:27 -08:00
Lenni Kuff
e0507e192b Fix unstable alter table test 2014-01-08 10:50:26 -08:00
Nong Li
261119b91f Forgot to update the test in previous commit. 2014-01-08 10:50:23 -08:00
Nong Li
8af35425e6 Fix unstable ordering with nans. 2014-01-08 10:50:22 -08:00
Nong Li
68e4c14527 Fix parquet incompatibilities. 2014-01-08 10:50:22 -08:00
Henry Robinson
ead69d377f IMPALA-249, IMPALA-252: Fixes for static partition keys. 2014-01-08 10:50:14 -08:00
Alex Behm
861ba05989 IMPALA-197: Outer join on constant expressions returns incorrect results. 2014-01-08 10:50:09 -08:00
Alex Behm
c9040aee22 IMPALA-111: COUNT(DISTINCT col) returns wrong results -- does not ignore NULLs. 2014-01-08 10:50:09 -08:00
Alex Behm
14557c7bab IMPALA-297: Remove distinction between value_expr and expr in parser. 2014-01-08 10:50:08 -08:00
Skye Wanderman-Milne
0c343913fa IMPALA-266: Round() does not output the right precision 2014-01-08 10:50:02 -08:00
Henry Robinson
7d2c47ad72 IMPALA-258: Make partition key string encoding Hive-compatible 2014-01-08 10:49:54 -08:00
Alex Behm
abafcf81ff IMPALA-287: Full outer join is missing results. 2014-01-08 10:49:54 -08:00
Alex Behm
4c45bc06c4 IMPALA-84: Predicates not evaluated if select exprs are constant. 2014-01-08 10:49:53 -08:00
Alex Behm
dbe3127383 IMPALA-285: Multiple outer joins with nesting crash impalad 2014-01-08 10:49:53 -08:00