Commit Graph

323 Commits

Author SHA1 Message Date
Dan Hecht
1fee56cb26 IMPALA-1080: Implement "SET <query_option>" as SQL statement.
Also add support for "SET", which returns a table of query options and
their respective values.

The front-end parses the option into a (key, value) pair and then the
existing backend logic is used to set the option, or return the result
sets.

Change-Id: I40dbd98537e2a73bdd5b27d8b2575a2fe6f8295b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3582
Reviewed-by: Daniel Hecht <dhecht@cloudera.com>
Tested-by: jenkins
(cherry picked from commit aa0f6a2fc1d3fe21f22cc7bc56887e1fdb02250b)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3614
2014-07-25 10:25:09 -07:00
Alex Behm
e9864d5f78 Introduce type hierarchy and add complex types.
This patch replaces ColumnType with a hierarchy of types that models
the existing scalar types as well as the new complex types ARRAY, MAP,
and STRUCT.

Change-Id: Ia895f41153e99febb0c35412acac12689c3c2064
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3491
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3538
2014-07-21 20:00:46 -07:00
Paden Tomasello
879a40913c Implemented UDFs for timestamp functions.
FromUtc and ToUtc use thirdparty libraries which use inline asm which
isn't currently supported with JIT. The UDFs are included in this
commit, but the function symbols were not changed in
impala_functions.py

Change-Id: I0824a434d4a26a39abf29bc6e47d51b5ad7991d6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3390
Reviewed-by: Paden Tomasello <paden.tomasello@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8e149ccd78010b7a22d6fff1b0de5614848b02ac)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3548
2014-07-21 15:27:46 -07:00
Lenni Kuff
7157f54bbe Support DROP STATS <table name>
Adds support for dropping all table and column stats from a table. Once incremental
stats are supported, this will provide the user a way to force a recompute of all
stats.

Change-Id: I27e03d5986b64eb91852bfc3417ffa971d432d6b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3533
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f1f074f24bfdc77c4cef147fe9d26f27df80ab81)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3551
2014-07-21 10:28:16 -07:00
Abdullah Yousufi
f4d1afe0ce IMPALA-921: Change EXPLAIN_LEVEL value from 0 to 1 in impala-shell for SET command
Change-Id: I2bfcefb5c8143d4cb4d74157c5309cd9445bac02
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3383
Reviewed-by: Abdullah Yousufi <abdullah.yousufi@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3499
2014-07-15 12:32:43 -07:00
Henry Robinson
dd4c1c32dc Add optional RM reservation limit to memtrackers
If RM and per-query memory limits were enabled at the same time, the
per-query limit would be ignored if RM wanted to expand the memory
allocation. This change adds an optional reservation limit to a
memtracker. The original limit goes back to being a hard limit -
i.e. any attempt to consume more than that amount results in
failure. The RM reservation limit is the RM-allocated memory limit. If
that is exceeded it triggers the ExpandRmReservation() method, which tries
to retrieve more memory as long as the hard limit is observed.

The net effect is that per-query memory limits have the intended,
hard-limit effect, while the RM limits coexist nicely and can expand
with more memory as required.

At the same time, we change the precedence of various ways of suggesting
an initial reservation size so that the user can change the reservation
size via a query option (MEM_RESERVATION_SIZE).

Change-Id: I41bfa4eb1336810a8a5946f6be3472111a052144
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3134
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2014-07-01 18:08:47 -07:00
Lenni Kuff
ad933ec765 Switch terminology of 'impersonated user' to 'delegated user'
This is to help ensure naming is consistent across the platform and
also avoid confusion with HS2 "impersonation" which is something very
different.

Change-Id: I48c1b76dff75b92b11ddc7aab0eb9a3a5d20e489
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3315
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 931f6a66c0d8dff25b746d127dc1f36e96b12f98)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3326
2014-06-28 20:46:06 -07:00
Henry Robinson
2a374e5893 Prepare resource broker for cancellation changes
This patch anticipates the changes to Llama that allow a
client-specified resource ID to be returned with every reservation or
expansion request. Doing this allows us to remove the tricky
coordination logic between WaitForNotification() and AMNotification()
when we don't know which side will access the rendezvous data structures
first. Now we can guarantee that the consumer-side will be set-up before
the notification is received.

Change-Id: I908b1dae8d074a84b0465e3a444d6651f126efd7
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3093
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
2014-06-21 00:08:19 -07:00
Victor Bittorf
2d7f2e19b2 IMPALA 938: Infer schema from Parquet file
Syntax is "CREATE TABLE name LIKE fileformat '/path/to/file'".
Supports all options that CREATE TABLE does. Currently only PARQUET is supported.
Run testdata/bin/create-load-data.sh after pulling this patch.

Change-Id: Ibb9fbb89dbde6acceb850b914c48d12f22b33f55
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2720
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3158
2014-06-20 17:38:01 -07:00
Victor Bittorf
778fba232c Adding MADlib vector operations.
These are basic UDFs which the user will need to interact with MADlib.

Change-Id: Iadcec2376e29d2f73f1bb04d5e695c58d9381952
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2368
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 2eac4ec1287e354c26c2b8916515307fe793fb93)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3160
2014-06-19 14:48:59 -07:00
Paden Tomasello
dca53ce023 Changes to row-batch.cc and Data.thrift interface.
This change will allow row-batch.cc to use LZ4 codec.
It will be implemented in a following patch.

Change-Id: I9302da1b72c83fcf8420724138d40ad0d82c554b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3030
Reviewed-by: Paden Tomasello <paden.tomasello@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3155
2014-06-19 12:53:41 -07:00
Alex Behm
ef6705d7e0 Rename MergeNode to UnionNode.
Change-Id: I9e3675a103757db1345b04bd1d102d2719efddd0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3128
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3154
Tested-by: Alex Behm <alex.behm@cloudera.com>
2014-06-19 12:44:21 -07:00
Alex Behm
677062be3d Rework planning of unions s.t. a UnionStmt produces a single MergeNode.
This patch changes the planning of a UnionStmt s.t. it always produces a single fragment
with a MergeNode connecting all child fragments as its root.
The data partition of the returned fragment and how the child fragments are merged
depends on the data partitions of the child fragments:
- All child fragments are unpartitioned or partitioned: The returned fragment is
  has a UNPARTITIONED or RANDOM data partition, respectively. The MergeNode absorbs
  the plan trees of all child fragments.
- Mixed partitioned/unpartitioned child fragments: The returned fragment is
  RANDOM partitioned. The plan trees of all partitioned child fragments are absorbed
  into the MergeNode. All unpartitioned child fragments are connected to the
  MergeNode via a RANDOM exchange, and remain unchanged otherwise.

Also adds support for random partitioned data exchanges.

Change-Id: I82b2d12c104d98c4e7133234653ee1b67658ef7a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2876
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3143
2014-06-19 00:56:58 -07:00
Alex Behm
9dc883b140 IMPALA-1005: Print consistent plan fragment ids in explain plan and runtime profile.
Change-Id: I63b59a896dc9dc0c9ed1d5e889f7b5626ba61202
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3037
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3124
2014-06-18 15:44:43 -07:00
Paden Tomasello
0326f17bb3 Adding Lz4 Codec.
Change-Id: I037d4e0de3b2cd2b8582caea058c8e1f2f880ff3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3027
Reviewed-by: Paden Tomasello <paden.tomasello@cloudera.com>
Tested-by: jenkins
2014-06-16 14:20:34 -07:00
Matthew Jacobs
dbe1b534ed IMPALA-1050: NPE error when pool placement policy cannot map user to pool
Change-Id: I53ed823ee55bee96269f4119af7da2dab25d4a7c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3028
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 569bd5d4a8e30a907a33551c58a3ab80849b8dc9)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3061
2014-06-15 13:38:20 -07:00
Nong Li
5bbf006d19 Update parquet spec to 2.0 and add decimal logical type.
Change-Id: I1a4cbe73a2494f8b2dd09f44bfcc0a019e710344
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3034
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-06-13 11:18:00 -07:00
Alex Behm
0251e5215c Allow MergeNode with constant selects to run correctly on multiple fragment instances.
Change-Id: I0b1ff27f591366b960aa944fadabbb4b35f4b9b4
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2832
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3002
2014-06-12 16:39:55 -07:00
Skye Wanderman-Milne
1cc628d32d IMPALA-950: Skip computing stats for decimal columns.
This patch also adds a mechanism to return analysis warnings to
client, which is used to log skipped decimal columns.

Change-Id: I30c246044a68ec8861cd5bed072bd54e65a079e6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2822
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit fc77422acef7e6f93fdeb5448309414b905f0725)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2984
2014-06-11 19:16:34 -07:00
Nong Li
5d903efca3 ExecSummary
The runtime profile as we present it is not very useful and I think the structure of
it makes it hard to consume. This patch adds a new client facing schemed set of
counters that are collected from the runtime profiles. For example, with this structure
it would be easy to have the shell get the stats of a running query and print a useful
progress report or to check the most relevant metrics for diagnosing issues.

Here's an example of the output for one of the tpch queries:
Operator              #Hosts   Avg Time   Max Time    #Rows  Est. #Rows  Peak Mem  Est. Peak Mem  Detail
------------------------------------------------------------------------------------------------------------------------
09:MERGING-EXCHANGE        1   79.738us   79.738us        5           5         0        -1.00 B  UNPARTITIONED
05:TOP-N                   3   84.693us   88.810us        5           5  12.00 KB       120.00 B
04:AGGREGATE               3    5.263ms    6.432ms        5           5  44.00 KB       10.00 MB  MERGE FINALIZE
08:AGGREGATE               3   16.659ms   27.444ms   52.52K     600.12K   3.20 MB       15.11 MB  MERGE
07:EXCHANGE                3    2.644ms      5.1ms   52.52K     600.12K         0              0  HASH(o_orderpriority)
03:AGGREGATE               3  342.913ms  966.291ms   52.52K     600.12K  10.80 MB       15.11 MB
02:HASH JOIN               3    2s165ms    2s171ms  144.87K     600.12K  13.63 MB      941.01 KB  INNER JOIN, BROADCAST
|--06:EXCHANGE             3    8.296ms    8.692ms   57.22K      15.00K         0              0  BROADCAST
|  01:SCAN HDFS            2    1s412ms    1s978ms   57.22K      15.00K  24.21 MB      176.00 MB  tpch.orders o
00:SCAN HDFS               3    8s032ms    8s558ms    3.79M     600.12K  32.29 MB      264.00 MB  tpch.lineitem l

Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-06-11 03:10:11 -07:00
Srinath Shankar
5755b0bdee Order by without limit for Impala
Enable order-by without limit
Added BufferedBlockMgr to allocate buffers and spill to disk.
Added Sorter for the external sort impelementation
Added new SortNode execution node that completely sorts its input
Changes to enable writing in IoMgr went in a separate patch.

Reviewed-on: http://gerrit.ent.cloudera.com:8080/1539
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins

Conflicts:

	testdata/workloads/functional-planner/queries/PlannerTest/tpcds-all.test

Change-Id: I3ece32affe5b006f53bbdfcc03ded01471e818ac
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2900
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins
2014-06-09 16:58:08 -07:00
Henry Robinson
60cbe1b0e1 IMPALA-741: Support partitions with non-existant HDFS locations
If a partition had a location that did not exist in HDFS, Impala would
refuse to load its metadata. This meant a typo could render a table
unloadable. We fix this problem by removing the existence check from the
frontend, and by inheriting access from the first extant parent of the
partition directory.

Fixing this exposed a second issue, where Impala wouldn't create
directories for partitions in the right place after an INSERT if the
partition location had been changed. To get this right we have to plumb
the partition ID through to Coordinator::FinalizeSuccessfulInsert(), so
that the coordinator can look up the partition's location from the
query-wide descriptor table. As a by-product, this patch rationalises
the per-partition, per-fragment statistics gathering a little bit by
putting almost all the per-partition stats into TInsertPartitionStatus.

Change-Id: I9ee0a1a1ef62cf28f55be3249e8142c362083163
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2851
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
2014-06-08 18:44:45 -07:00
Skye Wanderman-Milne
a618d34f17 More decimal builtins.
Change-Id: Ie5b89ad7d1fc80fa646f7cf5f520db13b25b9565
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2764
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 6e994ce7712047000d3a12b5eb677b5470687370)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2830
2014-06-06 19:42:45 -07:00
Nong Li
8f4dc0f2f0 IMPALA-974: Switch from FloatLiteral to DecimalLiteral.
Float/Doubles are lossy so using those as the default literal type
is problematic.

Change-Id: I5a619dd931d576e2e6cd7774139e9bafb9452db9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2758
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-31 22:19:06 -07:00
Nong Li
26ca559f38 Add decimal builtins: abs/round/ceil/floor/truncate.
Change-Id: I4fe0ee69475ff56d3dc0cd69ea21f677714ae8bc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2748
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-30 11:53:06 -07:00
Lenni Kuff
c45e9a70d9 [CDH5] Add DDL support for HDFS caching
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED

When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.

When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.

It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).

Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.

Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-27 16:47:15 -07:00
Dimitris Tsirogiannis
ca86e470de IMPALA-887: Improve partition pruning time
This commit is the first step in improving the performance of partition
pruning. Currently, Impala can prune approximately 10K partitions per
sec, thereby introducing significant overhead for huge table with a
large number of partitions. With this commit we reduce that overhead by
3X by batching the partition pruning calls to the backend.

Change-Id: I3303bfc7fb6fe014790f58a5263adeea94d0fe7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2608
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2687
2014-05-26 13:10:12 -07:00
Nong Li
723f583b4d Allow adding predicates after processing build table.
Change-Id: I4c845d9f08f0be29e548eceac3912871acd0270f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2658
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-22 13:09:51 -07:00
Lenni Kuff
83e239723f Add TRole/TPrivilege structs to Thrift CatalogObjects
These are used as our internal representation of the authorization policy metadata
(as opposed to directly using the Sentry Thrift structs). Versioned/managed in the
same way as other TCatalogObjects.

Change-Id: Ia1ed9bd4e25e9072849edebcae7c2d3a7aed660d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2545
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
(cherry picked from commit c89431775fcca19cdbeddba635b83fd121d39b04)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2646
2014-05-21 15:51:24 -07:00
Henry Robinson
e87c0eb22a [CDH5] Detect pseudo-distributed Llama cluster
Since we're no longer using the MiniLlama, we need to explicitly set
whether or not the cluster is pseudo-distributed. Impala needs this
information to correctly translate datanode addresses to a format that
Llama understands.

This change (adapted from one made by Casey) adds a method to the
frontend (callable via JNI) to get a configuration value from the Hadoop
configuration. We'll set that configuration value for local RM testing.

Change-Id: Ifd51db98a993ac0270dac2b832babbc394483c1a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2549
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-05-20 21:24:33 -07:00
Matthew Jacobs
f9c9a7ca13 Add SHOW DATA SOURCES
Change-Id: Ieeb0df107f45a58b8a99f717e96453da93ee7270
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2529
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit b2392c5bfe9fc928ad19af6ff6737e6dc6324e63)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2614
2014-05-19 17:52:27 -07:00
Victor Bittorf
0bb66ef327 Adding aliases ADD_MONTHS and SUB_MONTHS
This is a request for consistency with oracle.

Change-Id: I463a66694a068cd773532d8f6f853a4b089b918a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2400
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 1f0b643789596f96c54580b8c5262fada4dfc958)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2502
2014-05-09 17:35:29 -07:00
Matthew Jacobs
fb49706ec8 Add additional types to TColumnValue and fix field names
Adds 8 and 16 byte integer values and a binary value to TColumnValue
and fixes the field names.

Change-Id: Ie318fe7dad43b0cc0032b65b6b04c3fe173ae9b8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2418
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 68c476822402d27d985ed78fa5d14a843b681082)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2493
2014-05-08 17:38:54 -07:00
Matthew Jacobs
ebc6c5894e External Data Source: Frontend and catalog changes
Initial frontend and catalog changes for external data sources.

Change-Id: Ia0e61ef97cfd7a4e138ef555c17f2e45bbf08c18
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2224
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit dfa14c828957f751db9c89bae0bdc040ce6f648c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2485
2014-05-08 14:56:19 -07:00
Henry Robinson
38befd2126 IMPALA-724: Support infinite / nan values in text files
This patch allows the text scanner to read 'inf' or 'Infinity' from a
row and correctly translate it into floating-point infinity. It also
adds is_inf() and is_nan() builtins.

Finally, we change the text table writer to write Infinity and NaN for
compatibility with Hive.

In the future, we might consider adding nan / inf literals to our
grammar (postgres has this, see:
http://www.postgresql.org/docs/9.3/static/datatype-numeric.html).

Change-Id: I796f2852b3c6c3b72e9aae9dd5ad228d188a6ea3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2393
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 58091355142cadd2b74874d9aa7c8ab6bf3efe2f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2483
2014-05-08 12:28:53 -07:00
Matthew Jacobs
61b36a42bd External Data Source: Few small API changes
* Rename getStats() to prepare()
* Adds TRowBatch.num_rows to indicate number of rows when no cols are
  materialized
* Changes api and sample poms to produce source jars

Change-Id: I02dcc89e27716978708386cfc3f7940ee5dbc023
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2406
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 2d7fcba8b7442b54a388f8b994d0cfa08940bbd7)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2434
2014-05-02 17:10:25 -07:00
Nong Li
03e5665e56 Decimal: Read/Write to parquet.
This adds support for the FIXED_LENGTH_BYTE_ARRAY parquet type and
encoding for decimals.

Change-Id: I9d5780feb4530989b568ec8d168cbdc32b7039bd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1727
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2432
2014-05-02 16:38:35 -07:00
Henry Robinson
35d986cc8d Add max / min_*int() builtins
It would have been convenient today to know the largest values that
Impala accepts for its integer types. This patch adds max and min
builtins for our numeric types.

[localhost:21000] > select max_bigint(), max_int(), max_smallint(),
max_tinyint();
Query: select max_bigint(), max_int(), max_smallint(), max_tinyint()
+---------------------+------------+----------------+---------------+
| max_bigint()        | max_int()  | max_smallint() | max_tinyint() |
+---------------------+------------+----------------+---------------+
| 9223372036854775807 | 2147483647 | 32767          | 127           |
+---------------------+------------+----------------+---------------+

Change-Id: I6df6df2728197529c6375dbb1b7d3c9ddb9833d2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2381
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2398
2014-04-29 16:54:31 -07:00
Matthew Jacobs
1f07f2d7ee External Data Source: Thrift structure changes
A few changes to the external data source thrift types:
* Change RowBatch to return entire columns. Adds Data.TColumnData to
  represent an entire column.
* Makes all fields in ExternalDataSource (except for status fields on
  the result structures) optional in case fields become deprecated in
  the future.
* Adds a limit parameter to the TOpenParams structure in case the
  data source needs to apply the limit itself.

Change-Id: I62db68bfb64d2190dfdd0c84be5925ad5db031ef
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2345
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
(cherry picked from commit faf220d628359be1368f898493900fc2e2913c53)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2385
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
2014-04-27 12:57:13 -07:00
Victor Bittorf
46151dc7dd Adding EXTRACT builtin.
Change-Id: I6de20f336ecdfa3acd8d3a9166cff4a062baaacc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2247
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f233955020ffbd1023f2d6adbbfb22e267986305)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2370
2014-04-25 15:38:51 -07:00
Matthew Jacobs
25c0ebf58c External Data Source: Public API
Adds the thrift structures for the public external data source API
and a new maven project containing the Java ExternalDataSource
interface and the generated Java thrift classes.

The ExternalDataSource.thrift structures can evolve in a backward
compatible way. The ExternalDataSource Java interface will always
contain a version number in the namespace (e.g.
com.cloudera.impala.extdatasource.v1 for V1) so we can potentially
make breaking changes to the interface in the future but still
support older versions.

A trivial implementation of the ExternalDataSource API is also
added for testing purposes.
TODO: Make the sample data source implementation realistic.

Change-Id: I827d6420a87ed7a2bce34e050362ca98ddc5dbcc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2241
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f29814e9ede9d4c889f2648606fcf511feeb47ae)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2313
2014-04-22 18:34:48 -07:00
Victor Bittorf
c414c91931 Adding TRUNC builtin.
Includes additions to builtin UDF registration to support prepare/close.

Change-Id: I22668fa7ee033b3fa37050b7bccee935571ac453
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2243
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-04-22 13:17:12 -07:00
Nong Li
1cab95066d Add the return type as a column for SHOW FUNCTIONS.
Also includes some misc pattern matching cleanup.

Change-Id: I6c9ec78b094a73864b4d669afbd75a48c9bf9585
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2199
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2271
2014-04-17 17:58:13 -07:00
Matthew Jacobs
d0c353a9b4 IMPALA-922: Return helpful errors with Yarn group rules
When the -fair_scheduler_allocation_path is configured with a policy that uses
the "primaryGroup" Yarn queue allocation rule, Yarn throws an error if the user
is not on the local OS. Currently the user will get an error message that says:
"java.io.IOException: No groups found for user <username>". We now return a more
helpful error message.

Change-Id: I014ac15ef607e473957752f23af94d0cc4efec0f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2078
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 3cf37dc4e91afe887ada988f256b7008983580d2)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2244
2014-04-15 15:32:05 -07:00
Nong Li
87295a4e06 Decimal implementation.
This patch implements decimal support for text based formats.

Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238
Tested-by: jenkins
2014-04-14 21:07:32 -07:00
Henry Robinson
99c37aac37 IMPALA-827: Add an option for directories created by INSERT to inherit
their parent's permissions

This patch adds --insert_inherit_permissions. If true, all
new partition directories created by INSERT will inherit their
permissions from their parent. When false, the directories are created
with the default permissions.

Change-Id: Ib2b4c251e51ea5048387169678e8dde34ecfe5f6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1917
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2014-04-04 10:25:20 -07:00
Lenni Kuff
fd174a5e69 [CDH5] Remove duplication of network addresses in HdfsTable (updated for HDFS caching)
This is a port of 0b9134a from CDH4, but required some adjustments to work on CDH5 due to
the HDFS caching work. The differences from CDH4/CDH5 are mainly in
HdfsTable/HdfsPartition. I added a new BlockReplica class to represent a single block
with info on the host index + caching info.

This removes duplication TNetworkAddresses in the block location metadata of HdfsTable.
Each HdfsTable now contains a list of TNetworkAddress and the BlockLocations
just reference an index in this list to specify the host, rather than duplicating the
TNetworkAddress.

For a table with 100K blocks, this reduces the size of the THdfsTable struct by an additional
~50+% (on top of the duplicate file path changes). This takes the total size of the table
from:
21.1MB -> 9.4MB (file path duplication) -> 4.2MB (host duplication) = ~80% total improvement.

Change-Id: If7f11764dc0961376f9648779d253829f4cd83a2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1367
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1887
Reviewed-by: Nong Li <nong@cloudera.com>
2014-03-14 14:30:08 -07:00
Alex Behm
7fcd7cd64e Add list of tables missing stats to explain header and mem-limit exceeded error.
Change-Id: Ibe8f329d5513ae84a8134b9ddb3645fa174d8a66
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1501
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1880
2014-03-12 21:15:22 -07:00
Matthew Jacobs
e817c3742c Admission controller: fix a number of TODOs
* Remove requirement that fair scheduler and Llama conf files be on the classpath if
  specified as relative paths. Now they can be specified as any relative or absolute
  path.
* Add flags to disable all per-pool max requests limits or mem limits.
* Rename RequestPoolUtils to RequestPoolService
* Make it more clear RequestPoolService is a singleton by putting it in ExecEnv
* FileWatchService: use Executors.newScheduledThreadPool instead of a thread
* Moved MEGABYTE (and related constants) to new Constants class (frontend)
* Test RequestPoolService: Removed AllocationFileLoaderServiceHelper, replaced with
  reflection

Change-Id: Iadf79cf77a7894a469c3587d0019a6d0bee7e58f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1787
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit b9a167f6fdb4ab2595aca6035e1f9d926b909d94)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1858
2014-03-12 14:23:54 -07:00
Srinath Shankar
23d394c572 IMPALA-1238: Impala supports length() but not char_length()
Adds aliases char_length and character_length for the length()
 builtin. These are valid for ASCII.

Change-Id: I934d997f2c6d372ed12e7221efc1a574d68e01f3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1802
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1827
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
2014-03-08 16:53:49 -08:00