Commit Graph

301 Commits

Author SHA1 Message Date
Skye Wanderman-Milne
a618d34f17 More decimal builtins.
Change-Id: Ie5b89ad7d1fc80fa646f7cf5f520db13b25b9565
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2764
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 6e994ce7712047000d3a12b5eb677b5470687370)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2830
2014-06-06 19:42:45 -07:00
Nong Li
8f4dc0f2f0 IMPALA-974: Switch from FloatLiteral to DecimalLiteral.
Float/Doubles are lossy so using those as the default literal type
is problematic.

Change-Id: I5a619dd931d576e2e6cd7774139e9bafb9452db9
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2758
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-31 22:19:06 -07:00
Nong Li
26ca559f38 Add decimal builtins: abs/round/ceil/floor/truncate.
Change-Id: I4fe0ee69475ff56d3dc0cd69ea21f677714ae8bc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2748
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-30 11:53:06 -07:00
Lenni Kuff
c45e9a70d9 [CDH5] Add DDL support for HDFS caching
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED

When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.

When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.

It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).

Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.

Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-27 16:47:15 -07:00
Dimitris Tsirogiannis
ca86e470de IMPALA-887: Improve partition pruning time
This commit is the first step in improving the performance of partition
pruning. Currently, Impala can prune approximately 10K partitions per
sec, thereby introducing significant overhead for huge table with a
large number of partitions. With this commit we reduce that overhead by
3X by batching the partition pruning calls to the backend.

Change-Id: I3303bfc7fb6fe014790f58a5263adeea94d0fe7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2608
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2687
2014-05-26 13:10:12 -07:00
Nong Li
723f583b4d Allow adding predicates after processing build table.
Change-Id: I4c845d9f08f0be29e548eceac3912871acd0270f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2658
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-05-22 13:09:51 -07:00
Lenni Kuff
83e239723f Add TRole/TPrivilege structs to Thrift CatalogObjects
These are used as our internal representation of the authorization policy metadata
(as opposed to directly using the Sentry Thrift structs). Versioned/managed in the
same way as other TCatalogObjects.

Change-Id: Ia1ed9bd4e25e9072849edebcae7c2d3a7aed660d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2545
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
(cherry picked from commit c89431775fcca19cdbeddba635b83fd121d39b04)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2646
2014-05-21 15:51:24 -07:00
Henry Robinson
e87c0eb22a [CDH5] Detect pseudo-distributed Llama cluster
Since we're no longer using the MiniLlama, we need to explicitly set
whether or not the cluster is pseudo-distributed. Impala needs this
information to correctly translate datanode addresses to a format that
Llama understands.

This change (adapted from one made by Casey) adds a method to the
frontend (callable via JNI) to get a configuration value from the Hadoop
configuration. We'll set that configuration value for local RM testing.

Change-Id: Ifd51db98a993ac0270dac2b832babbc394483c1a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2549
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-05-20 21:24:33 -07:00
Matthew Jacobs
f9c9a7ca13 Add SHOW DATA SOURCES
Change-Id: Ieeb0df107f45a58b8a99f717e96453da93ee7270
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2529
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit b2392c5bfe9fc928ad19af6ff6737e6dc6324e63)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2614
2014-05-19 17:52:27 -07:00
Victor Bittorf
0bb66ef327 Adding aliases ADD_MONTHS and SUB_MONTHS
This is a request for consistency with oracle.

Change-Id: I463a66694a068cd773532d8f6f853a4b089b918a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2400
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 1f0b643789596f96c54580b8c5262fada4dfc958)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2502
2014-05-09 17:35:29 -07:00
Matthew Jacobs
fb49706ec8 Add additional types to TColumnValue and fix field names
Adds 8 and 16 byte integer values and a binary value to TColumnValue
and fixes the field names.

Change-Id: Ie318fe7dad43b0cc0032b65b6b04c3fe173ae9b8
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2418
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 68c476822402d27d985ed78fa5d14a843b681082)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2493
2014-05-08 17:38:54 -07:00
Matthew Jacobs
ebc6c5894e External Data Source: Frontend and catalog changes
Initial frontend and catalog changes for external data sources.

Change-Id: Ia0e61ef97cfd7a4e138ef555c17f2e45bbf08c18
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2224
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit dfa14c828957f751db9c89bae0bdc040ce6f648c)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2485
2014-05-08 14:56:19 -07:00
Henry Robinson
38befd2126 IMPALA-724: Support infinite / nan values in text files
This patch allows the text scanner to read 'inf' or 'Infinity' from a
row and correctly translate it into floating-point infinity. It also
adds is_inf() and is_nan() builtins.

Finally, we change the text table writer to write Infinity and NaN for
compatibility with Hive.

In the future, we might consider adding nan / inf literals to our
grammar (postgres has this, see:
http://www.postgresql.org/docs/9.3/static/datatype-numeric.html).

Change-Id: I796f2852b3c6c3b72e9aae9dd5ad228d188a6ea3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2393
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 58091355142cadd2b74874d9aa7c8ab6bf3efe2f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2483
2014-05-08 12:28:53 -07:00
Matthew Jacobs
61b36a42bd External Data Source: Few small API changes
* Rename getStats() to prepare()
* Adds TRowBatch.num_rows to indicate number of rows when no cols are
  materialized
* Changes api and sample poms to produce source jars

Change-Id: I02dcc89e27716978708386cfc3f7940ee5dbc023
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2406
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 2d7fcba8b7442b54a388f8b994d0cfa08940bbd7)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2434
2014-05-02 17:10:25 -07:00
Nong Li
03e5665e56 Decimal: Read/Write to parquet.
This adds support for the FIXED_LENGTH_BYTE_ARRAY parquet type and
encoding for decimals.

Change-Id: I9d5780feb4530989b568ec8d168cbdc32b7039bd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1727
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2432
2014-05-02 16:38:35 -07:00
Henry Robinson
35d986cc8d Add max / min_*int() builtins
It would have been convenient today to know the largest values that
Impala accepts for its integer types. This patch adds max and min
builtins for our numeric types.

[localhost:21000] > select max_bigint(), max_int(), max_smallint(),
max_tinyint();
Query: select max_bigint(), max_int(), max_smallint(), max_tinyint()
+---------------------+------------+----------------+---------------+
| max_bigint()        | max_int()  | max_smallint() | max_tinyint() |
+---------------------+------------+----------------+---------------+
| 9223372036854775807 | 2147483647 | 32767          | 127           |
+---------------------+------------+----------------+---------------+

Change-Id: I6df6df2728197529c6375dbb1b7d3c9ddb9833d2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2381
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2398
2014-04-29 16:54:31 -07:00
Matthew Jacobs
1f07f2d7ee External Data Source: Thrift structure changes
A few changes to the external data source thrift types:
* Change RowBatch to return entire columns. Adds Data.TColumnData to
  represent an entire column.
* Makes all fields in ExternalDataSource (except for status fields on
  the result structures) optional in case fields become deprecated in
  the future.
* Adds a limit parameter to the TOpenParams structure in case the
  data source needs to apply the limit itself.

Change-Id: I62db68bfb64d2190dfdd0c84be5925ad5db031ef
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2345
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
(cherry picked from commit faf220d628359be1368f898493900fc2e2913c53)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2385
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
2014-04-27 12:57:13 -07:00
Victor Bittorf
46151dc7dd Adding EXTRACT builtin.
Change-Id: I6de20f336ecdfa3acd8d3a9166cff4a062baaacc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2247
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f233955020ffbd1023f2d6adbbfb22e267986305)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2370
2014-04-25 15:38:51 -07:00
Matthew Jacobs
25c0ebf58c External Data Source: Public API
Adds the thrift structures for the public external data source API
and a new maven project containing the Java ExternalDataSource
interface and the generated Java thrift classes.

The ExternalDataSource.thrift structures can evolve in a backward
compatible way. The ExternalDataSource Java interface will always
contain a version number in the namespace (e.g.
com.cloudera.impala.extdatasource.v1 for V1) so we can potentially
make breaking changes to the interface in the future but still
support older versions.

A trivial implementation of the ExternalDataSource API is also
added for testing purposes.
TODO: Make the sample data source implementation realistic.

Change-Id: I827d6420a87ed7a2bce34e050362ca98ddc5dbcc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2241
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f29814e9ede9d4c889f2648606fcf511feeb47ae)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2313
2014-04-22 18:34:48 -07:00
Victor Bittorf
c414c91931 Adding TRUNC builtin.
Includes additions to builtin UDF registration to support prepare/close.

Change-Id: I22668fa7ee033b3fa37050b7bccee935571ac453
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2243
Reviewed-by: Victor Bittorf <victor.bittorf@cloudera.com>
Tested-by: jenkins
2014-04-22 13:17:12 -07:00
Nong Li
1cab95066d Add the return type as a column for SHOW FUNCTIONS.
Also includes some misc pattern matching cleanup.

Change-Id: I6c9ec78b094a73864b4d669afbd75a48c9bf9585
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2199
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2271
2014-04-17 17:58:13 -07:00
Matthew Jacobs
d0c353a9b4 IMPALA-922: Return helpful errors with Yarn group rules
When the -fair_scheduler_allocation_path is configured with a policy that uses
the "primaryGroup" Yarn queue allocation rule, Yarn throws an error if the user
is not on the local OS. Currently the user will get an error message that says:
"java.io.IOException: No groups found for user <username>". We now return a more
helpful error message.

Change-Id: I014ac15ef607e473957752f23af94d0cc4efec0f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2078
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 3cf37dc4e91afe887ada988f256b7008983580d2)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2244
2014-04-15 15:32:05 -07:00
Nong Li
87295a4e06 Decimal implementation.
This patch implements decimal support for text based formats.

Change-Id: I8e2c9e512ed149fe965216a72cb21fffd4f18e75
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1669
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2238
Tested-by: jenkins
2014-04-14 21:07:32 -07:00
Henry Robinson
99c37aac37 IMPALA-827: Add an option for directories created by INSERT to inherit
their parent's permissions

This patch adds --insert_inherit_permissions. If true, all
new partition directories created by INSERT will inherit their
permissions from their parent. When false, the directories are created
with the default permissions.

Change-Id: Ib2b4c251e51ea5048387169678e8dde34ecfe5f6
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1917
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2014-04-04 10:25:20 -07:00
Lenni Kuff
fd174a5e69 [CDH5] Remove duplication of network addresses in HdfsTable (updated for HDFS caching)
This is a port of 0b9134a from CDH4, but required some adjustments to work on CDH5 due to
the HDFS caching work. The differences from CDH4/CDH5 are mainly in
HdfsTable/HdfsPartition. I added a new BlockReplica class to represent a single block
with info on the host index + caching info.

This removes duplication TNetworkAddresses in the block location metadata of HdfsTable.
Each HdfsTable now contains a list of TNetworkAddress and the BlockLocations
just reference an index in this list to specify the host, rather than duplicating the
TNetworkAddress.

For a table with 100K blocks, this reduces the size of the THdfsTable struct by an additional
~50+% (on top of the duplicate file path changes). This takes the total size of the table
from:
21.1MB -> 9.4MB (file path duplication) -> 4.2MB (host duplication) = ~80% total improvement.

Change-Id: If7f11764dc0961376f9648779d253829f4cd83a2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1367
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1887
Reviewed-by: Nong Li <nong@cloudera.com>
2014-03-14 14:30:08 -07:00
Alex Behm
7fcd7cd64e Add list of tables missing stats to explain header and mem-limit exceeded error.
Change-Id: Ibe8f329d5513ae84a8134b9ddb3645fa174d8a66
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1501
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1880
2014-03-12 21:15:22 -07:00
Matthew Jacobs
e817c3742c Admission controller: fix a number of TODOs
* Remove requirement that fair scheduler and Llama conf files be on the classpath if
  specified as relative paths. Now they can be specified as any relative or absolute
  path.
* Add flags to disable all per-pool max requests limits or mem limits.
* Rename RequestPoolUtils to RequestPoolService
* Make it more clear RequestPoolService is a singleton by putting it in ExecEnv
* FileWatchService: use Executors.newScheduledThreadPool instead of a thread
* Moved MEGABYTE (and related constants) to new Constants class (frontend)
* Test RequestPoolService: Removed AllocationFileLoaderServiceHelper, replaced with
  reflection

Change-Id: Iadf79cf77a7894a469c3587d0019a6d0bee7e58f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1787
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit b9a167f6fdb4ab2595aca6035e1f9d926b909d94)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1858
2014-03-12 14:23:54 -07:00
Srinath Shankar
23d394c572 IMPALA-1238: Impala supports length() but not char_length()
Adds aliases char_length and character_length for the length()
 builtin. These are valid for ASCII.

Change-Id: I934d997f2c6d372ed12e7221efc1a574d68e01f3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1802
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1827
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
2014-03-08 16:53:49 -08:00
Srinath Shankar
985d90146e IMPALA-856:NULLIFZERO and ZEROIFNULL
Both nullifzero and zeroifnull apply only to numeric types.
nullifzero(arg) - Return NULL if arg == 0, arg otherwise
zeroifnull(arg) - Return 0 if arg is NULL, arg otherwise

Change-Id: I41260de1edca2f9fcf50594fe137ca1f68f76056
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1805
Reviewed-by: Srinath Shankar <sshankar@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1826
2014-03-08 13:09:33 -08:00
Henry Robinson
da1c7d37ff Add memory and VCPU expansion to RM-enabled queries
* Each node has one QueryResourceMgr per query it is running fragments
  for. A QueryResourceMgr handles creating expansion RPC requests, and
  monitoring the thread:VCPU ratio for each query (and requesting more
  VCPUs from YARN if oversubscribed).
* MemTrackers now have an ExpandLimit API which does nothing unless they
  have a QueryResourceMgr. This method blocks for now, but when the IO
  manager changes its API to use TryConsume(), we'll need to issue these
  asynchronously to avoid keeping hold of a thread.
* ResourceBroker etc. got updated to support the Expansion API.

Change-Id: Ia3c4635497f0563cfc5cd0e330e5f1f586577200
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1800
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
2014-03-07 08:58:05 -08:00
Matthew Jacobs
989830186f Remove RM pool configuration and yarn_pool query option/profile property
Admission control adds support for configuring pools via a fair scheduler
allocation configuration, so the pool configuration mechanism is no longer
needed. This also renames the "yarn_pool" query option to the more general
"request_pool" as it can also be used to configure the admission controller
when RM/Yarn is not used. Similarly, the query profile shows the pool as
"Request Pool" rather than "Yarn Pool".

Change-Id: Id2cefb77ccec000e8df954532399d27eb18a2309
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1668
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8d59416fb519ec357f23b5267949fd9682c9d62f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1759
2014-03-06 14:46:09 -08:00
Matthew Jacobs
41d90312fa Admission controller: user to pool resolution, authorization, and pool configs
Adds RequestPoolUtils which exposes user to pool resolution, authorization,
and relevant pool configurations by wrapping Yarn classes that provide that
functionality. (To support CDH4, those Yarn classes will come from
thirdparty/cdh4-extras.) RequestPoolUtils is created once by the backend and
the instance lives for the duration of the process.

Change-Id: I53db075555578614356d33f9d939c5378b9ec797
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1566
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8e385bdb54ed97e567c672a76723936c24cfe45f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1758
2014-03-06 14:21:31 -08:00
Skye Wanderman-Milne
6ceed1e632 UDF API additions
This patch introduces the ability to specify a prepare and close
function for a UDF, as well as FunctionContext methods for maintaining
state across UDF invocations within a query. Many of the changes are
related to adding an Expr::Open() function which calls the UDF's
prepare function, if specified (it has to be called in Open() since
the LLVM module must be compiled first).

Change-Id: I581d90d03dff71f7ff5d4a6bef839ba6bc46b443
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1693
Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 8e2ed7fb9051d98f89327715fdebd6f5ed22d6ee)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1757
2014-03-05 07:32:34 -08:00
Nong Li
f0a67153d3 Decimal analysis changes.
Change-Id: Ib7d6a6a7650cc9058ff1486fc7546ab66c698d46
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1734
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-03-03 21:15:00 -08:00
Matthew Jacobs
b879b4c2e4 Admission controller: Separate TPoolStats mem_usage and mem_estimate
Change-Id: I521de3a99faca3aaf10e3900a4a12b0d2fa7a0f3
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1704
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit b8fa9c0bf7b555d36180be42c89cd4d7f6b8ec7b)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1737
2014-03-03 19:44:51 -08:00
Nong Li
309ab4df0d Update backend to support hdfs caching.
Change-Id: I22761c8893c8fd222564d4e2a97bfba1284cd741
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1724
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-03-02 00:36:33 -08:00
Lenni Kuff
d6cbd3dc44 Ensure db/table names are always case insensitive in catalog topic entry keys
This fixes a bug that can happen with 'invalidate metadata <table name>' if the following
sequences of events happens:

1) Table is created in Impala (table names are always treated as lower case)
2) Table is dropped and re-created in Hive, using the same name but different casing
3) invalidate metadata <table name> is run in Impala, which will update the existing
   table with the version from the Hive metastore.

When building the next statestore update, the catalog server will send an update out
thinking that the table from 1) was dropped and the table from 3) was added because
the topic entry key is case sensitive. This may incorrectly remove the table from
an impalad's catalog. The fix is to always treat db/table names as case insensitive.

Change-Id: Ib59edc403989781bf12e0405c0ccd37b8e41ee41
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1634
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1637
2014-02-23 00:20:16 -08:00
Matthew Jacobs
af84be67dd Admission controller: add memory limits in addition to number of requests
Adds the ability to set per-pool memory limits. Each impalad tracks the memory
used by queries in each pool; a per-pool memory tracker is added between the
per-query trackers and the process memory tracker. The current memory usage
is disseminated via statestore heartbeats (along with the other per-pool stats)
and a cluster-wide estimate of the pool memory usage is updated when topic
updates are received. The admission controller will not admit incoming
requests if the total memory usage is already over the configured pool limit.

Change-Id: Ie9bc82d99643352ba77fb91b6c25b42938b1f745
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1508
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 64a137930a318e56a7090a317e6aa5df67ea72cd)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1623
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
2014-02-20 14:19:34 -08:00
Henry Robinson
bb028a54b7 IMPALA-809: Concurrently received statestore heartbeats are no longer an error
This patch fixes a problem observed when a subscriber was processing a
heartbeat, and while doing so tried to re-register with the
statestore. The statestore would schedule a heartbeat for the new
registration, but the subscriber would return an error, thinking that it
was still re-registering (see UpdateState() for the try_lock logic that
gave rise to this error). The statestore, upon receiving the error,
would update its failure detector and eventually mark the subscriber as
failed, unnecessarily forcing a re-registration loop.

This only regularly happens when UpdateState() takes a long time,
i.e. when a subscriber callback takes a while. This patch also adds
metrics to measure the amount of time callbacks take.

Change-Id: I157cdfd550279a6942e7ca54fe622520c8ad5dcf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1574
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
(cherry picked from commit bc0a8819e754623bc9e5e5ab805369ad8381e5b9)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1610
2014-02-19 17:34:37 -08:00
Nong Li
0d2919fe7f Refactor scalar and aggregate function analysis and execution.
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.

The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).

This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.

This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.

Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
2014-02-18 18:40:08 -08:00
Lenni Kuff
95404d4888 Support prioritized background table loading
The overall goal of this change allow for table metadata to be loaded in the background
but also to allow prioritization of loading on an as-needed basis. As part of analysis,
any tables that are not loaded are tracked and if analysis fails the Impalad will make
an RPC to the CatalogServer to requiest the metadata loading of these tables be
prioritized and analysis will be restarted.

To support this, the CatalogServer now has a deque of the tables to load. For
background loading, tables to load are added to the tail of the deque. However, a new
CatalogServer RPC was added that can prioritize the loading of one or more tables in
which case they will get added to the head of the deque. The next table to load is
always taken from the head. This helps prioritize loading but is admittedly not the most
fair approach.

The support the prioritized loading, some changes had to made on the Impalad side during
analysis:
- During analysis, any tables that are missing metadata are tracked.
- Analysis now runs in a loop. If it fails due to an AnalysisException AND at least 1
  table/view was missing metadata, these tables missing metadata are requested to be
  loaded by calling the CatalogServer.
- The impalad will wait until the required tables are received (by getting notified each
  time there is a call to updateCatalog()), and waiting to run analysis until all tables
  are available. Once the tables are available, analysis will restart.

This change also introduces two new flags:

--load_catalog_in_background (bool). When this is true (the default) the catalog server
will run a period background thread to queue all unloaded tables for loading. This is
generally the desired behavior, but there may be some cases (very large metastores) where
this may need to be disabled.

--num_metadata_loading_threads (int32). The number of threads to use when loading catalog
metadata (degree of parallelism). The default is 16, but it can be increased to improve
performance at the cost of stressing the Hive metastore/HDFS.

Change-Id: Ib94dbbf66ffcffea8c490f50f5c04d19fb2078ad
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1476
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1538
2014-02-13 23:43:06 -08:00
Nong Li
80d4fd958e IMPALA-786: Drop function should clear library cache.
We were previously only clearing the cache in the catalog service
update loop so the impalad the drop was issued to was not doing the
right thing.

Change-Id: I6bee228e8c0d565cea4ea61cbf64240d83a45a7d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1511
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
2014-02-10 18:51:39 -08:00
Matthew Jacobs
a7ff8da1a5 Query admission controller, part 1
This adds a simple admission control mechanism which is able to make
localized admission decisions and can queue some number of queries
that are not able to execute immediately. In this change, there is
a single pool and the maximum number of concurrent queries and the
maximum number of queued queries are configurable via flags, but the
data structures all support multiple pools so that we can later add
support for getting pools and per-pool configs from Yarn and Llama
configs (i.e. fair-scheduler.xml and llama-site.xml).

Each impalad keeps track of how many queries it has executing and
how many are queued, per pool. The statestore is used to disseminate
local pool statistics to all other impalads. When topic updates are
received, a new cluster-wide total number of currently executing
queries and total number of queued queries are calculated for each
pool. Those totals are used to make localized admission decisions in
the AdmissionController.

There are a number of per-pool metrics which are used in automated
testing (in a separate commit) and there are many assertions and
debugging logging which I've been using to verify manual testing.

Change-Id: I68f92c789108336fca33c2148a4e14534c77e9f0
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1347
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: jenkins
(cherry picked from commit e52b528b8a2fa23585510eab916ecb41da82d24b)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1302
2014-02-07 10:12:24 -08:00
Lenni Kuff
5646156382 Reduce duplicate/unused string data from FileDescriptor/FileBlock thrift structs
Made the following changes:

1) Removed filePath from FileBlock. It wasn't used anywhere.
2) Update FileDescriptor to use file name rather than file path. The full path to the file
   can be found by prepending the parent partition directory the file name.
3) Removed fileLength from FileBlock. This was the same length used in FileDescriptor.

Testing these changes out on two table on a 20 node cluster. One table had ~100K blocks
and one table had ~15K blocks.

In both cases I saw the following improvements:

* After making change 1) the total serialized size of the table dropped by ~30%
* After making change 1) and 2) the total serialized size of the table dropped by ~52%
* After making change 1), 2), and 3) the total serialized size of the table dropped by ~55%

Change-Id: Ic85b3cbcf775569f69b7303bec4adc52593fc35c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1351
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1480
2014-02-07 00:02:09 -08:00
Lenni Kuff
7a6892dcbe Fix race when invalidating catalog metadata and loading a new table
There was race when the catalog was invalidated at the same time a table
was being loaded. This is because an uninitialized Table was being returned
unexpectedly to the impalad due to the concurrent invalidate.

This fixes the problem by updating the CatalogObjectCache to load when
a catalog object is uninitialized, rather than load when null. New items can
now be added in a initialized or uninitialized state; uninitialized objects
are loaded on access.

Also adds a stress test for invalidate metadata/invalidate metadata <table>/refresh

In addition, it cleans up the locking in the Catalog to make it more
straight forward. The top-level catalogLock_ is now only in CatalogServiceCatalog
and this lock is used to protect the catalogVersion_. Operations that need to
perform an atomic bulk catalog operation can use this lock (such as when the
CatalogServer needs to take a snapshot of the catalog to calculate what delta to send
to the statestore). Otherwise, the lock is not needed and objects are protected by the
synchronization at each level in the object heirarchy (Db->[Function/Table]). That is,
Dbs are synchronized by the Db cache, each Db has a Table Cache which is synchronized
independently.

Change-Id: I9e542cd39cdbef26ddf05499470c0d96bb888765
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1355
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1418
2014-01-31 16:16:32 -08:00
Alex Behm
6b769d011d Adds limited support for the FETCH_FIRST fetch orientation in HS2 client requests.
Adds a bounded query-result cache that clients can enable by setting
an 'impala.resultset.cache.size'  option in the HS2 confOverlay mapof the HS2 exec request.
Impala permits FETCH_FIRST for a particular stmt iff result caching is enabled.
FETCH_FIRST will succeed as long all previously fetched rows fit into the bounded
result cache. Regardless of whether a FETCH_FIRST succeeds or not, clients may
always resume fetching with FETCH_NEXT.

The FETCH_FIRST feature is intended to allow HUE users to export an entire
result set (to Excel, CSV, etc.) after browsing through a few pages of results,
without having ro re-run the query from scratch.

Change-Id: I71ab4794ddef30842594c5e1f7bc94724d6ce89f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1356
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1406
2014-01-30 14:58:46 -08:00
Lenni Kuff
b4f5c1edcf Enable lazy loading of table metadata for the CatalogService/Impalad
This change adds support for lazy loading of table metadata to the
CatalogService/Impalad. The way this works is that the CatalogService initially
sends out an update with only the databases and table names (wrapped as
IncompleteTables). When an Impalad encounters one of these tables, it will contact
the catalog service to get the metadata, possibly triggering a metadata load if the
catalog server has not yet loaded this table.

With these changes the catalog server starts up in just seconds, even for large
metastores since it only needs to call into the metastore to get the list of tables
and databases. The performance of "invalidate metadata" also improves for the same reason.

I also picked up the catalog cleanup patch I had to make the APIs a bit more consistent and
remove the need for using a LoadingCache for databases.

This also fixes up the FE tests to run in a more realistic fashion. The FE tests now run
against catalog object recieved from the catalog server. This actually turned up some bugs
in our previous test configuration where we were not running with the correct column stats
(we were always running with avgSerializedSize = slotSize).  This changed some plans so the
planner tests needed to be updated.

Still TODO:
This does not include the changes to perform background metadata loading. I will send
that out as a separate patch on top of this.

Change-Id: Ied16f8a7f3a3393e89d6bfea78f0ba708d0ddd0e

Saving changes

Change-Id: I48c34408826b7396004177f5fc61a9523e664acc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1328
Tested-by: jenkins
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1338
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-21 21:43:29 -08:00
Henry Robinson
44f3e3448f Remove dead 'user' field from TQueryExecRequest, plus some
user-oriented cleanup

Also change QueryExecState::user() to use TQueryCtxt.user (not the
session), and rename QueryExecState::parent_session_ to just session_
because 'parent' made me wonder where the child session was.

We have two kinds of username - the 'connected' user is the one who
actually originated a session, and the 'impersonated' user is the one
that the connected user optionally chooses to execute operations
as. This patch makes it clearer which user is being referred to.

In particular, this adds the impersonated user as a separate field to
TSessionState, and pushes responsibility for deciding which username to
use for authorisation to the frontend, via
TSessionStateUtils.getEffectiveUser().

Change-Id: Id63b4deaa44ac0eaa98b08595b795c129013fd58
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1322
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
(cherry picked from commit e66b9dd069095e6ec6d6f95171f29098dc1c2c93)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1332
Tested-by: Henry Robinson <henry@cloudera.com>
2014-01-21 19:23:26 -08:00
Nong Li
69fe1c6c10 Change FE to use ColumnType instead of PrimitiveType.
PrimitiveType is an enum and cannot be used for more complex types. The change
touches a lot of files but very mechanically.

A similar change needs to be done in the BE which will be a subsequent patch.

The version as I have it breaks rolling upgrade due to the thrift changes. If
this is not okay, we can work around that but it will be annoying.

Change-Id: If3838bb27377bfc436afd6d90a327de2ead0af54
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1287
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1304
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
2014-01-17 14:32:55 -08:00
Henry Robinson
22b98b36c1 Add 'user' field to reservation requests
Change-Id: I34594aeaf4ef03fbeb10f50faf2d6824437d32cc
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1289
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Henry Robinson <henry@cloudera.com>
(cherry picked from commit 7cc19a36a6435c35cbd8c7f1c95449fed4f0494f)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1291
2014-01-15 21:16:39 -08:00