Test results can be verified using regular expressions. The extraction
of the regular expression substring from the expected test results had a
bug where only the first character of an expression was considered. This
lead to wrong but undetected test results.
Change-Id: Ia670da6e0758455a86dc44744b96b9465d890af3
Reviewed-on: http://gerrit.cloudera.org:8080/1818
Reviewed-by: Lars Volker <lv@cloudera.com>
Tested-by: Internal Jenkins
This change adds more analysis checks to verify the location
of the table or partition specified in a "CREATE/ALTER TABLE
... CACHED IN ..." statement can actually be cached. Caching
is only supported for HDFS locations.
If table-wide caching is enabled for a table, adding a partition
at an uncacheable location will be disallowed for that table
unless the attribute "UNCACHED" is explicitly specified.
Enabling table-wide caching for a table at an uncacheable
location or a table with partitions at uncacheable locations
will also be disallowed. However, caching can still be enabled
for individual partitions whose underlying locations
support caching.
Change-Id: I2299c9285126f4b035360f2ef902147188ccd5f1
Reviewed-on: http://gerrit.cloudera.org:8080/1373
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
This patch adds a 'location' column to the output of SHOW TABLE STATS /
SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE
SET LOCATION commands, particularly for partitions, and is easier to
identify than the output of DESCRIBE FORMATTED.
Some existing tests in alter-table.test have been updated to include
checking the location output before and after a SET LOCATION
command. The tests in show.test have also been updated to check for the
location; all other tests that use SHOW [TABLE STATS|PARTITIONS] use a
generic regex to avoid overly verbose tests.
Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb
Reviewed-on: http://gerrit.cloudera.org:8080/316
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
When a table is loaded in the catalog, we will now perform a check to
verify that the cache directive ID and cache replication factor is still
valid and the data is current.
If the cache directive does no longer exist, we issue a error message
and mark the table / partition as uncached. Furthermore, the replication
factor is updated with the information from the actual cache directive.
In the case of insert statement there is a special situation as the
catalog update is happening synchronously and will try to access the
cache directive information that might be stale. Thus in this insert
path, we catch the possible not found exception and reset the caching
information.
Change-Id: I882041ce5395b8a3d17e9fc2750053393340df65
Reviewed-on: http://gerrit.cloudera.org:8080/40
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: Internal Jenkins
This patch adds the possibility to specify the number of replicas that
should be cached in main memory. This can be useful in high QPS
scenarios as the majority of the load is no longer the single cached
replica, but a set of cached replicas. While the cache replication
factor can be larger than the block replication factor on disk, the
difference will be ignored by HDFS until more replicas become
available.
This extends the current syntax for specifying the cache pool in the
following way:
cached in 'poolName'
is extended with the optional replication factor
cached in 'poolName' with replication = XX
By default, the cache replication factor is set to 1. As this value is
not yet configurable in HDFS it's defined as a constant in the JniCatalog
thrift specification. If a partitioned table is cached, all its child
partitions inherit this cache replication factor. If child partitions
have a custom cache replication factor, changing the cache replication
factor on the partitioned table afterwards will overwrite this custom
value. If a new partition is added to the table, it will again inherit
the cache replication factor of the parent independent of the cache pool
that is used to cache the partition.
To review changes and status of the replication factor for tables and
partitions the replication factor is part of output of the "show
partitions" command.
Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED
When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.
When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.
It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).
Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.
Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins