Commit Graph

13 Commits

Author SHA1 Message Date
Alex Behm
19bab59854 Create/alter/describe tables with complex types.
This patch adds parsing of complex types and tests for using complex
types in various exprs and create/alter/describe stmts.

Change-Id: Ibc211a560c889f5ccfb616813700b923c89d8245
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3577
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3594
2014-07-23 17:26:14 -07:00
Lenni Kuff
7157f54bbe Support DROP STATS <table name>
Adds support for dropping all table and column stats from a table. Once incremental
stats are supported, this will provide the user a way to force a recompute of all
stats.

Change-Id: I27e03d5986b64eb91852bfc3417ffa971d432d6b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3533
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
(cherry picked from commit f1f074f24bfdc77c4cef147fe9d26f27df80ab81)
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3551
2014-07-21 10:28:16 -07:00
Ippokratis Pandis
e1ae5fe95a IMPALA-1068: COMPUTE STATS should place -1 in #NULLs
With IMPALA-1033 we disabled the counting of the number of NULLs in each column,
and that gave a 2x speed-up in the computation. But erroneously the value 0 was
being placed in the number of NULLs, instead of the correct -1 that indicates
'unknown'.

Change-Id: Ib882eb2a87e7e2469f606081cb2881461b441a45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3377
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3378
2014-07-07 15:13:25 -07:00
Alex Behm
96722da3fe Fix misplaced comment in testfile.
Change-Id: I55dc7d0e8e74a4f8c9a99e9601b2578ef6b0390d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3303
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3317
2014-06-30 10:17:26 -07:00
Ippokratis Pandis
6026f1ebe1 IMPALA-1055: Compute stats query statements don't quote DB and table names
The compute stats statement was not quoting the DB and table names. If those names
were aliasing with keywords, then the compute stats would not execute due to a syntax
error.

Change-Id: Ie08421246bb54a63a44eaf19d0d835da780b7033
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3170
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3198
2014-06-20 09:32:52 -07:00
Lenni Kuff
745c091fcc [CDH5] Update SHOW TABLE STATS to include per-partition HDFS caching stats
Change-Id: I71b01f84bbd308108d775e78c644e867b48e05be
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2621
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-28 08:54:54 -07:00
Lenni Kuff
c45e9a70d9 [CDH5] Add DDL support for HDFS caching
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED

When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.

When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.

It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).

Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.

Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-27 16:47:15 -07:00
Alex Behm
ce40134ad0 IMPALA-867: Fail COMPUTE STATS in analysis for Avro tables affected by HIVE-6308.
Avro tables that were not created with a column-definition list do not have
their columns properly populated in the Metastore backend DB (HIVE-6308).
For such tables COMPUTE STATS and Hive's ANALYZE TABLE cannot succeed.
This patch fails COMPUTE STATS in analysis for such broken Avro tables
and adds tests for Avro tables with mismatched a column-definition list
and Avro schema.

Change-Id: I561ecea944ae2f83d69950b7a1ab9edaa89bdcea
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1892
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1920
2014-03-14 23:24:55 -07:00
Alex Behm
3f68be2caa IMP-1227: Ignore columns of unsupported types in compute stats. Enclose identifiers that are Impala keywords in quotes.
Change-Id: Ie7fa6da2869090428c9229c44b973ecccbb49e8e
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1357
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1368
2014-01-28 17:18:17 -08:00
Alex Behm
74164e8f99 IMPALA-688: Fix column stats computation for HBase row key. Use regex to fix flaky tests.
Change-Id: I1d3fb915921bbc5366da0ee51608fd54aa237777
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1135
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:54:33 -08:00
Alex Behm
e4ad086dee Added max/avg length for string columns in COMPUTE STATS.
Change-Id: I6f61de2323ee12681642684ec633ed4bb7506de2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1079
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:30 -08:00
Nong Li
ab21dde002 Update compute stats test to use regex for parquet/hbase file size.
The parquet file stores the application version that wrote it so
is different between our c4 and c5 branches.

HBase storage is also not guaranteed to be identical across versions.

Change-Id: I02984a55e0678756e50c1fff6db22c43788d3916
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1028
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:17 -08:00
Alex Behm
93e5b262c2 Added COMPUTE STATS command for gathering table and column stats.
A compute stats command computes the table and column stats for a given
table and persists them in the metastore.
The table stats consist of the per-partition and per-table row count.
The column stats are computed on a per-table basis and consist of the
number of distinct values and the number of NULLs per column.

This patch introduces a new 'child query' concept that
compute stats utilizes. Child queries are cancelled
if the parent query is cancelled. A compute stats stmt is
executed by the following query hirarchy:
parent: compute stats query (DDL)
- child: compute table stats query (QUERY)
- child: compute column stats query (QUERY)

The new child query concept is necessary to decouple child query fetches
from parent query fetches, i.e., we could not execute a child query as
part of the original compute stats query, because then a client could
fetch the results we need for updating the Metastore statistics. The
reason why our existing CTAS works without this decoupling
is that its insert 'child query' is not fetchable.

Change-Id: I560533e3cb09bcbbdb3eea7fcf0b460bc6b36dcd
Reviewed-on: http://gerrit.ent.cloudera.com:8080/873
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:14 -08:00