Commit Graph

18 Commits

Author SHA1 Message Date
Taras Bobrovytsky
d07580c171 IMPALA-4962: Fix SHOW COLUMN STATS for HS2
Impala incorrectly returned NULLs in the "Max Size" column of the SHOW
COLUMN STATS result when executed through the HS2 interface. The issue
was that the column was specified to be type INT in the result schema,
but the actual type of the contents that we inserted into it was
"long". The reason why this is not an issue in Impala shell is because
we stringify the contents without inspecting the metadata for beeswax
results.

The issue was fixed by changing the type from INT to BIGINT.

Change-Id: I419657744635dfdc2e1562fe60a597617fff446e
Reviewed-on: http://gerrit.cloudera.org:8080/6109
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
2017-02-22 23:10:34 +00:00
Alex Behm
a9f7cf51f4 IMPALA-3491: Use unique_database fixture in test_metadata_query_statements.py.
Testing: Ran the test locally on exhaustive in a loop 10 times.
Ran a private exhaustive build on hdfs.

Change-Id: Ia0af1dc6534234508bd0fed03531f7fe8ff556aa
Reviewed-on: http://gerrit.cloudera.org:8080/3103
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Alex Behm <alex.behm@cloudera.com>
2016-06-07 09:34:30 -07:00
Shant Hovsepian
6d87fe090c Improve Hll estimate for small cardinalities.
Based on Google's HyperLogLog++ paper. Uses a bias correcting
interpolation as a sub algorithm for Hll estimates within a specific
range.

Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9
Reviewed-on: http://gerrit.cloudera.org:8080/415
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
2015-07-16 19:38:17 +00:00
Henry Robinson
f22b8659fd IMPALA-1595: Add 'location' to SHOW [TABLE STATS|PARTITIONS] for HDFS tables
This patch adds a 'location' column to the output of SHOW TABLE STATS /
SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE
SET LOCATION commands, particularly for partitions, and is easier to
identify than the output of DESCRIBE FORMATTED.

Some existing tests in alter-table.test have been updated to include
checking the location output before and after a SET LOCATION
command. The tests in show.test have also been updated to check for the
location; all other tests that use SHOW [TABLE STATS|PARTITIONS] use a
generic regex to avoid overly verbose tests.

Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb
Reviewed-on: http://gerrit.cloudera.org:8080/316
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: Internal Jenkins
2015-04-21 19:27:50 +00:00
ishaan
8369c3b13b Remove explicit references to functional_hbase tables from .test files.
Additionally, this patch also disabled the hbase/none test dimension if the
TARGET_FILESYSTEM environment variable is set to either s3 of isilon.

Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb
Reviewed-on: http://gerrit.cloudera.org:8080/74
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2015-02-23 23:32:41 +00:00
Martin Grund
cee1e84c1e IMPALA-1587: Extending caching directives for multiple replicas
This patch adds the possibility to specify the number of replicas that
should be cached in main memory. This can be useful in high QPS
scenarios as the majority of the load is no longer the single cached
replica, but a set of cached replicas. While the cache replication
factor can be larger than the block replication factor on disk, the
difference will be ignored by HDFS until more replicas become
available.

This extends the current syntax for specifying the cache pool in the
following way:

   cached in 'poolName'

is extended with the optional replication factor

   cached in 'poolName' with replication = XX

By default, the cache replication factor is set to 1. As this value is
not yet configurable in HDFS it's defined as a constant in the JniCatalog
thrift specification. If a partitioned table is cached, all its child
partitions inherit this cache replication factor. If child partitions
have a custom cache replication factor, changing the cache replication
factor on the partitioned table afterwards will overwrite this custom
value. If a new partition is added to the table, it will again inherit
the cache replication factor of the parent independent of the cache pool
that is used to cache the partition.

To review changes and status of the replication factor for tables and
partitions the replication factor is part of output of the "show
partitions" command.

Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533
Tested-by: jenkins
Reviewed-by: Martin Grund <mgrund@cloudera.com>
2015-01-26 20:30:59 -08:00
Henry Robinson
44f57e5fb6 IMPALA-1122: Compute stats with partition granularity
This patch adds the ability to compute and drop column and table
statistics at partition granularity.

The following commands are added. Detail about the implementation
follows.

COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>]

This variant of COMPUTE STATS will, ultimately, do the same thing as the
traditional COMPUTE STATS statement, but does so by caching the
intermediate state of the computation for each partition in the Hive
MetaStore. If the PARTITION clause is added, the computation is
performed for only that partition. If the PARTITION clause is omitted,
incremental stats are updated only for those partitions with missing
incremental stats (e.g. one column does not have stats, or incremental
stats was never computed for this partition). In this patch, incremental
stats are only invalidated when a DROP STATS variant is executed. Future
patches can automatically invalidate the statistics after REFRESH or
INSERT queries, etc.

DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec>

This variant of DROP stats removes the incremental statistics for the
given table. It does *not* recalculate the statistics for the whole
table, so this should be used only to invalidate the intermediate state
for a partition which will shortly be subject to COMPUTE INCREMENTAL
STATS. The point of this variant is to allow users to notify Impala when
they believe a partition has changed significantly enough to warrant
recomputation of its statistics. It is not necessary for new partitions;
Impala will detect that they do not have any valid statistics.

--------

This is achieved by adapting the existing HLL UDA via swapping its
finalize method for a new one which returns the intermediate HLL
buckets, rather than aggregating and then disposing of them. This
intermediate state is then returned to Impala's catalog-op-executor.cc,
which then passes the intermediate state back to the frontend to be
ultimately stored in the HMS.

This intermediate state is computed on a per-partition basis by grouping
the input to the UDA by partition. Thus, the incremental computation
produces one row for each partition selected (the set of which might be
quite small, if there are few partitions without valid incremental
stats: this is the point of the new commands).

At the same time, the query coordinator aggregates the output of the UDA
to produce table-level statistics. This computation incorporates any
existing (and not re-computed) intermediate partition state which is
passed to the coordinator by the frontend. The resulting statistics are
saved to the table as normal.

Intermediate statistics are serialised to the HMS by writing a Thrift
structure's serialised form to the partition's 'parameters' map. There
is a schema-imposed limit of 4000 characters to the serialised string,
which is exacerbated by the fact that the Thrift representation must
first be base-64 encoded to avoid type errors in the HMS. The current
patch breaks the encoded structure into 4k chunks, and then recombines
them on read. The alltypes table (11 columns) takes about three of these
chunks. This may mean that incremental stats are not suitable for
particularly wide tables: these structures could be zipped before
encoding for some space savings. In the meantime, the NDV estimates are
run-length encoded (since they are generally sparse); this can result in
substantial space savings.

Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475
Tested-by: jenkins
Reviewed-by: Henry Robinson <henry@cloudera.com>
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408
2014-11-25 09:13:37 -08:00
Henry Robinson
6af7c8fe4a IMPALA-1330: Fix column types for SHOW {table, partition} STATS
Because we add 'total' to the last row in SHOW PARTITIONS, we set the
partition key columns to be string. At least, that's what the comment
said, but we didn't do that in fact.

This patch also corrects the column type for max width, which should be INT.

Change-Id: I787ab17be27f45107340119017e528c58a3daad3
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4678
Reviewed-by: Henry Robinson <henry@cloudera.com>
Tested-by: jenkins
2014-10-06 15:16:56 -07:00
Alex Behm
19bab59854 Create/alter/describe tables with complex types.
This patch adds parsing of complex types and tests for using complex
types in various exprs and create/alter/describe stmts.

Change-Id: Ibc211a560c889f5ccfb616813700b923c89d8245
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3577
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3594
2014-07-23 17:26:14 -07:00
Ippokratis Pandis
e1ae5fe95a IMPALA-1068: COMPUTE STATS should place -1 in #NULLs
With IMPALA-1033 we disabled the counting of the number of NULLs in each column,
and that gave a 2x speed-up in the computation. But erroneously the value 0 was
being placed in the number of NULLs, instead of the correct -1 that indicates
'unknown'.

Change-Id: Ib882eb2a87e7e2469f606081cb2881461b441a45
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3377
Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3378
2014-07-07 15:13:25 -07:00
Dimitris Tsirogiannis
5a6f53db16 Add partition pruning tests
The following changes are included in this commit:
1. Modified the alltypesagg table to include an additional partition key
that has nulls.
2. Added a number of tests in hdfs.test that exercise the partition
pruning logic (see IMPALA-887).
3. Modified all the tests that are affected by the change in alltypesagg.

Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236
2014-06-24 02:14:27 -07:00
Alex Behm
70d7ff07af CDH-19856: Disable Hive's stats autogathering.
Change-Id: I04e91f91d29b7863848a750e362c9d94469df7f2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3156
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3169
2014-06-19 16:48:34 -07:00
Lenni Kuff
745c091fcc [CDH5] Update SHOW TABLE STATS to include per-partition HDFS caching stats
Change-Id: I71b01f84bbd308108d775e78c644e867b48e05be
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2621
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-28 08:54:54 -07:00
Lenni Kuff
c45e9a70d9 [CDH5] Add DDL support for HDFS caching
This change adds DDL support for HDFS caching. The DDL allows the user to indicate a
table or partition should be cached and which pool to cache the data into:
* Create a cached table: CREATE TABLE ... CACHED IN 'poolName'
* Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName'
* Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED

When a table/partition is marked as cached, a new HDFS caching request is submitted
to cache the location (HDFS path) of the table/partition and the ID of that request
is stored with in the table metadata (in the table properties). This is stored as:
'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS
and persisted across HDFS restarts.

When a cached table or partition is dropped it is important to uncache the cached data
(drop the associated cache request). For partitioned tables, this means dropping all
cache requests from all cached partitions in the table.
Likewise, if a partitioned table is created as cached, new partitions should be marked
as cached by default.

It is desirable to know which cache pools exists early on (in analysis) so the query
will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To
support this, a new cache pool catalog object type was introduced. The catalog server
caches the known pools (periodically refreshing the cache) and sends the known pools out
in catalog updates. This allows impalads to perform analysis checks on cache pool
existence going to HDFS. It would be easy to use this to add basic cache pool management
in the future (ADD/DROP/SHOW CACHE POOL).

Waiting for the table/partition to become cached may take a long time. Instead of
blocking the user from access the time during this period we will wait for the cache
requests to complete in the background and once they have finished the table metadata
will be automatically refreshed.

Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
2014-05-27 16:47:15 -07:00
Lenni Kuff
9e2dd7e049 Add support for SHOW PARTITIONS <table name>
This statement returns info on all partitions for the given table. It is implemented as
an alias for SHOW TABLE STATS, with some extended analysis checks (such as throwing if
the statement targets an unpartitioned table).

Change-Id: I19154a9d90314de18f86ba355aa5dbed808f147f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2145
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2179
Tested-by: jenkins
2014-04-10 12:15:39 -07:00
Alex Behm
74164e8f99 IMPALA-688: Fix column stats computation for HBase row key. Use regex to fix flaky tests.
Change-Id: I1d3fb915921bbc5366da0ee51608fd54aa237777
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1135
Tested-by: jenkins
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:54:33 -08:00
Alex Behm
e4ad086dee Added max/avg length for string columns in COMPUTE STATS.
Change-Id: I6f61de2323ee12681642684ec633ed4bb7506de2
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1079
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins
2014-01-08 10:54:30 -08:00
Lenni Kuff
bfb16ff552 Disable SHOW STATS tests because results are unstable (IMPALA-688)
Change-Id: Ib4b4fe3a29d3bd0e3c7ece8b5b21c4ec4b5eb289
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1060
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Lenni Kuff <lskuff@cloudera.com>
2014-01-08 10:54:22 -08:00