impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Taras Bobrovytsky	d07580c171	IMPALA-4962: Fix SHOW COLUMN STATS for HS2 Impala incorrectly returned NULLs in the "Max Size" column of the SHOW COLUMN STATS result when executed through the HS2 interface. The issue was that the column was specified to be type INT in the result schema, but the actual type of the contents that we inserted into it was "long". The reason why this is not an issue in Impala shell is because we stringify the contents without inspecting the metadata for beeswax results. The issue was fixed by changing the type from INT to BIGINT. Change-Id: I419657744635dfdc2e1562fe60a597617fff446e Reviewed-on: http://gerrit.cloudera.org:8080/6109 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-22 23:10:34 +00:00
Alex Behm	a9f7cf51f4	IMPALA-3491: Use unique_database fixture in test_metadata_query_statements.py. Testing: Ran the test locally on exhaustive in a loop 10 times. Ran a private exhaustive build on hdfs. Change-Id: Ia0af1dc6534234508bd0fed03531f7fe8ff556aa Reviewed-on: http://gerrit.cloudera.org:8080/3103 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com>	2016-06-07 09:34:30 -07:00
Shant Hovsepian	6d87fe090c	Improve Hll estimate for small cardinalities. Based on Google's HyperLogLog++ paper. Uses a bias correcting interpolation as a sub algorithm for Hll estimates within a specific range. Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9 Reviewed-on: http://gerrit.cloudera.org:8080/415 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2015-07-16 19:38:17 +00:00
Henry Robinson	f22b8659fd	IMPALA-1595: Add 'location' to SHOW [TABLE STATS\|PARTITIONS] for HDFS tables This patch adds a 'location' column to the output of SHOW TABLE STATS / SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE SET LOCATION commands, particularly for partitions, and is easier to identify than the output of DESCRIBE FORMATTED. Some existing tests in alter-table.test have been updated to include checking the location output before and after a SET LOCATION command. The tests in show.test have also been updated to check for the location; all other tests that use SHOW [TABLE STATS\|PARTITIONS] use a generic regex to avoid overly verbose tests. Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb Reviewed-on: http://gerrit.cloudera.org:8080/316 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2015-04-21 19:27:50 +00:00
ishaan	8369c3b13b	Remove explicit references to functional_hbase tables from .test files. Additionally, this patch also disabled the hbase/none test dimension if the TARGET_FILESYSTEM environment variable is set to either s3 of isilon. Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb Reviewed-on: http://gerrit.cloudera.org:8080/74 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-02-23 23:32:41 +00:00
Martin Grund	cee1e84c1e	IMPALA-1587: Extending caching directives for multiple replicas This patch adds the possibility to specify the number of replicas that should be cached in main memory. This can be useful in high QPS scenarios as the majority of the load is no longer the single cached replica, but a set of cached replicas. While the cache replication factor can be larger than the block replication factor on disk, the difference will be ignored by HDFS until more replicas become available. This extends the current syntax for specifying the cache pool in the following way: cached in 'poolName' is extended with the optional replication factor cached in 'poolName' with replication = XX By default, the cache replication factor is set to 1. As this value is not yet configurable in HDFS it's defined as a constant in the JniCatalog thrift specification. If a partitioned table is cached, all its child partitions inherit this cache replication factor. If child partitions have a custom cache replication factor, changing the cache replication factor on the partitioned table afterwards will overwrite this custom value. If a new partition is added to the table, it will again inherit the cache replication factor of the parent independent of the cache pool that is used to cache the partition. To review changes and status of the replication factor for tables and partitions the replication factor is part of output of the "show partitions" command. Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-01-26 20:30:59 -08:00
Henry Robinson	44f57e5fb6	IMPALA-1122: Compute stats with partition granularity This patch adds the ability to compute and drop column and table statistics at partition granularity. The following commands are added. Detail about the implementation follows. COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>] This variant of COMPUTE STATS will, ultimately, do the same thing as the traditional COMPUTE STATS statement, but does so by caching the intermediate state of the computation for each partition in the Hive MetaStore. If the PARTITION clause is added, the computation is performed for only that partition. If the PARTITION clause is omitted, incremental stats are updated only for those partitions with missing incremental stats (e.g. one column does not have stats, or incremental stats was never computed for this partition). In this patch, incremental stats are only invalidated when a DROP STATS variant is executed. Future patches can automatically invalidate the statistics after REFRESH or INSERT queries, etc. DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec> This variant of DROP stats removes the incremental statistics for the given table. It does not recalculate the statistics for the whole table, so this should be used only to invalidate the intermediate state for a partition which will shortly be subject to COMPUTE INCREMENTAL STATS. The point of this variant is to allow users to notify Impala when they believe a partition has changed significantly enough to warrant recomputation of its statistics. It is not necessary for new partitions; Impala will detect that they do not have any valid statistics. -------- This is achieved by adapting the existing HLL UDA via swapping its finalize method for a new one which returns the intermediate HLL buckets, rather than aggregating and then disposing of them. This intermediate state is then returned to Impala's catalog-op-executor.cc, which then passes the intermediate state back to the frontend to be ultimately stored in the HMS. This intermediate state is computed on a per-partition basis by grouping the input to the UDA by partition. Thus, the incremental computation produces one row for each partition selected (the set of which might be quite small, if there are few partitions without valid incremental stats: this is the point of the new commands). At the same time, the query coordinator aggregates the output of the UDA to produce table-level statistics. This computation incorporates any existing (and not re-computed) intermediate partition state which is passed to the coordinator by the frontend. The resulting statistics are saved to the table as normal. Intermediate statistics are serialised to the HMS by writing a Thrift structure's serialised form to the partition's 'parameters' map. There is a schema-imposed limit of 4000 characters to the serialised string, which is exacerbated by the fact that the Thrift representation must first be base-64 encoded to avoid type errors in the HMS. The current patch breaks the encoded structure into 4k chunks, and then recombines them on read. The alltypes table (11 columns) takes about three of these chunks. This may mean that incremental stats are not suitable for particularly wide tables: these structures could be zipped before encoding for some space savings. In the meantime, the NDV estimates are run-length encoded (since they are generally sparse); this can result in substantial space savings. Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408	2014-11-25 09:13:37 -08:00
Henry Robinson	6af7c8fe4a	IMPALA-1330: Fix column types for SHOW {table, partition} STATS Because we add 'total' to the last row in SHOW PARTITIONS, we set the partition key columns to be string. At least, that's what the comment said, but we didn't do that in fact. This patch also corrects the column type for max width, which should be INT. Change-Id: I787ab17be27f45107340119017e528c58a3daad3 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4678 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: jenkins	2014-10-06 15:16:56 -07:00
Alex Behm	19bab59854	Create/alter/describe tables with complex types. This patch adds parsing of complex types and tests for using complex types in various exprs and create/alter/describe stmts. Change-Id: Ibc211a560c889f5ccfb616813700b923c89d8245 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3577 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3594	2014-07-23 17:26:14 -07:00
Ippokratis Pandis	e1ae5fe95a	IMPALA-1068: COMPUTE STATS should place -1 in #NULLs With IMPALA-1033 we disabled the counting of the number of NULLs in each column, and that gave a 2x speed-up in the computation. But erroneously the value 0 was being placed in the number of NULLs, instead of the correct -1 that indicates 'unknown'. Change-Id: Ib882eb2a87e7e2469f606081cb2881461b441a45 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3377 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3378	2014-07-07 15:13:25 -07:00
Dimitris Tsirogiannis	5a6f53db16	Add partition pruning tests The following changes are included in this commit: 1. Modified the alltypesagg table to include an additional partition key that has nulls. 2. Added a number of tests in hdfs.test that exercise the partition pruning logic (see IMPALA-887). 3. Modified all the tests that are affected by the change in alltypesagg. Change-Id: I1a769375aaa71273341522eb94490ba5e4c6f00d Reviewed-on: http://gerrit.ent.cloudera.com:8080/2874 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3236	2014-06-24 02:14:27 -07:00
Alex Behm	70d7ff07af	CDH-19856: Disable Hive's stats autogathering. Change-Id: I04e91f91d29b7863848a750e362c9d94469df7f2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3156 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3169	2014-06-19 16:48:34 -07:00
Lenni Kuff	745c091fcc	[CDH5] Update SHOW TABLE STATS to include per-partition HDFS caching stats Change-Id: I71b01f84bbd308108d775e78c644e867b48e05be Reviewed-on: http://gerrit.ent.cloudera.com:8080/2621 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-05-28 08:54:54 -07:00
Lenni Kuff	c45e9a70d9	[CDH5] Add DDL support for HDFS caching This change adds DDL support for HDFS caching. The DDL allows the user to indicate a table or partition should be cached and which pool to cache the data into: * Create a cached table: CREATE TABLE ... CACHED IN 'poolName' * Cache a table/partition: ALTER TABLE ... [partitionSpec] SET CACHED IN 'poolName' * Uncache a table/partition: ALTER TABLE ... [partitionSpec] SET UNCACHED When a table/partition is marked as cached, a new HDFS caching request is submitted to cache the location (HDFS path) of the table/partition and the ID of that request is stored with in the table metadata (in the table properties). This is stored as: 'cache_directive_id'='<requestId>'. The cache requests and IDs are managed by HDFS and persisted across HDFS restarts. When a cached table or partition is dropped it is important to uncache the cached data (drop the associated cache request). For partitioned tables, this means dropping all cache requests from all cached partitions in the table. Likewise, if a partitioned table is created as cached, new partitions should be marked as cached by default. It is desirable to know which cache pools exists early on (in analysis) so the query will fail without hitting HDFS/CatalogServer if a non-existent pool is specified. To support this, a new cache pool catalog object type was introduced. The catalog server caches the known pools (periodically refreshing the cache) and sends the known pools out in catalog updates. This allows impalads to perform analysis checks on cache pool existence going to HDFS. It would be easy to use this to add basic cache pool management in the future (ADD/DROP/SHOW CACHE POOL). Waiting for the table/partition to become cached may take a long time. Instead of blocking the user from access the time during this period we will wait for the cache requests to complete in the background and once they have finished the table metadata will be automatically refreshed. Change-Id: I1de9c6e25b2a3bdc09edebda5510206eda3dd89b Reviewed-on: http://gerrit.ent.cloudera.com:8080/2310 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-05-27 16:47:15 -07:00
Lenni Kuff	9e2dd7e049	Add support for SHOW PARTITIONS <table name> This statement returns info on all partitions for the given table. It is implemented as an alias for SHOW TABLE STATS, with some extended analysis checks (such as throwing if the statement targets an unpartitioned table). Change-Id: I19154a9d90314de18f86ba355aa5dbed808f147f Reviewed-on: http://gerrit.ent.cloudera.com:8080/2145 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/2179 Tested-by: jenkins	2014-04-10 12:15:39 -07:00
Alex Behm	74164e8f99	IMPALA-688: Fix column stats computation for HBase row key. Use regex to fix flaky tests. Change-Id: I1d3fb915921bbc5366da0ee51608fd54aa237777 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1135 Tested-by: jenkins Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:54:33 -08:00
Alex Behm	e4ad086dee	Added max/avg length for string columns in COMPUTE STATS. Change-Id: I6f61de2323ee12681642684ec633ed4bb7506de2 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1079 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-01-08 10:54:30 -08:00
Lenni Kuff	bfb16ff552	Disable SHOW STATS tests because results are unstable (IMPALA-688) Change-Id: Ib4b4fe3a29d3bd0e3c7ece8b5b21c4ec4b5eb289 Reviewed-on: http://gerrit.ent.cloudera.com:8080/1060 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Lenni Kuff <lskuff@cloudera.com>	2014-01-08 10:54:22 -08:00

18 Commits