impala

mirror of https://github.com/apache/impala.git synced 2026-01-05 12:01:11 -05:00

Author	SHA1	Message	Date
Alex Behm	b3d8a507cb	IMPALA-5310: Add COMPUTE STATS TABLESAMPLE. Adds the TABLESAMPLE clause for COMPUTE STATS. Syntax: COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)] Computes and replaces the table-level row count and total file size, as well as all table-level column statistics. Existing partition-level row counts are not modified. The TABLESAMPLE clause can be used to limit the scanned data volume to a desired percentage. When sampling, the unmodified results of the COMPUTE STATS queries are sent to the CatalogServer. There, the stats are extrapolated before storing them into the HMS so as not to confuse other engines like Hive/SparkSQL which may rely on the shared HMS fields being accurate. Limitations - Only works for HDFS tables - TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS - TABLESAMPLE requires --enable_stats_extrapolation=true Changes to EXPLAIN The stored statistics from the HMS are more clearly displayed under a 'stored statistics' section. Example: 00:SCAN HDFS [functional.alltypes, RANDOM] partitions=24/24 files=24 size=478.45KB stored statistics: table: rows=7300 size=478.45KB partitions: 24/24 rows=7300 columns: all Testing: - added new functional tests - core/hdfs run passed Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7 Reviewed-on: http://gerrit.cloudera.org:8080/8136 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 22:37:01 +00:00
Bharath Vissapragada	2c0fc30628	IMPALA-5615: Fix compute incremental stats for general partition exprs The fix for IMPALA-1654 has broken the compute incremental stats child query generation logic for general partition expressions. This commit fixes it and also adds new queries to fix the test gap. These tests fail consistently without the patch. Change-Id: I227fc06f580eb9174f60ad0f515a3641cec19268 Reviewed-on: http://gerrit.cloudera.org:8080/7379 Reviewed-by: Bharath Vissapragada <bharathv@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-25 03:31:19 +00:00
Taras Bobrovytsky	d07580c171	IMPALA-4962: Fix SHOW COLUMN STATS for HS2 Impala incorrectly returned NULLs in the "Max Size" column of the SHOW COLUMN STATS result when executed through the HS2 interface. The issue was that the column was specified to be type INT in the result schema, but the actual type of the contents that we inserted into it was "long". The reason why this is not an issue in Impala shell is because we stringify the contents without inspecting the metadata for beeswax results. The issue was fixed by changing the type from INT to BIGINT. Change-Id: I419657744635dfdc2e1562fe60a597617fff446e Reviewed-on: http://gerrit.cloudera.org:8080/6109 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-22 23:10:34 +00:00
Alex Behm	d845413ab8	IMPALA-4854: Fix incremental stats with complex types. The bug: Compute incremental stats used to always do a full stats recomputation for tables with complex types. The logic for detecting schema changes (e.g. an added column) did not take into consideration that columns with complex types are ignored in the stats computation, and should therefore not be recognized as a new column that does not yet have stats. Testing: - Added a new regression test - Locally ran test_compute_stats.py and the FE tests Change-Id: I6e0335048d688ee25ff55c6628d0f6f8ecc1dd8a Reviewed-on: http://gerrit.cloudera.org:8080/6033 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-02-17 06:02:48 +00:00
Alex Behm	3aa4351625	IMPALA-4170: Fix identifier quoting in COMPUTE INCREMENTAL STATS. The SQL statements generated from COMPUTE INCREMENTAL STATS did not properly quote identifiers when incrementally updating the stats for newly added partitions. Our existing tests did not catch this case because the code paths for doing the initial stats computation and the incremental stats computation are different, in particular, the code for generating the SQL statements. Change-Id: I63adcc45dc964ce769107bf4139fc4566937bb96 Reviewed-on: http://gerrit.cloudera.org:8080/4479 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-09-21 01:24:53 +00:00
Alex Behm	a41710a0c8	Use unique_database fixture in test_compute_stats.py. This patch makes it a little easier to use the unique_database fixture with .test files. The RESULTS section can now contain $DATABASE which is replaced with the current database by the test framework. Testing: - ran the test locally on exhaustive - ran the test on hdfs and the local filesystem on Jenkins Change-Id: I8655eb769003f88c0e1ec1b254118e4ec3353b48 Reviewed-on: http://gerrit.cloudera.org:8080/2947 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:50 -07:00
Christopher Channing	9ea5caf0ef	IMPALA-2199: Row count not set for empty partition when spec is used with compute incremental stats This patch resolves an issue where row count is not set to 0 when a partition spec is used with 'compute incremental stats' on a partition that contains no data. The fix is to populate the partition 'expected list' in the frontend with the partition spec, the backend keeps track of which partitions had statistics generated. In the scenario where no statistics are generated for a partition, the backend will fall back to the 'expected list' to zero out the statistics. Change-Id: If4aac131dbe44e14a0477afa58e980da9e235d6b Reviewed-on: http://gerrit.cloudera.org:8080/627 Reviewed-by: Christopher Channing <cchanning@cloudera.com> Tested-by: Internal Jenkins	2015-08-13 09:38:30 +00:00
Shant Hovsepian	6d87fe090c	Improve Hll estimate for small cardinalities. Based on Google's HyperLogLog++ paper. Uses a bias correcting interpolation as a sub algorithm for Hll estimates within a specific range. Change-Id: If4fe692b4308f6a57aea6167e9bc00db11eaaab9 Reviewed-on: http://gerrit.cloudera.org:8080/415 Tested-by: Internal Jenkins Reviewed-by: Henry Robinson <henry@cloudera.com>	2015-07-16 19:38:17 +00:00
Henry Robinson	f22b8659fd	IMPALA-1595: Add 'location' to SHOW [TABLE STATS\|PARTITIONS] for HDFS tables This patch adds a 'location' column to the output of SHOW TABLE STATS / SHOW PARTITIONS. This helps users understand the effects of ALTER TABLE SET LOCATION commands, particularly for partitions, and is easier to identify than the output of DESCRIBE FORMATTED. Some existing tests in alter-table.test have been updated to include checking the location output before and after a SET LOCATION command. The tests in show.test have also been updated to check for the location; all other tests that use SHOW [TABLE STATS\|PARTITIONS] use a generic regex to avoid overly verbose tests. Change-Id: I9d276f7b133c38c9319e0906397ca1c31cec95bb Reviewed-on: http://gerrit.cloudera.org:8080/316 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2015-04-21 19:27:50 +00:00
Henry Robinson	146fe64a26	IMPALA-1615: Don't drop row count during DROP INCREMENTAL STATS Change-Id: I1ae23ca9d70eeb58a3c7c8c59fb633832edcff58 Reviewed-on: http://gerrit.cloudera.org:8080/148 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 20:22:49 +00:00
ishaan	8369c3b13b	Remove explicit references to functional_hbase tables from .test files. Additionally, this patch also disabled the hbase/none test dimension if the TARGET_FILESYSTEM environment variable is set to either s3 of isilon. Change-Id: I63aecaa478d2ba9eb68de729e9640071359a2eeb Reviewed-on: http://gerrit.cloudera.org:8080/74 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2015-02-23 23:32:41 +00:00
Alex Behm	ad6b9364c0	IMPALA-1629: Compute stats properly updates CHAR/VARCHAR column stats. The problem was that VARCHAR was not present in a few switch statements for updating/populating column stats in various places. Change-Id: I0b2b316b734d27a7ff08701b0986014be2473443 Reviewed-on: http://gerrit.cloudera.org:8080/65 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2015-02-19 22:53:42 +00:00
Martin Grund	cee1e84c1e	IMPALA-1587: Extending caching directives for multiple replicas This patch adds the possibility to specify the number of replicas that should be cached in main memory. This can be useful in high QPS scenarios as the majority of the load is no longer the single cached replica, but a set of cached replicas. While the cache replication factor can be larger than the block replication factor on disk, the difference will be ignored by HDFS until more replicas become available. This extends the current syntax for specifying the cache pool in the following way: cached in 'poolName' is extended with the optional replication factor cached in 'poolName' with replication = XX By default, the cache replication factor is set to 1. As this value is not yet configurable in HDFS it's defined as a constant in the JniCatalog thrift specification. If a partitioned table is cached, all its child partitions inherit this cache replication factor. If child partitions have a custom cache replication factor, changing the cache replication factor on the partitioned table afterwards will overwrite this custom value. If a new partition is added to the table, it will again inherit the cache replication factor of the parent independent of the cache pool that is used to cache the partition. To review changes and status of the replication factor for tables and partitions the replication factor is part of output of the "show partitions" command. Change-Id: I2aee63258d6da14fb5ce68574c6b070cf948fb4d Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5533 Tested-by: jenkins Reviewed-by: Martin Grund <mgrund@cloudera.com>	2015-01-26 20:30:59 -08:00
Henry Robinson	13b9cdd6b0	More test coverage for incremental stats Change-Id: I17778dcf019c2a219baa678211f221b7e04813bb Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5446 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins (cherry picked from commit 2f5ec47e0b5dd26bc9dfe884481d3a316201be2d) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5460 Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-12-01 17:26:39 -08:00
Henry Robinson	98064c4da8	Fix crash when columns are dropped between compute stats calls FinalizePartitionedColumnStats() should have iterated over the list of columns present in the table, rather than the list in the existing stats data structure. If a column was dropped, but still persisted in the old structure, it was possible that we could index off the end of an array. Change-Id: Ib1ab7690ffae05afff826b9d1a15871337691739 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5437 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins (cherry picked from commit cee8305cd2878c8f00622d39ddd43b7a5dfbbc0d) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5447 Reviewed-by: Henry Robinson <henry@cloudera.com>	2014-11-26 23:25:11 -08:00
Henry Robinson	44f57e5fb6	IMPALA-1122: Compute stats with partition granularity This patch adds the ability to compute and drop column and table statistics at partition granularity. The following commands are added. Detail about the implementation follows. COMPUTE INCREMENTAL STATS <tbl_name> [PARTITION <partition_spec>] This variant of COMPUTE STATS will, ultimately, do the same thing as the traditional COMPUTE STATS statement, but does so by caching the intermediate state of the computation for each partition in the Hive MetaStore. If the PARTITION clause is added, the computation is performed for only that partition. If the PARTITION clause is omitted, incremental stats are updated only for those partitions with missing incremental stats (e.g. one column does not have stats, or incremental stats was never computed for this partition). In this patch, incremental stats are only invalidated when a DROP STATS variant is executed. Future patches can automatically invalidate the statistics after REFRESH or INSERT queries, etc. DROP INCREMENTAL STATS <tbl_name> PARTITION <part_spec> This variant of DROP stats removes the incremental statistics for the given table. It does not recalculate the statistics for the whole table, so this should be used only to invalidate the intermediate state for a partition which will shortly be subject to COMPUTE INCREMENTAL STATS. The point of this variant is to allow users to notify Impala when they believe a partition has changed significantly enough to warrant recomputation of its statistics. It is not necessary for new partitions; Impala will detect that they do not have any valid statistics. -------- This is achieved by adapting the existing HLL UDA via swapping its finalize method for a new one which returns the intermediate HLL buckets, rather than aggregating and then disposing of them. This intermediate state is then returned to Impala's catalog-op-executor.cc, which then passes the intermediate state back to the frontend to be ultimately stored in the HMS. This intermediate state is computed on a per-partition basis by grouping the input to the UDA by partition. Thus, the incremental computation produces one row for each partition selected (the set of which might be quite small, if there are few partitions without valid incremental stats: this is the point of the new commands). At the same time, the query coordinator aggregates the output of the UDA to produce table-level statistics. This computation incorporates any existing (and not re-computed) intermediate partition state which is passed to the coordinator by the frontend. The resulting statistics are saved to the table as normal. Intermediate statistics are serialised to the HMS by writing a Thrift structure's serialised form to the partition's 'parameters' map. There is a schema-imposed limit of 4000 characters to the serialised string, which is exacerbated by the fact that the Thrift representation must first be base-64 encoded to avoid type errors in the HMS. The current patch breaks the encoded structure into 4k chunks, and then recombines them on read. The alltypes table (11 columns) takes about three of these chunks. This may mean that incremental stats are not suitable for particularly wide tables: these structures could be zipped before encoding for some space savings. In the meantime, the NDV estimates are run-length encoded (since they are generally sparse); this can result in substantial space savings. Change-Id: If82cf4753d19eb532265acb556f798b95fbb0f34 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4475 Tested-by: jenkins Reviewed-by: Henry Robinson <henry@cloudera.com> Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5408	2014-11-25 09:13:37 -08:00

16 Commits