impala

mirror of https://github.com/apache/impala.git synced 2026-01-10 00:00:16 -05:00

Author	SHA1	Message	Date
Alex Behm	b3d8a507cb	IMPALA-5310: Add COMPUTE STATS TABLESAMPLE. Adds the TABLESAMPLE clause for COMPUTE STATS. Syntax: COMPUTE STATS <table> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)] Computes and replaces the table-level row count and total file size, as well as all table-level column statistics. Existing partition-level row counts are not modified. The TABLESAMPLE clause can be used to limit the scanned data volume to a desired percentage. When sampling, the unmodified results of the COMPUTE STATS queries are sent to the CatalogServer. There, the stats are extrapolated before storing them into the HMS so as not to confuse other engines like Hive/SparkSQL which may rely on the shared HMS fields being accurate. Limitations - Only works for HDFS tables - TABLESAMPLE is not supported for COMPUTE INCREMENTAL STATS - TABLESAMPLE requires --enable_stats_extrapolation=true Changes to EXPLAIN The stored statistics from the HMS are more clearly displayed under a 'stored statistics' section. Example: 00:SCAN HDFS [functional.alltypes, RANDOM] partitions=24/24 files=24 size=478.45KB stored statistics: table: rows=7300 size=478.45KB partitions: 24/24 rows=7300 columns: all Testing: - added new functional tests - core/hdfs run passed Change-Id: I7f3e72471ac563adada4a4156033a85852b7c8b7 Reviewed-on: http://gerrit.cloudera.org:8080/8136 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-29 22:37:01 +00:00
Matthew Jacobs	6c12546561	IMPALA-4833: Compute precise per-host reservation size Before this change, the per-host reservation size was computed by the Planner. However, scheduling happens after planning, so the Planner must assume that all fragments run on all hosts, and the reservation size is likely much larger than it needs to be. This moves the computation of the per-host reservation size to the BE where it can be computed more precisely. This also includes a number of plan/profile changes. Change-Id: Idbcd1e9b1be14edc4017b4907e83f9d56059fbac Reviewed-on: http://gerrit.cloudera.org:8080/7630 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-12 08:10:07 +00:00
Tim Armstrong	64fd0115e5	IMPALA-4862: make resource profile consistent with backend behaviour This moves away from the PipelinedPlanNodeSet approach of enumerating sets of concurrently-executing nodes because unions would force creating many overlapping sets of nodes. The new approach computes the peak resources during Open() and the peak resources between Open() and Close() (i.e. while calling GetNext()) bottom-up for each plan node in a fragment. The fragment resources are then combined to produce the query resources. The basic assumptions for the new resource estimates are: * resources are acquired during or after the first call to Open() and released in Close(). * Blocking nodes call Open() on their child before acquiring their own resources (this required some backend changes). * Blocking nodes call Close() on their children before returning from Open(). * The peak resource consumption of the query is the sum of the independent fragments (except for the parallel join build plans where we can assume there will be synchronisation). This is conservative but we don't synchronise fragment Open() and Close() across exchanges so can't make stronger assumptions in general. Also compute the sum of minimum reservations. This will be useful in the backend to determine exactly when all of the initial reservations have been claimed from a shared pool of initial reservations. Testing: * Updated planner tests to reflect behavioural changes. * Added extra resource requirement planner tests for unions, subplans, pipelines of blocking operators, and bushy join plans. * Added single-node plans to resource-requirements tests. These have more complex plan trees inside a single fragment, which is useful for testing the peak resource requirement logic. Change-Id: I492cf5052bb27e4e335395e2a8f8a3b07248ec9d Reviewed-on: http://gerrit.cloudera.org:8080/7223 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-07-12 01:17:24 +00:00
Tim Armstrong	c4d284f3cc	IMPALA-5483: Automatically disable codegen for small queries This is similar to the single-node execution optimisation, but applies to slightly larger queries that should run in a distributed manner but won't benefit from codegen. This adds a new query option disable_codegen_rows_threshold that defaults to 50,000. If fewer than this number of rows are processed by a plan node per impalad, the cost of codegen almost certainly outweighs the benefit. Using rows processed as a threshold is justified by a simple model that assumes the cost of codegen and execution per row for the same operation are proportional. E.g. if x is the complexity of the operation, n is the number of rows processed, C is a constant factor giving the cost of codegen and Ec/Ei are constant factor giving the cost of codegen'd and interpreted execution and d, then the cost of the codegen'd operator is C * x + Ec * x * n and the cost of the interpreted operator is Ei * x * n. Rearranging means that interpretation is cheaper if n < C / (Ei - Ec), i.e. that (at least with the simplified model) it makes sense to choose interpretation or codegen based on a constant threshold. The model also implies that it is somewhat safer to choose codegen because the additional cost of codegen is O(1) but the additional cost of interpretation is O(n). I ran some experiments with TPC-H Q1, varying the input table size, to determine what the cut-over point where codegen was beneficial was. The cutover was around 150k rows per node for both text and parquet. At 50k rows per node disabling codegen was very beneficial - around 0.12s versus 0.24s. To be somewhat conservative I set the default threshold to 50k rows. On more complex queries, e.g. TPC-H Q10, the cutover tends to be higher because there are plan nodes that process many fewer than the max rows. Fix a couple of minor issues in the frontend - the numNodes_ calculation could return 0 for Kudu, and the single node optimization didn't handle the case where for a scan node with conjuncts, a limit and missing stats correctly (it considered the estimate still valid.) Testing: Updated e2e tests that set disable_codegen to set disable_codegen_rows_threshold to 0, so that those tests run both with and without codegen still. Added an e2e test to make sure that the optimisation is applied in the backend. Added planner tests for various cases where codegen should and shouldn't be disabled. Perf: Added a targeted perf test for a join+agg over a small input, which benefits from this change. Change-Id: I273bcee58641f5b97de52c0b2caab043c914b32e Reviewed-on: http://gerrit.cloudera.org:8080/7153 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-06-29 21:14:59 +00:00
Alex Behm	e89d7057a6	IMPALA-2373: Extrapolate row counts for HDFS tables. The main idea of this patch is to use table stats to extrapolate the row counts for new/modified partitions. Existing behavior: - Partitions that lack the row count stat are ignored when estimating the cardinality of HDFS scans. Such partitions effectively have an estimated row count of zero. - We always use the row count stats for partitions that have one. The row count may be innaccurate if data in such partitions has changed significantly. Summary of changes: - Enhance COMPUTE STATS to also store the total number of file bytes in the table. - Use the table-level row count and file bytes stats to estimate the number of rows in a scan. - A new impalad startup flag is added to enable/disable the extrapolation behavior. The feature is disabled by default. Note that even with the feature disabled, COMPUTE STATS stores the file bytes so you can enable the feature without having to run COMPUTE STATS again. Testing: - Added new FE unit test - Added new EE test Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/6840 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-26 21:06:17 +00:00

5 Commits