Reworks the FK/PK join detection logic to:
- more accurately recognize many-to-many joins
- avoid dim/dim joins for multi-column PKs
The new detection logic maintains our existing philosophy of generally
assuming a FK/PK join, unless there is strong evidence to the
contrary, as follows.
For each set of simple equi-join conjuncts between two tables, we
compute the joint NDV of the right-hand side columns by
multiplication, and if the joint NDV is significantly smaller than
the right-hand side row count, then we are fairly confident that the
right-hand side is not a PK. Otherwise, we assume the set of conjuncts
could represent a FK/PK relationship.
Extends the explain plan to include the outcome of the FK/PK detection
at EXPLAIN_LEVEL > STANDARD.
Performance testing:
1. Full TPC-DS run on 10TB:
- Q10 improved by >100x
- Q72 improved by >25x
- Q17,Q26,Q29 improved by 2x
- Q64 regressed by 10x
- Total runtime: Improved by 2x
- Geomean: Minor improvement
The regression of Q64 is understood and we will try to address it
in follow-on changes. The previous plan was better by accident and
not because of superior logic.
2. Nightly TPC-H and TPC-DS runs:
- No perf differences
Testing:
- The existing planner test cover the changes.
- Code/hdfs run passed.
Change-Id: I49074fe743a28573cff541ef7dbd0edd88892067
Reviewed-on: http://gerrit.cloudera.org:8080/7257
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
The main idea of this patch is to use table stats to
extrapolate the row counts for new/modified partitions.
Existing behavior:
- Partitions that lack the row count stat are ignored
when estimating the cardinality of HDFS scans. Such
partitions effectively have an estimated row count
of zero.
- We always use the row count stats for partitions that
have one. The row count may be innaccurate if data in
such partitions has changed significantly.
Summary of changes:
- Enhance COMPUTE STATS to also store the total number
of file bytes in the table.
- Use the table-level row count and file bytes stats
to estimate the number of rows in a scan.
- A new impalad startup flag is added to enable/disable
the extrapolation behavior. The feature is disabled by
default. Note that even with the feature disabled,
COMPUTE STATS stores the file bytes so you can enable
the feature without having to run COMPUTE STATS again.
Testing:
- Added new FE unit test
- Added new EE test
Change-Id: I972c8a03ed70211734631a7dc9085cb33622ebc4
Reviewed-on: http://gerrit.cloudera.org:8080/6840
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
Compute the minimum buffer requirement for spilling nodes and
per-host estimates for the entire plan tree.
This builds on top of the existing resource estimation code, which
computes the sets of plan nodes that can execute concurrently. This is
cleaned up so that the process of producing resource requirements is
clearer. It also removes the unused VCore estimates.
Fixes various bugs and other issues:
* computeCosts() was not called for unpartitioned fragments, so
the per-operator memory estimate was not visible.
* Nested loop join was not treated as a blocking join.
* The TODO comment about union was misleading
* Fix the computation for mt_dop > 1 by distinguishing per-instance and
per-host estimates.
* Always generate an estimate instead of unpredictably returning
-1/"unavailable" in many circumstances - there was little rhyme or
reason to when this happened.
* Remove the special "trivial plan" estimates. With the rest of the
cleanup we generate estimates <= 10MB for those trivial plans through
the normal code path.
I left one bug (IMPALA-4862) unfixed because it is subtle, will affect
estimates for many plans and will be easier to review once we have the
test infra in place.
Testing:
Added basic planner tests for resource requirements in both the MT and
non-MT cases.
Re-enabled the explain_level tests, which appears to be the only
coverage for many of these estimates. Removed the complex and
brittle test cases and replaced with a couple of much simpler
end-to-end tests.
Change-Id: I1e358182bcf2bc5fe5c73883eb97878735b12d37
Reviewed-on: http://gerrit.cloudera.org:8080/5847
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The runtime profile as we present it is not very useful and I think the structure of
it makes it hard to consume. This patch adds a new client facing schemed set of
counters that are collected from the runtime profiles. For example, with this structure
it would be easy to have the shell get the stats of a running query and print a useful
progress report or to check the most relevant metrics for diagnosing issues.
Here's an example of the output for one of the tpch queries:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
------------------------------------------------------------------------------------------------------------------------
09:MERGING-EXCHANGE 1 79.738us 79.738us 5 5 0 -1.00 B UNPARTITIONED
05:TOP-N 3 84.693us 88.810us 5 5 12.00 KB 120.00 B
04:AGGREGATE 3 5.263ms 6.432ms 5 5 44.00 KB 10.00 MB MERGE FINALIZE
08:AGGREGATE 3 16.659ms 27.444ms 52.52K 600.12K 3.20 MB 15.11 MB MERGE
07:EXCHANGE 3 2.644ms 5.1ms 52.52K 600.12K 0 0 HASH(o_orderpriority)
03:AGGREGATE 3 342.913ms 966.291ms 52.52K 600.12K 10.80 MB 15.11 MB
02:HASH JOIN 3 2s165ms 2s171ms 144.87K 600.12K 13.63 MB 941.01 KB INNER JOIN, BROADCAST
|--06:EXCHANGE 3 8.296ms 8.692ms 57.22K 15.00K 0 0 BROADCAST
| 01:SCAN HDFS 2 1s412ms 1s978ms 57.22K 15.00K 24.21 MB 176.00 MB tpch.orders o
00:SCAN HDFS 3 8s032ms 8s558ms 3.79M 600.12K 32.29 MB 264.00 MB tpch.lineitem l
Change-Id: Iaad4b9dd577c375006313f19442bee6d3e27246a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2964
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Our new build machines (e.g., beefy) have more cores than our other machines,
so scan nodes may have a different memory estimate causing the explain tests
to fail. This patch fixes the num_scanner_threads to 1 for explain tests
to ensure consisteny estimates.
Change-Id: Ie6194f3c3b17d04aa141d04fcddb7ac948e92fcf
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1735
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1753
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
This patch cleans up analysis and execution of scalar and aggregate functions
so that there is no difference between how builtins and user functions are
handled. The only difference is that the catalog is populated with the builtins
all the time.
The BE always gets a TFunction object and just executes it (builtins will have
an empty hdfs file location).
This removes the opcode registry and all of the functionality is subsumed by
the catalog, most of which was already duplicated there anyway.
This also introduces the concept of a system database; databases that the
user cannot modify and is populated automatically on startup.
Change-Id: Iaa3f84dad0a1a57691f5c7d8df7305faf01d70ed
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1386
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1577
The overall goal of this change allow for table metadata to be loaded in the background
but also to allow prioritization of loading on an as-needed basis. As part of analysis,
any tables that are not loaded are tracked and if analysis fails the Impalad will make
an RPC to the CatalogServer to requiest the metadata loading of these tables be
prioritized and analysis will be restarted.
To support this, the CatalogServer now has a deque of the tables to load. For
background loading, tables to load are added to the tail of the deque. However, a new
CatalogServer RPC was added that can prioritize the loading of one or more tables in
which case they will get added to the head of the deque. The next table to load is
always taken from the head. This helps prioritize loading but is admittedly not the most
fair approach.
The support the prioritized loading, some changes had to made on the Impalad side during
analysis:
- During analysis, any tables that are missing metadata are tracked.
- Analysis now runs in a loop. If it fails due to an AnalysisException AND at least 1
table/view was missing metadata, these tables missing metadata are requested to be
loaded by calling the CatalogServer.
- The impalad will wait until the required tables are received (by getting notified each
time there is a call to updateCatalog()), and waiting to run analysis until all tables
are available. Once the tables are available, analysis will restart.
This change also introduces two new flags:
--load_catalog_in_background (bool). When this is true (the default) the catalog server
will run a period background thread to queue all unloaded tables for loading. This is
generally the desired behavior, but there may be some cases (very large metastores) where
this may need to be disabled.
--num_metadata_loading_threads (int32). The number of threads to use when loading catalog
metadata (degree of parallelism). The default is 16, but it can be increased to improve
performance at the cost of stressing the Hive metastore/HDFS.
Change-Id: Ib94dbbf66ffcffea8c490f50f5c04d19fb2078ad
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1476
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1538
There are now 4 explain levels summarized as follows:
- Level 0: MINIMAL
Non-fragmented parallel plan only showing plan nodes with minimal attributes
- Level 1: STANDARD
Non-fragmented parallel plan with some details in plan nodes
- Level 2: EXTENDED
Non-fragmented parallel plan with full details in plan nodes including
the table/column stats, row size, #hosts, cardinality,
and estimated per-host memory requirement
- Level 3: VERBOSE
Fragmented parallel plan with full details (like level 2)
This patch also includes several bugfixes related to plan costing and/or
testing of explain plans.
Change-Id: I622310f01d1b3d53ea1031adaf3b3ffdd94eba30
Reviewed-on: http://gerrit.ent.cloudera.com:8080/1211
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: jenkins