47 Commits

Author SHA1 Message Date
Riza Suminto
3210ec58c5 IMPALA-14006: Bound max_instances in CreateInputCollocatedInstances
IMPALA-11604 (part 2) changes how many instances to create in
Scheduler::CreateInputCollocatedInstances. This works when the left
child fragment of a parent fragment is distributed across nodes.
However, if the left child fragment instance is limited to only 1
node (the case of UNPARTITIONED fragment), the scheduler might
over-parallelize the parent fragment by scheduling too many instances in
a single node.

This patch attempts to mitigate the issue in two ways. First, it adds
bounding logic in PlanFragment.traverseEffectiveParallelism() to lower
parallelism further if the left (probe) side of the child fragment is
not well distributed across nodes.

Second, it adds TQueryExecRequest.max_parallelism_per_node to relay
information from Analyzer.getMaxParallelismPerNode() to the scheduler.
With this information, the scheduler can do additional sanity checks to
prevent Scheduler::CreateInputCollocatedInstances from
over-parallelizing a fragment. Note that this sanity check can also cap
MAX_FS_WRITERS option under a similar scenario.

Added ScalingVerdict enum and TRACE log it to show the scaling decision
steps.

Testing:
- Add planner test and e2e test that exercise the corner case under
  COMPUTE_PROCESSING_COST=1 option.
- Manually comment the bounding logic in traverseEffectiveParallelism()
  and confirm that the scheduler's sanity check still enforces the
  bounding.

Change-Id: I65223b820c9fd6e4267d57297b1466d4e56829b3
Reviewed-on: http://gerrit.cloudera.org:8080/22840
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-05-07 03:34:15 +00:00
Riza Suminto
f1de0c392f IMPALA-13636: Fix target file_format in TestTpcdsInsert
TestTpcdsInsert creates a temporary table to test insert functionality.
It has three problems:

1. It does not use unique_database parameter, so the temporary table is
   not cleaned up after test finished.
2. It ignores file_format from test vector, causing inconsistency in the
   temporary table's file format. str_insert is always in PARQUET format,
   while store_sales_insert is always in TEXTFILE format.
3. text file_format dimension is never exercised, because
   --workload_exploration_strategy in run-all-tests.sh does not
   explicitly list tpcds-insert workload.

This patch fixes all three problems and few flake8 warnings in
test_tpcds_queries.py.

Testing:
- Run bin/run-all-tests.sh with
  EXPLORATION_STRATEGY=exhaustive
  EE_TEST=true
  EE_TEST_FILES="query_test/test_tpcds_queries.py::TestTpcdsInsert"
  Verified that the temporary table format follows file_format
  dimension.

Change-Id: Iea621ec1d6a53eba9558b0daa3a4cc97fbcc67ae
Reviewed-on: http://gerrit.cloudera.org:8080/22291
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com>
Tested-by: Riza Suminto <riza.suminto@cloudera.com>
2025-01-06 22:15:12 +00:00
wzhou-code
3cbb3be5f7 IMPALA-13018: Block push down of conjuncts with implicit casting on base columns for jdbc tables
The query of q80a consists BETWEEN with casting to timestamp in where
clause like:
  d_date between cast('2000-08-23' as timestamp)
    and (cast('2000-08-23' as timestamp) + interval 30 days)
Between predicate does cast all exprs to compatible types. Planner
generates predicates for DataSourceScanNode as:
  CAST(d_date AS TIMESTAMP) >= TIMESTAMP '2000-08-23 00:00:00',
  CAST(d_date AS TIMESTAMP) <= TIMESTAMP '2000-09-22 00:00:00'
But casting to Date/Timestamp for a column cannot be pushed down to JDBC
table now. This patch fixes the issue by blocking such conjuncts with
implicit unsafe casting or casting to date/timestamp to be added into
offered predicate list for JDBC table.
Note that explicit casting on base columns are not allowed to
pushdown.

Testing:
 - Add new planner unit-tests, including explicit casting, implicit
   casting to date/timestamp, built-in functions, arithmetic
   expressions.
   The predicates which are accepted for JDBC are shown in plan under
   "data source predicates" of DataSourceScanNode, predicates which
   are not accepted for JDBC are shown in plan under "predicates" of
   DataSourceScanNodes.
 - Passed all tpcds queries for JDBC tables, including q80a.
 - Passed core test

Change-Id: Iabd7e28b8d5f11f25a000dc4c9ab65895056b572
Reviewed-on: http://gerrit.cloudera.org:8080/21409
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-05-10 07:18:45 +00:00
wzhou-code
08f8a30025 IMPALA-12910: Support running TPCH/TPCDS queries for JDBC tables
This patch adds script to create external JDBC tables for the dataset of
TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for
external JDBC tables with Impala-Impala federation. Note that JDBC
tables are mapping tables, they don't take additional disk spaces.
It fixes the race condition when caching of SQL DataSource objects by
using a new DataSourceObjectCache class, which checks reference count
before closing SQL DataSource.
Adds a new query-option 'clean_dbcp_ds_cache' with default value as
true. When it's set as false, SQL DataSource object will not be closed
when its reference count equals 0 and will be kept in cache until
the SQL DataSource is idle for more than 5 minutes. Flag variable
'dbcp_data_source_idle_timeout_s' is added to make the duration
configurable.
java.sql.Connection.close() fails to remove a closed connection from
connection pool sometimes, which causes JDBC working threads to wait
for available connections from the connection pool for a long time.
The work around is to call BasicDataSource.invalidateConnection() API
to close a connection.
Two flag variables are added for DBCP configuration properties
'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait'
properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively
in apache.commons.dbcp v2.
Fixes a bug for database type comparison since the type strings
specified by user could be lower case or mix of upper/lower cases, but
the code compares the types with upper case string.
Fixes issue to close SQL DataSource object in JdbcDataSource.open()
and JdbcDataSource.getNext() when some errors returned from DBCP APIs
or JDBC drivers.

testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables
for Impala-Impala, Postgres and MySQL.
Following sample commands creates TPCDS JDBC tables for Impala-Impala
federation with remote coordinator running at 10.19.10.86, and Postgres
server running at 10.19.10.86:
  ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
    --jdbc_db_name=tpcds_jdbc --workload=tpcds \
    --database_type=IMPALA --database_host=10.19.10.86 --clean

  ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
    --jdbc_db_name=tpcds_jdbc --workload=tpcds \
    --database_type=POSTGRES --database_host=10.19.10.86 \
    --database_name=tpcds --clean

TPCDS tests for JDBC tables run only for release/exhaustive builds.
TPCH tests for JDBC tables run for core and exhaustive builds, except
Dockerized builds.

Remaining Issues:
 - tpcds-decimal_v2-q80a failed with returned rows not matching expected
   results for some decimal values. This will be fixed in IMPALA-13018.

Testing:
 - Passed core tests.
 - Passed query_test/test_tpcds_queries.py in release/exhaustive build.
 - Manually verified that only one SQL DataSource object was created for
   test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option
   'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object
   was closed by cleanup thread.

Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a
Reviewed-on: http://gerrit.cloudera.org:8080/21304
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-05-02 02:14:20 +00:00
Riza Suminto
6abfdbc56c IMPALA-12980: Translate CpuAsk into admission control slots
Impala has a concept of "admission control slots" - the amount of
parallelism that should be allowed on an Impala daemon. This defaults to
the number of processors per executor and can be overridden with
-–admission_control_slots flag.

Admission control slot accounting is described in IMPALA-8998. It
computes 'slots_to_use' for each backend based on the maximum number of
instances of any fragment on that backend. This can lead to slot
underestimation and query overadmission. For example, assume an executor
node with 48 CPU cores and configured with -–admission_control_slots=48.
It is assigned 4 non-blocking query fragments, each has 12 instances
scheduled in this executor. IMPALA-8998 algorithm will request the max
instance (12) slots rather than the sum of all non-blocking fragment
instances (48). With the 36 remaining slots free, the executor can still
admit another fragment from a different query but will potentially have
CPU contention with the one that is currently running.

When COMPUTE_PROCESSING_COST is enabled, Planner will generate a CpuAsk
number that represents the cpu requirement of that query over a
particular executor group set. This number is an estimation of the
largest number of query fragment instances that can run in parallel
without waiting, given by the blocking operator analysis. Therefore, the
fragment trace that sums into that CpuAsk number can be translated into
'slots_to_use' as well, which will be a closer resemblance of maximum
parallel execution of fragment instances.

This patch adds a new query option called SLOT_COUNT_STRATEGY to control
which admission control slot accounting to use. There are two possible
values:
- LARGEST_FRAGMENT, which is the original algorithm from IMPALA-8998.
  This is still the default value for the SLOT_COUNT_STRATEGY option.
- PLANNER_CPU_ASK, which will follow the fragment trace that contributes
  towards CpuAsk number. This strategy will schedule more or equal
  admission control slots than the LARGEST_FRAGMENT strategy.

To do the PLANNER_CPU_ASK strategy, the Planner will mark fragments that
contribute to CpuAsk as dominant fragments. It also passes
max_slot_per_executor information that it knows about the executor group
set to the scheduler.

AvgAdmissionSlotsPerExecutor counter is added to describe what Planner
thinks the average 'slots_to_use' per backend will be, which follows
this formula:

  AvgAdmissionSlotsPerExecutor = ceil(CpuAsk / num_executors)

Actual 'slots_to_use' in each backend may differ than
AvgAdmissionSlotsPerExecutor, depending on what is scheduled on that
backend. 'slots_to_use' will be shown as 'AdmissionSlots' counter under
each executor profile node.

Testing:
- Update test_executors.py with AvgAdmissionSlotsPerExecutor assertion.
- Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost.
- Add EE test test_processing_cost.py.
- Add FE test PlannerTest#testProcessingCostPlanAdmissionSlots.

Change-Id: I338ca96555bfe8d07afce0320b3688a0861663f2
Reviewed-on: http://gerrit.cloudera.org:8080/21257
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-04-18 21:58:13 +00:00
Riza Suminto
379038f763 IMPALA-12429: Reduce parallelism for TPC-DS q51a and q67a test.
TestTpcdsQueryWithProcessingCost.test_tpcds_q51a and
TestTpcdsQuery.test_tpcds_q67a has been intermittently failing with
memory oversubscription error. The fact that test minicluster start 3
impalad in single host probably make admission control less effective in
preventing these queries from running in parallel with others.

This patch keep both test, but reduce max_fragment_instances_per_node
from 4 to 2 to lower its memory requirement.

Before patch:

q51a
Max Per-Host Resource Reservation: Memory=3.08GB Threads=129
Per-Host Resource Estimates: Memory=124.24GB
Per Host Min Memory Reservation: localhost:27001(2.93 GB) localhost:27002(1.97 GB) localhost:27000(2.82 GB)
Per Host Number of Fragment Instances: localhost:27001(115) localhost:27002(79) localhost:27000(119)
Admission result: Admitted immediately
Cluster Memory Admitted: 33.00 GB
Per Node Peak Memory Usage: localhost:27000(2.84 GB) localhost:27002(1.99 GB) localhost:27001(2.95 GB)
Per Node Bytes Read: localhost:27000(62.08 MB) localhost:27002(45.71 MB) localhost:27001(47.39 MB)

q67a
Max Per-Host Resource Reservation: Memory=2.15GB Threads=105
Per-Host Resource Estimates: Memory=4.48GB
Per Host Min Memory Reservation: localhost:27001(2.13 GB) localhost:27002(2.13 GB) localhost:27000(2.15 GB)
Per Host Number of Fragment Instances: localhost:27001(76) localhost:27002(76) localhost:27000(105)
Cluster Memory Admitted: 13.44 GB
Per Node Peak Memory Usage: localhost:27000(2.24 GB) localhost:27002(2.21 GB) localhost:27001(2.21 GB)
Per Node Bytes Read: localhost:27000(112.79 MB) localhost:27002(109.57 MB) localhost:27001(105.16 MB)

After patch:

q51a
Max Per-Host Resource Reservation: Memory=2.00GB Threads=79
Per-Host Resource Estimates: Memory=118.75GB
Per Host Min Memory Reservation: localhost:27001(1.84 GB) localhost:27002(1.28 GB) localhost:27000(1.86 GB)
Per Host Number of Fragment Instances: localhost:27001(65) localhost:27002(46) localhost:27000(74)
Cluster Memory Admitted: 33.00 GB
Per Node Peak Memory Usage: localhost:27000(1.88 GB) localhost:27002(1.31 GB) localhost:27001(1.88 GB)
Per Node Bytes Read: localhost:27000(62.08 MB) localhost:27002(45.71 MB) localhost:27001(47.39 MB)

q67a
Max Per-Host Resource Reservation: Memory=1.31GB Threads=85
Per-Host Resource Estimates: Memory=3.76GB
Per Host Min Memory Reservation: localhost:27001(1.29 GB) localhost:27002(1.29 GB) localhost:27000(1.31 GB)
Per Host Number of Fragment Instances: localhost:27001(56) localhost:27002(56) localhost:27000(85)
Cluster Memory Admitted: 11.28 GB
Per Node Peak Memory Usage: localhost:27000(1.35 GB) localhost:27002(1.32 GB) localhost:27001(1.33 GB)
Per Node Bytes Read: localhost:27000(112.79 MB) localhost:27002(109.57 MB) localhost:27001(105.16 MB)

Testing:
- Pass test_tpcds_queries.py in local machine.

Change-Id: I6ae5aeb97a8353d5eaa4d85e3f600513f42f7cf4
Reviewed-on: http://gerrit.cloudera.org:8080/20581
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-10-20 20:31:43 +00:00
Riza Suminto
baddaf2241 IMPALA-12144: Skip TestTpcdsQueryWithProcessingCost if dockerised
There is a sign of flakiness in TestTpcdsQueryWithProcessingCost within
dockerised environment. The flakiness seems to happen due to tighter
per-process memory limit in dockerised environment. This patch skip
TestTpcdsQueryWithProcessingCost in dockerised environment.

Testing:
- Hack SkipIfDockerizedCluster.insufficient_mem_limit to return True if
  IS_HDFS and confirm that the whole TestTpcdsQueryWithProcessingCost is
  skipped.

Change-Id: Ibb6b2d4258a2c6613d1954552f21641b42cb3c38
Reviewed-on: http://gerrit.cloudera.org:8080/19892
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-16 20:40:11 +00:00
Riza Suminto
1d0b111bcf IMPALA-12091: Control scan parallelism by its processing cost
Before this patch, Impala still relies on MT_DOP option to decide the
degree of parallelism of the scan fragment when a query runs with
COMPUTE_PROCESSING_COST=1. This patch adds the scan node's processing
cost as another consideration to raise scan parallelism beyond MT_DOP.

Scan node cost is now adjusted to also consider the number of effective
scan ranges. Each scan range is given a weight of (0.5% *
min_processing_per_thread), which roughly means that one scan node
instance can handle at most 200 scan ranges.

Query option MAX_FRAGMENT_INSTANCES_PER_NODE is added as an upper
bound on scan parallelism if COMPUTE_PROCESSING_COST=true. If the number
of scan ranges is fewer than the maximum parallelism allowed by the scan
node's processing cost, that processing cost will be clamped down
to (min_processing_per_thread / number of scan ranges). Lowering
MAX_FRAGMENT_INSTANCES_PER_NODE can also clamp down the scan processing
cost in a similar way. For interior fragments, a combination of
MAX_FRAGMENT_INSTANCES_PER_NODE, PROCESSING_COST_MIN_THREADS, and the
number of available cores per node is accounted to determine maximum
fragment parallelism per node. For scan fragment, only the first two are
considered to encourage Frontend to choose a larger executor group as
needed.

Two new static state is added into exec-node.h: is_mt_fragment_ and
num_instances_per_node_. The backend code that refers to the MT_DOP
option is replaced with either is_mt_fragment_ or
num_instances_per_node_.

Two new criteria are added during effective parallelism calculation in
PlanFragment.adjustToMaxParallelism():

- If a fragment has UnionNode, its parallelism is the maximum between
  its input fragments and its collocated ScanNode's expected
  parallelism.
- If a fragment only has a single ScanNode (and no UnionNode), its
  parallelism is calculated in the same fashion as the interior fragment
  but will not be lowered anymore since it will not have any child
  fragment to compare with.

Admission control slots remain unchanged. This may cause a query to fail
admission if Planner selects scan parallelism that is higher than the
configured admission control slots value. Setting
MAX_FRAGMENT_INSTANCES_PER_NODE equal to or lower than configured
admission control slots value can help lower scan parallelism and pass
the admission controller.

The previous workaround to control scan parallelism by IMPALA-12029 is
now removed. This patch also disables IMPALA-10287 optimization if
COMPUTE_PROCESSING_COST=true. This is because IMPALA-10287 relies on a
fixed number of fragment instances in DistributedPlanner.java. However,
effective parallelism calculation is done much later and may change the
final number of instances of hash join fragment, rendering
DistributionMode selected by IMPALA-10287 inaccurate.

This patch is benchmarked using single_node_perf_run.py with the
following parameters:

args="-gen_experimental_profile=true -default_query_options="
args+="mt_dop=4,compute_processing_cost=1,processing_cost_min_threads=1 "
./bin/single_node_perf_run.py --num_impalads=3 --scale=10 \
    --workloads=tpcds --iterations=5 --table_formats=parquet/none/none \
    --impalad_args="$args" \
    --query_names=TPCDS-Q3,TPCDS-Q14-1,TPCDS-Q14-2,TPCDS-Q23-1,TPCDS-Q23-2,TPCDS-Q49,TPCDS-Q76,TPCDS-Q78,TPCDS-Q80A \
    "IMPALA-12091~1" IMPALA-12091

The benchmark result is as follows:
+-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| Workload  | Query       | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%)  | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+
| TPCDS(10) | TPCDS-Q23-1 | parquet / none / none | 4.62   | 4.54        |   +1.92%   |   0.23%    |   1.59%        | 5     |   +2.32%       | 1.15    | 2.67  |
| TPCDS(10) | TPCDS-Q14-1 | parquet / none / none | 5.82   | 5.76        |   +1.08%   |   5.27%    |   3.89%        | 5     |   +2.04%       | 0.00    | 0.37  |
| TPCDS(10) | TPCDS-Q23-2 | parquet / none / none | 4.65   | 4.58        |   +1.38%   |   1.97%    |   0.48%        | 5     |   +0.81%       | 0.87    | 1.51  |
| TPCDS(10) | TPCDS-Q49   | parquet / none / none | 1.49   | 1.48        |   +0.46%   | * 36.02% * | * 34.95% *     | 5     |   +1.26%       | 0.58    | 0.02  |
| TPCDS(10) | TPCDS-Q14-2 | parquet / none / none | 3.76   | 3.75        |   +0.39%   |   1.67%    |   0.58%        | 5     |   -0.03%       | -0.58   | 0.49  |
| TPCDS(10) | TPCDS-Q78   | parquet / none / none | 2.80   | 2.80        |   -0.04%   |   1.32%    |   1.33%        | 5     |   -0.42%       | -0.29   | -0.05 |
| TPCDS(10) | TPCDS-Q80A  | parquet / none / none | 2.87   | 2.89        |   -0.51%   |   1.33%    |   0.40%        | 5     |   -0.01%       | -0.29   | -0.82 |
| TPCDS(10) | TPCDS-Q3    | parquet / none / none | 0.18   | 0.19        |   -1.29%   | * 15.26% * | * 15.87% *     | 5     |   -0.54%       | -0.87   | -0.13 |
| TPCDS(10) | TPCDS-Q76   | parquet / none / none | 1.08   | 1.11        |   -2.98%   |   0.92%    |   1.70%        | 5     |   -3.99%       | -2.02   | -3.47 |
+-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+

Testing:
- Pass PlannerTest.testProcessingCost
- Pass test_executor_groups.py
- Reenable test_tpcds_q51a in TestTpcdsQueryWithProcessingCost with
  MAX_FRAGMENT_INSTANCES_PER_NODE set to 5
- Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost
- Pass core tests

Change-Id: If948e45455275d9a61a6cd5d6a30a8b98a7c729a
Reviewed-on: http://gerrit.cloudera.org:8080/19807
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2023-05-11 22:46:31 +00:00
Joe McDonnell
82bd087fb1 IMPALA-11973: Add absolute_import, division to all eligible Python files
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
 1. Python 3 requires absolute imports within packages. This
    can be emulated via "from __future__ import absolute_import"
 2. Python 3 changed division to "true" division that doesn't
    round to an integer. This can be emulated via
    "from __future__ import division"

This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.

I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.

Testing:
 - Ran core tests

Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2023-03-09 17:17:57 +00:00
Riza Suminto
dafc0fb7a8 IMPALA-11604 (part 2): Compute Effective Parallelism of Query
Part 1 of IMPALA-11604 implements the ProcessingCost model for each
PlanNode and DataSink. This second part builds on top of ProcessingCost
model by adjusting the number of instances for each fragment after
considering their production-consumption ratio, and then finally returns
a number representing an ideal CPU core count required for a query to
run efficiently. A more detailed explanation of the CPU costing
algorithm can be found in the three steps below.

I. Compute the total ProcessingCost of a fragment.

The costing algorithm splits a query fragment into several segments
divided by blocking PlanNode/DataSink boundary. Each fragment segment is
a subtree of PlanNodes/DataSink in the fragment with a DataSink or
blocking PlanNode as root and non-blocking leaves. All other nodes in
the segment are non-blocking. PlanNodes or DataSink that belong to the
same segment will have their ProcessingCost summed. A new CostingSegment
class is added to represent this segment.

A fragment that has a blocking PlanNode or blocking DataSink is called a
blocking fragment. Currently, only JoinBuildSink is considered as
blocking DataSink. A fragment without any blocking nodes is called a
non-blocking fragment. Step III discuss further about blocking and
non-blocking fragment.

Take an example of the following fragment plant, which is blocking since
it has 3 blocking PlanNode: 12:AGGREGATE, 06:SORT, and 08:TOP-N.

  F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=6 (adjusted from 12)
  fragment-costs=[34974657, 2159270, 23752870, 22]
  08:TOP-N [LIMIT=100]
  |  cost=900
  |
  07:ANALYTIC
  |  cost=23751970
  |
  06:SORT
  |  cost=2159270
  |
  12:AGGREGATE [FINALIZE]
  |  cost=34548320
  |
  11:EXCHANGE [HASH(i_class)]
     cost=426337

In bottom-up direction, there exist four segments in F03:
  Blocking segment 1: (11:EXCHANGE, 12:AGGREGATE)
  Blocking segment 2: 06:SORT
  Blocking segment 3: (07:ANALYTIC, 08:TOP-N)
  Non-blocking segment 4: DataStreamSink of F03

Therefore we have:
  PC(segment 1) = 426337+34548320
  PC(segment 2) = 2159270
  PC(segment 3) = 23751970+900
  PC(segment 4) = 22

These per-segment costs stored in a CostingSegment tree rooted at
PlanFragment.rootSegment_, and are [34974657, 2159270, 23752870, 22]
respectively after the post-order traversal.

This is implemented in PlanFragment.computeCostingSegment() and
PlanFragment.collectCostingSegmentHelper().

II. Compute the effective degree of parallelism (EDoP) of fragments.

The costing algorithm walks PlanFragments of the query plan tree in
post-order traversal. Upon visiting a PlanFragment, the costing
algorithm attempts to adjust the number of instances (effective
parallelism) of that fragment by comparing the last segment's
ProcessingCost of its child and production-consumption rate between its
adjacent segments from step I. To simplify this initial implementation,
the parallelism of PlanFragment containing EmptySetNode, UnionNode, or
ScanNode will remain unchanged (follow MT_DOP).

This step is implemented at PlanFragment.traverseEffectiveParallelism().

III. Compute the EDoP of the query.

Effective parallelism of a query is the maximum upper bound of CPU core
count that can parallelly work on a query when considering the
overlapping between fragment execution and blocking operators. We
compute this in a similar post-order traversal as step II and split the
query tree into blocking fragment subtrees similar to step I. The
following is an example of a query plan from TPCDS-Q12.

  F04:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1
  PLAN-ROOT SINK
  |
  13:MERGING-EXCHANGE [UNPARTITIONED]
  |
  F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=3 (adjusted from 12)
  08:TOP-N [LIMIT=100]
  |
  07:ANALYTIC
  |
  06:SORT
  |
  12:AGGREGATE [FINALIZE]
  |
  11:EXCHANGE [HASH(i_class)]
  |
  F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=12
  05:AGGREGATE [STREAMING]
  |
  04:HASH JOIN [INNER JOIN, BROADCAST]
  |
  |--F05:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
  |  JOIN BUILD
  |  |
  |  10:EXCHANGE [BROADCAST]
  |  |
  |  F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
  |  02:SCAN HDFS [tpcds10_parquet.date_dim, RANDOM]
  |
  03:HASH JOIN [INNER JOIN, BROADCAST]
  |
  |--F06:PLAN FRAGMENT [RANDOM] hosts=3 instances=3
  |  JOIN BUILD
  |  |
  |  09:EXCHANGE [BROADCAST]
  |  |
  |  F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1
  |  01:SCAN HDFS [tpcds10_parquet.item, RANDOM]
  |
  00:SCAN HDFS [tpcds10_parquet.web_sales, RANDOM]

A blocking fragment is a fragment that has a blocking PlanNode or
blocking DataSink in it. The costing algorithm splits the query plan
tree into blocking subtrees divided by blocking fragment boundary. Each
blocking subtree has a blocking fragment as a root and non-blocking
fragments as the intermediate or leaf nodes. From the TPCDS-Q12 example
above, the query plan is divided into five blocking subtrees of
[(F05, F02), (F06, F01), F00, F03, F04].

A CoreCount is a container class that represents the CPU core
requirement of a subtree of a query or the query itself. Each blocking
subtree will have its fragment's adjusted instance count summed into a
single CoreCount. This means that all fragments within a blocking
subtree can run in parallel and should be assigned one core per fragment
instance. The CoreCount for each blocking subtree in the TPCDS-Q12
example is [4, 4, 12, 3, 1].

Upon visiting a blocking fragment, the maximum between current
CoreCount (rooted at that blocking fragment) vs previous blocking
subtrees CoreCount is taken and the algorithm continues up to the next
ancestor PlanFragment. The final CoreCount for the TPCDS-Q12 example is
12.

This step is implemented at Planner.computeBlockingAwareCores() and
PlanFragment.traverseBlockingAwareCores().

The resulting CoreCount at the root PlanFragment is then taken as the
ideal CPU core count / EDoP of the query. This number will be compared
against the total CPU count of an Impala executor group to determine if
it fits to run in that set or not. A backend flag
query_cpu_count_divisor is added to help scale down/up the EDoP of a
query if needed.

Two query options are added to control the entire computation of EDoP.
1. COMPUTE_PROCESSING_COST
   Control whether to enable this CPU costing algorithm or not.
   Must also set MT_DOP > 0 for this query option to take effect.

2. PROCESSING_COST_MIN_THREADS
   Control the minimum number of fragment instances (threads) that the
   costing algorithm is allowed to adjust. The costing algorithm is in
   charge of increasing the fragment's instance count beyond this
   minimum number through producer-consumer rate comparison. The maximum
   number of fragment is max between PROCESSING_COST_MIN_THREADS,
   MT_DOP, and number of cores per executor.

This patch also adds three backend flags to tune the algorithm.
1. query_cpu_count_divisor
   Divide the CPU requirement of a query to fit the total available CPU
   in the executor group. For example, setting value 2 will fit the
   query with CPU requirement 2X to an executor group with total
   available CPU X. Note that setting with a fractional value less than
   1 effectively multiplies the query CPU requirement. A valid value is
   > 0.0. The default value is 1.

2. processing_cost_use_equal_expr_weight
   If true, all expression evaluations are weighted equally to 1 during
   the plan node's processing cost calculation. If false, expression
   cost from IMPALA-2805 will be used. Default to true.

3. min_processing_per_thread
   Minimum processing load (in processing cost unit) that a fragment
   instance needs to work on before planner considers increasing
   instance count based on the processing cost rather than the MT_DOP
   setting. The decision is per fragment. Setting this to high number
   will reduce parallelism of a fragment (more workload per fragment),
   while setting to low number will increase parallelism (less workload
   per fragment). Actual parallelism might still be constrained by the
   total number of cores in selected executor group, MT_DOP, or
   PROCESSING_COST_MIN_THREAD query option. Must be a positive integer.
   Currently default to 10M.

As an example, the following are additional ProcessingCost information
printed to coordinator log for Q3, Q12, and Q15 ran on TPCDS 10GB scale,
3 executors, MT_DOP=4, PROCESSING_COST_MAX_THREADS=4, and
processing_cost_use_equal_expr_weight=false.

  Q3
  CoreCount={total=12 trace=F00:12}

  Q12
  CoreCount={total=12 trace=F00:12}

  Q15
  CoreCount={total=15 trace=N07:3+F00:12}

There are a few TODOs which will be done in follow up tasks:
1. Factor in row width in ProcessingCost calcuation (IMPALA-11972).
2. Tune the individual expression cost from IMPALA-2805.
3. Benchmark and tune min_processing_per_thread with an optimal value.
4. Revisit cases where cardinality is not available (getCardinality() or
   getInputCardinality() return -1).
5. Bound SCAN and UNION fragments by ProcessingCost as well (need to
   address IMPALA-8081).

Testing:
- Add TestTpcdsQueryWithProcessingCost, which is a similar run of
  TestTpcdsQuery, but with COMPUTE_PROCESSING_COST=1 and MT_DOP=4.
  Setting log level TRACE for PlanFragment and manually running
  TestTpcdsQueryWithProcessingCost in minicluster shows several fragment
  instance count reduction from 12 to either of 9, 6, or 3 in
  coordinator log.
- Add PlannerTest#testProcessingCost
  Adjusted fragment count is indicated by "(adjusted from 12)" in the
  query profile.
- Add TestExecutorGroups::test_query_cpu_count_divisor.

Co-authored-by: Qifan Chen <qchen@cloudera.com>

Change-Id: Ibb2a796fdf78336e95991955d89c671ec82be62e
Reviewed-on: http://gerrit.cloudera.org:8080/19593
Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2023-03-08 15:32:28 +00:00
Shant Hovsepian
3f1b1476af IMPALA-10034: Add remaining TPC-DS queries to workload.
Include remaining TPC-DS queries to the testdata workload definition.

Q8 and Q38 were using non standard variants, those have been
replaced by the official query versions. Q35 is using an official
variant. Had to escape a table alias in Q90 as we treat 'AT' as a
reserved keyword.

Change-Id: Id5436689390f149694f14e6da1df624de4f5f7ad
Reviewed-on: http://gerrit.cloudera.org:8080/16280
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-08-24 16:02:45 +00:00
Shant Hovsepian
ea3f073881 IMPALA-9943,IMPALA-4974: INTERSECT/EXCEPT [DISTINCT]
INTERSECT and EXCEPT set operations are implemented as rewrites to
joins. Currently only the DISTINCT qualified operators are implemented,
not ALL qualified. The operator MINUS is supported as an alias for
EXCEPT.

We mimic Oracle and Hive's non-standard implementation which treats all
operators with the same precedence, as opposed to the SQL Standard of
giving INTERSECT higher precedence.

A new class SetOperationStmt was created to encompass the previous
UnionStmt behavior. UnionStmt is preserved as a special case of union
only operands to ensure compatibility with previous union planning
behavior.

Tests:
* Added parser and analyzer tests.
* Ensured no test failures or plan changes for union tests.
* Added TPC-DS queries 14,38,87 to functional and planner tests.
* Added functional tests test_intersect test_except
* New planner testSetOperationStmt

Change-Id: I5be46f824217218146ad48b30767af0fc7edbc0f
Reviewed-on: http://gerrit.cloudera.org:8080/16123
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-07-31 17:23:45 +00:00
Tim Armstrong
574fef2a76 IMPALA-9917: grouping() and grouping_id() support
Implements grouping() and grouping_id() builtins.
grouping_id() has both a no-arg version, which returns
a bit vector of all grouping exprs and a varargs version,
which returns a bit vector of the provided arguments.

Grouping is a keyword, so needs special handling in the
parser to be accepted as a function name.

These functions are implemented in the transpose agg
with a CASE expression similar to other aggregate functions,
but returning the grouping() or grouping_id() value for that
aggregation class instead of an aggregated value.

Testing:
* Added parser test for grouping keyword.
* Added analysis tests for the functions.
* Added basic planner test to show expressions generated
* Added some TPC-DS queries that use grouping() - queries
  80, 70 and 86 using reference .test files from Fang-Yu
  Rao. 27 and 36 were added with reference results from
  https://github.com/cwida/tpcds-result-reproduction
* Add targeted end-to-end tests.
* Added view compatibility test with Hive.

Change-Id: If0b1640d606256c0fe9204d2a21a8f6d06abcdb6
Reviewed-on: http://gerrit.cloudera.org:8080/16140
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-07-14 03:13:18 +00:00
Tim Armstrong
3e1e7da229 IMPALA-9898: generate grouping set plans
Integrates the parsing and analysis with plan generation.

Testing:
* Add analysis test to make sure we reject unsupported queries.
* Added targeted planner tests to ensure we generate the correct
  aggregation classes for a variety of cases.
* Add targeted end-to-end functional tests.

Added five TPC-DS queries that use ROLLUP, building on some work done
by Fang-Yu Rao. Some tweaks were required for these tests.
* Add an extra ORDER BY clause to q77 to make fully deterministic.
* Add backticks around `returns` to avoid reserved word.
* Add INTERVAL keyword to date/timestamp arithmetic.

We can run q80, too, but I haven't added or verified results yet -
that can be done in a follow-up.

Change-Id: Ie454c5bf7aee266321dee615548d7f2b71380197
Reviewed-on: http://gerrit.cloudera.org:8080/16128
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-07-14 03:13:18 +00:00
Tim Armstrong
fea5dffec5 IMPALA-9924: handle single subquery in or predicate
This patch supports a subset of cases of subqueries
inside OR inside WHERE and HAVING clauses.

The approach used is to rewrite the subquery into
a many-to-one LEFT OUTER JOIN with the subquery and
then replace the subquery in the expression with a
reference to the single select list expressions of
the subquery. This works because:
* A many-to-one LEFT OUTER JOIN returns one output row
  for each left input row, meaning that for every row
  in the original query before the rewrite, we get
  the same row plus a single matched row from the subquery
* Expressions can be rewritten to refer to a slotref from
  the right side of the LEFT OUTER JOIN without affecting
  semantics. E.g. an IN subquery becomes <slot> IS NOT NULL
  or <operator> (<subquery>) becomes <operator> <slot>.

This does not affect SELECT list subqueries, which are
rewritten using a different mechanism that can already
support some subqueries in disjuncts.

Correlated and uncorrelated subqueries are both supported, but
various limitations are present.
Limitations:
* Only one subquery per predicate is supported. The rewriting approach
  should generalize to multiple subqueries but other code needs
  refactoring to handle this case.
* EXISTS and NOT EXISTS subqueries are not supported. The rewriting
  approach can generalise to that, but we need to add or pick a
  select list item from the subquery to check for NULL/IS NOT NULL
  and a little more work is required to do that correctly.
* NOT IN is not supported because of the special NULL semantics.
* Subqueries with aggregates + grouping by are not supported because
  we rely on adding distinct to select list and we don't
  support distinct + aggregations because of IMPALA-5098.

Tests:
* Positive analysis tests for IN and binary predicate operators.
* Negative analysis tests for unsupported subquery operators.
* Negative analysis tests for multiple subqueries.
* Negative analysis tests for runtime scalar subqueries.
* Positive and negative analysis tests for aggregations in subquery.
* TPC-DS Query 45 planner and query tests
* Targeted planner tests for various supported queries.
* Targeted functional tests to confirm plans are executable and
  return correct result. These exercise a mix of the supported
  features - correlated/correlated, aggregate functions,
  EXISTS/comparator, etc.
* Tests for BETWEEN predicate, which is supported as a side-effect
  of being rewritten during analysis.

Change-Id: I64588992901afd7cd885419a0b7f949b0b174976
Reviewed-on: http://gerrit.cloudera.org:8080/16152
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>
2020-07-13 16:02:27 +00:00
Tim Armstrong
24f24b131f IMPALA-9902: add rewrite of TPC-DS q38
I generated the query with dsqgen and then
rewrote it to avoid intersect.

Testing:
Compared results to hive running the original version of the
query.

Change-Id: I81807683aa265a946729e15156bd2e33724103e1
Reviewed-on: http://gerrit.cloudera.org:8080/16118
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-07-06 23:47:45 +00:00
Shant Hovsepian
2dca55695e IMPALA-9784, IMPALA-9905: Uncorrelated subqueries in HAVING.
Support rewriting subqueries in the HAVING clause by nesting the
aggregation query and pulling up the subquery predicates into the outer
WHERE clause.

Testing:
  * New analyzer tests
  * New functional subquery tests
  * Added Q23, Q24 and Q44 to the tpcds workload
  * Ran subquery rewrite tests

Change-Id: I124a58a09a1a47e1222a22d84b54fe7d07844461
Reviewed-on: http://gerrit.cloudera.org:8080/16052
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-07-05 22:03:42 +00:00
Shant Hovsepian
388ad555d7 IMPALA-8954: Uncorrelated scalar subqueries in the select list
Extend StmtRewriter with the ability to rewrite scalar subqueries in the
select list into cross joins. Currently the subquery must pass plan-time
checks to determine that it returns a single row which may miss cases
that may be valid at runtime or with more complex evaluation of the
predicate expressions in the planner. Support for correlated subqueries
will be a follow on change.

Testing:
  * Added new analyzer tests, updated previous subquery tests
  * test_queries.py::TestQueries::test_subquery
  * Added test_tpcds_q9 to e2e and planner tests

Change-Id: Ibcf55d26889aa01d69bb85f18c9241dda095fb66
Reviewed-on: http://gerrit.cloudera.org:8080/16007
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
2020-07-05 22:03:42 +00:00
Fang-Yu Rao
2a48f7dd98 IMPALA-9890 (Part 1): Add more TPCDS queries to Impala's test suite
This patch adds the following 12 TPCDS queries to the class of
TestTpcdsDecimalV2Query: Q26, Q30, Q31, Q47, Q48, Q57, Q58, Q59, Q63,
Q83, Q85, and Q89. All the queries except for Q31 are added to the class
of TestTpcdsQuery as well because Impala returns one fewer row than
expected for TestTpcdsQuery::test_tpcds_q31(), which requires further
investigation.

To verify whether or not the returned result set from Impala for a given
query is correct, we compare the result set with that produced by the
HiveServer2 (HS2) in Impala's mini-cluster. We could execute SQL
statements in HS2 via Beeline, HS2's command line shell, which could be
launched by the following command.

beeline -u "jdbc:hive2://localhost:11050/default"

We note that among these 12 queries, the execution of Q31, Q58, and Q83
result in the error of "Counters limit exceeded" by TEZ. To work around
this problem, for these 3 queries we have to execute the following
statement before running them to increase the default number of
counters, which is set to 120.

set tez.counters.max=1200

On the other hand, the table of 'reason' is referenced by Q85. This
table was not referenced by any TPCDS query before this patch and thus
was not created. In this regard, in this patch we also modify
tpcds_schema_template.sql to create this additional table along with its
data.

Testing:
- Verified that this patch passes the exhaustive tests in the DEBUG
  build.

Change-Id: Ib5f260e75a3803aabe9ccef271ba94036f96e5cf
Reviewed-on: http://gerrit.cloudera.org:8080/16119
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2020-06-30 13:06:33 +00:00
Tim Armstrong
d4648e87b4 IMPALA-4356,IMPALA-7331: codegen all ScalarExprs
Based on initial draft patch by Pooja Nilangekar.

Codegen'd expressions can be executed in two ways - either by
being called directly from a fully codegend function, or from
interpreted code via a function pointer (previously
ScalarFnCall::scalar_fn_wrapper_).

This change moves the function pointer from ScalarFnCall to its
base class ScalarExpr, so the full expr tree can be codegen'd, not
just the ScalarFnCall subtrees. The key refactoring and improvements
are:
* ScalarExpr::Get*Val() switches between interpreted and the codegen'd
  function pointer code paths in an inline function, avoiding a
  virtual function call to ScalarFnCal::Get*Val().
* Boilerplate logic is moved to ScalarExpr::GetCodegendComputeFn(),
  which calls a virtual function GetCodegenComputeFnImpl().
* ScalarFnCall's logic for deciding whether to interpret or codegen is
  better abstracted and exposed to ScalarExpr as IsInterpretable()
  and ShouldCodegen() methods.
* The ScalarExpr::codegend_compute_fn_ function pointer is only
  populated for expressions that are "codegen entry points". These
  include the roots of expr trees and non-root expressions
  where the parent expression calls Get*Val() from the
  pseudo-codegend GetCodegendComputeFnWrapper().
* ScalarFnCall is always initialised for interpreted execution.
  Otherwise the function pointer is needed for non-root expressions,
  e.g. to support ScalarExprEvaluator::GetConstantVal().
* Latent bugs/gaps for codegen of CollectionVal are fixed. CollectionVal
  is modified to use the StringVal memory layout to allow code sharing
  with StringVal. These fixes allowed simplification of
  IsNotEmptyPredicate codegen (from IMPALA-7657).

I chose to tackle two problems in one change - adding support for
generating codegen'd function pointers for all ScalarExprs, and adding
the "entry point" concept - to avoid a blow-up in the number of
codegen'd entry points that could lead to longer codegen times and/or
worse code because of inlining changes.

IMPALA-7331 (CHAR codegen support functions) is also fixed because
it was simpler to enable CHAR codegen within ScalarExpr than to carry
forward the exiting CHAR workarounds from ScalarFnCall. The
CHAR-specific codegen support required in the scalar expr subsystem is
very limited.  StringVal intermediates are used everywhere. Only
SlotRef actually operates on the different tuple layout, and the
required codegen support for SlotRef already exists for UDA
intermediates anyway.

Testing:
* Ran exhaustive tests.

Perf:
* Ran a basic insert benchmark, which went from 10.1s to 7.6s
  create table foo stored as parquet as
  select case when l_orderkey % 2 = 0 then 'aaa' else 'bbb' end
  from tpch30_parquet.lineitem;
* Ran a basic CHAR expr test:
  set num_nodes=1;
  set mt_dop=1;
  select count(*) from lineitem
  where cast(l_linestatus as CHAR(2)) = 'O ' and
        cast(l_returnflag as CHAR(2)) = 'N '
  The time spent in the scan went from 520ms to 220ms.
* Added perf regression test to tpcds-insert, similar to the manual
  benchmark.
* Ran single-node TPC-H with large and small scale factors, to estimate
  impact on execution perf and query startup time, respectively.

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(30) | parquet / none / none | 6.84    | -0.18%     | 4.49       | -0.31%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval   |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+
| TPCH(30) | TPCH-Q20 | parquet / none / none | 2.58   | 2.47        |   +4.18%   |   1.29%   |   0.88%        | 5     |   +4.12%       | 2.31    | 5.81   |
| TPCH(30) | TPCH-Q17 | parquet / none / none | 4.81   | 4.61        |   +4.33%   |   2.18%   |   2.15%        | 5     |   +3.91%       | 1.73    | 3.09   |
| TPCH(30) | TPCH-Q21 | parquet / none / none | 26.45  | 26.16       |   +1.09%   |   0.37%   |   0.50%        | 5     |   +1.36%       | 2.02    | 3.94   |
| TPCH(30) | TPCH-Q9  | parquet / none / none | 15.92  | 15.75       |   +1.09%   |   2.87%   |   1.65%        | 5     |   +0.88%       | 0.29    | 0.73   |
| TPCH(30) | TPCH-Q12 | parquet / none / none | 2.38   | 2.35        |   +1.12%   |   1.64%   |   1.11%        | 5     |   +0.80%       | 1.15    | 1.26   |
| TPCH(30) | TPCH-Q14 | parquet / none / none | 2.94   | 2.91        |   +1.13%   |   7.68%   |   5.37%        | 5     |   -0.34%       | -0.29   | 0.27   |
| TPCH(30) | TPCH-Q18 | parquet / none / none | 18.10  | 18.02       |   +0.42%   |   2.70%   |   0.56%        | 5     |   +0.28%       | 0.29    | 0.34   |
| TPCH(30) | TPCH-Q8  | parquet / none / none | 4.72   | 4.72        |   -0.04%   |   1.20%   |   1.65%        | 5     |   +0.05%       | 0.00    | -0.04  |
| TPCH(30) | TPCH-Q19 | parquet / none / none | 3.92   | 3.93        |   -0.26%   |   1.08%   |   2.36%        | 5     |   +0.20%       | 0.58    | -0.23  |
| TPCH(30) | TPCH-Q6  | parquet / none / none | 1.27   | 1.27        |   -0.28%   |   0.22%   |   0.88%        | 5     |   +0.09%       | 0.29    | -0.68  |
| TPCH(30) | TPCH-Q16 | parquet / none / none | 2.64   | 2.65        |   -0.45%   |   1.65%   |   0.65%        | 5     |   -0.24%       | -0.58   | -0.57  |
| TPCH(30) | TPCH-Q22 | parquet / none / none | 3.10   | 3.13        |   -0.76%   |   1.47%   |   1.12%        | 5     |   -0.21%       | -0.29   | -0.93  |
| TPCH(30) | TPCH-Q2  | parquet / none / none | 1.20   | 1.21        |   -0.80%   |   2.26%   |   2.47%        | 5     |   -0.82%       | -1.15   | -0.53  |
| TPCH(30) | TPCH-Q4  | parquet / none / none | 1.97   | 1.99        |   -1.37%   |   1.84%   |   3.21%        | 5     |   -0.47%       | -0.58   | -0.83  |
| TPCH(30) | TPCH-Q13 | parquet / none / none | 11.53  | 11.63       |   -0.91%   |   0.46%   |   0.49%        | 5     |   -0.95%       | -2.02   | -3.08  |
| TPCH(30) | TPCH-Q10 | parquet / none / none | 5.13   | 5.21        |   -1.51%   |   2.24%   |   4.05%        | 5     |   -0.94%       | -0.58   | -0.73  |
| TPCH(30) | TPCH-Q5  | parquet / none / none | 3.61   | 3.66        |   -1.40%   |   0.66%   |   0.79%        | 5     |   -1.33%       | -1.73   | -3.05  |
| TPCH(30) | TPCH-Q7  | parquet / none / none | 19.42  | 19.71       |   -1.52%   |   1.34%   |   1.39%        | 5     |   -1.22%       | -1.44   | -1.76  |
| TPCH(30) | TPCH-Q3  | parquet / none / none | 5.08   | 5.15        |   -1.49%   |   1.34%   |   0.73%        | 5     |   -1.35%       | -1.44   | -2.20  |
| TPCH(30) | TPCH-Q15 | parquet / none / none | 3.42   | 3.49        |   -1.92%   |   0.93%   |   1.47%        | 5     |   -1.53%       | -1.15   | -2.49  |
| TPCH(30) | TPCH-Q11 | parquet / none / none | 1.15   | 1.19        |   -3.17%   |   2.27%   |   1.95%        | 5     |   -4.21%       | -1.15   | -2.41  |
| TPCH(30) | TPCH-Q1  | parquet / none / none | 9.26   | 9.63        |   -3.85%   |   0.62%   |   0.59%        | 5     |   -3.78%       | -2.31   | -10.25 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+

Cluster Name: UNKNOWN
Lab Run Info: UNKNOWN
Impala Version:          impalad version 3.2.0-SNAPSHOT RELEASE ()
Baseline Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE (2019-03-19)

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(2)  | parquet / none / none | 0.90    | -0.08%     | 0.80       | -0.05%         |
+----------+-----------------------+---------+------------+------------+----------------+

+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| Workload | Query    | File Format           | Avg(s) | Base Avg(s) | Delta(Avg) | StdDev(%) | Base StdDev(%) | Iters | Median Diff(%) | MW Zval | Tval  |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+
| TPCH(2)  | TPCH-Q18 | parquet / none / none | 1.22   | 1.19        |   +1.93%   |   3.81%   |   4.46%        | 20    |   +3.34%       | 1.62    | 1.46  |
| TPCH(2)  | TPCH-Q10 | parquet / none / none | 0.74   | 0.73        |   +1.97%   |   3.36%   |   2.94%        | 20    |   +0.97%       | 1.88    | 1.95  |
| TPCH(2)  | TPCH-Q11 | parquet / none / none | 0.49   | 0.48        |   +1.91%   |   6.19%   |   4.64%        | 20    |   +0.25%       | 0.95    | 1.09  |
| TPCH(2)  | TPCH-Q4  | parquet / none / none | 0.43   | 0.43        |   +1.99%   |   6.26%   |   5.86%        | 20    |   +0.15%       | 0.92    | 1.03  |
| TPCH(2)  | TPCH-Q15 | parquet / none / none | 0.50   | 0.49        |   +1.82%   |   7.32%   |   6.35%        | 20    |   +0.26%       | 1.01    | 0.83  |
| TPCH(2)  | TPCH-Q1  | parquet / none / none | 0.98   | 0.97        |   +0.79%   |   4.64%   |   2.73%        | 20    |   +0.36%       | 0.77    | 0.65  |
| TPCH(2)  | TPCH-Q19 | parquet / none / none | 0.83   | 0.83        |   +0.65%   |   3.33%   |   2.80%        | 20    |   +0.44%       | 2.18    | 0.67  |
| TPCH(2)  | TPCH-Q14 | parquet / none / none | 0.62   | 0.62        |   +0.97%   |   2.86%   |   1.00%        | 20    |   +0.04%       | 0.13    | 1.42  |
| TPCH(2)  | TPCH-Q3  | parquet / none / none | 0.88   | 0.87        |   +0.57%   |   2.17%   |   1.74%        | 20    |   +0.29%       | 1.15    | 0.92  |
| TPCH(2)  | TPCH-Q12 | parquet / none / none | 0.53   | 0.53        |   +0.27%   |   4.58%   |   5.78%        | 20    |   +0.46%       | 1.47    | 0.16  |
| TPCH(2)  | TPCH-Q17 | parquet / none / none | 0.72   | 0.72        |   +0.15%   |   3.64%   |   5.55%        | 20    |   +0.21%       | 0.86    | 0.10  |
| TPCH(2)  | TPCH-Q21 | parquet / none / none | 2.05   | 2.05        |   +0.21%   |   1.99%   |   2.37%        | 20    |   +0.01%       | 0.25    | 0.30  |
| TPCH(2)  | TPCH-Q5  | parquet / none / none | 1.28   | 1.27        |   +0.24%   |   1.61%   |   1.80%        | 20    |   -0.02%       | -0.57   | 0.44  |
| TPCH(2)  | TPCH-Q13 | parquet / none / none | 1.27   | 1.27        |   -0.34%   |   1.69%   |   1.83%        | 20    |   -0.20%       | -1.65   | -0.61 |
| TPCH(2)  | TPCH-Q7  | parquet / none / none | 1.72   | 1.73        |   -0.55%   |   2.40%   |   1.69%        | 20    |   -0.03%       | -0.42   | -0.83 |
| TPCH(2)  | TPCH-Q8  | parquet / none / none | 1.27   | 1.28        |   -0.68%   |   3.10%   |   3.89%        | 20    |   -0.06%       | -0.54   | -0.62 |
| TPCH(2)  | TPCH-Q6  | parquet / none / none | 0.36   | 0.36        |   -0.84%   |   0.79%   |   3.51%        | 20    |   -0.07%       | -0.36   | -1.04 |
| TPCH(2)  | TPCH-Q2  | parquet / none / none | 0.65   | 0.65        |   -1.17%   |   4.76%   |   5.99%        | 20    |   -0.05%       | -0.25   | -0.69 |
| TPCH(2)  | TPCH-Q9  | parquet / none / none | 1.59   | 1.62        |   -2.01%   |   1.45%   |   5.12%        | 20    |   -0.16%       | -1.24   | -1.69 |
| TPCH(2)  | TPCH-Q20 | parquet / none / none | 0.68   | 0.69        |   -1.73%   |   4.35%   |   4.43%        | 20    |   -0.49%       | -1.74   | -1.25 |
| TPCH(2)  | TPCH-Q22 | parquet / none / none | 0.38   | 0.40        |   -2.89%   |   7.42%   |   6.39%        | 20    |   -0.21%       | -0.66   | -1.34 |
| TPCH(2)  | TPCH-Q16 | parquet / none / none | 0.59   | 0.62        |   -4.01%   |   6.33%   |   5.83%        | 20    |   -4.72%       | -1.39   | -2.13 |
+----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+

Change-Id: I839d7a3a2f5e1309c33a1f66013ef11628c5dc11
Reviewed-on: http://gerrit.cloudera.org:8080/12797
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-05-15 22:34:28 +00:00
Tim Armstrong
30082232ed IMPALA-8410: enable TestTpcdsInsert by default
Fix incorrect output to match current behaviour. The test takes
24s on my system but can be run in parallel with other tests.

Change-Id: Ibf9a279d57ad74de0c77a90dde69e5c4dc563a3f
Reviewed-on: http://gerrit.cloudera.org:8080/13055
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2019-04-17 22:10:03 +00:00
Tim Armstrong
f876b320aa IMPALA-5946,IMPALA-5956: add TPC-DS q31,q59,q89
Q31: the substitution variables didn't match the TPC-DS spec.
     After fixing this, the results match up to 4 digits of
     rounding (there is some error introduced in intermediate
     calculations).
Q59: the results match the reference results up to rounding.
Q89: the results match up to 5 digits of rounding.

I verified the matches by using a spreadsheet comparing reference
and actual results.
https://docs.google.com/spreadsheets/d/1MNEqkfYRRAd3xqY6m20tTHquqjtCThDaGdizzRAQ8co/edit?usp=sharing

*https://github.com/gregrahn/tpcds-kit/blob/master/specification/TPC-DS_v2.10.0.pdf
^https://github.com/gregrahn/tpcds-kit/tree/master/answer_sets

Change-Id: I49634e8f63066773c9c78dd5585a0ee69daf720a
Reviewed-on: http://gerrit.cloudera.org:8080/11845
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-02 22:11:03 +00:00
Tim Armstrong
e3a702707c IMPALA-5950: fix TPC-DS Q35a and Q48 queries
The query text (for Q48) and substitution parameters didn't match the
TPC-DS standard for qualification queries. After fixing that, the
queries return the results expected by the TPC-DS standards.

Note that this may affect the performance of perf workloads running
tpcds-unmodified.

Change-Id: Ic7c737f68adf616738d6eb6e5a02593af25bcbaf
Reviewed-on: http://gerrit.cloudera.org:8080/11833
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2018-11-01 08:05:48 +00:00
Taras Bobrovytsky
0a1d586d2a IMPALA-4924: Enable Decimal V2 by default
In this commit we enable Decimal_V2 by default. We also update the
expected results in many of our tests.

Testing:
Ran an exhaustive test which almost passed. Updated the few failed
tests in it.

Cherry-pick: not for 2.x

Change-Id: Ibbdd05bf986b7947f106b396017faa3a0bd87fd7
Reviewed-on: http://gerrit.cloudera.org:8080/9062
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-25 04:33:11 +00:00
Taras Bobrovytsky
35a3e186d6 IMPALA-5478: Run TPCDS queries with decimal_v2 enabled
We add new TPCDS .test files that are expected to be run with decimal_v2
enabled. The new expected results were generated using Impala and I
inspected them manually.

Change-Id: Ib867c51a521ec4a087bc127d99aee4b95ba97733
Reviewed-on: http://gerrit.cloudera.org:8080/8985
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
2018-01-18 03:28:51 +00:00
Tim Wood
f05bd241ea IMPALA-5376: Implement all TPCDS test cases or alternates for Impala.
Main source for TPCDS query and result definitions: https://github.com/gregrahn/tpcds-kit.
TPC-DS v2.5.0 qualification queries from G. Rahn, Cloudera, Inc.
Data set constructed in mini-cluster using $IMPALA_HOME/buildall.sh -testdata....
This commit continues previous work on IMPALA-5376 in the ASF Impala repo
and the Cloudera Gerrit service.

This commit splits multi-query tests in the TPC-DS suite definition into one
query and result set per test file, as the test framework requires.  Names for
such files have -1, -2... inner suffixes.

The portion of the TPC-DS test suite in this commit passes.
It contains no failures, as reflected by runs of
$IMPALA_HOME/tests/run-tests.py query_test/test_tpcds_queries.py ...

IMPALA-6007 addresses the TPC-DS cases that require skipping (because we don't
support them or they flap) or expected-failure (xfail, because we support them
but they fail due to bugs.)  These require some added tooling for non-Pytest
frameworks like the stress test to avoid attempting them until they work.
Tests that flap are marked to skip, with a bug ID, since they don't reliably pass or xfail.

Expected result sets come from the TPC-DS kit.  Some TPC-DS test cases
in this commit have been modified in sematically-neutral ways so as to pass
on Impala.

The tests/query_test/test_tpcds_queries.py driver file is authoritative for the
active/skip/xfail status for each case and a brief reason.  The following list
describes the current status as:
--- test-name
deviance from TPC-DS spec
changes made

--- tpcds-q22a.test
RESULT MISMATCH in LSD of AVG() values
FIXED, HAND_ROUNDED AVG() VALUES IN RESULT SET
--- tpcds-q26.test
RESULT MISMATCH in LSD of AVG() values
ABSENT, IMPALA-6087
--- tpcds-q28.test
RESULT MISMATCH in LSD of AVG() values
ABSENT, IMPALA-6087
--- tpcds-q30.test
UNRECOGNIZED CHARACTER
ABSENT, IMPALA-5961.
--- tpcds-q31.test
RESULT MISMATCH in LSD of DECIMAL values
ABSENT, IMPALA-5956.
--- tpcds-q35a.test
RESULT MISMATCH
ABSENT, IMPALA-5950.
--- tpcds-q36a.test
RESULT MISMATCH
ABSENT, IMPALA-4741
--- tpcds-q47.test
RESULT MISMATCH in LSD of DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q48.test
RESULT MISMATCH in scalar value
ABSENT, IMPALA-5950.
--- tpcds-q49.test
RESULT MISMATCH in LSD of DECIMAL values
ABSENT, IMPALA-5945
--- tpcds-q57.test
RESULT MISMATCH, excess scale in DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q58.test
RESULT MISMATCH in DECIMAL values
ABSENT, IMPALA-5946
--- tpcds-q59.test
RESULT MISMATCH, excess scale in DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q61.test
RESULT MISMATCH in DECIMAL value
FIXED. CAST RESULT QUOTIENT TO DECIMAL(15, 4), TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q63.test
RESULT MISMATCH, excess scale in DECIMAL values
ABSENT, IMPALA-6087
--- tpcds-q64.test
RESULT MISMATCH
ADDED ORDER BY COLUMNS.
--- tpcds-q66.test
RESULT MISMATCH
ABSENT, IMPALA-4741
--- tpcds-q77a.test
RESULT MISMATCH
FIXED. TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q78.test
RESULT MISMATCH
FIXED. TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q83.test
RESULT MISMATCH
ABSENT, IMPALA-5945.
--- tpcds-q85.test
MISSING TABLE "reason"
ABSENT, IMPALA-5960
--- tpcds-q86a.test
RESULT MISMATCH
FIXED. TAKE ACTUAL RESULT AS EXPECTED
--- tpcds-q89.test
RESULT MISMATCH, DECIMAL values flap
ABSENT, ADDED ROUND(2) TO 8th COLUMN, TAKE ACTUAL RESULTS AS EXPECTED, IMPALA-5956.
--- tpcds-q90.test
RESULT MISMATCH
ABSENT, IMPALA-5945.
--- tpcds-q93.test
MISSING TABLE "reason"
ABSENT, IMPALA-5960
--- tpcds-q98.test
RESULT MISMATCH
FIXED, ADDED ROUND() TO LAST COLUMN

Change-Id: I6e284888600a7a69d1f23fcb7dac21cbb13b7d66
Reviewed-on: http://gerrit.cloudera.org:8080/8102
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
2017-10-23 19:32:10 +00:00
Michael Ho
f15589573b IMPALA-5376: Loads all TPC-DS tables
This change loads the missing tables in TPC-DS. In addition,
it also fixes up the loading of the partitioned table store_sales
so all partitions will be loaded. The existing TPC-DS queries are
also updated to use the parameters for qualification runs as noted
in the TPC-DS specification. Some hard-coded partition filters were
also removed. They were there due to the lack of dynamic partitioning
in the past. Some missing TPC-DS queries are also added to this change,
including query28 which discovered the infamous IMPALA-5251.

Having all tables in TPC-DS available paves the way for us to include
all supported TPCDS queries in our functional testing. Due to the change
in the data, planner tests and the E2E tests have different results than
before. The results of E2E tests were compared against the run done with
Netezza and Vertica. The divergence were all due to the truncation behavior
of decimal types in DECIMAL_V1.

Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9
Reviewed-on: http://gerrit.cloudera.org:8080/6877
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Impala Public Jenkins
2017-05-27 05:19:53 +00:00
David Knupp
f590bc0da6 IMPALA-4750: Rename test infra classes so they don't mimic test classes.
This patch addresses warning messages from pytest re: the imported
TestMatrix, TestVector, and TestDimension classes, which were being
collected as potential test classes. The fix was to simply prepend
the class names with Impala-

git grep -l 'TestDimension' | xargs \
    sed -i 's/TestDimension/ImpalaTestDimension/g'

git grep -l 'TestMatrix' | xargs \
    sed -i 's/TestMatrix/ImpalaTestMatrix/g'

git grep -l 'TestVector' | xargs \
    sed -i 's/TestVector/ImpalaTestVector/g'

The tests all passed in an exhaustive run on the upstream jenkins
server:

http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/

Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1
Reviewed-on: http://gerrit.cloudera.org:8080/5794
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
2017-01-26 23:40:22 +00:00
Matthew Jacobs
c7fa03286b IMPALA-3718: Support subset of functional-query for Kudu
Adds initial support for the functional-query test workload
for Kudu tables.

There are a few issues that make loading the functional
schema difficult on Kudu:
 1) Kudu tables must have one or more columns that together
    constitute a unique primary key.
   a) Primary key columns must currently be the first columns
      in the table definition (KUDU-1271).
   b) Primary key columns cannot be nullable (KUDU-1570).
 2) Kudu tables must be specified with distribution
    parameters.

(1) limits the tables that can be loaded without ugly
workarounds. This patch only includes important tables that
are used for relevant tests, most notably the alltypes*
family. In particular, alltypesagg is important but it does
not have a set of columns that are non-nullable and form a unique
primary key. As a result, that table is created in Kudu with
a different name and an additional BIGINT column for a PK
that is a unique index and is generated at data loading time
using the ROW_NUMBER analytic function. A view is then
wrapped around the underlying table that matches the
alltypesagg schema exactly. When KUDU-1570 is resolved, this
can be simplified.

(2) requires some additional considerations and custom
syntax. As a result, the DDL to create the tables is
explicitly specified in CREATE_KUDU sections in the
functional_schema_constraints.csv, and an additional
DEPENDENT_LOAD_KUDU section was added to specify custom data
loading DML that differs from the existing DEPENDENT_LOAD.

TODO: IMPALA-4005: generate_schema_statements.py needs refactoring

Tests that are not relevant or not yet supported have been
marked with xfail and a skip where appropriate.

TODO: Support remaining functional tables/tests when possible.

Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc
Reviewed-on: http://gerrit.cloudera.org:8080/4175
Reviewed-by: Matthew Jacobs <mj@cloudera.com>
Tested-by: Internal Jenkins
2016-09-14 22:11:04 +00:00
Dan Hecht
ffa7829b70 IMPALA-3918: Remove Cloudera copyrights and add ASF license header
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:

http://www.apache.org/legal/src-headers.html#headers

Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
   http://www.apache.org/legal/src-headers.html#notice
   to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
   on the website.

Much of this change was automatically generated via:

git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]

Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.

[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
    modification to ORIG_LICENSE to match Impala's license text.

Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
2016-08-09 08:19:41 +00:00
Taras Bobrovytsky
609b80410e Clean up Python test import statements
Many of our test scripts have import statements that look like
"from xxx import *". It is a good practice to explicitly name what
needs to be imported. This commit implements this practice. Also,
unused import statements are removed.

Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8
Reviewed-on: http://gerrit.cloudera.org:8080/3444
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
2016-07-15 23:26:18 +00:00
Casey Ching
074e5b4349 Remove hashbang from non-script python files
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.

Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
2015-08-04 05:26:07 +00:00
Martin Grund
f853eeb2f0 IMPALA-1284: Allow implicit cross joins
This patch adds support for executing implicit cross joins in Impala. An
implicit cross join occurs when two tables are referenced in the FROM
clause of a select statement without specifying the join type and
in absence of applicable equi-join predicate.

To convert an implicit join into a cross join, we manually create the
cross join node during the join reordering. When two sub-plans are
compared to each other a Hash Join plan is always preferred.

As a side effect, explicit cross joins that have equi join conjuncts are
now rewritten to hash joins.

This patch enables us to run TPC-DS queries Q61 and Q88 which are added
as planner and query tests.

Change-Id: Ifd53a78e8eb38d553eb039bfeef0216e438790ba
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4695
Reviewed-by: Martin Grund <mgrund@cloudera.com>
Tested-by: jenkins
(cherry picked from commit 77ff7f09350d028be033d772c5c456ceb8828013)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5319
2014-11-20 11:25:33 -08:00
ishaan
10303ed440 Add partition filters to tpcds-q89 and re-enable tpcds-q47
This patch adds partitions filters to tpcds-q89 to account for the lack of dynamic
partition pruning. Additionally, it also re-enables running tpcds-q47, which was blocked
by IMPALA-1238

Change-Id: Ied05d80565ebb29cd06b3c38d76bd31f0285028e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4453
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-10-08 16:50:16 -07:00
ishaan
e126a3c8b5 Enable more tpcds queries that use correlated subqueries and analytic functions.
This patch only operates on queries that use store_sales as the fact table.

Change-Id: I763245ef5f68bb1519bcb4d4b26ede96913a1d57
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4312
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4106
2014-09-27 01:15:41 -07:00
Nong Li
bbce0d62c9 [CDH5] Disable tpch/tpcds query and planner tests.
Change-Id: I30ecefe2db9ee7996433cda025f86ef8669284e9
2014-09-20 19:41:42 -07:00
Nong Li
2596e936c7 PHJ: remove maintaining tuple row ptrs.
The level of indirection of TupleRow* (which are just a list of Tuple*) is not
helpful. This memory is a waste and additional logic would have to be added to
ensure that it triggers spilling. Instead, we store an offset into the tuple stream
(which is always 8 bytes and already part of the hash table structure).

Change-Id: I89eb9474746a39ba1b0543aafabdc4a8ddccf88b
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4321
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
2014-09-20 16:12:10 -07:00
Lenni Kuff
0ac0527643 Reduce test execution time by limiting long running tests to exhaustive exec strategy
I looked at the latest run from master and took the tests suites that had long
execution times. This cleans those test suites up to either completely disable them
on 'core' or add constraints to limit the number of test vectors. It shouldn't impact
nightly coverage since we still run the same tests exhaustively.

Change-Id: I10c78c35155b00de0c36d9fc0923b2b1fc6b44de
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3119
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3125
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
2014-06-18 16:18:17 -07:00
ishaan
db97981ab9 [CDH5] Switch the tpcds schemas to use decimal instead of float/double.
This patch converts the tpcds schemas to use decimal instead of float/double. Currently,
Impala can only r/w decimal in text, therefore, the tables are constrained to text. The
schemas were obtained from the official tpc spec:
http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf

Change-Id: I1ef0113dcb48bad52af75ee93b47b08adf9e1a69
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2403
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: jenkins
2014-06-08 11:47:23 -07:00
ishaan
53cd9eadab Treat HBase as a file format for functional tests
Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922
Reviewed-on: http://gerrit.ent.cloudera.com:8080/102
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
2014-01-08 10:52:36 -08:00
Nong Li
707a566b5d Add test to tpcds queries to validate table row counts.
I tried to investigate the jenkins issue where we weren't returning any rows.
I setup the cluster on that box manually and noticed there weren't any results
because the store_sales table was empty. Refresh did not fix. This looks like
a data loading issue. Adding this test would make discovering this like this
much easier.

Change-Id: I8ccddd43892b279d506371b9de717629815c6a08
Reviewed-on: http://gerrit.ent.cloudera.com:8080/260
Reviewed-by: Nong Li <nong@cloudera.com>
Tested-by: Nong Li <nong@cloudera.com>
2014-01-08 10:52:17 -08:00
Lenni Kuff
a3016cc4d4 Add partitioned tpcds insert workload and tests
Change-Id: Iff45853153bf0830be3e423c994392998385a64f
Reviewed-on: http://gerrit.ent.cloudera.com:8080/256
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
2014-01-08 10:52:16 -08:00
Lenni Kuff
17ed6ea177 Partition TPC-DS dataset and add additional TPC-DS workload queries
Change-Id: I5410e68fdfd818a8287e0974332c3e36c344c300
Reviewed-on: http://gerrit.ent.cloudera.com:8080/99
Tested-by: jenkins <kitchen-build@cloudera.com>
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
2014-01-08 10:52:13 -08:00
Nong Li
f60f2d3e50 Implement support for grouped scan ranges in io mgr and integration with parquet. 2014-01-08 10:49:18 -08:00
Nong Li
0df9476be1 Parquet data loading. 2014-01-08 10:48:48 -08:00
Lenni Kuff
12d18631e3 Test enhancements: dynamic table format data loading, per-workload exploration stategies 2014-01-08 10:47:07 -08:00
Lenni Kuff
1b248d067b Add TPC-DS dataset and workload 2014-01-08 10:46:52 -08:00