impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
Riza Suminto	3210ec58c5	IMPALA-14006: Bound max_instances in CreateInputCollocatedInstances IMPALA-11604 (part 2) changes how many instances to create in Scheduler::CreateInputCollocatedInstances. This works when the left child fragment of a parent fragment is distributed across nodes. However, if the left child fragment instance is limited to only 1 node (the case of UNPARTITIONED fragment), the scheduler might over-parallelize the parent fragment by scheduling too many instances in a single node. This patch attempts to mitigate the issue in two ways. First, it adds bounding logic in PlanFragment.traverseEffectiveParallelism() to lower parallelism further if the left (probe) side of the child fragment is not well distributed across nodes. Second, it adds TQueryExecRequest.max_parallelism_per_node to relay information from Analyzer.getMaxParallelismPerNode() to the scheduler. With this information, the scheduler can do additional sanity checks to prevent Scheduler::CreateInputCollocatedInstances from over-parallelizing a fragment. Note that this sanity check can also cap MAX_FS_WRITERS option under a similar scenario. Added ScalingVerdict enum and TRACE log it to show the scaling decision steps. Testing: - Add planner test and e2e test that exercise the corner case under COMPUTE_PROCESSING_COST=1 option. - Manually comment the bounding logic in traverseEffectiveParallelism() and confirm that the scheduler's sanity check still enforces the bounding. Change-Id: I65223b820c9fd6e4267d57297b1466d4e56829b3 Reviewed-on: http://gerrit.cloudera.org:8080/22840 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2025-05-07 03:34:15 +00:00
Riza Suminto	f1de0c392f	IMPALA-13636: Fix target file_format in TestTpcdsInsert TestTpcdsInsert creates a temporary table to test insert functionality. It has three problems: 1. It does not use unique_database parameter, so the temporary table is not cleaned up after test finished. 2. It ignores file_format from test vector, causing inconsistency in the temporary table's file format. str_insert is always in PARQUET format, while store_sales_insert is always in TEXTFILE format. 3. text file_format dimension is never exercised, because --workload_exploration_strategy in run-all-tests.sh does not explicitly list tpcds-insert workload. This patch fixes all three problems and few flake8 warnings in test_tpcds_queries.py. Testing: - Run bin/run-all-tests.sh with EXPLORATION_STRATEGY=exhaustive EE_TEST=true EE_TEST_FILES="query_test/test_tpcds_queries.py::TestTpcdsInsert" Verified that the temporary table format follows file_format dimension. Change-Id: Iea621ec1d6a53eba9558b0daa3a4cc97fbcc67ae Reviewed-on: http://gerrit.cloudera.org:8080/22291 Reviewed-by: Michael Smith <michael.smith@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Riza Suminto <riza.suminto@cloudera.com>	2025-01-06 22:15:12 +00:00
wzhou-code	3cbb3be5f7	IMPALA-13018: Block push down of conjuncts with implicit casting on base columns for jdbc tables The query of q80a consists BETWEEN with casting to timestamp in where clause like: d_date between cast('2000-08-23' as timestamp) and (cast('2000-08-23' as timestamp) + interval 30 days) Between predicate does cast all exprs to compatible types. Planner generates predicates for DataSourceScanNode as: CAST(d_date AS TIMESTAMP) >= TIMESTAMP '2000-08-23 00:00:00', CAST(d_date AS TIMESTAMP) <= TIMESTAMP '2000-09-22 00:00:00' But casting to Date/Timestamp for a column cannot be pushed down to JDBC table now. This patch fixes the issue by blocking such conjuncts with implicit unsafe casting or casting to date/timestamp to be added into offered predicate list for JDBC table. Note that explicit casting on base columns are not allowed to pushdown. Testing: - Add new planner unit-tests, including explicit casting, implicit casting to date/timestamp, built-in functions, arithmetic expressions. The predicates which are accepted for JDBC are shown in plan under "data source predicates" of DataSourceScanNode, predicates which are not accepted for JDBC are shown in plan under "predicates" of DataSourceScanNodes. - Passed all tpcds queries for JDBC tables, including q80a. - Passed core test Change-Id: Iabd7e28b8d5f11f25a000dc4c9ab65895056b572 Reviewed-on: http://gerrit.cloudera.org:8080/21409 Reviewed-by: Riza Suminto <riza.suminto@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-10 07:18:45 +00:00
wzhou-code	08f8a30025	IMPALA-12910: Support running TPCH/TPCDS queries for JDBC tables This patch adds script to create external JDBC tables for the dataset of TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for external JDBC tables with Impala-Impala federation. Note that JDBC tables are mapping tables, they don't take additional disk spaces. It fixes the race condition when caching of SQL DataSource objects by using a new DataSourceObjectCache class, which checks reference count before closing SQL DataSource. Adds a new query-option 'clean_dbcp_ds_cache' with default value as true. When it's set as false, SQL DataSource object will not be closed when its reference count equals 0 and will be kept in cache until the SQL DataSource is idle for more than 5 minutes. Flag variable 'dbcp_data_source_idle_timeout_s' is added to make the duration configurable. java.sql.Connection.close() fails to remove a closed connection from connection pool sometimes, which causes JDBC working threads to wait for available connections from the connection pool for a long time. The work around is to call BasicDataSource.invalidateConnection() API to close a connection. Two flag variables are added for DBCP configuration properties 'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait' properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively in apache.commons.dbcp v2. Fixes a bug for database type comparison since the type strings specified by user could be lower case or mix of upper/lower cases, but the code compares the types with upper case string. Fixes issue to close SQL DataSource object in JdbcDataSource.open() and JdbcDataSource.getNext() when some errors returned from DBCP APIs or JDBC drivers. testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables for Impala-Impala, Postgres and MySQL. Following sample commands creates TPCDS JDBC tables for Impala-Impala federation with remote coordinator running at 10.19.10.86, and Postgres server running at 10.19.10.86: ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=IMPALA --database_host=10.19.10.86 --clean ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=POSTGRES --database_host=10.19.10.86 \ --database_name=tpcds --clean TPCDS tests for JDBC tables run only for release/exhaustive builds. TPCH tests for JDBC tables run for core and exhaustive builds, except Dockerized builds. Remaining Issues: - tpcds-decimal_v2-q80a failed with returned rows not matching expected results for some decimal values. This will be fixed in IMPALA-13018. Testing: - Passed core tests. - Passed query_test/test_tpcds_queries.py in release/exhaustive build. - Manually verified that only one SQL DataSource object was created for test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option 'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object was closed by cleanup thread. Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a Reviewed-on: http://gerrit.cloudera.org:8080/21304 Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-02 02:14:20 +00:00
Riza Suminto	6abfdbc56c	IMPALA-12980: Translate CpuAsk into admission control slots Impala has a concept of "admission control slots" - the amount of parallelism that should be allowed on an Impala daemon. This defaults to the number of processors per executor and can be overridden with -–admission_control_slots flag. Admission control slot accounting is described in IMPALA-8998. It computes 'slots_to_use' for each backend based on the maximum number of instances of any fragment on that backend. This can lead to slot underestimation and query overadmission. For example, assume an executor node with 48 CPU cores and configured with -–admission_control_slots=48. It is assigned 4 non-blocking query fragments, each has 12 instances scheduled in this executor. IMPALA-8998 algorithm will request the max instance (12) slots rather than the sum of all non-blocking fragment instances (48). With the 36 remaining slots free, the executor can still admit another fragment from a different query but will potentially have CPU contention with the one that is currently running. When COMPUTE_PROCESSING_COST is enabled, Planner will generate a CpuAsk number that represents the cpu requirement of that query over a particular executor group set. This number is an estimation of the largest number of query fragment instances that can run in parallel without waiting, given by the blocking operator analysis. Therefore, the fragment trace that sums into that CpuAsk number can be translated into 'slots_to_use' as well, which will be a closer resemblance of maximum parallel execution of fragment instances. This patch adds a new query option called SLOT_COUNT_STRATEGY to control which admission control slot accounting to use. There are two possible values: - LARGEST_FRAGMENT, which is the original algorithm from IMPALA-8998. This is still the default value for the SLOT_COUNT_STRATEGY option. - PLANNER_CPU_ASK, which will follow the fragment trace that contributes towards CpuAsk number. This strategy will schedule more or equal admission control slots than the LARGEST_FRAGMENT strategy. To do the PLANNER_CPU_ASK strategy, the Planner will mark fragments that contribute to CpuAsk as dominant fragments. It also passes max_slot_per_executor information that it knows about the executor group set to the scheduler. AvgAdmissionSlotsPerExecutor counter is added to describe what Planner thinks the average 'slots_to_use' per backend will be, which follows this formula: AvgAdmissionSlotsPerExecutor = ceil(CpuAsk / num_executors) Actual 'slots_to_use' in each backend may differ than AvgAdmissionSlotsPerExecutor, depending on what is scheduled on that backend. 'slots_to_use' will be shown as 'AdmissionSlots' counter under each executor profile node. Testing: - Update test_executors.py with AvgAdmissionSlotsPerExecutor assertion. - Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost. - Add EE test test_processing_cost.py. - Add FE test PlannerTest#testProcessingCostPlanAdmissionSlots. Change-Id: I338ca96555bfe8d07afce0320b3688a0861663f2 Reviewed-on: http://gerrit.cloudera.org:8080/21257 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-04-18 21:58:13 +00:00
Riza Suminto	379038f763	IMPALA-12429: Reduce parallelism for TPC-DS q51a and q67a test. TestTpcdsQueryWithProcessingCost.test_tpcds_q51a and TestTpcdsQuery.test_tpcds_q67a has been intermittently failing with memory oversubscription error. The fact that test minicluster start 3 impalad in single host probably make admission control less effective in preventing these queries from running in parallel with others. This patch keep both test, but reduce max_fragment_instances_per_node from 4 to 2 to lower its memory requirement. Before patch: q51a Max Per-Host Resource Reservation: Memory=3.08GB Threads=129 Per-Host Resource Estimates: Memory=124.24GB Per Host Min Memory Reservation: localhost:27001(2.93 GB) localhost:27002(1.97 GB) localhost:27000(2.82 GB) Per Host Number of Fragment Instances: localhost:27001(115) localhost:27002(79) localhost:27000(119) Admission result: Admitted immediately Cluster Memory Admitted: 33.00 GB Per Node Peak Memory Usage: localhost:27000(2.84 GB) localhost:27002(1.99 GB) localhost:27001(2.95 GB) Per Node Bytes Read: localhost:27000(62.08 MB) localhost:27002(45.71 MB) localhost:27001(47.39 MB) q67a Max Per-Host Resource Reservation: Memory=2.15GB Threads=105 Per-Host Resource Estimates: Memory=4.48GB Per Host Min Memory Reservation: localhost:27001(2.13 GB) localhost:27002(2.13 GB) localhost:27000(2.15 GB) Per Host Number of Fragment Instances: localhost:27001(76) localhost:27002(76) localhost:27000(105) Cluster Memory Admitted: 13.44 GB Per Node Peak Memory Usage: localhost:27000(2.24 GB) localhost:27002(2.21 GB) localhost:27001(2.21 GB) Per Node Bytes Read: localhost:27000(112.79 MB) localhost:27002(109.57 MB) localhost:27001(105.16 MB) After patch: q51a Max Per-Host Resource Reservation: Memory=2.00GB Threads=79 Per-Host Resource Estimates: Memory=118.75GB Per Host Min Memory Reservation: localhost:27001(1.84 GB) localhost:27002(1.28 GB) localhost:27000(1.86 GB) Per Host Number of Fragment Instances: localhost:27001(65) localhost:27002(46) localhost:27000(74) Cluster Memory Admitted: 33.00 GB Per Node Peak Memory Usage: localhost:27000(1.88 GB) localhost:27002(1.31 GB) localhost:27001(1.88 GB) Per Node Bytes Read: localhost:27000(62.08 MB) localhost:27002(45.71 MB) localhost:27001(47.39 MB) q67a Max Per-Host Resource Reservation: Memory=1.31GB Threads=85 Per-Host Resource Estimates: Memory=3.76GB Per Host Min Memory Reservation: localhost:27001(1.29 GB) localhost:27002(1.29 GB) localhost:27000(1.31 GB) Per Host Number of Fragment Instances: localhost:27001(56) localhost:27002(56) localhost:27000(85) Cluster Memory Admitted: 11.28 GB Per Node Peak Memory Usage: localhost:27000(1.35 GB) localhost:27002(1.32 GB) localhost:27001(1.33 GB) Per Node Bytes Read: localhost:27000(112.79 MB) localhost:27002(109.57 MB) localhost:27001(105.16 MB) Testing: - Pass test_tpcds_queries.py in local machine. Change-Id: I6ae5aeb97a8353d5eaa4d85e3f600513f42f7cf4 Reviewed-on: http://gerrit.cloudera.org:8080/20581 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-10-20 20:31:43 +00:00
Riza Suminto	baddaf2241	IMPALA-12144: Skip TestTpcdsQueryWithProcessingCost if dockerised There is a sign of flakiness in TestTpcdsQueryWithProcessingCost within dockerised environment. The flakiness seems to happen due to tighter per-process memory limit in dockerised environment. This patch skip TestTpcdsQueryWithProcessingCost in dockerised environment. Testing: - Hack SkipIfDockerizedCluster.insufficient_mem_limit to return True if IS_HDFS and confirm that the whole TestTpcdsQueryWithProcessingCost is skipped. Change-Id: Ibb6b2d4258a2c6613d1954552f21641b42cb3c38 Reviewed-on: http://gerrit.cloudera.org:8080/19892 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-16 20:40:11 +00:00
Riza Suminto	1d0b111bcf	IMPALA-12091: Control scan parallelism by its processing cost Before this patch, Impala still relies on MT_DOP option to decide the degree of parallelism of the scan fragment when a query runs with COMPUTE_PROCESSING_COST=1. This patch adds the scan node's processing cost as another consideration to raise scan parallelism beyond MT_DOP. Scan node cost is now adjusted to also consider the number of effective scan ranges. Each scan range is given a weight of (0.5% * min_processing_per_thread), which roughly means that one scan node instance can handle at most 200 scan ranges. Query option MAX_FRAGMENT_INSTANCES_PER_NODE is added as an upper bound on scan parallelism if COMPUTE_PROCESSING_COST=true. If the number of scan ranges is fewer than the maximum parallelism allowed by the scan node's processing cost, that processing cost will be clamped down to (min_processing_per_thread / number of scan ranges). Lowering MAX_FRAGMENT_INSTANCES_PER_NODE can also clamp down the scan processing cost in a similar way. For interior fragments, a combination of MAX_FRAGMENT_INSTANCES_PER_NODE, PROCESSING_COST_MIN_THREADS, and the number of available cores per node is accounted to determine maximum fragment parallelism per node. For scan fragment, only the first two are considered to encourage Frontend to choose a larger executor group as needed. Two new static state is added into exec-node.h: is_mt_fragment_ and num_instances_per_node_. The backend code that refers to the MT_DOP option is replaced with either is_mt_fragment_ or num_instances_per_node_. Two new criteria are added during effective parallelism calculation in PlanFragment.adjustToMaxParallelism(): - If a fragment has UnionNode, its parallelism is the maximum between its input fragments and its collocated ScanNode's expected parallelism. - If a fragment only has a single ScanNode (and no UnionNode), its parallelism is calculated in the same fashion as the interior fragment but will not be lowered anymore since it will not have any child fragment to compare with. Admission control slots remain unchanged. This may cause a query to fail admission if Planner selects scan parallelism that is higher than the configured admission control slots value. Setting MAX_FRAGMENT_INSTANCES_PER_NODE equal to or lower than configured admission control slots value can help lower scan parallelism and pass the admission controller. The previous workaround to control scan parallelism by IMPALA-12029 is now removed. This patch also disables IMPALA-10287 optimization if COMPUTE_PROCESSING_COST=true. This is because IMPALA-10287 relies on a fixed number of fragment instances in DistributedPlanner.java. However, effective parallelism calculation is done much later and may change the final number of instances of hash join fragment, rendering DistributionMode selected by IMPALA-10287 inaccurate. This patch is benchmarked using single_node_perf_run.py with the following parameters: args="-gen_experimental_profile=true -default_query_options=" args+="mt_dop=4,compute_processing_cost=1,processing_cost_min_threads=1 " ./bin/single_node_perf_run.py --num_impalads=3 --scale=10 \ --workloads=tpcds --iterations=5 --table_formats=parquet/none/none \ --impalad_args="$args" \ --query_names=TPCDS-Q3,TPCDS-Q14-1,TPCDS-Q14-2,TPCDS-Q23-1,TPCDS-Q23-2,TPCDS-Q49,TPCDS-Q76,TPCDS-Q78,TPCDS-Q80A \ "IMPALA-12091~1" IMPALA-12091 The benchmark result is as follows: +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ \| TPCDS(10) \| TPCDS-Q23-1 \| parquet / none / none \| 4.62 \| 4.54 \| +1.92% \| 0.23% \| 1.59% \| 5 \| +2.32% \| 1.15 \| 2.67 \| \| TPCDS(10) \| TPCDS-Q14-1 \| parquet / none / none \| 5.82 \| 5.76 \| +1.08% \| 5.27% \| 3.89% \| 5 \| +2.04% \| 0.00 \| 0.37 \| \| TPCDS(10) \| TPCDS-Q23-2 \| parquet / none / none \| 4.65 \| 4.58 \| +1.38% \| 1.97% \| 0.48% \| 5 \| +0.81% \| 0.87 \| 1.51 \| \| TPCDS(10) \| TPCDS-Q49 \| parquet / none / none \| 1.49 \| 1.48 \| +0.46% \| * 36.02% * \| * 34.95% * \| 5 \| +1.26% \| 0.58 \| 0.02 \| \| TPCDS(10) \| TPCDS-Q14-2 \| parquet / none / none \| 3.76 \| 3.75 \| +0.39% \| 1.67% \| 0.58% \| 5 \| -0.03% \| -0.58 \| 0.49 \| \| TPCDS(10) \| TPCDS-Q78 \| parquet / none / none \| 2.80 \| 2.80 \| -0.04% \| 1.32% \| 1.33% \| 5 \| -0.42% \| -0.29 \| -0.05 \| \| TPCDS(10) \| TPCDS-Q80A \| parquet / none / none \| 2.87 \| 2.89 \| -0.51% \| 1.33% \| 0.40% \| 5 \| -0.01% \| -0.29 \| -0.82 \| \| TPCDS(10) \| TPCDS-Q3 \| parquet / none / none \| 0.18 \| 0.19 \| -1.29% \| * 15.26% * \| * 15.87% * \| 5 \| -0.54% \| -0.87 \| -0.13 \| \| TPCDS(10) \| TPCDS-Q76 \| parquet / none / none \| 1.08 \| 1.11 \| -2.98% \| 0.92% \| 1.70% \| 5 \| -3.99% \| -2.02 \| -3.47 \| +-----------+-------------+-----------------------+--------+-------------+------------+------------+----------------+-------+----------------+---------+-------+ Testing: - Pass PlannerTest.testProcessingCost - Pass test_executor_groups.py - Reenable test_tpcds_q51a in TestTpcdsQueryWithProcessingCost with MAX_FRAGMENT_INSTANCES_PER_NODE set to 5 - Pass test_tpcds_queries.py::TestTpcdsQueryWithProcessingCost - Pass core tests Change-Id: If948e45455275d9a61a6cd5d6a30a8b98a7c729a Reviewed-on: http://gerrit.cloudera.org:8080/19807 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-11 22:46:31 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Riza Suminto	dafc0fb7a8	IMPALA-11604 (part 2): Compute Effective Parallelism of Query Part 1 of IMPALA-11604 implements the ProcessingCost model for each PlanNode and DataSink. This second part builds on top of ProcessingCost model by adjusting the number of instances for each fragment after considering their production-consumption ratio, and then finally returns a number representing an ideal CPU core count required for a query to run efficiently. A more detailed explanation of the CPU costing algorithm can be found in the three steps below. I. Compute the total ProcessingCost of a fragment. The costing algorithm splits a query fragment into several segments divided by blocking PlanNode/DataSink boundary. Each fragment segment is a subtree of PlanNodes/DataSink in the fragment with a DataSink or blocking PlanNode as root and non-blocking leaves. All other nodes in the segment are non-blocking. PlanNodes or DataSink that belong to the same segment will have their ProcessingCost summed. A new CostingSegment class is added to represent this segment. A fragment that has a blocking PlanNode or blocking DataSink is called a blocking fragment. Currently, only JoinBuildSink is considered as blocking DataSink. A fragment without any blocking nodes is called a non-blocking fragment. Step III discuss further about blocking and non-blocking fragment. Take an example of the following fragment plant, which is blocking since it has 3 blocking PlanNode: 12:AGGREGATE, 06:SORT, and 08:TOP-N. F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=6 (adjusted from 12) fragment-costs=[34974657, 2159270, 23752870, 22] 08:TOP-N [LIMIT=100] \| cost=900 \| 07:ANALYTIC \| cost=23751970 \| 06:SORT \| cost=2159270 \| 12:AGGREGATE [FINALIZE] \| cost=34548320 \| 11:EXCHANGE [HASH(i_class)] cost=426337 In bottom-up direction, there exist four segments in F03: Blocking segment 1: (11:EXCHANGE, 12:AGGREGATE) Blocking segment 2: 06:SORT Blocking segment 3: (07:ANALYTIC, 08:TOP-N) Non-blocking segment 4: DataStreamSink of F03 Therefore we have: PC(segment 1) = 426337+34548320 PC(segment 2) = 2159270 PC(segment 3) = 23751970+900 PC(segment 4) = 22 These per-segment costs stored in a CostingSegment tree rooted at PlanFragment.rootSegment_, and are [34974657, 2159270, 23752870, 22] respectively after the post-order traversal. This is implemented in PlanFragment.computeCostingSegment() and PlanFragment.collectCostingSegmentHelper(). II. Compute the effective degree of parallelism (EDoP) of fragments. The costing algorithm walks PlanFragments of the query plan tree in post-order traversal. Upon visiting a PlanFragment, the costing algorithm attempts to adjust the number of instances (effective parallelism) of that fragment by comparing the last segment's ProcessingCost of its child and production-consumption rate between its adjacent segments from step I. To simplify this initial implementation, the parallelism of PlanFragment containing EmptySetNode, UnionNode, or ScanNode will remain unchanged (follow MT_DOP). This step is implemented at PlanFragment.traverseEffectiveParallelism(). III. Compute the EDoP of the query. Effective parallelism of a query is the maximum upper bound of CPU core count that can parallelly work on a query when considering the overlapping between fragment execution and blocking operators. We compute this in a similar post-order traversal as step II and split the query tree into blocking fragment subtrees similar to step I. The following is an example of a query plan from TPCDS-Q12. F04:PLAN FRAGMENT [UNPARTITIONED] hosts=1 instances=1 PLAN-ROOT SINK \| 13:MERGING-EXCHANGE [UNPARTITIONED] \| F03:PLAN FRAGMENT [HASH(i_class)] hosts=3 instances=3 (adjusted from 12) 08:TOP-N [LIMIT=100] \| 07:ANALYTIC \| 06:SORT \| 12:AGGREGATE [FINALIZE] \| 11:EXCHANGE [HASH(i_class)] \| F00:PLAN FRAGMENT [RANDOM] hosts=3 instances=12 05:AGGREGATE [STREAMING] \| 04:HASH JOIN [INNER JOIN, BROADCAST] \| \|--F05:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 \| JOIN BUILD \| \| \| 10:EXCHANGE [BROADCAST] \| \| \| F02:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 \| 02:SCAN HDFS [tpcds10_parquet.date_dim, RANDOM] \| 03:HASH JOIN [INNER JOIN, BROADCAST] \| \|--F06:PLAN FRAGMENT [RANDOM] hosts=3 instances=3 \| JOIN BUILD \| \| \| 09:EXCHANGE [BROADCAST] \| \| \| F01:PLAN FRAGMENT [RANDOM] hosts=1 instances=1 \| 01:SCAN HDFS [tpcds10_parquet.item, RANDOM] \| 00:SCAN HDFS [tpcds10_parquet.web_sales, RANDOM] A blocking fragment is a fragment that has a blocking PlanNode or blocking DataSink in it. The costing algorithm splits the query plan tree into blocking subtrees divided by blocking fragment boundary. Each blocking subtree has a blocking fragment as a root and non-blocking fragments as the intermediate or leaf nodes. From the TPCDS-Q12 example above, the query plan is divided into five blocking subtrees of [(F05, F02), (F06, F01), F00, F03, F04]. A CoreCount is a container class that represents the CPU core requirement of a subtree of a query or the query itself. Each blocking subtree will have its fragment's adjusted instance count summed into a single CoreCount. This means that all fragments within a blocking subtree can run in parallel and should be assigned one core per fragment instance. The CoreCount for each blocking subtree in the TPCDS-Q12 example is [4, 4, 12, 3, 1]. Upon visiting a blocking fragment, the maximum between current CoreCount (rooted at that blocking fragment) vs previous blocking subtrees CoreCount is taken and the algorithm continues up to the next ancestor PlanFragment. The final CoreCount for the TPCDS-Q12 example is 12. This step is implemented at Planner.computeBlockingAwareCores() and PlanFragment.traverseBlockingAwareCores(). The resulting CoreCount at the root PlanFragment is then taken as the ideal CPU core count / EDoP of the query. This number will be compared against the total CPU count of an Impala executor group to determine if it fits to run in that set or not. A backend flag query_cpu_count_divisor is added to help scale down/up the EDoP of a query if needed. Two query options are added to control the entire computation of EDoP. 1. COMPUTE_PROCESSING_COST Control whether to enable this CPU costing algorithm or not. Must also set MT_DOP > 0 for this query option to take effect. 2. PROCESSING_COST_MIN_THREADS Control the minimum number of fragment instances (threads) that the costing algorithm is allowed to adjust. The costing algorithm is in charge of increasing the fragment's instance count beyond this minimum number through producer-consumer rate comparison. The maximum number of fragment is max between PROCESSING_COST_MIN_THREADS, MT_DOP, and number of cores per executor. This patch also adds three backend flags to tune the algorithm. 1. query_cpu_count_divisor Divide the CPU requirement of a query to fit the total available CPU in the executor group. For example, setting value 2 will fit the query with CPU requirement 2X to an executor group with total available CPU X. Note that setting with a fractional value less than 1 effectively multiplies the query CPU requirement. A valid value is > 0.0. The default value is 1. 2. processing_cost_use_equal_expr_weight If true, all expression evaluations are weighted equally to 1 during the plan node's processing cost calculation. If false, expression cost from IMPALA-2805 will be used. Default to true. 3. min_processing_per_thread Minimum processing load (in processing cost unit) that a fragment instance needs to work on before planner considers increasing instance count based on the processing cost rather than the MT_DOP setting. The decision is per fragment. Setting this to high number will reduce parallelism of a fragment (more workload per fragment), while setting to low number will increase parallelism (less workload per fragment). Actual parallelism might still be constrained by the total number of cores in selected executor group, MT_DOP, or PROCESSING_COST_MIN_THREAD query option. Must be a positive integer. Currently default to 10M. As an example, the following are additional ProcessingCost information printed to coordinator log for Q3, Q12, and Q15 ran on TPCDS 10GB scale, 3 executors, MT_DOP=4, PROCESSING_COST_MAX_THREADS=4, and processing_cost_use_equal_expr_weight=false. Q3 CoreCount={total=12 trace=F00:12} Q12 CoreCount={total=12 trace=F00:12} Q15 CoreCount={total=15 trace=N07:3+F00:12} There are a few TODOs which will be done in follow up tasks: 1. Factor in row width in ProcessingCost calcuation (IMPALA-11972). 2. Tune the individual expression cost from IMPALA-2805. 3. Benchmark and tune min_processing_per_thread with an optimal value. 4. Revisit cases where cardinality is not available (getCardinality() or getInputCardinality() return -1). 5. Bound SCAN and UNION fragments by ProcessingCost as well (need to address IMPALA-8081). Testing: - Add TestTpcdsQueryWithProcessingCost, which is a similar run of TestTpcdsQuery, but with COMPUTE_PROCESSING_COST=1 and MT_DOP=4. Setting log level TRACE for PlanFragment and manually running TestTpcdsQueryWithProcessingCost in minicluster shows several fragment instance count reduction from 12 to either of 9, 6, or 3 in coordinator log. - Add PlannerTest#testProcessingCost Adjusted fragment count is indicated by "(adjusted from 12)" in the query profile. - Add TestExecutorGroups::test_query_cpu_count_divisor. Co-authored-by: Qifan Chen <qchen@cloudera.com> Change-Id: Ibb2a796fdf78336e95991955d89c671ec82be62e Reviewed-on: http://gerrit.cloudera.org:8080/19593 Reviewed-by: Wenzhe Zhou <wzhou@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Kurt Deschler <kdeschle@cloudera.com> Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>	2023-03-08 15:32:28 +00:00
Shant Hovsepian	3f1b1476af	IMPALA-10034: Add remaining TPC-DS queries to workload. Include remaining TPC-DS queries to the testdata workload definition. Q8 and Q38 were using non standard variants, those have been replaced by the official query versions. Q35 is using an official variant. Had to escape a table alias in Q90 as we treat 'AT' as a reserved keyword. Change-Id: Id5436689390f149694f14e6da1df624de4f5f7ad Reviewed-on: http://gerrit.cloudera.org:8080/16280 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-08-24 16:02:45 +00:00
Shant Hovsepian	ea3f073881	IMPALA-9943,IMPALA-4974: INTERSECT/EXCEPT [DISTINCT] INTERSECT and EXCEPT set operations are implemented as rewrites to joins. Currently only the DISTINCT qualified operators are implemented, not ALL qualified. The operator MINUS is supported as an alias for EXCEPT. We mimic Oracle and Hive's non-standard implementation which treats all operators with the same precedence, as opposed to the SQL Standard of giving INTERSECT higher precedence. A new class SetOperationStmt was created to encompass the previous UnionStmt behavior. UnionStmt is preserved as a special case of union only operands to ensure compatibility with previous union planning behavior. Tests: * Added parser and analyzer tests. * Ensured no test failures or plan changes for union tests. * Added TPC-DS queries 14,38,87 to functional and planner tests. * Added functional tests test_intersect test_except * New planner testSetOperationStmt Change-Id: I5be46f824217218146ad48b30767af0fc7edbc0f Reviewed-on: http://gerrit.cloudera.org:8080/16123 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Aman Sinha <amsinha@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-31 17:23:45 +00:00
Tim Armstrong	574fef2a76	IMPALA-9917: grouping() and grouping_id() support Implements grouping() and grouping_id() builtins. grouping_id() has both a no-arg version, which returns a bit vector of all grouping exprs and a varargs version, which returns a bit vector of the provided arguments. Grouping is a keyword, so needs special handling in the parser to be accepted as a function name. These functions are implemented in the transpose agg with a CASE expression similar to other aggregate functions, but returning the grouping() or grouping_id() value for that aggregation class instead of an aggregated value. Testing: * Added parser test for grouping keyword. * Added analysis tests for the functions. * Added basic planner test to show expressions generated * Added some TPC-DS queries that use grouping() - queries 80, 70 and 86 using reference .test files from Fang-Yu Rao. 27 and 36 were added with reference results from https://github.com/cwida/tpcds-result-reproduction * Add targeted end-to-end tests. * Added view compatibility test with Hive. Change-Id: If0b1640d606256c0fe9204d2a21a8f6d06abcdb6 Reviewed-on: http://gerrit.cloudera.org:8080/16140 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	3e1e7da229	IMPALA-9898: generate grouping set plans Integrates the parsing and analysis with plan generation. Testing: * Add analysis test to make sure we reject unsupported queries. * Added targeted planner tests to ensure we generate the correct aggregation classes for a variety of cases. * Add targeted end-to-end functional tests. Added five TPC-DS queries that use ROLLUP, building on some work done by Fang-Yu Rao. Some tweaks were required for these tests. * Add an extra ORDER BY clause to q77 to make fully deterministic. * Add backticks around `returns` to avoid reserved word. * Add INTERVAL keyword to date/timestamp arithmetic. We can run q80, too, but I haven't added or verified results yet - that can be done in a follow-up. Change-Id: Ie454c5bf7aee266321dee615548d7f2b71380197 Reviewed-on: http://gerrit.cloudera.org:8080/16128 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-14 03:13:18 +00:00
Tim Armstrong	fea5dffec5	IMPALA-9924: handle single subquery in or predicate This patch supports a subset of cases of subqueries inside OR inside WHERE and HAVING clauses. The approach used is to rewrite the subquery into a many-to-one LEFT OUTER JOIN with the subquery and then replace the subquery in the expression with a reference to the single select list expressions of the subquery. This works because: * A many-to-one LEFT OUTER JOIN returns one output row for each left input row, meaning that for every row in the original query before the rewrite, we get the same row plus a single matched row from the subquery * Expressions can be rewritten to refer to a slotref from the right side of the LEFT OUTER JOIN without affecting semantics. E.g. an IN subquery becomes <slot> IS NOT NULL or <operator> (<subquery>) becomes <operator> <slot>. This does not affect SELECT list subqueries, which are rewritten using a different mechanism that can already support some subqueries in disjuncts. Correlated and uncorrelated subqueries are both supported, but various limitations are present. Limitations: * Only one subquery per predicate is supported. The rewriting approach should generalize to multiple subqueries but other code needs refactoring to handle this case. * EXISTS and NOT EXISTS subqueries are not supported. The rewriting approach can generalise to that, but we need to add or pick a select list item from the subquery to check for NULL/IS NOT NULL and a little more work is required to do that correctly. * NOT IN is not supported because of the special NULL semantics. * Subqueries with aggregates + grouping by are not supported because we rely on adding distinct to select list and we don't support distinct + aggregations because of IMPALA-5098. Tests: * Positive analysis tests for IN and binary predicate operators. * Negative analysis tests for unsupported subquery operators. * Negative analysis tests for multiple subqueries. * Negative analysis tests for runtime scalar subqueries. * Positive and negative analysis tests for aggregations in subquery. * TPC-DS Query 45 planner and query tests * Targeted planner tests for various supported queries. * Targeted functional tests to confirm plans are executable and return correct result. These exercise a mix of the supported features - correlated/correlated, aggregate functions, EXISTS/comparator, etc. * Tests for BETWEEN predicate, which is supported as a side-effect of being rewritten during analysis. Change-Id: I64588992901afd7cd885419a0b7f949b0b174976 Reviewed-on: http://gerrit.cloudera.org:8080/16152 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com>	2020-07-13 16:02:27 +00:00
Tim Armstrong	24f24b131f	IMPALA-9902: add rewrite of TPC-DS q38 I generated the query with dsqgen and then rewrote it to avoid intersect. Testing: Compared results to hive running the original version of the query. Change-Id: I81807683aa265a946729e15156bd2e33724103e1 Reviewed-on: http://gerrit.cloudera.org:8080/16118 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-06 23:47:45 +00:00
Shant Hovsepian	2dca55695e	IMPALA-9784, IMPALA-9905: Uncorrelated subqueries in HAVING. Support rewriting subqueries in the HAVING clause by nesting the aggregation query and pulling up the subquery predicates into the outer WHERE clause. Testing: * New analyzer tests * New functional subquery tests * Added Q23, Q24 and Q44 to the tpcds workload * Ran subquery rewrite tests Change-Id: I124a58a09a1a47e1222a22d84b54fe7d07844461 Reviewed-on: http://gerrit.cloudera.org:8080/16052 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Shant Hovsepian	388ad555d7	IMPALA-8954: Uncorrelated scalar subqueries in the select list Extend StmtRewriter with the ability to rewrite scalar subqueries in the select list into cross joins. Currently the subquery must pass plan-time checks to determine that it returns a single row which may miss cases that may be valid at runtime or with more complex evaluation of the predicate expressions in the planner. Support for correlated subqueries will be a follow on change. Testing: * Added new analyzer tests, updated previous subquery tests * test_queries.py::TestQueries::test_subquery * Added test_tpcds_q9 to e2e and planner tests Change-Id: Ibcf55d26889aa01d69bb85f18c9241dda095fb66 Reviewed-on: http://gerrit.cloudera.org:8080/16007 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2020-07-05 22:03:42 +00:00
Fang-Yu Rao	2a48f7dd98	IMPALA-9890 (Part 1): Add more TPCDS queries to Impala's test suite This patch adds the following 12 TPCDS queries to the class of TestTpcdsDecimalV2Query: Q26, Q30, Q31, Q47, Q48, Q57, Q58, Q59, Q63, Q83, Q85, and Q89. All the queries except for Q31 are added to the class of TestTpcdsQuery as well because Impala returns one fewer row than expected for TestTpcdsQuery::test_tpcds_q31(), which requires further investigation. To verify whether or not the returned result set from Impala for a given query is correct, we compare the result set with that produced by the HiveServer2 (HS2) in Impala's mini-cluster. We could execute SQL statements in HS2 via Beeline, HS2's command line shell, which could be launched by the following command. beeline -u "jdbc:hive2://localhost:11050/default" We note that among these 12 queries, the execution of Q31, Q58, and Q83 result in the error of "Counters limit exceeded" by TEZ. To work around this problem, for these 3 queries we have to execute the following statement before running them to increase the default number of counters, which is set to 120. set tez.counters.max=1200 On the other hand, the table of 'reason' is referenced by Q85. This table was not referenced by any TPCDS query before this patch and thus was not created. In this regard, in this patch we also modify tpcds_schema_template.sql to create this additional table along with its data. Testing: - Verified that this patch passes the exhaustive tests in the DEBUG build. Change-Id: Ib5f260e75a3803aabe9ccef271ba94036f96e5cf Reviewed-on: http://gerrit.cloudera.org:8080/16119 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-30 13:06:33 +00:00
Tim Armstrong	d4648e87b4	IMPALA-4356,IMPALA-7331: codegen all ScalarExprs Based on initial draft patch by Pooja Nilangekar. Codegen'd expressions can be executed in two ways - either by being called directly from a fully codegend function, or from interpreted code via a function pointer (previously ScalarFnCall::scalar_fn_wrapper_). This change moves the function pointer from ScalarFnCall to its base class ScalarExpr, so the full expr tree can be codegen'd, not just the ScalarFnCall subtrees. The key refactoring and improvements are: * ScalarExpr::GetVal() switches between interpreted and the codegen'd function pointer code paths in an inline function, avoiding a virtual function call to ScalarFnCal::GetVal(). * Boilerplate logic is moved to ScalarExpr::GetCodegendComputeFn(), which calls a virtual function GetCodegenComputeFnImpl(). * ScalarFnCall's logic for deciding whether to interpret or codegen is better abstracted and exposed to ScalarExpr as IsInterpretable() and ShouldCodegen() methods. * The ScalarExpr::codegend_compute_fn_ function pointer is only populated for expressions that are "codegen entry points". These include the roots of expr trees and non-root expressions where the parent expression calls GetVal() from the pseudo-codegend GetCodegendComputeFnWrapper(). ScalarFnCall is always initialised for interpreted execution. Otherwise the function pointer is needed for non-root expressions, e.g. to support ScalarExprEvaluator::GetConstantVal(). * Latent bugs/gaps for codegen of CollectionVal are fixed. CollectionVal is modified to use the StringVal memory layout to allow code sharing with StringVal. These fixes allowed simplification of IsNotEmptyPredicate codegen (from IMPALA-7657). I chose to tackle two problems in one change - adding support for generating codegen'd function pointers for all ScalarExprs, and adding the "entry point" concept - to avoid a blow-up in the number of codegen'd entry points that could lead to longer codegen times and/or worse code because of inlining changes. IMPALA-7331 (CHAR codegen support functions) is also fixed because it was simpler to enable CHAR codegen within ScalarExpr than to carry forward the exiting CHAR workarounds from ScalarFnCall. The CHAR-specific codegen support required in the scalar expr subsystem is very limited. StringVal intermediates are used everywhere. Only SlotRef actually operates on the different tuple layout, and the required codegen support for SlotRef already exists for UDA intermediates anyway. Testing: * Ran exhaustive tests. Perf: * Ran a basic insert benchmark, which went from 10.1s to 7.6s create table foo stored as parquet as select case when l_orderkey % 2 = 0 then 'aaa' else 'bbb' end from tpch30_parquet.lineitem; * Ran a basic CHAR expr test: set num_nodes=1; set mt_dop=1; select count() from lineitem where cast(l_linestatus as CHAR(2)) = 'O ' and cast(l_returnflag as CHAR(2)) = 'N ' The time spent in the scan went from 520ms to 220ms. Added perf regression test to tpcds-insert, similar to the manual benchmark. * Ran single-node TPC-H with large and small scale factors, to estimate impact on execution perf and query startup time, respectively. +----------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +----------+-----------------------+---------+------------+------------+----------------+ \| TPCH(30) \| parquet / none / none \| 6.84 \| -0.18% \| 4.49 \| -0.31% \| +----------+-----------------------+---------+------------+------------+----------------+ +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ \| TPCH(30) \| TPCH-Q20 \| parquet / none / none \| 2.58 \| 2.47 \| +4.18% \| 1.29% \| 0.88% \| 5 \| +4.12% \| 2.31 \| 5.81 \| \| TPCH(30) \| TPCH-Q17 \| parquet / none / none \| 4.81 \| 4.61 \| +4.33% \| 2.18% \| 2.15% \| 5 \| +3.91% \| 1.73 \| 3.09 \| \| TPCH(30) \| TPCH-Q21 \| parquet / none / none \| 26.45 \| 26.16 \| +1.09% \| 0.37% \| 0.50% \| 5 \| +1.36% \| 2.02 \| 3.94 \| \| TPCH(30) \| TPCH-Q9 \| parquet / none / none \| 15.92 \| 15.75 \| +1.09% \| 2.87% \| 1.65% \| 5 \| +0.88% \| 0.29 \| 0.73 \| \| TPCH(30) \| TPCH-Q12 \| parquet / none / none \| 2.38 \| 2.35 \| +1.12% \| 1.64% \| 1.11% \| 5 \| +0.80% \| 1.15 \| 1.26 \| \| TPCH(30) \| TPCH-Q14 \| parquet / none / none \| 2.94 \| 2.91 \| +1.13% \| 7.68% \| 5.37% \| 5 \| -0.34% \| -0.29 \| 0.27 \| \| TPCH(30) \| TPCH-Q18 \| parquet / none / none \| 18.10 \| 18.02 \| +0.42% \| 2.70% \| 0.56% \| 5 \| +0.28% \| 0.29 \| 0.34 \| \| TPCH(30) \| TPCH-Q8 \| parquet / none / none \| 4.72 \| 4.72 \| -0.04% \| 1.20% \| 1.65% \| 5 \| +0.05% \| 0.00 \| -0.04 \| \| TPCH(30) \| TPCH-Q19 \| parquet / none / none \| 3.92 \| 3.93 \| -0.26% \| 1.08% \| 2.36% \| 5 \| +0.20% \| 0.58 \| -0.23 \| \| TPCH(30) \| TPCH-Q6 \| parquet / none / none \| 1.27 \| 1.27 \| -0.28% \| 0.22% \| 0.88% \| 5 \| +0.09% \| 0.29 \| -0.68 \| \| TPCH(30) \| TPCH-Q16 \| parquet / none / none \| 2.64 \| 2.65 \| -0.45% \| 1.65% \| 0.65% \| 5 \| -0.24% \| -0.58 \| -0.57 \| \| TPCH(30) \| TPCH-Q22 \| parquet / none / none \| 3.10 \| 3.13 \| -0.76% \| 1.47% \| 1.12% \| 5 \| -0.21% \| -0.29 \| -0.93 \| \| TPCH(30) \| TPCH-Q2 \| parquet / none / none \| 1.20 \| 1.21 \| -0.80% \| 2.26% \| 2.47% \| 5 \| -0.82% \| -1.15 \| -0.53 \| \| TPCH(30) \| TPCH-Q4 \| parquet / none / none \| 1.97 \| 1.99 \| -1.37% \| 1.84% \| 3.21% \| 5 \| -0.47% \| -0.58 \| -0.83 \| \| TPCH(30) \| TPCH-Q13 \| parquet / none / none \| 11.53 \| 11.63 \| -0.91% \| 0.46% \| 0.49% \| 5 \| -0.95% \| -2.02 \| -3.08 \| \| TPCH(30) \| TPCH-Q10 \| parquet / none / none \| 5.13 \| 5.21 \| -1.51% \| 2.24% \| 4.05% \| 5 \| -0.94% \| -0.58 \| -0.73 \| \| TPCH(30) \| TPCH-Q5 \| parquet / none / none \| 3.61 \| 3.66 \| -1.40% \| 0.66% \| 0.79% \| 5 \| -1.33% \| -1.73 \| -3.05 \| \| TPCH(30) \| TPCH-Q7 \| parquet / none / none \| 19.42 \| 19.71 \| -1.52% \| 1.34% \| 1.39% \| 5 \| -1.22% \| -1.44 \| -1.76 \| \| TPCH(30) \| TPCH-Q3 \| parquet / none / none \| 5.08 \| 5.15 \| -1.49% \| 1.34% \| 0.73% \| 5 \| -1.35% \| -1.44 \| -2.20 \| \| TPCH(30) \| TPCH-Q15 \| parquet / none / none \| 3.42 \| 3.49 \| -1.92% \| 0.93% \| 1.47% \| 5 \| -1.53% \| -1.15 \| -2.49 \| \| TPCH(30) \| TPCH-Q11 \| parquet / none / none \| 1.15 \| 1.19 \| -3.17% \| 2.27% \| 1.95% \| 5 \| -4.21% \| -1.15 \| -2.41 \| \| TPCH(30) \| TPCH-Q1 \| parquet / none / none \| 9.26 \| 9.63 \| -3.85% \| 0.62% \| 0.59% \| 5 \| -3.78% \| -2.31 \| -10.25 \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+--------+ Cluster Name: UNKNOWN Lab Run Info: UNKNOWN Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE () Baseline Impala Version: impalad version 3.2.0-SNAPSHOT RELEASE (2019-03-19) +----------+-----------------------+---------+------------+------------+----------------+ \| Workload \| File Format \| Avg (s) \| Delta(Avg) \| GeoMean(s) \| Delta(GeoMean) \| +----------+-----------------------+---------+------------+------------+----------------+ \| TPCH(2) \| parquet / none / none \| 0.90 \| -0.08% \| 0.80 \| -0.05% \| +----------+-----------------------+---------+------------+------------+----------------+ +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+ \| Workload \| Query \| File Format \| Avg(s) \| Base Avg(s) \| Delta(Avg) \| StdDev(%) \| Base StdDev(%) \| Iters \| Median Diff(%) \| MW Zval \| Tval \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+ \| TPCH(2) \| TPCH-Q18 \| parquet / none / none \| 1.22 \| 1.19 \| +1.93% \| 3.81% \| 4.46% \| 20 \| +3.34% \| 1.62 \| 1.46 \| \| TPCH(2) \| TPCH-Q10 \| parquet / none / none \| 0.74 \| 0.73 \| +1.97% \| 3.36% \| 2.94% \| 20 \| +0.97% \| 1.88 \| 1.95 \| \| TPCH(2) \| TPCH-Q11 \| parquet / none / none \| 0.49 \| 0.48 \| +1.91% \| 6.19% \| 4.64% \| 20 \| +0.25% \| 0.95 \| 1.09 \| \| TPCH(2) \| TPCH-Q4 \| parquet / none / none \| 0.43 \| 0.43 \| +1.99% \| 6.26% \| 5.86% \| 20 \| +0.15% \| 0.92 \| 1.03 \| \| TPCH(2) \| TPCH-Q15 \| parquet / none / none \| 0.50 \| 0.49 \| +1.82% \| 7.32% \| 6.35% \| 20 \| +0.26% \| 1.01 \| 0.83 \| \| TPCH(2) \| TPCH-Q1 \| parquet / none / none \| 0.98 \| 0.97 \| +0.79% \| 4.64% \| 2.73% \| 20 \| +0.36% \| 0.77 \| 0.65 \| \| TPCH(2) \| TPCH-Q19 \| parquet / none / none \| 0.83 \| 0.83 \| +0.65% \| 3.33% \| 2.80% \| 20 \| +0.44% \| 2.18 \| 0.67 \| \| TPCH(2) \| TPCH-Q14 \| parquet / none / none \| 0.62 \| 0.62 \| +0.97% \| 2.86% \| 1.00% \| 20 \| +0.04% \| 0.13 \| 1.42 \| \| TPCH(2) \| TPCH-Q3 \| parquet / none / none \| 0.88 \| 0.87 \| +0.57% \| 2.17% \| 1.74% \| 20 \| +0.29% \| 1.15 \| 0.92 \| \| TPCH(2) \| TPCH-Q12 \| parquet / none / none \| 0.53 \| 0.53 \| +0.27% \| 4.58% \| 5.78% \| 20 \| +0.46% \| 1.47 \| 0.16 \| \| TPCH(2) \| TPCH-Q17 \| parquet / none / none \| 0.72 \| 0.72 \| +0.15% \| 3.64% \| 5.55% \| 20 \| +0.21% \| 0.86 \| 0.10 \| \| TPCH(2) \| TPCH-Q21 \| parquet / none / none \| 2.05 \| 2.05 \| +0.21% \| 1.99% \| 2.37% \| 20 \| +0.01% \| 0.25 \| 0.30 \| \| TPCH(2) \| TPCH-Q5 \| parquet / none / none \| 1.28 \| 1.27 \| +0.24% \| 1.61% \| 1.80% \| 20 \| -0.02% \| -0.57 \| 0.44 \| \| TPCH(2) \| TPCH-Q13 \| parquet / none / none \| 1.27 \| 1.27 \| -0.34% \| 1.69% \| 1.83% \| 20 \| -0.20% \| -1.65 \| -0.61 \| \| TPCH(2) \| TPCH-Q7 \| parquet / none / none \| 1.72 \| 1.73 \| -0.55% \| 2.40% \| 1.69% \| 20 \| -0.03% \| -0.42 \| -0.83 \| \| TPCH(2) \| TPCH-Q8 \| parquet / none / none \| 1.27 \| 1.28 \| -0.68% \| 3.10% \| 3.89% \| 20 \| -0.06% \| -0.54 \| -0.62 \| \| TPCH(2) \| TPCH-Q6 \| parquet / none / none \| 0.36 \| 0.36 \| -0.84% \| 0.79% \| 3.51% \| 20 \| -0.07% \| -0.36 \| -1.04 \| \| TPCH(2) \| TPCH-Q2 \| parquet / none / none \| 0.65 \| 0.65 \| -1.17% \| 4.76% \| 5.99% \| 20 \| -0.05% \| -0.25 \| -0.69 \| \| TPCH(2) \| TPCH-Q9 \| parquet / none / none \| 1.59 \| 1.62 \| -2.01% \| 1.45% \| 5.12% \| 20 \| -0.16% \| -1.24 \| -1.69 \| \| TPCH(2) \| TPCH-Q20 \| parquet / none / none \| 0.68 \| 0.69 \| -1.73% \| 4.35% \| 4.43% \| 20 \| -0.49% \| -1.74 \| -1.25 \| \| TPCH(2) \| TPCH-Q22 \| parquet / none / none \| 0.38 \| 0.40 \| -2.89% \| 7.42% \| 6.39% \| 20 \| -0.21% \| -0.66 \| -1.34 \| \| TPCH(2) \| TPCH-Q16 \| parquet / none / none \| 0.59 \| 0.62 \| -4.01% \| 6.33% \| 5.83% \| 20 \| -4.72% \| -1.39 \| -2.13 \| +----------+----------+-----------------------+--------+-------------+------------+-----------+----------------+-------+----------------+---------+-------+ Change-Id: I839d7a3a2f5e1309c33a1f66013ef11628c5dc11 Reviewed-on: http://gerrit.cloudera.org:8080/12797 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-05-15 22:34:28 +00:00
Tim Armstrong	30082232ed	IMPALA-8410: enable TestTpcdsInsert by default Fix incorrect output to match current behaviour. The test takes 24s on my system but can be run in parallel with other tests. Change-Id: Ibf9a279d57ad74de0c77a90dde69e5c4dc563a3f Reviewed-on: http://gerrit.cloudera.org:8080/13055 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-04-17 22:10:03 +00:00
Tim Armstrong	f876b320aa	IMPALA-5946,IMPALA-5956: add TPC-DS q31,q59,q89 Q31: the substitution variables didn't match the TPC-DS spec. After fixing this, the results match up to 4 digits of rounding (there is some error introduced in intermediate calculations). Q59: the results match the reference results up to rounding. Q89: the results match up to 5 digits of rounding. I verified the matches by using a spreadsheet comparing reference and actual results. https://docs.google.com/spreadsheets/d/1MNEqkfYRRAd3xqY6m20tTHquqjtCThDaGdizzRAQ8co/edit?usp=sharing *https://github.com/gregrahn/tpcds-kit/blob/master/specification/TPC-DS_v2.10.0.pdf ^https://github.com/gregrahn/tpcds-kit/tree/master/answer_sets Change-Id: I49634e8f63066773c9c78dd5585a0ee69daf720a Reviewed-on: http://gerrit.cloudera.org:8080/11845 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-02 22:11:03 +00:00
Tim Armstrong	e3a702707c	IMPALA-5950: fix TPC-DS Q35a and Q48 queries The query text (for Q48) and substitution parameters didn't match the TPC-DS standard for qualification queries. After fixing that, the queries return the results expected by the TPC-DS standards. Note that this may affect the performance of perf workloads running tpcds-unmodified. Change-Id: Ic7c737f68adf616738d6eb6e5a02593af25bcbaf Reviewed-on: http://gerrit.cloudera.org:8080/11833 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-11-01 08:05:48 +00:00
Taras Bobrovytsky	0a1d586d2a	IMPALA-4924: Enable Decimal V2 by default In this commit we enable Decimal_V2 by default. We also update the expected results in many of our tests. Testing: Ran an exhaustive test which almost passed. Updated the few failed tests in it. Cherry-pick: not for 2.x Change-Id: Ibbdd05bf986b7947f106b396017faa3a0bd87fd7 Reviewed-on: http://gerrit.cloudera.org:8080/9062 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-25 04:33:11 +00:00
Taras Bobrovytsky	35a3e186d6	IMPALA-5478: Run TPCDS queries with decimal_v2 enabled We add new TPCDS .test files that are expected to be run with decimal_v2 enabled. The new expected results were generated using Impala and I inspected them manually. Change-Id: Ib867c51a521ec4a087bc127d99aee4b95ba97733 Reviewed-on: http://gerrit.cloudera.org:8080/8985 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2018-01-18 03:28:51 +00:00
Tim Wood	f05bd241ea	IMPALA-5376: Implement all TPCDS test cases or alternates for Impala. Main source for TPCDS query and result definitions: https://github.com/gregrahn/tpcds-kit. TPC-DS v2.5.0 qualification queries from G. Rahn, Cloudera, Inc. Data set constructed in mini-cluster using $IMPALA_HOME/buildall.sh -testdata.... This commit continues previous work on IMPALA-5376 in the ASF Impala repo and the Cloudera Gerrit service. This commit splits multi-query tests in the TPC-DS suite definition into one query and result set per test file, as the test framework requires. Names for such files have -1, -2... inner suffixes. The portion of the TPC-DS test suite in this commit passes. It contains no failures, as reflected by runs of $IMPALA_HOME/tests/run-tests.py query_test/test_tpcds_queries.py ... IMPALA-6007 addresses the TPC-DS cases that require skipping (because we don't support them or they flap) or expected-failure (xfail, because we support them but they fail due to bugs.) These require some added tooling for non-Pytest frameworks like the stress test to avoid attempting them until they work. Tests that flap are marked to skip, with a bug ID, since they don't reliably pass or xfail. Expected result sets come from the TPC-DS kit. Some TPC-DS test cases in this commit have been modified in sematically-neutral ways so as to pass on Impala. The tests/query_test/test_tpcds_queries.py driver file is authoritative for the active/skip/xfail status for each case and a brief reason. The following list describes the current status as: --- test-name deviance from TPC-DS spec changes made --- tpcds-q22a.test RESULT MISMATCH in LSD of AVG() values FIXED, HAND_ROUNDED AVG() VALUES IN RESULT SET --- tpcds-q26.test RESULT MISMATCH in LSD of AVG() values ABSENT, IMPALA-6087 --- tpcds-q28.test RESULT MISMATCH in LSD of AVG() values ABSENT, IMPALA-6087 --- tpcds-q30.test UNRECOGNIZED CHARACTER ABSENT, IMPALA-5961. --- tpcds-q31.test RESULT MISMATCH in LSD of DECIMAL values ABSENT, IMPALA-5956. --- tpcds-q35a.test RESULT MISMATCH ABSENT, IMPALA-5950. --- tpcds-q36a.test RESULT MISMATCH ABSENT, IMPALA-4741 --- tpcds-q47.test RESULT MISMATCH in LSD of DECIMAL values ABSENT, IMPALA-6087 --- tpcds-q48.test RESULT MISMATCH in scalar value ABSENT, IMPALA-5950. --- tpcds-q49.test RESULT MISMATCH in LSD of DECIMAL values ABSENT, IMPALA-5945 --- tpcds-q57.test RESULT MISMATCH, excess scale in DECIMAL values ABSENT, IMPALA-6087 --- tpcds-q58.test RESULT MISMATCH in DECIMAL values ABSENT, IMPALA-5946 --- tpcds-q59.test RESULT MISMATCH, excess scale in DECIMAL values ABSENT, IMPALA-6087 --- tpcds-q61.test RESULT MISMATCH in DECIMAL value FIXED. CAST RESULT QUOTIENT TO DECIMAL(15, 4), TAKE ACTUAL RESULT AS EXPECTED --- tpcds-q63.test RESULT MISMATCH, excess scale in DECIMAL values ABSENT, IMPALA-6087 --- tpcds-q64.test RESULT MISMATCH ADDED ORDER BY COLUMNS. --- tpcds-q66.test RESULT MISMATCH ABSENT, IMPALA-4741 --- tpcds-q77a.test RESULT MISMATCH FIXED. TAKE ACTUAL RESULT AS EXPECTED --- tpcds-q78.test RESULT MISMATCH FIXED. TAKE ACTUAL RESULT AS EXPECTED --- tpcds-q83.test RESULT MISMATCH ABSENT, IMPALA-5945. --- tpcds-q85.test MISSING TABLE "reason" ABSENT, IMPALA-5960 --- tpcds-q86a.test RESULT MISMATCH FIXED. TAKE ACTUAL RESULT AS EXPECTED --- tpcds-q89.test RESULT MISMATCH, DECIMAL values flap ABSENT, ADDED ROUND(2) TO 8th COLUMN, TAKE ACTUAL RESULTS AS EXPECTED, IMPALA-5956. --- tpcds-q90.test RESULT MISMATCH ABSENT, IMPALA-5945. --- tpcds-q93.test MISSING TABLE "reason" ABSENT, IMPALA-5960 --- tpcds-q98.test RESULT MISMATCH FIXED, ADDED ROUND() TO LAST COLUMN Change-Id: I6e284888600a7a69d1f23fcb7dac21cbb13b7d66 Reviewed-on: http://gerrit.cloudera.org:8080/8102 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-23 19:32:10 +00:00
Michael Ho	f15589573b	IMPALA-5376: Loads all TPC-DS tables This change loads the missing tables in TPC-DS. In addition, it also fixes up the loading of the partitioned table store_sales so all partitions will be loaded. The existing TPC-DS queries are also updated to use the parameters for qualification runs as noted in the TPC-DS specification. Some hard-coded partition filters were also removed. They were there due to the lack of dynamic partitioning in the past. Some missing TPC-DS queries are also added to this change, including query28 which discovered the infamous IMPALA-5251. Having all tables in TPC-DS available paves the way for us to include all supported TPCDS queries in our functional testing. Due to the change in the data, planner tests and the E2E tests have different results than before. The results of E2E tests were compared against the run done with Netezza and Vertica. The divergence were all due to the truncation behavior of decimal types in DECIMAL_V1. Change-Id: Ic5277245fd20827c9c09ce5c1a7a37266ca476b9 Reviewed-on: http://gerrit.cloudera.org:8080/6877 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-27 05:19:53 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Matthew Jacobs	c7fa03286b	IMPALA-3718: Support subset of functional-query for Kudu Adds initial support for the functional-query test workload for Kudu tables. There are a few issues that make loading the functional schema difficult on Kudu: 1) Kudu tables must have one or more columns that together constitute a unique primary key. a) Primary key columns must currently be the first columns in the table definition (KUDU-1271). b) Primary key columns cannot be nullable (KUDU-1570). 2) Kudu tables must be specified with distribution parameters. (1) limits the tables that can be loaded without ugly workarounds. This patch only includes important tables that are used for relevant tests, most notably the alltypes* family. In particular, alltypesagg is important but it does not have a set of columns that are non-nullable and form a unique primary key. As a result, that table is created in Kudu with a different name and an additional BIGINT column for a PK that is a unique index and is generated at data loading time using the ROW_NUMBER analytic function. A view is then wrapped around the underlying table that matches the alltypesagg schema exactly. When KUDU-1570 is resolved, this can be simplified. (2) requires some additional considerations and custom syntax. As a result, the DDL to create the tables is explicitly specified in CREATE_KUDU sections in the functional_schema_constraints.csv, and an additional DEPENDENT_LOAD_KUDU section was added to specify custom data loading DML that differs from the existing DEPENDENT_LOAD. TODO: IMPALA-4005: generate_schema_statements.py needs refactoring Tests that are not relevant or not yet supported have been marked with xfail and a skip where appropriate. TODO: Support remaining functional tables/tests when possible. Change-Id: Iada88e078352e4462745d9a9a1b5111260d21acc Reviewed-on: http://gerrit.cloudera.org:8080/4175 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-09-14 22:11:04 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
Martin Grund	f853eeb2f0	IMPALA-1284: Allow implicit cross joins This patch adds support for executing implicit cross joins in Impala. An implicit cross join occurs when two tables are referenced in the FROM clause of a select statement without specifying the join type and in absence of applicable equi-join predicate. To convert an implicit join into a cross join, we manually create the cross join node during the join reordering. When two sub-plans are compared to each other a Hash Join plan is always preferred. As a side effect, explicit cross joins that have equi join conjuncts are now rewritten to hash joins. This patch enables us to run TPC-DS queries Q61 and Q88 which are added as planner and query tests. Change-Id: Ifd53a78e8eb38d553eb039bfeef0216e438790ba Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4695 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: jenkins (cherry picked from commit 77ff7f09350d028be033d772c5c456ceb8828013) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5319	2014-11-20 11:25:33 -08:00
ishaan	10303ed440	Add partition filters to tpcds-q89 and re-enable tpcds-q47 This patch adds partitions filters to tpcds-q89 to account for the lack of dynamic partition pruning. Additionally, it also re-enables running tpcds-q47, which was blocked by IMPALA-1238 Change-Id: Ied05d80565ebb29cd06b3c38d76bd31f0285028e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4453 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-10-08 16:50:16 -07:00
ishaan	e126a3c8b5	Enable more tpcds queries that use correlated subqueries and analytic functions. This patch only operates on queries that use store_sales as the fact table. Change-Id: I763245ef5f68bb1519bcb4d4b26ede96913a1d57 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4312 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4106	2014-09-27 01:15:41 -07:00
Nong Li	bbce0d62c9	[CDH5] Disable tpch/tpcds query and planner tests. Change-Id: I30ecefe2db9ee7996433cda025f86ef8669284e9	2014-09-20 19:41:42 -07:00
Nong Li	2596e936c7	PHJ: remove maintaining tuple row ptrs. The level of indirection of TupleRow* (which are just a list of Tuple*) is not helpful. This memory is a waste and additional logic would have to be added to ensure that it triggers spilling. Instead, we store an offset into the tuple stream (which is always 8 bytes and already part of the hash table structure). Change-Id: I89eb9474746a39ba1b0543aafabdc4a8ddccf88b Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4321 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-09-20 16:12:10 -07:00
Lenni Kuff	0ac0527643	Reduce test execution time by limiting long running tests to exhaustive exec strategy I looked at the latest run from master and took the tests suites that had long execution times. This cleans those test suites up to either completely disable them on 'core' or add constraints to limit the number of test vectors. It shouldn't impact nightly coverage since we still run the same tests exhaustively. Change-Id: I10c78c35155b00de0c36d9fc0923b2b1fc6b44de Reviewed-on: http://gerrit.ent.cloudera.com:8080/3119 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3125 Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-06-18 16:18:17 -07:00
ishaan	db97981ab9	[CDH5] Switch the tpcds schemas to use decimal instead of float/double. This patch converts the tpcds schemas to use decimal instead of float/double. Currently, Impala can only r/w decimal in text, therefore, the tables are constrained to text. The schemas were obtained from the official tpc spec: http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf Change-Id: I1ef0113dcb48bad52af75ee93b47b08adf9e1a69 Reviewed-on: http://gerrit.ent.cloudera.com:8080/2403 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-06-08 11:47:23 -07:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Nong Li	707a566b5d	Add test to tpcds queries to validate table row counts. I tried to investigate the jenkins issue where we weren't returning any rows. I setup the cluster on that box manually and noticed there weren't any results because the store_sales table was empty. Refresh did not fix. This looks like a data loading issue. Adding this test would make discovering this like this much easier. Change-Id: I8ccddd43892b279d506371b9de717629815c6a08 Reviewed-on: http://gerrit.ent.cloudera.com:8080/260 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com>	2014-01-08 10:52:17 -08:00
Lenni Kuff	a3016cc4d4	Add partitioned tpcds insert workload and tests Change-Id: Iff45853153bf0830be3e423c994392998385a64f Reviewed-on: http://gerrit.ent.cloudera.com:8080/256 Tested-by: jenkins <kitchen-build@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com>	2014-01-08 10:52:16 -08:00
Lenni Kuff	17ed6ea177	Partition TPC-DS dataset and add additional TPC-DS workload queries Change-Id: I5410e68fdfd818a8287e0974332c3e36c344c300 Reviewed-on: http://gerrit.ent.cloudera.com:8080/99 Tested-by: jenkins <kitchen-build@cloudera.com> Reviewed-by: Marcel Kornacker <marcel@cloudera.com>	2014-01-08 10:52:13 -08:00
Nong Li	f60f2d3e50	Implement support for grouped scan ranges in io mgr and integration with parquet.	2014-01-08 10:49:18 -08:00
Nong Li	0df9476be1	Parquet data loading.	2014-01-08 10:48:48 -08:00
Lenni Kuff	12d18631e3	Test enhancements: dynamic table format data loading, per-workload exploration stategies	2014-01-08 10:47:07 -08:00
Lenni Kuff	1b248d067b	Add TPC-DS dataset and workload	2014-01-08 10:46:52 -08:00

47 Commits