impala

mirror of https://github.com/apache/impala.git synced 2026-01-03 06:00:52 -05:00

Files

Taras Bobrovytsky 57d7c614bc IMPALA-5036: Parquet count star optimization

Instead of materializing empty rows when computing count star, we use
the data stored in the Parquet RowGroup.num_rows field. The Parquet
scanner tuple is modified to have one slot into which we will write the
num rows statistic. The aggregate function is changed from count to a
special sum function that gets initialized to 0. We also add a rewrite
rule so that count(<literal>) is rewritten to count(*) in order to make
sure that this optimization is applied in all cases.

Testing:
- Added functional and planner tests

Change-Id: I536b85c014821296aed68a0c68faadae96005e62
Reviewed-on: http://gerrit.cloudera.org:8080/6812
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Impala Public Jenkins

2017-07-06 01:26:44 +00:00

functional-planner

IMPALA-5036: Parquet count star optimization

2017-07-06 01:26:44 +00:00

functional-query

IMPALA-5036: Parquet count star optimization

2017-07-06 01:26:44 +00:00

hive-benchmark

Refactor testing framework to generate Avro tables.

2014-01-08 10:48:45 -08:00

perf-regression

IMPALA-3311: fix string data coming out of aggs in subplans

2016-05-12 23:06:36 -07:00

targeted-perf

IMPALA-5483: Automatically disable codegen for small queries

2017-06-29 21:14:59 +00:00

targeted-stress

BufferedBlockMgr: bug fixes for stress.

2014-10-06 15:09:13 -07:00

tpcds

IMPALA-5376: Loads all TPC-DS tables

2017-05-27 05:19:53 +00:00

tpcds-insert

[CDH5] Modified TPCDS schema and queries to match Impala TPCDS kit

2014-08-08 02:20:40 -07:00

tpch

IMPALA-4895: Memory limit exceeded in test_outer_joins

2017-02-09 00:50:15 +00:00

tpch_nested

Improve the SQL for nested TPCH-Q18.

2016-03-04 04:35:54 +00:00

README

Move functional data loading to new framework + initial changes for workload directory structure

2014-01-08 10:44:18 -08:00

README

This directory contains Impala test workloads. The directory layout for the workloads should follow:

workloads/
   <data set name>/<data set name>_dimensions.csv  <- The test dimension file
   <data set name>/<data set name>_core.csv  <- A test vector file
   <data set name>/<data set name>_pairwise.csv
   <data set name>/<data set name>_exhaustive.csv
   <data set name>/queries/<query test>.test <- The queries for this workload