impala

mirror of https://github.com/apache/impala.git synced 2026-01-04 09:00:56 -05:00

Author	SHA1	Message	Date
Alexander Behm	ee705e3083	Added timestamp arithmetic expressions.	2014-01-08 10:44:31 -08:00
Alan Choi	f15ef994fb	"mvn test" now uses impalad and beeswax api to submit query and fetch, including insert query. review issue: 260	2014-01-08 10:44:30 -08:00
Lenni Kuff	87d0ed137f	Temporarily disabled TPC-H planner tests that require data to be loaded in tmp tables I am temporarily disabling the TPC-H planner tests that require data to be pre-loaded in temp tables. This resolves a problem where the TPC-H query tests need to be run before the TPC-H planner tests. I have filed "IMP-171" to track the work to re-enable these tests.	2014-01-08 10:44:30 -08:00
Michael Ubell	1a05c4e776	Chage to use impalad for data loading.	2014-01-08 10:44:29 -08:00
Marcel Kornacker	52bd3ad173	fixing PlannerTest	2014-01-08 10:44:28 -08:00
Marcel Kornacker	04d12f03fc	cleaning up logging output	2014-01-08 10:44:28 -08:00
Alan Choi	88101bc90e	This patch implements the probabilistic counting algorithm as an aggregate "distinctpc" and "distinctpcsa". We've gathered statistics on an internal dataset (all columns) which is part of our regression data. It's roughly 400mb, ~100 columns, int/bigint/string type. On Hive, it took roughly 64sec. On this Impala implementation, it took 35sec. By adding inline to hash-util.h (which we don't), we can achieve 24~26sec. Change-Id: Ibcba3c9512b49e8b9eb0c2fec59dfd27f14f84c3	2014-01-08 10:44:27 -08:00
Alan Choi	cbadb4eac4	When a scan range begins at the starting point fo the tuple, we'll missed that tuple. This patch fixes this problem. review: 162	2014-01-08 10:44:24 -08:00
Nong Li	84f71a716f	Counter fixes.	2014-01-08 10:44:23 -08:00
Lenni Kuff	91f51a1b39	Fixed issue with data loading of workloads that have non-word characters in their names Fixed a problem where we were not properly looking up the dataset associated with the given workload if it had non-word characters in its name (a-z & _). Also cut down on the execution time of the hive-benchmark workload under the "core" vector.	2014-01-08 10:44:23 -08:00
Lenni Kuff	04edc8f534	Update benchmark tests to run against generic workload, data loading with scale factor, +more This change updates the run-benchmark script to enable it to target one or more workloads. Now benchmarks can be run like: ./run-benchmark --workloads=hive-benchmark,tpch We lookup the workload in the workloads directory, then read the associated query .test files and start executing them. To ensure the queries are not duplicated between benchmark and query tests, I moved all existing queries (under fe/src/test/resources/* to the workloads directory. You do NOT need to look through all the .test files, I've just moved them. The one new file is the 'hive-benchmark.test' which contains the hive benchmark queries. Also added support for generating schema for different scale factors as well as executing against these scale factors. For example, let's say we have a dataset with a scale factor called "SF1". We would first generate the schema using: ./generate_schema_statements --workload=<workload> --scale_factor="SF3" This will create tables with a unique names from the other scale factors. Run the generated .sql file to load the data. Alternatively, the data can loaded by running a new python script: ./bin/load-data.py -w <workload1>,<workload2> -e <exploration strategy> -s [scale factor] For example: load-data.sh -w tpch -e core -s SF3 Then run against this: ./run-benchmark --workloads=<workload> --scale_factor=SF3 This changeset also includes a few other minor tweaks to some of the test scripts. Change-Id: Ife8a8d91567d75c9612be37bec96c1e7780f50d6	2014-01-08 10:44:22 -08:00
Michael Ubell	81d54e85e5	Change trevni data loading to avoid possible connection issues.	2014-01-08 10:44:21 -08:00
Michael Ubell	ede83a84b2	Make the trevni load script exit on error.	2014-01-08 10:44:20 -08:00
Michael Ubell	bc7eff1240	Kill off planserver after loading trevni files so jenkins is happy	2014-01-08 10:44:20 -08:00
Michael Ubell	02d63d8dc3	Trevni file support	2014-01-08 10:44:19 -08:00
Lenni Kuff	84d91fca4f	Fix sequence file data loading for the alltypesmixedformat table Moved this out of the data loading framework because it is kind of a special case. I will consider how we can update the framework to address mixed format tables.	2014-01-08 10:44:18 -08:00
Lenni Kuff	cef688d0fd	IMP-95: Fix/recognize intermittent data load failures on jenkins Builds now fail on data loading problems. Also a simple test fix.	2014-01-08 10:44:18 -08:00
Lenni Kuff	bf27a31f98	Move functional data loading to new framework + initial changes for workload directory structure This change moves (almost) all the functional data loading to the new data loading framework. This removes the need for the create.sql, load.sql, and load-raw-data.sql file. Instead we just have the single schema template file: testdata/datasets/functional/functional_schema_template.sql This template can be used to generate the schema for all file formats and compression variations. It also should help make loading data easier. Now you can run: bin/load-impala-data.sh "query-test" "exhaustive" And get all data needed for running the query tests. This change also includes the initial changes for new dataset/workload directory structure. The new structure looks like: testdata/workload <- Will contain query files and test vectors/dimensions testdata/datasets <- WIll contain the data files and schema templates Note: This is the first part of the change to this directory structure - it's not yet complete. # Please enter the commit message for your changes. Lines starting	2014-01-08 10:44:18 -08:00
Henry Robinson	69777066df	IMP-163: Fix loading table with string partition key	2014-01-08 10:44:17 -08:00
Alan Choi	b17e24d654	A few FE fixes [rewritten by hnr] review issue: 198 Change-Id: I84a2f38b0bce5a6f33dfb974de60c822945834e5	2014-01-08 10:44:15 -08:00
Lenni Kuff	d2bb6ae068	Rewrote TPCH queries to be more performant and better represent the official TPC-H queries + added TPCH Planner test Rewrote TPCH queries to be more performant by removing subqueries, re-ordering joins, etc. Also did this work refering to the official TPC-H documentation rather than the work done for Hive which overly used subqueries. Also added a Planner test for TPCH and made a few changes so the TPC-H tests can run against ImpalaD.	2014-01-08 10:44:15 -08:00
Lenni Kuff	e49ae73236	Updated cache_tables to properly load from mini DFS data node directory Previously the cache_tables script would run from a hardcoded data dir location	2014-01-08 10:44:14 -08:00
Lenni Kuff	7d14ece317	Added support for changing data directory of mini dfs cluster This will allow test runs to save their state run-to-run and greatly reduce data loading time.	2014-01-08 10:44:13 -08:00
Lenni Kuff	6678d2bd50	Missed semi-colon in schema generation script	2014-01-08 10:44:13 -08:00
Lenni Kuff	e293164b37	Added TPCH functional query tests and schema generation This adds most of the Hive TPCH queries into the functional Impala tests. This code review doesn't actually include the TPCH data. The data set is relatively large. Instead I updated scripts to copy the data from a data host. This change has a few parts: 1) Update the benchmark schema generation/test vector generation to be more generic. This way we can use the same schema creation/data loading steps for TPCH as we do for benchmark tests. 2) Add in schema template for the TPCH workload along with test vectors and dimensions which are used for schema generation. 3) Add in a new test file for each TPC-H query. The Hive TPCH work broke down the queries to generate some "temp" tables, then execute using joins/selects from these temp tables. Since creating the temp tables does some real work it is good to execute these via Impala. Each test a) Runs all the Insert statements to generate the temp tables b) runs the additional TPCH queries 4) Updated all the TPCH insert statements and queries to be parameterized on $TABLE name. This way we can run the tests across all combinations of file format/compression/etc. 5) Updated data loading Change-Id: I6891acc4c7464eaf1dc7dbbb532ddbeb6c259bab	2014-01-08 10:44:06 -08:00
Michael Ubell	1c86585d46	IMP-117 IMP-118 Fix LIKE processing add NOT LIKE	2012-07-17 11:04:37 -07:00
Lenni Kuff	0da77037e3	Updated Impala performance schema and test vector generation This change updates the Impala performance schema and test vector generation techniques. It also migrates the existing benchmark scripts that were Ruby over to use Python. The changes has a few parts: 1) Conversion of test vector generation and benchmark statement generation from Ruby to Python. A result of this was also to update the benchmark test vector and dimension files to be written in CSV format (python doesn't have built-in YAML support) 2) Standardize on the naming for benchmark tables to (somewhat match Query tests). In general the form is: * If file_format=text and compression=none, do not use a table suffix * Abbreviate sequence file as (seq) rc file as (rc) etc * If using BLOCK compression don't append anything to table name, if using 'record' append 'record' 3) Created a new way to adding new schemas. this is the benchmark_schema_template.sql file. The generate_benchmark_statements.py script reads this in and breaks up the sections. The section format is: ==== Data Set Name --- BASE table name --- CREATE STATEMENT Template --- INSERT ... SELECT * format --- LOAD Base statement --- LOAD STATEMENT Format Where BASE Table is a table the other file formats/compression types can be generated from. This would generally be a local file. The thinking is that if the files already exist in HDFS then we can just load the file directly rather than issue an INSERT ... SELECT * statement. The generate_benchmark_statements.py script has been updated to use this new template as well as query HDFS for each table to determine how it should be created. It then outputs an ideal file call load-benchmark-*-generated.sql. Since this file is geneated dynamically we can remove the old benchmark statement files. 4) This has been hooked into load-benchmark-data.sh and run_query has been updated to use the new format as well	2012-07-12 23:12:20 -07:00
Lenni Kuff	1bb1890eb9	Fix data loading for AllTypesTiny table	2012-07-11 18:20:16 -07:00
Alexander Behm	097616a31d	Single node execution of union.	2012-07-11 13:12:43 -07:00
Lenni Kuff	96543325a7	Fixed run-mini-dfs script to resolve issue starting cluster	2012-06-29 09:45:32 -07:00
Nong Li	3130da5fc7	Fix a few bugs with HdfsTextScanner for the benchmark queries.	2012-06-28 15:59:47 -07:00
Nong Li	6fb4072cca	Add utility to warm buffer cache with all blocks for tables used in benchmark queries.	2012-06-28 10:35:53 -07:00
Henry Robinson	ce2ae276c1	Build changes for CDH4 upgrade	2012-06-22 16:05:03 -07:00
Henry Robinson	3ff3559805	Add support for per-partition file formats to front end and backend. At the same time, this patch removes the partitionKeyRegex in favour of explicitly sending a list of literal expressions for each file path from the front end.	2012-06-05 12:00:09 -07:00
Michael Ubell	4a4a67ef74	Fix loading of alltypesaggnonulls.	2012-06-04 15:51:06 -07:00
Nong Li	f9efe06649	Move IR cross compile output to a better folder for packaging.	2012-06-01 13:14:18 -07:00
Michael Ubell	3608b3fb06	RC File rewrite	2012-05-22 20:37:47 -07:00
Lenni Kuff	35951643f5	Fixed benchmark generation scripts and make_release scripts to properly generate and execute the benchmark queries. Updated to demove Lzo compression and add coverage of 'DefaultCodec' Fixed up make_release to more cleanly list queries.	2012-05-17 17:40:00 -07:00
Lenni Kuff	bd2c63b69b	Added scripts for generating and running benchmarks across different data sets and file formats The scripts consist of a few parts * generate_test_vectors.rb - This is a ruby script (I will convert it to python with a later checkin) which reads in a dimension file (in this case benchmark_dimensions.yaml) that describes the different dimensions that we want to explore. Currently the dimensions are: data set, file format, and compression algorithm. This script outputs to a file a list of "test vectors". It explores the input based on a fully exhaustive and a pairwise exploration strategy. The goal is to have a reduced set of test vectors to provide coverage but don't take as long to run as the exhaustive set of vectors. Note that I am checking in the vector outputs so this script only needs to be run if we want to generate a new set of vectors. I also am checking in a vector called benchmark_core.vector which just describes the current benchmark behavior (no compression, text file). This will allow running benchmarks with the coverage that exists today. * testdata/bin/generate_benchmark_statements.rb - This script reads in the vector output and generates the statements required to create the benchmark schema and load the data into the benchmark tables with the proper compression/file format settings. It outputs "sql" files "create-benchmark-.sql" and "load-benchmark-.sql". * updated the load-benchmark-data.sh to take parameters for which set of data to load (pairwise or exhaustive). If no parameters are passed is just loads the "core" data which is what it does now. * Updated run_benchmark.py so that you can specify what type of run you want to do - core, exhaustive, or pairwise. It will read in the corresponding vector file and generate the proper table names for the queries to select the proper data. Also added the functionality to specify a results file to compare against. Overall notes: By default the current behavior and coverage is maintained when running these scripts without any parameters. Everything needed is checked in so the ruby scripts don't need to be run unless we want to add dimensions at a later time.	2012-05-08 16:06:45 -07:00
Henry Robinson	2bbd4193a1	Fix PlannerTest failure	2012-05-03 17:15:16 -07:00
Henry Robinson	2af14392a6	Serial INSERT support	2012-05-03 13:44:32 -07:00
Michael Ubell	81809f908b	Fix data table creation for data loading.	2012-05-02 21:21:42 -07:00
Michael Ubell	7b14187bf1	Install snappy library add create-load-data.sh	2012-05-02 07:31:10 -07:00
Michael Ubell	62d29ff1c6	Sequence File Scanner	2012-05-01 17:48:24 -07:00
Nong Li	f2ddd7bb73	Enable loading llvm modules from disk and modifying it at runtime.	2012-04-23 11:17:21 -07:00
Henry Robinson	92673b7852	Add -noformat to buildall.sh. Fix java.library.path in pom.xml; clean up indentation	2012-04-12 16:59:52 -07:00
Henry Robinson	a8df205248	Various build improvements: make sure that unpack-dependencies happens before preparing test data, wait for master up before RS start.	2012-03-26 18:30:37 -07:00
Michael Ubell	8897169c87	Timestamp data type implimentation.	2012-03-22 21:38:18 -07:00
Alan Choi	727ee77ec4	HBase now runs on pseudo-distributed mode with 4 region servers code review : http://review.sf.cloudera.com/r/14695/	2012-03-08 15:07:12 -08:00
Alan Choi	a780ffdcb3	Enabling multi-node distributed execution: - adding flag --backends="host:port,host:port,..." , which TestEnv uses to create clients for ImpalaBackendServices running on those nodes; this is just a hack in order to be able to use runquery for multi-node execution - impalad-main.cc: main() of impala daemon, which will export both ImpalaService and ImpalaBackendService (but at the moment only does the latter; everything related to ImpalaService is commented out) - com.cloudera.impala.service.Frontend: API to the frontend functionality; invoked by impalad via jni; ignore for now Adding classes Scheduler, SimpleScheduler, ExecEnv, TestExecEnv, BackendClientCache.	2012-03-02 12:27:27 -08:00

1 2

82 Commits