impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
wzhou-code	08f8a30025	IMPALA-12910: Support running TPCH/TPCDS queries for JDBC tables This patch adds script to create external JDBC tables for the dataset of TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for external JDBC tables with Impala-Impala federation. Note that JDBC tables are mapping tables, they don't take additional disk spaces. It fixes the race condition when caching of SQL DataSource objects by using a new DataSourceObjectCache class, which checks reference count before closing SQL DataSource. Adds a new query-option 'clean_dbcp_ds_cache' with default value as true. When it's set as false, SQL DataSource object will not be closed when its reference count equals 0 and will be kept in cache until the SQL DataSource is idle for more than 5 minutes. Flag variable 'dbcp_data_source_idle_timeout_s' is added to make the duration configurable. java.sql.Connection.close() fails to remove a closed connection from connection pool sometimes, which causes JDBC working threads to wait for available connections from the connection pool for a long time. The work around is to call BasicDataSource.invalidateConnection() API to close a connection. Two flag variables are added for DBCP configuration properties 'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait' properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively in apache.commons.dbcp v2. Fixes a bug for database type comparison since the type strings specified by user could be lower case or mix of upper/lower cases, but the code compares the types with upper case string. Fixes issue to close SQL DataSource object in JdbcDataSource.open() and JdbcDataSource.getNext() when some errors returned from DBCP APIs or JDBC drivers. testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables for Impala-Impala, Postgres and MySQL. Following sample commands creates TPCDS JDBC tables for Impala-Impala federation with remote coordinator running at 10.19.10.86, and Postgres server running at 10.19.10.86: ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=IMPALA --database_host=10.19.10.86 --clean ${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \ --jdbc_db_name=tpcds_jdbc --workload=tpcds \ --database_type=POSTGRES --database_host=10.19.10.86 \ --database_name=tpcds --clean TPCDS tests for JDBC tables run only for release/exhaustive builds. TPCH tests for JDBC tables run for core and exhaustive builds, except Dockerized builds. Remaining Issues: - tpcds-decimal_v2-q80a failed with returned rows not matching expected results for some decimal values. This will be fixed in IMPALA-13018. Testing: - Passed core tests. - Passed query_test/test_tpcds_queries.py in release/exhaustive build. - Manually verified that only one SQL DataSource object was created for test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option 'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object was closed by cleanup thread. Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a Reviewed-on: http://gerrit.cloudera.org:8080/21304 Reviewed-by: Abhishek Rawat <arawat@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2024-05-02 02:14:20 +00:00
Eyizoha	2f06a7b052	IMPALA-10798: Initial support for reading JSON files Prototype of HdfsJsonScanner implemented based on rapidjson, which supports scanning data from splitting json files. The scanning of JSON data is mainly completed by two parts working together. The first part is the JsonParser responsible for parsing the JSON object, which is implemented based on the SAX-style API of rapidjson. It reads data from the char stream, parses it, and calls the corresponding callback function when encountering the corresponding JSON element. See the comments of the JsonParser class for more details. The other part is the HdfsJsonScanner, which inherits from HdfsScanner and provides callback functions for the JsonParser. The callback functions are responsible for providing data buffers to the Parser and converting and materializing the Parser's parsing results into RowBatch. It should be noted that the parser returns numeric values as strings to the scanner. The scanner uses the TextConverter class to convert the strings to the desired types, similar to how the HdfsTextScanner works. This is an advantage compared to using number value provided by rapidjson directly, as it eliminates concerns about inconsistencies in converting decimals (e.g. losing precision). Added a startup flag, enable_json_scanner, to be able to disable this feature if we hit critical bugs in production. Limitations - Multiline json objects are not fully supported yet. It is ok when each file has only one scan range. However, when a file has multiple scan ranges, there is a small probability of incomplete scanning of multiline JSON objects that span ScanRange boundaries (in such cases, parsing errors may be reported). For more details, please refer to the comments in the 'multiline_json.test'. - Compressed JSON files are not supported yet. - Complex types are not supported yet. Tests - Most of the existing end-to-end tests can run on JSON format. - Add TestQueriesJsonTables in test_queries.py for testing multiline, malformed, and overflow in JSON. Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Reviewed-on: http://gerrit.cloudera.org:8080/19699 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-09-05 16:55:41 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
stiga-huang	818cd8fa27	IMPALA-5717: Support for reading ORC data files This patch integrates the orc library into Impala and implements HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner supplies input needed from the orc-reader, tracks memory consumption of the reader and transfers the reader's output (orc::ColumnVectorBatch) into impala::RowBatch. The ORC version we used is release-1.4.3. A startup option --enable_orc_scanner is added for this feature. It's set to true by default. Setting it to false will fail queries on ORC tables. Currently, we only support reading primitive types. Writing into ORC table has not been supported neither. Tests - Most of the end-to-end tests can run on ORC format. - Add tpcds, tpch tests for ORC. - Add some ORC specific tests. - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library is not robust for corrupt files (ORC-315). Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4 Reviewed-on: http://gerrit.cloudera.org:8080/9134 Reviewed-by: Quanlong Huang <huangquanlong@gmail.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-11 05:13:02 +00:00
Grant Henke	03794389fc	IMPALA-6551: Change Kudu TPCDS and TPCH columns to DECIMAL Before Kudu supported DECIMAL columns the TPCDS and TPCH columns were djusted to use DOUBLE in place of DECIMAL. This patch undoes that change now that Kudu supports DECIMAL. Testing: - Updated concurrent_select.py - Updated test_tpch_queries.py - Excersized by the Kudu planner tests Change-Id: I2f7e4464dc6705cadd610a82c459390a9c0dfe4f Reviewed-on: http://gerrit.cloudera.org:8080/9484 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-14 21:38:06 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Michael Ho	1e306211d0	IMPALA-3838, IMPALA-4495: Codegen EvalRuntimeFilters() and fixes filter stats updates This change codegens HdfsParquetScanner::EvalRuntimeFilters() by unrolling its loop, codegen'ing the expression evaluation of the runtime filter and replacing some type information with constants in the hashing function of runtime filter to avoid branching at runtime. This change also fixes IMPALA-4495 by not counting a row as 'considered' in the filter stats before the filter arrives. This avoids unnecessarily marking a runtime filter as ineffective before it's even used. With this change, TPCDS-Q88 improves by 13-14%. primitive_broadcast_join_1 improves by 24%. Change-Id: I27114869840e268d17e91d6e587ef811628e3837 Reviewed-on: http://gerrit.cloudera.org:8080/4833 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-11-23 12:48:47 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Dimitris Tsirogiannis	6fbd35fa87	Enable TPC-H workload for Kudu tables With this commit we enable loading of TPC-H data in Kudu tables and running the 22 TPC-H queries against Kudu. Since Kudu doesn't support the decimal data type, we had to modify the queries by using round() function and update the test results. Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289 Reviewed-on: http://gerrit.cloudera.org:8080/3789 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2016-07-28 04:35:11 +00:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
ishaan	e1ac28c171	Move test_tpch_queries back to query_test and remove an unused test file. Change-Id: I26aa1bbecd0ed4547b60d7e6be7821336b099d21 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5870 Reviewed-by: Ippokratis Pandis <ipandis@cloudera.com> Tested-by: jenkins	2015-01-28 15:28:13 -08:00
Nong Li	bbce0d62c9	[CDH5] Disable tpch/tpcds query and planner tests. Change-Id: I30ecefe2db9ee7996433cda025f86ef8669284e9	2014-09-20 19:41:42 -07:00
Nong Li	fd35cee887	Reorganize/reduce end to end test time. This patch does a few things: 1) Move the metadata tests into their own folder under tests/. I think it's useful to loosely categorize them so it's easier to run a subset of the tests that are most useful for the changes you are making. 2) Reduce the test vectors for query_tests. We should have identical coverage in the daily exhaustive runs but the normal runs should be much better. In particular, deemphasizing scanner tests since that code is more stable now. 3) Misc test cleanup/consolidate python test files/etc. Change-Id: I03c2f34877aed192c2a50665bd5e15fa85e12f1e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3831 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-08-17 12:43:57 -07:00
Alex Behm	5798cf7c6f	CDH-18432: Fix assignment of On-clause predicates of semi-/anti-joins. Change-Id: Iad4b772f170ea70ded0747ce55972b0a5194f88a Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3852 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-08-17 12:24:51 -07:00
ishaan	dc3dc3dc1e	Enable tpch queries to run on text to unblock the full data load build. Some planner tests depend on data being populated in the tpch tmp tables (in text format) . This change re-enables the tpch query tests to run on text so that they pass. Change-Id: I4ed09f55e05cb01978cb6f0808c6395552c0f129 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3176 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-06-19 16:19:13 -07:00
Lenni Kuff	0ac0527643	Reduce test execution time by limiting long running tests to exhaustive exec strategy I looked at the latest run from master and took the tests suites that had long execution times. This cleans those test suites up to either completely disable them on 'core' or add constraints to limit the number of test vectors. It shouldn't impact nightly coverage since we still run the same tests exhaustively. Change-Id: I10c78c35155b00de0c36d9fc0923b2b1fc6b44de Reviewed-on: http://gerrit.ent.cloudera.com:8080/3119 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3125 Reviewed-by: Lenni Kuff <lskuff@cloudera.com>	2014-06-18 16:18:17 -07:00
Lenni Kuff	d698881f71	Improve test run throughput by executing more tests in parallel This updates the tests to run more test cases in parallel and also removes some unneeded "invalidate metadata" calls. This cut down the 'serial' execution time for me by 10+ minutes. Change-Id: I04b4d6db508a26a1a2e4b972bcf74f4d8b9dde5a Reviewed-on: http://gerrit.ent.cloudera.com:8080/757 Tested-by: jenkins Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:53:46 -08:00
ishaan	53cd9eadab	Treat HBase as a file format for functional tests Change-Id: Ia01181a1e10eb108419122d347e9d869a69e8922 Reviewed-on: http://gerrit.ent.cloudera.com:8080/102 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-01-08 10:52:36 -08:00
Nong Li	f60f2d3e50	Implement support for grouped scan ranges in io mgr and integration with parquet.	2014-01-08 10:49:18 -08:00
Lenni Kuff	03f04518d7	Fix planner test failure due to empty tpch temp tables	2014-01-08 10:49:04 -08:00
Lenni Kuff	12d18631e3	Test enhancements: dynamic table format data loading, per-workload exploration stategies	2014-01-08 10:47:07 -08:00
Nong Li	15dfd968fb	Disable tpch-q21 and fix plan output for tpch-q22. We can now generate the temp table for q22 which changes the plan output.	2014-01-08 10:47:03 -08:00
Nong Li	f46c654e01	Enable tpch-q21 and tpch-q22 in tests.	2014-01-08 10:47:03 -08:00
Lenni Kuff	2c12200395	Improve logging verbosity in Python test infrastructure	2014-01-08 10:46:55 -08:00
Lenni Kuff	bed633c1ae	Extract config/metastore creation from buildall + script for loading warehouse snapshot	2014-01-08 10:46:53 -08:00
Lenni Kuff	ef48f65e76	Add test framework for running Impala query tests via Python This is the first set of changes required to start getting our functional test infrastructure moved from JUnit to Python. After investigating a number of option, I decided to go with a python test executor named py.test (http://pytest.org/). It is very flexible, open source (MIT licensed), and will enable us to do some cool things like parallel test execution. As part of this change, we now use our "test vectors" for query test execution. This will be very nice because it means if load the "core" dataset you know you will be able to run the "core" query tests (specified by --exploration_strategy when running the tests). You will see that now each combination of table format + query exec options is treated like an individual test case. this will make it much easier to debug exactly where something failed. These new tests can be run using the script at tests/run-tests.sh	2014-01-08 10:46:50 -08:00

28 Commits