This patch adds script to create external JDBC tables for the dataset of
TPCH and TPCDS, and adds unit-tests to run TPCH and TPCDS queries for
external JDBC tables with Impala-Impala federation. Note that JDBC
tables are mapping tables, they don't take additional disk spaces.
It fixes the race condition when caching of SQL DataSource objects by
using a new DataSourceObjectCache class, which checks reference count
before closing SQL DataSource.
Adds a new query-option 'clean_dbcp_ds_cache' with default value as
true. When it's set as false, SQL DataSource object will not be closed
when its reference count equals 0 and will be kept in cache until
the SQL DataSource is idle for more than 5 minutes. Flag variable
'dbcp_data_source_idle_timeout_s' is added to make the duration
configurable.
java.sql.Connection.close() fails to remove a closed connection from
connection pool sometimes, which causes JDBC working threads to wait
for available connections from the connection pool for a long time.
The work around is to call BasicDataSource.invalidateConnection() API
to close a connection.
Two flag variables are added for DBCP configuration properties
'maxTotal' and 'maxWaitMillis'. Note that 'maxActive' and 'maxWait'
properties are renamed to 'maxTotal' and 'maxWaitMillis' respectively
in apache.commons.dbcp v2.
Fixes a bug for database type comparison since the type strings
specified by user could be lower case or mix of upper/lower cases, but
the code compares the types with upper case string.
Fixes issue to close SQL DataSource object in JdbcDataSource.open()
and JdbcDataSource.getNext() when some errors returned from DBCP APIs
or JDBC drivers.
testdata/bin/create-tpc-jdbc-tables.py supports to create JDBC tables
for Impala-Impala, Postgres and MySQL.
Following sample commands creates TPCDS JDBC tables for Impala-Impala
federation with remote coordinator running at 10.19.10.86, and Postgres
server running at 10.19.10.86:
${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
--jdbc_db_name=tpcds_jdbc --workload=tpcds \
--database_type=IMPALA --database_host=10.19.10.86 --clean
${IMPALA_HOME}/testdata/bin/create-tpc-jdbc-tables.py \
--jdbc_db_name=tpcds_jdbc --workload=tpcds \
--database_type=POSTGRES --database_host=10.19.10.86 \
--database_name=tpcds --clean
TPCDS tests for JDBC tables run only for release/exhaustive builds.
TPCH tests for JDBC tables run for core and exhaustive builds, except
Dockerized builds.
Remaining Issues:
- tpcds-decimal_v2-q80a failed with returned rows not matching expected
results for some decimal values. This will be fixed in IMPALA-13018.
Testing:
- Passed core tests.
- Passed query_test/test_tpcds_queries.py in release/exhaustive build.
- Manually verified that only one SQL DataSource object was created for
test_tpcds_queries.py::TestTpcdsQueryForJdbcTables since query option
'clean_dbcp_ds_cache' was set as false, and the SQL DataSource object
was closed by cleanup thread.
Change-Id: I44e8c1bb020e90559c7f22483a7ab7a151b8f48a
Reviewed-on: http://gerrit.cloudera.org:8080/21304
Reviewed-by: Abhishek Rawat <arawat@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Prototype of HdfsJsonScanner implemented based on rapidjson, which
supports scanning data from splitting json files.
The scanning of JSON data is mainly completed by two parts working
together. The first part is the JsonParser responsible for parsing the
JSON object, which is implemented based on the SAX-style API of
rapidjson. It reads data from the char stream, parses it, and calls the
corresponding callback function when encountering the corresponding JSON
element. See the comments of the JsonParser class for more details.
The other part is the HdfsJsonScanner, which inherits from HdfsScanner
and provides callback functions for the JsonParser. The callback
functions are responsible for providing data buffers to the Parser and
converting and materializing the Parser's parsing results into RowBatch.
It should be noted that the parser returns numeric values as strings to
the scanner. The scanner uses the TextConverter class to convert the
strings to the desired types, similar to how the HdfsTextScanner works.
This is an advantage compared to using number value provided by
rapidjson directly, as it eliminates concerns about inconsistencies in
converting decimals (e.g. losing precision).
Added a startup flag, enable_json_scanner, to be able to disable this
feature if we hit critical bugs in production.
Limitations
- Multiline json objects are not fully supported yet. It is ok when
each file has only one scan range. However, when a file has multiple
scan ranges, there is a small probability of incomplete scanning of
multiline JSON objects that span ScanRange boundaries (in such cases,
parsing errors may be reported). For more details, please refer to
the comments in the 'multiline_json.test'.
- Compressed JSON files are not supported yet.
- Complex types are not supported yet.
Tests
- Most of the existing end-to-end tests can run on JSON format.
- Add TestQueriesJsonTables in test_queries.py for testing multiline,
malformed, and overflow in JSON.
Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Reviewed-on: http://gerrit.cloudera.org:8080/19699
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Python 3 changes list operators such as range, map, and filter
to be lazy. Some code that expects the list operators to happen
immediately will fail. e.g.
Python 2:
range(0,5) == [0,1,2,3,4]
True
Python 3:
range(0,5) == [0,1,2,3,4]
False
The fix is to wrap locations with list(). i.e.
Python 3:
list(range(0,5)) == [0,1,2,3,4]
True
Since the base operators are now lazy, Python 3 also removes the
old lazy versions (e.g. xrange, ifilter, izip, etc). This uses
future's builtins package to convert the code to the Python 3
behavior (i.e. xrange -> future's builtins.range).
Most of the changes were done via these futurize fixes:
- libfuturize.fixes.fix_xrange_with_import
- lib2to3.fixes.fix_map
- lib2to3.fixes.fix_filter
This eliminates the pylint warnings:
- xrange-builtin
- range-builtin-not-iterating
- map-builtin-not-iterating
- zip-builtin-not-iterating
- filter-builtin-not-iterating
- reduce-builtin
- deprecated-itertools-function
Testing:
- Ran core job
Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f
Reviewed-on: http://gerrit.cloudera.org:8080/19589
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This takes steps to make Python 2 behave like Python 3 as
a way to flush out issues with running on Python 3. Specifically,
it handles two main differences:
1. Python 3 requires absolute imports within packages. This
can be emulated via "from __future__ import absolute_import"
2. Python 3 changed division to "true" division that doesn't
round to an integer. This can be emulated via
"from __future__ import division"
This changes all Python files to add imports for absolute_import
and division. For completeness, this also includes print_function in the
import.
I scrutinized each old-division location and converted some locations
to use the integer division '//' operator if it needed an integer
result (e.g. for indices, counts of records, etc). Some code was also using
relative imports and needed to be adjusted to handle absolute_import.
This fixes all Pylint warnings about no-absolute-import and old-division,
and these warnings are now banned.
Testing:
- Ran core tests
Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b
Reviewed-on: http://gerrit.cloudera.org:8080/19588
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.
A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.
Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.
Tests
- Most of the end-to-end tests can run on ORC format.
- Add tpcds, tpch tests for ORC.
- Add some ORC specific tests.
- Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
is not robust for corrupt files (ORC-315).
Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang <huangquanlong@gmail.com>
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Before Kudu supported DECIMAL columns the TPCDS and TPCH
columns were djusted to use DOUBLE in place of DECIMAL. This
patch undoes that change now that Kudu supports DECIMAL.
Testing:
- Updated concurrent_select.py
- Updated test_tpch_queries.py
- Excersized by the Kudu planner tests
Change-Id: I2f7e4464dc6705cadd610a82c459390a9c0dfe4f
Reviewed-on: http://gerrit.cloudera.org:8080/9484
Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Tested-by: Impala Public Jenkins
This patch addresses warning messages from pytest re: the imported
TestMatrix, TestVector, and TestDimension classes, which were being
collected as potential test classes. The fix was to simply prepend
the class names with Impala-
git grep -l 'TestDimension' | xargs \
sed -i 's/TestDimension/ImpalaTestDimension/g'
git grep -l 'TestMatrix' | xargs \
sed -i 's/TestMatrix/ImpalaTestMatrix/g'
git grep -l 'TestVector' | xargs \
sed -i 's/TestVector/ImpalaTestVector/g'
The tests all passed in an exhaustive run on the upstream jenkins
server:
http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/
Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1
Reviewed-on: http://gerrit.cloudera.org:8080/5794
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Tested-by: Impala Public Jenkins
This change codegens HdfsParquetScanner::EvalRuntimeFilters()
by unrolling its loop, codegen'ing the expression evaluation
of the runtime filter and replacing some type information with
constants in the hashing function of runtime filter to avoid
branching at runtime.
This change also fixes IMPALA-4495 by not counting a row as
'considered' in the filter stats before the filter arrives.
This avoids unnecessarily marking a runtime filter as
ineffective before it's even used.
With this change, TPCDS-Q88 improves by 13-14%.
primitive_broadcast_join_1 improves by 24%.
Change-Id: I27114869840e268d17e91d6e587ef811628e3837
Reviewed-on: http://gerrit.cloudera.org:8080/4833
Reviewed-by: Michael Ho <kwho@cloudera.com>
Tested-by: Internal Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
With this commit we enable loading of TPC-H data in Kudu tables and
running the 22 TPC-H queries against Kudu. Since Kudu doesn't support
the decimal data type, we had to modify the queries by using round()
function and update the test results.
Change-Id: I3a5de71fefa92a78970226d8f49ef445d28f9289
Reviewed-on: http://gerrit.cloudera.org:8080/3789
Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Tested-by: Internal Jenkins
Many of our test scripts have import statements that look like
"from xxx import *". It is a good practice to explicitly name what
needs to be imported. This commit implements this practice. Also,
unused import statements are removed.
Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8
Reviewed-on: http://gerrit.cloudera.org:8080/3444
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.
Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
This patch does a few things:
1) Move the metadata tests into their own folder under tests/. I think it's useful to
loosely categorize them so it's easier to run a subset of the tests that are most
useful for the changes you are making.
2) Reduce the test vectors for query_tests. We should have identical coverage in
the daily exhaustive runs but the normal runs should be much better. In particular,
deemphasizing scanner tests since that code is more stable now.
3) Misc test cleanup/consolidate python test files/etc.
Change-Id: I03c2f34877aed192c2a50665bd5e15fa85e12f1e
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3831
Tested-by: jenkins
Reviewed-by: Nong Li <nong@cloudera.com>
Some planner tests depend on data being populated in the tpch tmp tables (in text format)
. This change re-enables the tpch query tests to run on text so that they pass.
Change-Id: I4ed09f55e05cb01978cb6f0808c6395552c0f129
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3176
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
Tested-by: Ishaan Joshi <ishaan@cloudera.com>
I looked at the latest run from master and took the tests suites that had long
execution times. This cleans those test suites up to either completely disable them
on 'core' or add constraints to limit the number of test vectors. It shouldn't impact
nightly coverage since we still run the same tests exhaustively.
Change-Id: I10c78c35155b00de0c36d9fc0923b2b1fc6b44de
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3119
Reviewed-by: Marcel Kornacker <marcel@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3125
Reviewed-by: Lenni Kuff <lskuff@cloudera.com>
This updates the tests to run more test cases in parallel and also removes some
unneeded "invalidate metadata" calls. This cut down the 'serial' execution time
for me by 10+ minutes.
Change-Id: I04b4d6db508a26a1a2e4b972bcf74f4d8b9dde5a
Reviewed-on: http://gerrit.ent.cloudera.com:8080/757
Tested-by: jenkins
Reviewed-by: Ishaan Joshi <ishaan@cloudera.com>
This is the first set of changes required to start getting our functional test
infrastructure moved from JUnit to Python. After investigating a number of
option, I decided to go with a python test executor named py.test
(http://pytest.org/). It is very flexible, open source (MIT licensed), and will
enable us to do some cool things like parallel test execution.
As part of this change, we now use our "test vectors" for query test execution.
This will be very nice because it means if load the "core" dataset you know you
will be able to run the "core" query tests (specified by --exploration_strategy
when running the tests).
You will see that now each combination of table format + query exec options is
treated like an individual test case. this will make it much easier to debug
exactly where something failed.
These new tests can be run using the script at tests/run-tests.sh