impala

mirror of https://github.com/apache/impala.git synced 2026-01-08 21:03:01 -05:00

Author	SHA1	Message	Date
Thomas Tauber-Marshall	2510fe0aa0	IMPALA-4252: Min-max runtime filters for Kudu This patch implements min-max filters for runtime filters. Each runtime filter generates a bloom filter or a min-max filter, depending on if it has HDFS or Kudu targets, respectively. In RuntimeFilterGenerator in the planner, each hash join node generates a bloom and min-max filter for each equi-join predicate, but only those filters that end up being assigned to a target make it into the final plan. Min-max filters are only assigned to Kudu scans if the target expr is a column, as Kudu doesn't support bounds on general exprs, and only if the join op is '=' and not 'is distinct from', as Kudu doesn't support returning NULLs if a bound is set. Min-max filters are inserted into by the PartitionedHashJoinBuilder. Codegen is used to eliminate branching on the type of filter. String min-max filters truncate their bounds at 1024 chars, so that the max amount of memory used by min-max filters is negligible. For now, min-max filters are only applied at the KuduScanner, which passes them into the Kudu client. Future work will address applying min-max filters at HDFS scan nodes and applying bloom filters at Kudu scan nodes. Functional Testing: - Added new planner tests and updated the old ones. (in old tests, a lot of runtime filters are renumbered as we always generate min-max filters even if they don't end up getting assigned and they take up some of the RF ids). - Updated existing runtime filter tests to work with Kudu. - Added e2e tests for min-max filter specific functionality. Perf Testing: - All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu, timings are averages of 3 runs. - Ran a contrived query with a filter that does not eliminate any rows (full self join of lineitem). The difference in running time was negligible - 24.46s with filters on, 24.15s with filters off for a ~1% slowdown. - Ran a contrived query with a filter that elimiates all rows (self join on lineitem with a join condition that never matches). The filters resulted in a significant speedup - 0.26s with filters on, 1.46s with filters off for a ~5.6x speedup. This query is added to targeted-perf. Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c Reviewed-on: http://gerrit.cloudera.org:8080/7793 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-11-17 21:33:51 +00:00
Philip Zeyliger	77e010ae4e	IMPALA-6070: Parallel compute_table_stats.py Uses a thread pool to issue many compute stats commands in parallel to Impala, rather than doing it serially. Where it was obvious, I combined multiple stats commands into fewer, to reduce the number of "show databses" and serialized "show tables" commands. This speeds up the compute stats step in data loading significantly. My measurements for testdata/bin/compute-table-stats.sh running before and after this change, with the Impala daemons restarted (cold) or not restarted (warm) on an 8-core, 32GB RAM machine were: old, cold: 7m44s new, cold: 1m42s old, warm: 1m23s new, warm: 48s The data load in the full test build behaves in a cold fashion. It's typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to run this compute stats step for 9 or 10 minutes. With this change, this will come down to about 2 minutes. Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d Reviewed-on: http://gerrit.cloudera.org:8080/8354 Reviewed-by: David Knupp <dknupp@cloudera.com> Tested-by: Impala Public Jenkins	2017-10-24 23:54:15 +00:00
Dan Hecht	5f323124ae	IMPALA-4990: fix run_tests.py --update_results Seems to have broken with some recent commits. Change-Id: I9c22e197662228158d7935ebfb12d9b3691eb499 Reviewed-on: http://gerrit.cloudera.org:8080/6151 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-23 04:49:13 +00:00
Henry Robinson	e4a0e2f391	IMPALA-5775: Allow shell to support TLSv1, v1.1 and v1.2 The shell uses Thrift's TSSLSocket to negotiate secure connections to Impala. This socket uses a variable SSL_VERSION to determine which SSL and TLS protocol versions it will connect to. SSL_VERSION was hardcoded to be PROTOCOL_TLSv1, which only supports TLSv1 servers and no other protocol version. Change the allowed version to be PROTOCOL_SSLv23, which supports any TLS or SSL protocol. We rely on the server not to allow SSLv2 or v3 connections. Testing: Added a new custom cluster test to confirm that the shell can connect to a TLSv1.2 cluster. Confirmed that the test is correctly skipped on machines with an old version of OpenSSL that does not support TLSv1.2. Change-Id: I5487f82d110676b9c3c7a5305931da00c7f68ca0 Reviewed-on: http://gerrit.cloudera.org:8080/7675 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-16 08:10:02 +00:00
Sailesh Mukil	1f34a9e703	IMPALA-5383: Fix PARQUET_FILE_SIZE option for ADLS PARQUET_FILE_SIZE query option doesn't work with ADLS because the AdlFileSystem doesn't have a notion of block sizes. And impala depends on the filesystem remembering the block size which is then used as the target parquet file size (this is done for Hdfs so that the parquet file size and block size match even if the parquet_file_size isn't a valid blocksize). We special case for Adls just like we do for S3 to bypass the FileSystem block size, and instead just use the requested PARQUET_FILE_SIZE as the output partitions block_size (and consequently the parquet file target size). Testing: Re-enabled test_insert_parquet_verify_size() for ADLS. Also fixed a miscellaneous bug with the ADLS client listing helper function. Change-Id: I474a913b0ff9b2709f397702b58cb1c74251c25b Reviewed-on: http://gerrit.cloudera.org:8080/7018 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-31 07:41:24 +00:00
Sailesh Mukil	50bd015f2d	IMPALA-5333: Add support for Impala to work with ADLS This patch leverages the AdlFileSystem in Hadoop to allow Impala to talk to the Azure Data Lake Store. This patch has functional changes as well as adds test infrastructure for testing Impala over ADLS. We do not support ACLs on ADLS since the Hadoop ADLS connector does not integrate ADLS ACLs with Hadoop users/groups. For testing, we use the azure-data-lake-store-python client from Microsoft. This client seems to have some consistency issues. For example, a drop table through Impala will delete the files in ADLS, however, listing that directory through the python client immediately after the drop, will still show the files. This behavior is unexpected since ADLS claims to be strongly consistent. Some tests have been skipped due to this limitation with the tag SkipIfADLS.slow_client. Tracked by IMPALA-5335. The azure-data-lake-store-python client also only works on CentOS 6.6 and over, so the python dependencies for Azure will not be downloaded when the TARGET_FILESYSTEM is not "adls". While running ADLS tests, the expectation will be that it runs on a machine that is at least running CentOS 6.6. Note: This is only a test limitation, not a functional one. Clusters with older OSes like CentOS 6.4 will still work with ADLS. Added another dependency to bootstrap_build.sh for the ADLS Python client. Testing: Ran core tests with and without TARGET_FILESYSTEM as 'adls' to make sure that all tests pass and that nothing breaks. Change-Id: Ic56b9988b32a330443f24c44f9cb2c80842f7542 Reviewed-on: http://gerrit.cloudera.org:8080/6910 Tested-by: Impala Public Jenkins Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>	2017-05-25 19:35:24 +00:00
Lars Volker	6251d8b4dd	IMPALA-3909: Populate min/max statistics in Parquet writer Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Reviewed-on: http://gerrit.cloudera.org:8080/5611 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2017-02-02 06:44:48 +00:00
Taras Bobrovytsky	2159beee89	IMPALA-4467: Add support for DML statements in stress test - Add support for insert, upsert, update and and delete statements. - Add support for compute stats with mt_dop query options. - Update impyla version in order to be able to have access to query error text for DML queries. - Made flake8 fixes. flake8 on this file is clean. For every Kudu table in the databases, we make a copy and add a '_original' suffix to the table name. The DML queries will only make modifications to the non original table, the original table will never be modified. The orignal tables could be used to bring the non-original table to the inital state. Two flags were added for doing this: --reset-databases-before-binary-search and --reset-databases-after-binary-search. The DML queries are generated based on the mod values passed in with the following flag: --dml-mod-values 11 13 17. For each mod value 4 DML queries are generated. The DML operations will touch table rows where primary_key % mod_value = 0. So, the larger the mod value, the more rows would be affected. The DML queries are generated in such a way that the data for the insert, upsert, and update queries is taken from the table with the _original suffix. The stress test generates DML queries for only kudu databases. For example, --tpch-kudu-db=tpch_100_kudu --tpch-db=tpch_100 --generate-dml-queries would only generate queries for the tpch_100_kudu database. Here's an example of a full call with the new options that runs the stress test on the local mini cluster: ./concurrent_select.py \ --tpch-kudu-db=tpch_kudu \ --generate-dml-queries \ --dml-mod-values 11 13 17 \ --generate-compute-stats-queries \ --select-probability=0.5 \ --mem-limit-padding-pct=25 \ --mem-limit-padding-abs=50 \ --reset-databases-before-binary-search \ --reset-databases-after-binary-search Change-Id: Ia2aafdc6851cc0e1677a3c668d3350e47c4bfe40 Reviewed-on: http://gerrit.cloudera.org:8080/5093 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Impala Public Jenkins	2016-12-20 01:33:01 +00:00
Michael Ho	a41918d443	Fix E2E test infrastructure to handle missing exceptions correctly This change fixes a bug in the E2E infrastructure that handles the case when an expected exception wasn't thrown. The code was expecting that test_section['CATCH'] to be a string but in reality it's a list of strings. It also clarifies the error message about the missing exception. This change also enforces that the CATCH subsection in tests cannot be empty. Change-Id: I7d83c5db59e8a239e4e70694a1e625af6f21419c Reviewed-on: http://gerrit.cloudera.org:8080/5260 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-12-01 23:43:03 +00:00
Thomas Tauber-Marshall	3833707dbd	IMPALA-4466: Improve Kudu CRUD test coverage The results in the test files were verified by hand. This patch also introduces a new test section 'DML_RESULTS', which takes the name of a table as a comment and the contents of the table as its body and then verifies that the body matches the actual contents of the table. This makes it easy to check that a DML operation has the desired effect on the contents of a table, rather than always having to add another test case that runs a select on the table. For now, this section cannot be used in a test along with the RESULTS or ERRORS sections. TODO: Refactor the DML test case handling (IMPALA-4471) Change-Id: Ib9e7afbef60186edb00a9d11fbe5a8c64931add6 Reviewed-on: http://gerrit.cloudera.org:8080/4953 Reviewed-by: Matthew Jacobs <mj@cloudera.com> Tested-by: Internal Jenkins	2016-11-17 02:54:30 +00:00
Henry Robinson	34b5f1c416	IMPALA-(3895,3859): Don't log file data on parse errors Logging file or table data is a bad idea, and doing it by default is particularly bad. This patch changes HdfsScanNode::LogRowParseError() to log a file and offset only. Testing: See rewritten tests. To support testing this change, we also fix IMPALA-3895, by introducing a canonical string __HDFS_FILENAME__ that all Hadoop filenames in the ERROR output are replaced with before comparing with the expected results. This fixes a number of issues with the old way of matching filenames which purported to be a regex, but really wasn't. In particular, we can now match the rest of an ERROR line after the filename, which was not possible before. In some cases, we don't want to substitute filenames because the ERROR output is looking for a very specific output. In that case we can write: $NAMENODE/<filename> and this patch will not perform _any_ filename substitutions on ERROR sections that contain the $NAMENODE string. Finally, this patch fixes a bug where a test that had an ERRORS section but no RESULTS section would silently pass without testing anything. Change-Id: I5a604f8784a9ff7b4bf878f82ee7f56697df3272 Reviewed-on: http://gerrit.cloudera.org:8080/4020 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-08-25 10:20:36 +00:00
Sailesh Mukil	ac4f22b1b0	IMPALA-3957: Test failure in S3 build: TestLoadData.test_load The test_load() test failed with an error which said that the number of of files in the destination was wrong. This could probably be because the filesystem_client.copy() (for S3) in the setup_method() silently failed without copying one of the files as a one off error. I'm not sure why S3 failed to do the copy, but this patch adds an assert after the copy to make sure that if it the s3_client.copy() fails to do the copy, it will assert instead of continuing with the rest of the tests. Change-Id: I966a469e94099d3d971e470ae6e992386070c5e9 Reviewed-on: http://gerrit.cloudera.org:8080/3881 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-08-11 03:06:09 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Tim Armstrong	bc8c55afcd	IMPALA-3729: batch_size=1 coverage for avro scanner Also fix a stale comment in the avro scanner header. The main work here is to fix the handling of empty result sets in the test result verifier. This is a problem because we wanted to verify that the results in the test file were a superset of the rows returned, and this was thrown off by superflous '' rows in the expected and actual result sets. The basic problem is that the way test file sections was parsed conflated an empty result section with non-empty result section that had a single empty string. I.e.: ---- RESULTS ==== vs ---- RESULTS ==== both got resolved to ['']. Change-Id: Ia007e558d92c7e4ce30be90446fdbb1f50a0ebc4 Reviewed-on: http://gerrit.cloudera.org:8080/3413 Tested-by: Internal Jenkins Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-07-19 23:30:02 -07:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Sailesh Mukil	6f1fe4ebe7	IMPALA-3577, IMPALA-3486: Partitions on multiple filesystems breaks with S3_SKIP_INSERT_STAGING The HdfsTableSink usualy creates a HDFS connection to the filesystem that the base table resides in. However, if we create a partition in a FS different than that of the base table and set S3_SKIP_INSERT_STAGING to "true", the table sink will try to write to a different filesystem with the wrong filesystem connector. This patch allows the table sink itself to work with different filesystems by getting rid of a single FS connector and getting a connector per partition. This also reenables the multiple_filesystems test and modifies it to use the unique_database fixture so that parallel runs on the same bucket do not clash and end up in failures. This patch also introduces a SECONDARY_FILESYSTEM environment variable which will be set by the test to allow S3, Isilon and the localFS to be used as the secondary filesystems. All jobs with HDFS as the default filesystem need to set the appropriate environment for S3 and Isilon, i.e. the following: - export AWS_SECERT_ACCESS_KEY - export AWS_ACCESS_KEY_ID - export SECONDARY_FILESYSTEM (to whatever filesystem needs to be tested) TODO: SECONDARY_FILESYSTEM and FILESYSTEM_PREFIX and NAMENODE have a lot of similarities. Need to clean them up in a following patch. Change-Id: Ib13b610eb9efb68c83894786cea862d7eae43aa7 Reviewed-on: http://gerrit.cloudera.org:8080/3146 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-31 23:32:11 -07:00
Michael Ho	3a4a77521e	IMPALA-3608: Updates Impala E2E test framework to allow multiple exception messages Some of our tests which are expected to fail due to low query memory limits can fail non-deterministically with different error messages. In addition, some tests may throw different error messages when running with the legacy join nodes. This change updates the test infrastructure to allow multiple exception messages to be specified by using adding "ANY_OF" to the "CATCH" subsection. Change-Id: Ie6d81fd3ae601f565b575edfeefff7c5a6c07974 Reviewed-on: http://gerrit.cloudera.org:8080/3205 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Internal Jenkins	2016-05-31 23:32:10 -07:00
Sailesh Mukil	ed7f5ebf53	IMPALA-1878: Support INSERT and LOAD DATA on S3 and between filesystems Previously Impala disallowed LOAD DATA and INSERT on S3. This patch functionally enables LOAD DATA and INSERT on S3 without making major changes for the sake of improving performance over S3. This patch also enables both INSERT and LOAD DATA between file systems. S3 does not support the rename operation, so the staged files in S3 are copied instead of renamed, which contributes to the slow performance on S3. The FinalizeSuccessfulInsert() function now does not make any underlying assumptions of the filesystem it is on and works across all supported filesystems. This is done by adding a full URI field to the base directory for a partition in the TInsertPartitionStatus. Also, the HdfsOp class now does not assume a single filesystem and gets connections to the filesystems based on the URI of the file it is operating on. Added a python S3 client called 'boto3' to access S3 from the python tests. A new class called S3Client is introduced which creates wrappers around the boto3 functions and have the same function signatures as PyWebHdfsClient by deriving from a base abstract class BaseFileSystem so that they can be interchangeably through a 'generic_client'. test_load.py is refactored to use this generic client. The ImpalaTestSuite setup creates a client according to the TARGET_FILESYSTEM environment variable and assigns it to the 'generic_client'. P.S: Currently, the test_load.py runs 4x slower on S3 than on HDFS. Performance needs to be improved in future patches. INSERT performance is slower than on HDFS too. This is mainly because of an extra copy that happens between staging and the final location of a file. However, larger INSERTs come closer to HDFS permformance than smaller inserts. ACLs are not taken care of for S3 in this patch. It is something that still needs to be discussed before implementing. Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d Reviewed-on: http://gerrit.cloudera.org:8080/2574 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:49 -07:00
Skye Wanderman-Milne	9b51b2b6e6	IMPALA-2835: introduce PARQUET_FALLBACK_SCHEMA_RESOLUTION query option This patch introduces a new query option, PARQUET_FALLBACK_SCHEMA_RESOLUTION which allows Parquet files' schemas to be resolved by either name or position. It's "fallback" because eventually field IDs will be the primary schema resolution scheme, and we don't want to create an option that we will have to change the name of later. The default is still by position. I chose to do a query option because it will make testing easier and also be easier to diagnose resolution problems quickly in the field. If users want to switch the default behavior to be by name (like Hive), they can use the --default_query_options flag. This patch also introduces a new test section, SHELL, which can be used to execute shell commands in a .test file. This is useful for copying files into test tables. Change-Id: Id0c715ea23792b2a6872610839a40532aabbb5a6 Reviewed-on: http://gerrit.cloudera.org:8080/2384 Reviewed-by: Skye Wanderman-Milne <skye@cloudera.com> Tested-by: Internal Jenkins	2016-04-02 04:04:25 +00:00
Henry Robinson	b3937295fb	Runtime filters tests This patch adds functional tests for runtime filters. It relies on setting RUNTIME_FILTER_WAIT_TIME_MS high enough to ensure that filters are received. To make the test files more readable, this patch also adds a new COMMENT section to the test syntax, and allows blank spaces between queries so that the separation of different test cases can be made more obvious. Currently missing is a test for disabling probe-side filters based on selectivity, as we lack suitable tables to trigger the disable condition. Change-Id: I94d617c6d23ffa394a6eb7ead56f1cfb701e0d90 Reviewed-on: http://gerrit.cloudera.org:8080/2603 Reviewed-by: Henry Robinson <henry@cloudera.com> Tested-by: Internal Jenkins	2016-03-23 04:07:14 +00:00
Skye Wanderman-Milne	a78f3a8ca5	IMPALA-2069: add USE_UTF8_PARQUET_STRINGS query option This option toggles whether the parquet writer will use the UTF8 annotation for string columns. This patch includes a test that writes a table with or without this option, then verifies that the annotation is or isn't present using a new get_parquet_metadata Python utility. Change-Id: I030c9f5c6272e09c1ce133f66234e3cfb26b68d4 Reviewed-on: http://gerrit.cloudera.org:8080/2531 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-03-17 05:58:39 +00:00
David Alves	82222abaf5	Merge branch 'feature/kudu' into cdh5-trunk This merges the 'feature/kudu' branch with cdh5-trunk as of commit: 055500cc753f87f6d1c70627321fcc825044e183 This patch is not a pure merge patch in the sense that goes beyond conflict resolution to also address reviews to the 'feature/kudu' branch as a whole. The review items and their resolution can be inspected at: http://gerrit.cloudera.org:8080/#/c/1403/ Change-Id: I6dd4270cd17a4f5c02811c343726db3504275a92	2016-03-11 11:37:58 -08:00
Lars Volker	6b566a2d35	IMPALA-3004: Fix QueryTest tests Test files in testdata/workloads/functional-query/queries/QueryTest are parsed by test_file_parser.py, which used to ignore everything before the first ==== line as a file header. This change fixes all affected files. This change also modifies the test file parser to forbid headers starting with what looks like a subsection title ('----'), which should prevent the reintroduction of similar errors in the future. Change-Id: Iaa1bc5ffd02782e24289c7843dcb35401c334519 Reviewed-on: http://gerrit.cloudera.org:8080/2220 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-02-19 00:03:15 -08:00
Tim Armstrong	2c2670e389	IMPALA-1305: streaming pre-aggregations Aggregations are implemented as a distributed pre-aggregation, an exchange, then a final aggregation that produces the results of the aggregation. In many cases the pre-aggregation significantly reduces the amount of data to be exchanged. However, in other cases, the preaggregation does not greatly reduce the amount of data exchanged or can use a lot of memory and starve other operators that would benefit more from the additional memory. In these cases we would be better off "passing through" some input tuples by transforming them into intermediate tuples without aggregating them. This patch adds a streaming pre-aggregation mode to PartitionedAggregationNode that tries to aggregate input rows with a hash table, but can switch to passing through the input tuples (after transforming them into the appropriate tuple format). It does this if it hits a memory limit or if the aggregation is not sufficiently reducing the node's output (specifically, if the number of aggregated rows in the hash table is more than half the number of unaggregated rows consumed by the pre-aggregation). Pre-aggregations never need to spill because they can pass through rows when under memory pressure. This initial implementation is quite conservative: it retains the partitioning of the previous implementation because switching to a single partition proved to regress performance of some queries while improving others. It also always keeps hash tables around and updates them with matching input rows so that reduction statistics are updated and early decisions to pass through data can be reversed. Future work could explore different approaches within the new framework to get larger performance gains. Currently we see significant performance benefits for queries with a very low reduction factor, e.g. group by on a nearly unique column Includes codegen support for the passthrough streaming. Adds a query option, disable_streaming_preaggregations, in case a user wants to revert to the old behaviour. Adds TPC-H tests to exercise the new passthrough code path and updates planner tests to include the new [STREAMING] detail added by the planner. Change-Id: Ia40525340cba89a8c4e70164ae11447e96494664 Reviewed-on: http://gerrit.cloudera.org:8080/1698 Tested-by: Internal Jenkins Reviewed-by: Dan Hecht <dhecht@cloudera.com>	2016-02-11 19:03:51 +00:00
Jim Apple	8395d285b7	Add flag to get compute_table_stats.py to stop on errors Change-Id: I42569889579385207beda2f88b6015ad24a29148 Reviewed-on: http://gerrit.cloudera.org:8080/1976 Reviewed-by: Jim Apple <jbapple@cloudera.com> Tested-by: Internal Jenkins	2016-02-02 03:25:29 +00:00
Sailesh Mukil	f1824c9552	IMPALA-2884: Add a use_ssl option ot compute_table_stats.py This patch adds a 'use_ssl' option to the compute_table_stats.py script. This change is needed so that this script does not fail in clusters with SSL enabled. Change-Id: I88b7279d368f59c6eff890b04f629050d1b9c896 Reviewed-on: http://gerrit.cloudera.org:8080/1892 Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins	2016-01-26 21:22:58 +00:00
Lars Volker	b3ae4921a6	Fix comment handling in test file parser The test file parser is supposed to handle multiple-item comments for a section. The implementation had an issue were only single-item comments were handled correctly. Change-Id: I47f7201044f6f92d10a62bb9d2eada1bd4c47a23 Reviewed-on: http://gerrit.cloudera.org:8080/1819 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Internal Jenkins	2016-01-22 21:01:18 +00:00
Casey Ching	f288867833	Stress test: Various changes The major changes are: 1) Collect backtrace and fatal log on crash. 2) Poll memory usage. The data is only displayed at this time. 3) Support kerberos. 4) Add random queries. 5) Generate random and TPC-H nested data on a remote cluster. The random data generator was converted to use MR for scaling. 6) Add a cluster abstraction to run data loading for #5 on a remote or local cluster. This also moves and consolidates some Cloudera Manager utilities that were in the stress test. 7) Cleanup the wrappers around impyla. That stuff was getting messy. Change-Id: I4e4b72dbee1c867626a0b22291dd6462819e35d7 Reviewed-on: http://gerrit.cloudera.org:8080/1298 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2016-01-20 23:00:25 +00:00
Vlad Berindei	b6c20b2a40	Allow Impala to run against local filesystem. Allow Impala to start only with a running HMS (and no additional services like HDFS, HBase, Hive, YARN) and use the local file system. Skip all tests that need these services, use HDFS caching or assume that multiple impalads are running. To run Impala with the local filesystem, set TARGET_FILESYSTEM to 'local' and WAREHOUSE_LOCATION_PREFIX to a location on the local filesystem where the current user has permissions since this is the location where the test data will be extracted. Test coverage (with core strategy) in comparison with HDFS and S3: HDFS 1348 tests passed S3 1157 tests passed Local Filesystem 1161 tests passed Change-Id: Ic9718c7e0307273382b1cc6baf203ff2fb2acd03 Reviewed-on: http://gerrit.cloudera.org:8080/1352 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins Readability: Alex Behm <alex.behm@cloudera.com>	2015-12-05 06:48:32 +00:00
Tim Armstrong	a280b93a37	IMPALA-2070: include the database comment when showing databases As part of change, refactor catalog and frontend functions to return TDatabase/Db objects instead of just the string names of databases - this required a lot of method/variable renamings. Add test for creating database with comment. Modify existing tests that assumed only a single column in SHOW DATABASES results. Change-Id: I400e99b0aa60df24e7f051040074e2ab184163bf Reviewed-on: http://gerrit.cloudera.org:8080/620 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-12-04 01:40:41 +00:00
Michael Ho	f0c2742641	IMPALA-2004: Implement "SHOW CREATE" for udfs and udas. This patch extends the SHOW statement to also support user-defined functions and user-defined aggregate functions. The syntax of the new SHOW statements is as follows: SHOW CREATE [AGGREGATE] FUNCTION [<db_name>.]<func_name>; <db_name> and <func_name> are the names of the database and udf/uda respectively. Sample outputs of the new SHOW statements are as follows: Query: show create function fn +------------------------------------------------------------------+ \| result \| +------------------------------------------------------------------+ \| CREATE FUNCTION default.fn() \| \| RETURNS INT \| \| LOCATION 'hdfs://localhost:20500/test-warehouse/libTestUdfs.so' \| \| SYMBOL='_Z2FnPN10impala_udf15FunctionContextE' \| \| \| +------------------------------------------------------------------+ Query: show create aggregate function agg_fn +------------------------------------------------------------------------------------------+ \| result \| +------------------------------------------------------------------------------------------+ \| CREATE AGGREGATE FUNCTION default.agg_fn(INT) \| \| RETURNS BIGINT \| \| LOCATION 'hdfs://localhost:20500/test-warehouse/libudasample.so' \| \| UPDATE_FN='_Z11CountUpdatePN10impala_udf15FunctionContextERKNS_6IntValEPNS_9BigIntValE' \| \| INIT_FN='_Z9CountInitPN10impala_udf15FunctionContextEPNS_9BigIntValE' \| \| MERGE_FN='_Z10CountMergePN10impala_udf15FunctionContextERKNS_9BigIntValEPS2_' \| \| FINALIZE_FN='_Z13CountFinalizePN10impala_udf15FunctionContextERKNS_9BigIntValE' \| \| \| +------------------------------------------------------------------------------------------+ Please note that all the overloaded functions which match the given function name and category will be printed. This patch also extends the python test infrastructure to support expected results which include newline characters. A new subsection comment called 'MULTI_LINE' has been added for the 'RESULT' section. With this comment, a test can include its multi-line output inside [ ] and the content inside [ ] will be treated as a single line, including the newline character. Change-Id: Idbe433eeaf5e24ed55c31d905fea2a6160c46011 Reviewed-on: http://gerrit.cloudera.org:8080/1271 Reviewed-by: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com> Tested-by: Internal Jenkins	2015-10-23 05:11:07 +00:00
Casey Ching	d202d6a967	Use "impala-python" (virtualenv) instead of system python Python tests and infra scripts will now use "python" from the virtualenv via $IMPALA_HOME/bin/impala-python. Some scripts could be simplified now that python 2.6 and a dependable set of third-party libraries are available but that is not done as part of this commit. Change-Id: If1cf96898d6350e78ea107b9026b12ba63a4162f Reviewed-on: http://gerrit.cloudera.org:8080/603 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-06 02:09:09 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
Casey Ching	0ee478d1ce	Python: Improve usability of the ClusterController class This is general clean up in prep for use with the stress test. Changes: 1) Failed commands and failure to connect now raise exceptions. Previously run_cmd() was not guaranteed to do anything at all in remote mode. 2) Fix scope of 'hosts' which was at the class level but was modified by instance level functions which makes no sense since different instances could clash with each other. 3) Remove uses of opaque args and *kwargs instead of named args. The generic forms should be avoided since they impair readability. 4) Stop trying to get the cluster hosts from an environment variable unconditionally upon construction. 5) Remove 'local' member variable, it's not needed and allowing 'local' to be set to False when no 'hosts' are not set makes no sense. 6) Simplify and remove unneeded methods and arguments. Change-Id: Id90bd3b640f2681bb7e82a5e6d5e49ed8c5a7b98 Reviewed-on: http://gerrit.cloudera.org:8080/514 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Internal Jenkins	2015-07-22 00:44:45 +00:00
Casey Ching	cb4998c28e	Python: Remove log configuration from test_file_parser.py Previously the test_file_parser would setup the logging configuration as part of importing the module. The test_file_parser is not executable and not a logging utility so it should not have any effect on logging. If some other file relies on this it should be fixed separately. Change-Id: Ib7293d152d0c0cd3c8f31533c95e50b2678e927b Reviewed-on: http://gerrit.cloudera.org:8080/473 Tested-by: Internal Jenkins Reviewed-by: Casey Ching <casey@cloudera.com>	2015-07-16 23:30:45 +00:00
David Alves	af466e4ae8	Fix 'only' constraint handling for schema_constraints.csv The goal of the 'only' is that only the explicitely mentioned tables are to be created in the specified format. However there is currently a bug in the parsing of the file that makes it so that only the _last_ table is actually created. That is if there is a sequence of statements for a certain format with the 'only' constraint only the last one is used. This patch fixes this by making sure we actually create multiple tables if we find them. Change-Id: I28e91aeefc03dcca7de6ad4aa50456dcb90ed95c Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6973 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: jenkins	2015-06-18 17:08:11 -07:00
ishaan	f327a53c70	Fix metadata/test_load.py to work with Isilon. test_load was using /tmp as the staging directory, which did not cleaned up in Isilon, leading to a build failure. This patch does the following: - use /test-warehouse as the staging directory. - replace calls to the hdfs commandline with calls to the in-house hdfs client. - cleanup the test file and remove duplicates. Additionally, a new method is introduced in the hdfs client to simulate hdfs dfs -cp, i.e, it does a get and a put to mimic the hdfs command line's semantics. Change-Id: I0cc27ab00df5f5ec3138b995144ab45ad622605d Reviewed-on: http://gerrit.cloudera.org:8080/431 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-06-05 00:52:14 +00:00
Martin Grund	6b945cf257	Starting Kudu as part of the run-all.sh command / data loading This starts a kudu mini-cluster with a master and three tablet servers on a single host. This requires to have a checkout of the kudu-bin project accessible. By default the location of the checkout is expected to be $IMPALA_HOME/../kudu-bin. In addition, this patch enables loading data to kudu via the load-data.py command. Currently only the "liketbl" is created for Kudu, but not laoded with data. This has to be done manually from the kudu-bin repo for now. Change-Id: Ia7981b023f119759e5e13e78322a6c89f82bd085 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/6499 Tested-by: jenkins Reviewed-by: David Alves <david.alves@cloudera.com>	2015-06-01 15:53:34 -07:00
ishaan	dbc78aaa2c	Enable isilon end to end tests for Impala. This patch introduces changes to run tests against Isilon, combined with minor cleanup of the test and client code. For Isilon, it: - Populates the SkipIfIsilon class with appropriate pytest markers. - Introduces a new default for the hdfs client in order to connect to Isilon. - Cleans up a few test files take the underlying filesystem into account. - Cleans up the interface for metadata/test_insert_behaviour, query_test/test_ddl On the client side, we introduce a wrapper around a few pywebhdfs's methods, specifically: - delete_file_dir does not throw an error if the file does not exist. - get_file_dir_status automatically strips the leading '/' Change-Id: Ic630886e253e43b2daaf5adc8dedc0a271b0391f Reviewed-on: http://gerrit.cloudera.org:8080/370 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-05-27 22:25:12 +00:00
ishaan	09e5eaeda2	Introduce classes for pytest's skipif markers. This patch encapsulates pytests's skipif markers in classes. It leads to the following benefits: - Provide context and grouping for tests being skipped. - As we improve test reporting, annotations will give us a better idea of coverage. Change-Id: Ib0557fb78c873047c214bb62bb6b045ceabaf0c9 Reviewed-on: http://gerrit.cloudera.org:8080/297 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins Reviewed-on: http://gerrit.cloudera.org:8080/343	2015-04-19 03:09:59 +00:00
Taras Bobrovytsky	29a7368940	Modified perf_result_datastore to use Impala instead of MySQL Change-Id: I441a51bc7e03d1bfe2283e77c16cba9394034258 Reviewed-on: http://gerrit.cloudera.org:8080/325 Reviewed-by: Martin Grund <mgrund@cloudera.com> Tested-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>	2015-04-09 20:25:28 +00:00
ishaan	970a9d7b83	Explicitly use the target filesystem in metadata and authorization tests. This patch also adds a small utility file for for supporting different filesystems. Change-Id: I28b1217b0cb901360e28e8d0ba269c9144117d2e Reviewed-on: http://gerrit.cloudera.org:8080/124 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Internal Jenkins	2015-03-05 04:57:41 +00:00
casey	c413c03517	Misc updates to the query generator (part 2 of 2) Summary of changes: 1) (from Taras) Exercise CTAS and views by creating one from a random query, then SELECT * FROM table/view. 2) Use bulk loading to generate random data. The old method was to use INSERTs which is very slow. Now local data files are generated and uploaded. 3) Misc schema parsing changes needed to support the simplified type system in the earlier review (part 1). Change-Id: I7986b97aa12051dc043faafef34a9540117e889f Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5646 Reviewed-by: Casey Ching <casey@cloudera.com> Tested-by: Casey Ching <casey@cloudera.com> Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: Ishaan Joshi <ishaan@cloudera.com>	2014-12-19 16:37:46 -08:00
ishaan	dee6911b20	Enable loading metadata from the hive metastore snapshot and cleanup build scripts. This patch contains the following changes: - Add a metastore_snapshot_file parameter to build.sh - Enable skipping loading the metadata. - create-load-data.sh is refactored into functions. - A lot of scripts source impala-config, which creates a lot of log spew. This has now been muted. - Unecessary log spew from compute-table-stats has been muted. - build_thirdparty.sh determins its parallelism from the system, it was previously hard coded to 4 - Only force load data of the particular dataset if a schema change is detected. Change-Id: I909336451e5c1ca57d21f040eb94c0e831546837 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5540 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins	2014-12-19 13:41:00 -08:00
Alex Behm	f696861c5c	Throw error on unrecognized test sections. Our .test file parser used to not abort tests when there is a malformed test/section. This patch changes that behavior to report an error and treat the test as failed. Quite a few tests were not well-formed, and were not executed as a result. This patch fixes those tests. Arguably, the test file parser should be more flexible in which places to accept comments, but this patch does not address that problem. Change-Id: If53358eb0cb958b68e51940b071e64c1d6c3ec6f Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5468 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: jenkins	2014-12-02 18:08:09 -08:00
Taras Bobrovytsky	fd1a469878	Significant improvements to benchmark report - Added % change to performance regressions/improvements table - Automatic extraction of Impala version from runtime profiles - Execution summary row will not be printed if max time is < 100ms or < 2% of the overall runtime - Failed queries are ignored - First result is discarded for each query - Geometric mean was added to summary - Improved handling of multiple workloads in a single JSON file - Improved handling of the case when queries are different in results and reference results - Works well for single client runs. Additional work is needed to handle multiple client runs well. Change-Id: Ice7b9cc4fd7502a448d35ace10fbcef183df1769 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4210 Reviewed-by: Ishaan Joshi <ishaan@cloudera.com> Tested-by: jenkins (cherry picked from commit c722f6b0a104df54b550978cd222a9af4d39b929) Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5250 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>	2014-11-13 18:54:08 -08:00
Henry Robinson	267b81142d	[CDH5] IMPALA-1279: Check ACLs for INSERT and LOAD statements This patch forces LOAD and INSERT to check ACLs during analysis. We mimic the behaviour of HDFS's ACL checking by adding code to FsPermissionChecker. Change-Id: I42660db1da13ceaef63f582cff2c2078e08f90a1 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4428 Reviewed-by: Marcel Kornacker <marcel@cloudera.com> Tested-by: jenkins	2014-09-19 20:34:48 -07:00
Lenni Kuff	293ead3b2a	[CDH5] Authorize SHOW ROLES statements and support SHOW CURRENT ROLES This patch adds the necessary changes required to authorize SHOW ROLES statements. This is not as easy as it could be because the Sentry Service doesn't currently expose the metadata for who is/isn't authorized to execute these statements. To authorize the statements, we need to first make an RPC to the Sentry Service (via the Catalog Server) and then only proceed with the SHOW statement if the check succeeds. We should consider revisiting this approach in the future when more metadata is available from Sentry. Additionally, this patch adds support for SHOW CURRENT ROLES which shows all roles that are currently granted to the current user. Change-Id: Ia01c20d58ab081f49a85566075836d8c6e25dbd4 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4367 Reviewed-by: Lenni Kuff <lskuff@cloudera.com> Tested-by: jenkins	2014-09-19 05:41:33 -07:00
Mike Yoder	75a97d3d7e	[CDH5] Kerberize mini-cluster and Impala daemons This is the first iteration of a kerberized development environment. All the daemons start and use kerberos, with the sole exception of the hive metastore. This is sufficient to test impala authentication. When buildall.sh is run using '-kerberize', it will stop before loading data or attempting to run tests. Loading data into the cluster is known to not work at this time, the root causes being that Beeline -> HiveServer2 -> MapReduce throws errors, and Beeline -> HiveServer2 -> HBase has problems. These are left for later work. However, the impala daemons will happily authenticate using kerberos both from clients (like the impala shell) and amongst each other. This means that if you can get data into the mini-cluster, you could query it. Usage: * Supply a '-kerberize' option to buildall.sh, or * Supply a '-kerberize' option to create-test-configuration.sh, then 'run-all.sh -format', re-source impala-config.sh, and then start impala daemons as usual. You must reformat the cluster because kerberizing it will change all the ownership of all files in HDFS. Notable changes: * Added clean start/stop script for the llama-minikdc * Creation of Kerberized HDFS - namenode and datanodes * Kerberized HBase (and Zookeeper) * Kerberized Hive (minus the MetaStore) * Kerberized Impala * Loading of data very nearly working Still to go: * Kerberize the MetaStore * Get data loading working * Run all tests * The unknown unknowns * Extensive testing Change-Id: Iee3f56f6cc28303821fc6a3bf3ca7f5933632160 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4019 Reviewed-by: Michael Yoder <myoder@cloudera.com> Tested-by: jenkins	2014-09-05 12:36:21 -07:00
Taras Bobrovytsky	e94de02469	Added execution summary, modified benchmark to handle JSON - Added execution summary to the beeswax client and QueryResult - Modified report-benchmark-results to handle JSON and perform execution summary comparison between runs - Added comments to the new workload runner Change-Id: I9c3c5f2fdc5d8d1e70022c4077334bc44e3a2d1d Reviewed-on: http://gerrit.ent.cloudera.com:8080/3598 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: jenkins (cherry picked from commit fd0b1406be2511c202e02fa63af94fbbe5e18eee) Reviewed-on: http://gerrit.ent.cloudera.com:8080/3618	2014-07-25 21:06:00 -07:00

1 2 3

102 Commits