impala

mirror of https://github.com/apache/impala.git synced 2026-01-07 09:02:19 -05:00

Author	SHA1	Message	Date
Bikramjeet Vig	8cbec20ea4	IMPALA-7351: Add estimates to Exchange node Added rough estimates for exchange node and a justification of the method in the in-line comments. Testing: Updated Planner tests. Change-Id: I5b577f9511abc48b992e814d50bba4959f23f7fd Reviewed-on: http://gerrit.cloudera.org:8080/11692 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-25 21:17:12 +00:00
Thomas Tauber-Marshall	15e8ce4f27	IMPALA-7677: Fix DCHECK failure in GroupingAggregator After inserting all of its input into its Aggregators, StreamingAggregationNode performs some cleanup, such as calling InputDone() on each Aggregator. Previously, StreamingAggregationNode only checked that all of the child's batches had been fetched before doing this cleanup, which causes problems if the final child batch isn't processed fully in a single GetNext() call. In this case, multiple calls to InputDone() lead to a DCHECK failure. The solution is to only perform the cleanup once the final child batch has been fully processed. Testing: - Added an e2e test with a query that hits this condition. Change-Id: I851007a60472d0e53081c076c863c866c516677c Reviewed-on: http://gerrit.cloudera.org:8080/11626 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-24 01:03:38 +00:00
Bikramjeet Vig	0340a153ce	IMPALA-7697: Fix flakiness in test_resource_limits This patch fixes one of the tests in test_resource_limits that expects a query to run for more than 2 seconds but currently fails because it sometimes completes earlier than that. Change-Id: I2ba7080f62f0af3e16ef6c304463ebf78dec1b0c Reviewed-on: http://gerrit.cloudera.org:8080/11741 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 03:43:41 +00:00
Bikramjeet Vig	77c56a805a	IMPALA-7699: Fix spilling test run with hdfs erasure coding turned on A spilling test when run on test build with hdfs erasure coding turned on hits an out of memory error on the hdfs scan node. This happens because the test is tuned for a regular 3 node minicluster without hdfs erasure coding. Fix is to simply increase the memory limit on the test to accommodate this difference yet keep it small enough to achieve desired spilling on the hash join node. Testing: Ran it on an EC enabled minicluster to make sure it works Change-Id: I207569822ba7388e78936d25e2311fa09c7a1b9a Reviewed-on: http://gerrit.cloudera.org:8080/11740 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-20 03:32:07 +00:00
Thomas Tauber-Marshall	9f5c5e6df0	IMPALA-7272: Fix crash in StringMinMaxFilter StringMinMaxFilter uses a MemPool to allocate space for StringBuffers. Previously, the MemPool was created by RuntimeFilterBank and passed to each StringMinMaxFilter. In queries with multiple StringMinMaxFilters being generated by the same fragment instance, this leads to overlapping use of the MemPool by different threads, which is incorrect as MemPools are not thread-safe. The solution is to have each StringMinMaxFilter create its own MemPool. This patch also documents MemPool as not thread-safe and introduces a DFAKE_MUTEX to help enforce correct usage. Doing this requires modifying our CMakeLists.txt to pass '-DNDEBUG' to clang only in RELEASE builds, so that the DFAKE_MUTEX will be present in the compiled IR for DEBUG builds. Testing: - I have been unable to repro the actual crash despite trying a large variety of different things. However, with additional logging added its clear that the MemPool is being used concurrently, which is incorrect. - Added an e2e test that covers the potential issue. It hits the DFAKE_MUTEX with a sleep added to MemPool::Allocate. - Ran a full exhaustive build in both DEBUG and RELEASE. Change-Id: I751cad7e6b75c9d95d7ad029bbd1e52fe09e8a29 Reviewed-on: http://gerrit.cloudera.org:8080/11650 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-18 20:56:52 +00:00
Zoltan Borok-Nagy	de7f09d726	IMPALA-7644: Hide Parquet page index writing with feature flag This commit adds the command line flag enable_parquet_page_index_writing to the Impala daemon that switches Impala's ability of writing the Parquet page index. By default the flag is false, i.e. Impala doesn't write the page index. This flag is only temporary, we plan to remove it once Impala is able to read the page index and has better testing around it. Because of this change I had to move test_parquet_page_index.py to the custom_cluster test suite since I need to set this command line flag in order to test the functionality. I also merged most of the test cases because we don't want to restart the cluster too many times. I removed 'num_data_pages_' from BaseColumnWriter since it was rather confusing and didn't provide any measurable performance improvement. This commit fixes the ASAN error produced by the first IMPALA-7644 commit which was reverted later. Change-Id: Ib4a9098a2085a385351477c715ae245d83bf1c72 Reviewed-on: http://gerrit.cloudera.org:8080/11694 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-17 19:57:17 +00:00
Joe McDonnell	af76186e01	IMPALA-7704: Revert "IMPALA-7644: Hide Parquet page index writing with feature flag" The fix for IMPALA-7644 introduced ASAN issues detailed in IMPALA-7704. Reverting for now. This reverts commit `843683ed6c`. Change-Id: Icf0a64d6ec747275e3ecd6e801e054f81095591a Reviewed-on: http://gerrit.cloudera.org:8080/11671 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Michael Ho <kwho@cloudera.com>	2018-10-13 03:26:03 +00:00
Tim Armstrong	d05f73f415	IMPALA-7647: Add HS2/Impyla dimension to TestQueries I used some ideas from Alex Leblang's abandoned patch: https://gerrit.cloudera.org/#/c/137/ in order to run .test files through HS2. The advantage of using Impyla is that much of the code will be reusable for any Python client implementing the standard Python dbapi and does not require us implementing yet another thrift client. This gives us better coverage of non-trivial result sets from HS2, including handling of NULLs, error logs and more interesting result sets than the basic HS2 tests. I added HS2 coverage to TestQueries, which has a reasonable variety of queries and covers the data types in alltypes. I also added TestDecimalQueries, TestStringQuery and TestCharFormats to get coverage of DECIMAL, CHAR and VARCHAR that aren't in alltypes. Coverage of results sets with NULLs was limited so I added a couple of queries. Places where results differ from Beeswax: * Impyla is a Python dbapi client so must convert timestamps into python datetime objects, which only have microsecond precision. Therefore result timestamps within nanosecond precision are truncated. * The HS2 interface reports the NULL type as BOOLEAN as a workaround for IMPALA-914. * The Beeswax interface reported VARCHAR as STRING, but HS2 reports VARCHAR. I dealt with different results by adding additional result sections so that the expected differences between the clients/protocols were explicit. Limitations: * Not all of the same methods are implemented as for beeswax, so some tests that have more complicated interactions with the client will not work with HS2 yet. * We don't have a way to get the affected row count for inserts. I also simplified the ImpalaConnection API by removing some unnecessary methods and moved some generic methods to the base class. Testing: * Confirmed that it detected IMPALA-7588 by re-applying the buggy patch. * Ran exhaustive and CentOS6 tests. Change-Id: I9908ccc4d3df50365be8043b883cacafca52661e Reviewed-on: http://gerrit.cloudera.org:8080/11546 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-09 00:45:10 +00:00
Zoltan Borok-Nagy	843683ed6c	IMPALA-7644: Hide Parquet page index writing with feature flag This commit adds the command line flag enable_parquet_page_index_writing to the Impala daemon that switches Impala's ability of writing the Parquet page index. By default the flag is false, i.e. Impala doesn't write the page index. This flag is only temporary, we plan to remove it once Impala is able to read the page index and has better testing around it. Because of this change I had to move test_parquet_page_index.py to the custom_cluset test suite since I need to set this command line flag in order to test the functionality. I also merged most of the test cases because we don't want to restart the cluster too many times. Change-Id: If9994882aa59cbaf3ae464100caa8211598287bc Reviewed-on: http://gerrit.cloudera.org:8080/11563 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-08 13:27:49 +00:00
Fredy Wijaya	e5c502e4e4	IMPALA-7671: Fix broken SHOW GRANT USER ON <object> This patch fixes the broken SHOW GRANT USER ON <object> that always shows an empty result due to incorrect comparison between TPrivilege for the filter vs TPrivilege for the actual privilege that should not consider the "grantoption". Testing: - Added new E2E tests - Ran all FE tests - Ran all authorization E2E tests Change-Id: I7adc403caddd18e5a954cf6affd5d1d555b9f5f0 Reviewed-on: http://gerrit.cloudera.org:8080/11598 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-06 07:25:49 +00:00
Bikramjeet Vig	fc91e706b4	IMPALA-7349: Add Admission control support for automatically setting per host memory limit for a query With this patch the per host memory limit of a query is automatically set using the mem_limit set in the query options and the mem_estimate calculated by the planner based on the following pseudo code: if mem_limit is set in query options: use that and if 'clamp-mem-limit-query-option' is true: enforce the min/max query mem limits defined in the pool config. else: mem_limit = max(mem_estiamte, min_mem_limit_required_to_accomodate_largest_initial_reservation) finally, enforce min/max query mem limits defined in the pool config on this value. This calculated mem limit will also be used for admission accounting and consequently for admission control. Moreover, three new pool configuration options have been added to enable this behaviour: "min-query-mem-limit" & "max-query-mem-limit" => help clamp the per host memory limit for a query. If both these limits are not configured, then the estimates from planning are not used as a memory limit and only used for making admission decisions. Moreover the estimates will no longer have a lower bound based on the largest initial reservation. "clamp-mem-limit-query-option" => if false, the mem_limit defined in the query options is used directly and the max/min query mem limits are not enforced on it. Testing: Added e2e test cases. Added frontend tests for changes to RequestPoolService. Successfully passed exhaustive tests. Change-Id: Ifec00141651982f5975803c2165b7d7a10ebeaa6 Reviewed-on: http://gerrit.cloudera.org:8080/11157 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-05 04:38:24 +00:00
Tim Armstrong	c80c62f22d	IMPALA-7648 part 1: add expected out-of-memory tests This adds test coverage for some cases where queries are currently expected to fail with out-of-memory: * memory limit exceeded in exchange node * aggregation with large var-len intermediate values * top N with large limit * hash join with many duplicates on right side * analytic with a large window that needs to be buffered in-memory Note that it's not always totally deterministic where the query hits 'memory limit exceeded' so we don't include the node ID or name in the expected error message. Testing: * ran exhaustive tests * looped modified tests locally overnight Change-Id: Icd1a7eb97837b742a967c260cafb5a7f4f45412e Reviewed-on: http://gerrit.cloudera.org:8080/11564 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-05 01:36:17 +00:00
Bikramjeet Vig	3fabc2de47	IMPALA-7351: Improve memory estimates for Kudu Scan Nodes This patch adds memory estimates for kudu scan nodes based on empirically derived estimates for the scan's memory consumption that were added in IMPALA-7096. Testing: Modified resource requirements planner test. Change-Id: If9bb52530271b0bff91311a67d222a2e9fac1229 Reviewed-on: http://gerrit.cloudera.org:8080/11440 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-04 22:04:02 +00:00
Csaba Ringhofer	810841115a	IMPALA-7595: Check the validity of the time part of Parquet timestamps Before this fix Impala did not check whether a timestamp's time part is out of the valid [0, 24 hour) range when reading Parquet files, so these timestamps were memcopied as they were to slots, leading to results like: 1970-01-01 -00:00:00.000000001 1970-01-01 24:00:00 Different parts of Impala treat these timestamp differently: - string conversion leads to invalid representation that cannot be converted back to timestamp - timezone conversions handle the overflowing time part and give a valid timestamp result (at least since CCTZ, I did not check older versions of Impala) - Parquet writing inserts these timestamp as they are, so the resulting Parquet file will also contain corrupt timestamps The fix adds a check that converts these corrupt timestamps to NULL, similarly to the handling of timestamp outside the [1400..10000) range. A new error code is added for this case. If both the date and the time part is corrupt, then error about corrupt time is returned. Testing: - added a new scanner test that reads a corrupted Parquet file with edge values Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058 Reviewed-on: http://gerrit.cloudera.org:8080/11521 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-10-01 13:20:40 +00:00
stiga-huang	ddef2cb9b1	IMPALA-376: add built-in functions for parsing JSON This patch implements the same function as Hive UDF get_json_object. We reuse RapidJson to parse the json string. In order to track the memory used in RapidJson, we wrap FunctionContext into an allocator. get_json_object accepts two parameters: a json string and a selector (json path). We parse the json string into a Document tree and then perform BFS according to the selector. For example, to process get_json_object('[{\"a\":1}, {\"a\":2}, {\"a\":3}]', '$[].a'), we first perform '$[]' to extract all the items in the root array. Then we get a queue consists of {a:1},{a:2},{a:3} and perform '.a' selector on all values in the queue. The final results is 1,2,3 in the queue. As there're multiple results, they should be encapsulated into an array. The output results is a string of '[1,2,3]'. More examples can be found in expr-test.cc. Test: * Add unit tests in expr-test * Add e2e tests in exprs.test * Add tests in test_alloc_fail.py to check handling of out of memory Change-Id: I6a9d3598cb3beca0865a7edb094f3a5b602dbd2f Reviewed-on: http://gerrit.cloudera.org:8080/10950 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-29 11:59:03 +00:00
Adam Holley	34e666f520	IMPALA-7503: SHOW GRANT USER not showing all privileges. This patch fixes the SHOW GRANT USER statement to show all privileges granted to a user, either directly via object ownership, or granted through a role via a group the user belongs to. The output for SHOW GRANT USER will have two additional columns for privilege name and privilege type so the user can know where the privilege comes from. Truncated sample showing two columns that are different from role: +----------------+----------------+--------+----------+-... \| principal_type \| principal_name \| scope \| database \| ... +----------------+----------------+--------+----------+-... \| USER \| foo \| table \| foo_db \| ... \| ROLE \| foo_role \| server \| \| ... +----------------+----------------+--------+----------+-... Testing: - Create new custom cluster test with custom group mapping. - Ran FE and custom cluster tests. Change-Id: Ie9f6c88f5569e1c414ceb8a86e7b013eaa3ecde1 Reviewed-on: http://gerrit.cloudera.org:8080/11531 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-28 17:05:59 +00:00
Yongjun Zhang	ba27b03814	IMPALA-6202. mod and % should be equivalent. Currently in DECIMAL V2 mode, typeof(9.9 % 3) is DECIMAL(2,1) and typeof(mod(9.9, 3)) is DECIMAL(4,1), while both are expected to be DECIMAL(2,1). This jira fixes V2 mode by replacing "mod" with "%" at parser stage thus they share the same code path afterwards. Testing: Added unit tests and done real cluster testing. Change-Id: Ib0067da04083859ffbf662a29007154461bea2ba Reviewed-on: http://gerrit.cloudera.org:8080/11443 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-09-27 20:41:29 +00:00
Bikramjeet Vig	1c94450ca9	IMPALA-7353: Improve memory estimates for Hbase Scan Nodes Currently for hbase scan nodes we use a constant estimate of 1GB which is generally a gross over-estimation. This patch improves upon those estimates by using huerestics based on how hbase rows are stored and fetched and how the scanners interact with the internal memory pool. Testing: Added/Modified resource requirements planner test. Added a junit test for the estimation logic. Change-Id: I583545c3f5e454854f111871c5fbc4f108ae4bff Reviewed-on: http://gerrit.cloudera.org:8080/11306 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-27 03:42:58 +00:00
Tim Armstrong	ce145ffee6	IMPALA-7600: bump mem_limit for test_kudu_scan_mem_usage The estimate for memory consumption for this scan is 9 columns * 384kb per column = 3.375mb. So if we set the mem_limit to 6.5mb, we should still not get more than one scanner thread, but we can avoid hitting out-of-memory. The issue in the JIRA was queued row batches. With this change, and num_scanner_threads=2, there should be max 12 row batches (10 in the queue, 2 in the scanner threads about to be enqueued) and based on the column stats I'd estimate that each row batch is around 200kb, so this change should provide significantly more headroom. Change-Id: I6d992cc076bc8678089f765bdffe92e877e9d229 Reviewed-on: http://gerrit.cloudera.org:8080/11513 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-26 06:51:56 +00:00
Thomas Tauber-Marshall	df53ec2385	IMPALA-110: Support for multiple DISTINCT This patch adds support for having multiple aggregate functions in a single SELECT block that use DISTINCT over different sets of columns. Planner design: - The existing tree-based plan shape with a two-phased aggregation is maintained. - Existing plans are not changed. - Aggregates are grouped into 'aggregation classes' based on their expressions in the distinct portion which may be empty for non-distinct aggregates. - The aggregation framework is generalized to simultaneously process multiple aggregation classes within the tree-based plan. This process splits the results of different aggregation classes into separate rows, so a final aggregation is needed to transpose the results into the desired form. - Main challenge: Each aggregation class consumes and produces different tuples, so conceptually a union-type of tuples flows through the runtime. The tuple union is represented by a TupleRow with one tuple per aggregation class. Only one tuple in such a TupleRow is non-NULL. - Backend exec nodes in the aggregation plan will be aware of this tuple-union either explicitly in their implementation or by relying on expressions that distinguish the aggregation classes. - To distinguish the aggregation classes, e.g. in hash exchanges, CASE expressions are crafted to hash/group on the appropriate slots. Deferred FE work: - Beautify/condense the long CASE exprs - Push applicable conjuncts into individual aggregators before the transposition step - Added a few testing TODOs to reduce the size of this patch - Decide whether we want to change existing plans to the new model Execution design: - Previous patches separated out aggregation logic from the exec node into Aggregators. This is extended to support multiple Aggregators per node, with different grouping and aggregating functions. - There is a fast path for aggregations with only one aggregator, which leaves the execution essentially unchanged from before. - When there are multiple aggregators, the first aggregation node in the plan replicates its input to each aggregator. The output of this step is rows where only a single tuple is non-null, corresponding to the aggregator that produced the row. - A new expr is introduced, ValidTupleId, which takes one of these rows and returns which tuple is non-null. - For additional aggregation nodes, the input is split apart into 'mini-batches' according to which aggregator the row corresponds to. Testing: - Added analyzer and planner tests - Added end-to-end queries tests - Ran hdfs/core tests - Added support in the query generator and ran in a loop. Change-Id: I055402eaef6d81e5f70e850d9f8a621e766830a4 Reviewed-on: http://gerrit.cloudera.org:8080/10771 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-26 03:54:49 +00:00
poojanilangekar	f970b755c3	IMPALA-7352: Account for clustering in HdfsTableSink Previously, HdfsTableSink::computeResourceProfile() didn't account for clustering while estimating the memory requirement of an insert fragment. This change ensures that the resource estimates produced account for the fact that clustered inserts produce a single partition at a time. Testing: Modified testResourceRequirements PlannerTest to account for clustering while generating insert plans. Change-Id: I75f8baf5fc3e1c357edf6d0cebd1e5dbafc8a3a8 Reviewed-on: http://gerrit.cloudera.org:8080/11485 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-25 03:29:51 +00:00
Andrew Sherman	15e40a3c94	IMPALA-589: Add sql function returning the impalad coordinator hostname. In every execution of an Impala query, one of the impalad daemons acts as the coordinator node. In some cases, such as when using a proxy, a user cannot predict which host will act as the coordinator. To aid in diagnosis, we provide a sql function which returns the name of the host on which the coordinator is running. EXTERNAL DESCRIPTION: Add a builtin function called coordinator(), which returns the name of the host which is running the impalad that is acting as the coordinator for the current query. TESTING: - Added a basic unit test for the new function. - Added a unit test which simulates the case when coord_address is unset. - Added a query that uses coordinator() to exprs.test - Hand tested in a development deployment. - Ran regression tests and got a clean run. Change-Id: I94d6e2664ba659b48df53c5c06f67b502c533e47 Reviewed-on: http://gerrit.cloudera.org:8080/11459 Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-24 20:26:44 +00:00
David Knupp	a9d0305c76	IMPALA-7584: set.test: omit pool name in catch string Running this test on an actual cluster results in a failure, since the pool name differs. Removing the name allows this test to pass on clusters. Also tested that minicluster tests still pass. Change-Id: I1529c040520a1d8e7ca47960c76028b2579c8d03 Reviewed-on: http://gerrit.cloudera.org:8080/11476 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-20 22:03:27 +00:00
Thomas Tauber-Marshall	463f2244af	IMPALA-7573: Fix GroupingAggregator::Reset with output_partition_ GroupingAggregator::Reset() doesn't call Close() on output_partition_, which can lead to hitting a DCHECK(is_closed) in the Partition destructor. Testing: - Added an e2e regression test. Change-Id: I6db8ec8479b18b5ed681d7ac438480711ab7a1ba Reviewed-on: http://gerrit.cloudera.org:8080/11446 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-18 00:37:50 +00:00
Yongjun Zhang	dbe0f86e00	IMPALA-6442: Misleading file offset reporting in error messages. The error message described in IMPALA-6442 incorrectly reported the file offset where the Parquet footer starts, as if the offset is counted from the file end instead of from the file beginning. The fix changed the reported file offset to be counted from the beginning of the Parquet file. Testing: Create a small table that contains one row of data with a single column that's bigint and store it as Parquet. Manually changed the footer size field to be 1) smaller than the original footer size by 1, to trigger the error message fixed by this jira to be printed, to verify that the fix functions correctly; 2) bigger than the file size, thus to trigger another related error message to be printed. Change-Id: I35235e99ea9ceb0d31961dd3b8069f7194f5a2de Reviewed-on: http://gerrit.cloudera.org:8080/11379 Reviewed-by: Lars Volker <lv@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-09-10 16:09:41 +00:00
Fredy Wijaya	9934b473b7	IMPALA-6373: Allow primitive type widening on parquet tables This patch implements support for primitive type widening on parquet tables. It only supports conversion to those types without any loss of precision. - tinyint (INT32) -> smallint (INT32), int (INT32), bigint (INT64), double (DOUBLE) - smallint (INT32) -> int (INT32), bigint (INT64), double (DOUBLE) - int (INT32) -> bigint (INT64), double (DOUBLE) - float (FLOAT) -> double (DOUBLE) Testing: - Added BE test - Added E2E test - Ran core tests Change-Id: If93394b035c64cf6fc5f37b54d29c034cc1f86e4 Reviewed-on: http://gerrit.cloudera.org:8080/11268 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-23 15:55:53 +00:00
Zoltan Borok-Nagy	e27954a5aa	IMPALA-6709: Simplify tests that copy local files to tables We had quite a few tests that created a table and used "hdfs dfs -copyFromLocal" to copy data files to the warehouse directory for this table. This operation needs some boilerplate code that I refactored to the new functions called create_table_from_parquet() and create_table_and_copy_files(). Change-Id: Ie00a4561825facf8abe2e8e74a6b6e93194f416f Reviewed-on: http://gerrit.cloudera.org:8080/11127 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-22 18:08:20 +00:00
Vincent Tran	3ff4cde772	IMPALA-6844: Fix possible NULL dereference in to_date() builtin If result.ptr allocation fails for some reason inside the StringVal constructor, we still overwrite result.len and continue. This change checks that the StringVal pointer is not NULL before dereferencing it, and returns NULL if it is. Testing: Added a test case of the to_date() function to alloc-fail-init.test to leverage the fault injector --stress_fn_ctx_alloc. Change-Id: I14cfb29a592885bb2f39958c8644f93db5220a68 Reviewed-on: http://gerrit.cloudera.org:8080/11286 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-08-21 22:06:02 +00:00
Tim Armstrong	bddd7def99	IMPALA-7465: fix test_kudu_scan_mem_usage The issue was that the row batch queue could grow a lot if the consumer was slow. Also add an additional test to exercise the OOM code path in Kudu for completeness. Testing: Added sleep to kudu-scan-node.cc that reproduced the problem. Looped modified test to flush out flakiness. Change-Id: Ic4a95b6b6d96a447df68ef4912a86f1e11f219ca Reviewed-on: http://gerrit.cloudera.org:8080/11285 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-21 21:38:11 +00:00
Todd Lipcon	7c0450d6d4	IMPALA-7439. CREATE DATABASE creates a catalog entry with missing location Previously, when creating a new database, the CatalogOpExecutor would create an HMS Database object, issue the HMS createDatabase call, and then create a Catalog entry from that same Database object. The resulting Catalog entry would be missing certain fields that are auto-created by the HMS itself, most importantly the location field. The code for CTAS seems to have contained a workaround for this issue ever since catalogd was first introduced: rather than using the location stored in the Db object, it would re-fetch the Database from HMS. Now that this is fixed, that workaround could be removed and some code simplified. A new test verifies that a newly-created database has the appropriate location, and existing CTAS tests verify that functionality didn't regress. Change-Id: I13df31cee1e5768b073e0e35c4c16ebf1892be23 Reviewed-on: http://gerrit.cloudera.org:8080/11229 Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-17 00:54:28 +00:00
Tim Armstrong	7ccf736908	IMPALA-7096: restore scanner thread memory heuristics This restores some of the heuristics removed in IMPALA-4835 that can help scans from hitting OOM conditions. The heuristics are implemented at the query level rather than in each scan node in isolation. Introduce a ScannerMemLimiter class that belongs to the QueryState that tracks the amount of memory estimated to be consumed for all scanner threads running for the query on the current backend. Also check soft memory limits to see if scanner threads should be started or the current scanner thread should stop. The long-term plan is to switch to the MT scan node implementations. When that happens this code can be removed. In the meantime this code is imperfect but will help avoid OOM in many scenarios. Testing: Added regression tests for HDFS and Kudu where we previously could run out of memory with a low mem_limit. Manual testing: * Ran query tests with --thread_creation_fault_injection=true for a bit, confirmed no crashes. * ran single-node stress test for Kudu and Parquet for 10-20 min each. Change-Id: Ib9907fa8c4d2b0b85f67f4f160899c1c258ad82b Reviewed-on: http://gerrit.cloudera.org:8080/11103 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-16 21:25:00 +00:00
Tim Armstrong	29905d1ea4	IMPALA-7442: reduce mem requirement of semi-joins-exhaustive The test started running into IMPALA-7446, maybe because of a timing change. This appears to have always been possible. The fix is to reduce the memory requirement of the test. IMPALA-2256 is no longer really possible because the BufferedTupleStream code was simplified to avoid the 32-bit row index limitation, so we're not losing important coverage on the current code with this change. Testing: Ran test in a loop to confirm it did not OOM. Change-Id: I9d9480cad6bf8222abe990e7046498a0531e2849 Reviewed-on: http://gerrit.cloudera.org:8080/11223 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-16 00:21:54 +00:00
Michael Brown	cddb35be92	IMPALA-7445: separate skippable tables from test_resource_limits Separate the Hbase and Kudu tables used in test_resource_limits to their own files so they are skippable in environments where those services are not used. Change-Id: I02d9fd0b48817f755e1eee2ae4613e2089fa1973 Reviewed-on: http://gerrit.cloudera.org:8080/11221 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-15 04:15:46 +00:00
Tianyi Wang	3cb784310f	IMPALA-7347: Ignore numFilesErasureCoded in TestShowCreateTable This table properties only exist for HDFS tables. To get the test work on local tables, it needs to be ignored. Change-Id: Icc8494fb91c4777cee662a97f750486aa8e79a8e Reviewed-on: http://gerrit.cloudera.org:8080/11192 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-13 21:36:16 +00:00
Tim Armstrong	b7d509d761	IMPALA-7231: group plan nodes into pipelines This adds some informational output to explain plans and sends the information to the backend. The idea is that this will make it easier to explain how Impala's pipelined execution works and also enable future work on profile analysis that can more intelligently group plan nodes. Tests: * Updated planner tests to include new output. Change-Id: I1d10eb14d997242f445e5c5fc5362d5410370721 Reviewed-on: http://gerrit.cloudera.org:8080/10848 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-10 19:31:30 +00:00
Tim Armstrong	3e17705eca	IMPALA-6034: Add scanned bytes limits per query This adds support for aggregate resource limits at runtime, specified via query options. If a query exceeds a limit it is terminated. The checks are periodic so the query may go somewhat over the limits. SCAN_BYTES_LIMIT is exposed as an advanced query option. CPU_LIMIT_S is hidden as a development query option because it is flawed - the CPU user/sys time is only updated upon thread completion, so in many cases the limit will not take effect until well after the resources have been used. IMPALA-7318 tracks enabling this. Query profile is updated to include query wide and per backend metrics for CPU and scanned bytes. Example from "select count(*) from tpch_parquet.lineitem": Per Node Peak Memory Usage: tarmstrong-box:22000(289.50 KB) tarmstrong-box:22001(249.50 KB) tarmstrong-box:22002(249.50 KB) Per Node Bytes Read: tarmstrong-box:22000(100.00 KB) tarmstrong-box:22001(100.00 KB) tarmstrong-box:22002(100.00 KB) Per Node User Time: tarmstrong-box:22000(40.000ms) tarmstrong-box:22001(32.000ms) tarmstrong-box:22002(24.000ms) Per Node System Time: tarmstrong-box:22000(0.000ns) tarmstrong-box:22001(0.000ns) tarmstrong-box:22002(0.000ns) - FiltersReceived: 0 (0) - FinalizationTimer: 0.000ns - NumBackends: 3 (3) - NumFragmentInstances: 4 (4) - NumFragments: 2 (2) - TotalBytesRead: 300.00 KB (307200) - TotalCpuTime: 96.000ms Testing: Added tests for various permutations for CPU_LIMIT_S and SCAN_BYTES_LIMIT Based on a previous patch by Mostafa Mokhtar <mmokhtar@cloudera.com> Change-Id: I3e85f80b70b3fce47e637e9322ed0316ee84f6a9 Reviewed-on: http://gerrit.cloudera.org:8080/11081 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-09 22:18:01 +00:00
Todd Lipcon	d958b4779c	IMPALA-7411, IMPALA-7414. Fix failing tests on local filesystem The new test added by IMPALA-7308 failed on local-filesystem builds because the warehouse path is not directly at /test-warehouse. This fix prefixes the path appropriately with $FILESYSTEM_PREFIX Additionally, the fix for IMPALA-5542 made a similar mistake constructing a path on the Python side of a test case. Fixed by using the get_fs_path function. Change-Id: I6922e24a74576d0d000e8e2645a235868583c1e1 Reviewed-on: http://gerrit.cloudera.org:8080/11164 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-09 19:18:08 +00:00
Tim Armstrong	a6c356850b	IMPALA-7403: fix child batch mem mgmt in analytic The core of the fix is in ProcessChildBatches(), where we copy 'prev_input_tuple_' to 'prev_input_tuple_pool_' and reset the child batch before the call to child(0)->GetNext(). This solves a couple of problems: * prev_input_tuple_ may be referencing memory from the child that had the needs_deep_copy() flag set and therefore will be freed or recycled when calling child(0)->GetNext() again. * 'prev_child_batch_' may have been holding onto resources that the child had flushed, which means they need to be freed before the next GetNext() call. Also refactors the logic around child_cmp_row_ to make the variable lifetime and data flow clearer. Testing: Add regression test. The test passes both with this patch alone and with IMPALA-7333 reapplied. Change-Id: I09eb6213d47287f2addb72f8c1304085d2d48c55 Reviewed-on: http://gerrit.cloudera.org:8080/11155 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-08 20:51:45 +00:00
Todd Lipcon	4aec50484a	IMPALA-7308. Support Avro tables in LocalCatalog This adds support for loading Avro-formatted tables in LocalCatalog. In the case that the table properties indicate a table is Avro-formatted, the semantics are identical to the existing catalog implementation: - if an explicit avro schema is specified, it overrides the schema provided by the HMS - if no explicit avro schema is specified, one is inferred, and then the inferred schema takes the place of the one provided by the HMS (thus promoting columns like TINYINT to INT) - on COMPUTE STATS, if any discrepancy is discovered between the HMS schema and the inferred schema, an error is emitted. The semantics for LocalCatalog are slightly different in the case of tables which have not been configured as Avro format on the table level: The existing implementation has the behavior that, when a table is loaded, all partitions are inspected, and, if any partition is discovered with Avro format, the above rules are applied. This has some very unexpected results, described in an earlier email to dev@impala.apache.org [1]. To summarize that email thread, the existing behavior was decided to be unintuitive and inconsistent with Hive. Additionally, this behavior requires loading all partitions up-front, which gets in the goal of lazy/granular metadata loading in LocalCatalog. Thus, the LocalCatalog implementation differs as follows: - the "schema override" behavior ONLY occurs if the Avro file format has been selected at a table level. - if an Avro partition is added to a non-Avro table, and that partition has a schema that isn't compatible with the table's schema, an error will occur on read. The thread additionally discusses adding an error message on "alter" to prevent users from adding an Avro partition to a table with an incompatible schema. To keep the scope of this patch minimal, that is not yet implemented here. I filed IMPALA-7309 to change the behavior of the existing catalog implementation to match. A new test verifies the behavior, set to 'xfail' when running on the existing catalog implementation. [1] https://lists.apache.org/thread.html/fb68c54bd66a40982ee17f9f16f87a4112220a5df035a311bda310f1@%3Cdev.impala.apache.org%3E Change-Id: Ie4b86c8203271b773a711ed77558ec3e3070cb69 Reviewed-on: http://gerrit.cloudera.org:8080/10970 Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Vuk Ercegovac <vercegovac@cloudera.com>	2018-08-07 17:38:04 +00:00
Tim Armstrong	a8d7a50bd5	IMPALA-7354: planner test resource estimates for more workloads Adds the resource estimates for key benchmark workloads: TPC-H, TPC-DS, TPC-H Nested and TPC-H Kudu to the planner test so that we can track changes in resource requirements and estimates for these queries. Also don't show decimal places for MB and KB estimates. The estimates are not accurate to that level and displaying extra precision has some disadvantages: * It communicates to readers that the estimates have a high level of precision. * It increases the odds of small variations in file sizes, etc causing test failures. Also fixed a regex in the stress test that didn't escape the decimal point correctly. Testing: Ran core tests. Change-Id: I6a9f836699200ea87fb03bf36abad0e23949ac26 Reviewed-on: http://gerrit.cloudera.org:8080/11087 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-07 08:34:33 +00:00
Thomas Tauber-Marshall	70bf9ea296	IMPALA-7397: fix DCHECK in impala::AggregationNode::Close As part of IMPALA-110, all of the logic of performing aggregations was refactored out of the aggregation ExecNode and into Aggregators. Each Aggregator manages its own memory, so a DCHECK was added in AggregationNode::Close to ensure that no allocations were made from the regular ExecNode mem pools. This DCHECK is violated if the node was assigned conjuncts that allocate mem in Prepare - even though the conjuncts are evaluated in the Aggregator, we still initialize them in ExecNode::Prepare. This patch solves the problem by creating a copy of the TPlanNode without the conjuncts to pass to ExecNode. In the future, when TAggregator is introduced, we can get rid of this by directly assigning conjuncts to Aggregators. Note that this doesn't affect StreamingAggregationNode, which never has conjuncts assigned to it, but this patch fixes an incorrect DCHECK that enforces this. Testing: - Added a regression test for this case. Change-Id: I65ae00ac23a62ab9f4a7ff06ac9ad5cece80c39d Reviewed-on: http://gerrit.cloudera.org:8080/11132 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-06 23:45:08 +00:00
Csaba Ringhofer	dc32bf7703	IMPALA-7362: Add query option to set timezone This change adds a new query option "timezone" which defines the timezone used for utc<->local conversions. The main goal is to simplify testing, but I think that some users may also find it useful so it is added as a "general" query option. Examples: set timezone=UTC; set timezone="Europe/Budapest" The timezones are validated, but as query options are not sent to the coordinator immediately, the error checking will only happen when running a query. Leading/trailing " and 'characters are stripped because the / character cannot be entered unquoted in some contexts. Currently the timezone has effect in the following cases: -function now() -conversions between unix time and timestamp if flag use_local_tz_for_unix_timestamp_conversions is true -reading parquet timestamps written by Hive if flag convert_legacy_hive_parquet_utc_timestamps is true In the near future Parquet timestamps's isAdjustedToUTC property will be supported, which will decide whether to do utc->local conversion on a per file+column basis. This conversion will be also affected. Testing: - Extended test_local_tz_conversion.py to actually test utc<->local conversion. Until now the effect of flag use_local_tz_for_unix_timestamp_conversions was practically untested. - Added a shell test to check that the default of the query option is the system's timezone. - Added a shell test to check timezone validation. Change-Id: I73de86eff096e1c581d3b56a0d9330d686f77272 Reviewed-on: http://gerrit.cloudera.org:8080/11064 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-03 17:45:25 +00:00
Tianyi Wang	fb3d47d356	IMPALA-7347: Update tests to accomodate HIVE-18118 HIVE-18118 adds 'numFilesErasureCoded' to table properties. This patch addes it to test_show_create_table to work with the latest Hive. Change-Id: I6aae402dd38374de90b35c32166a9507e6eb29f9 Reviewed-on: http://gerrit.cloudera.org:8080/11108 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-02 20:25:31 +00:00
Zoltan Borok-Nagy	7917eac0ad	IMPALA-5542: Impala cannot scan Parquet decimal stored as int64_t/int32_t The Decimal type in Parquet is a logical type. That means the Parquet file stores some physical/primitive type that is annotated by the DECIMAL tag to make it behave like decimals. The allowed physical types for decimals are INT32, INT64, FIXED, and BINARY. Before this commit Impala could only read decimals stored as FIXED or BINARY. Spark decided to write decimals as INT32 or INT64 when their precision allows it: (1 <= precision <= 9) ==> INT32 (10 <= precision <= 18) ==> INT64 I updated our column readers to accept INT32 and INT64 as valid physical types for decimals. Testing: * extended parquet-plain-test.cc * added Parquet files generated by Spark 2.3.1 and updated test_scanners.py Change-Id: Ib8c41bfc7c1664bdba5099d3893dc8dbe4304794 Reviewed-on: http://gerrit.cloudera.org:8080/11000 Reviewed-by: Zoltan Borok-Nagy <boroknagyz@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-08-02 20:21:12 +00:00
poojanilangekar	672a271fd0	IMPALA-7234: Improve memory estimates produced by the Planner Previously, the planner used the getMajorityFormat to estimate the memory requirements of its partitions. Additionally, before IMPALA-6625 was merged, the majority format for a multi-format table with no numerical majority was calculated using a HashMap, thus producing non deterministic results. This change ensures that the memory estimate is deterministic and always based on partition that has the maximum memory requirement. Testing: Ran all PlannerTests. Also, modified plans of scans with multiple partitions to ensure that the memory estimate produced corresponds to the partition with the maximum requirement. Change-Id: I0666ae3d45fbd8615d3fa9a8626ebd29cf94fb4b Reviewed-on: http://gerrit.cloudera.org:8080/11001 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-31 21:01:57 +00:00
Tim Armstrong	170956b854	IMPALA-7360: sequence scanners sometimes skip blocks The handling of sync markers after processing a block was broken - eos_ was set if the sync marker straddles the boundary. The expected behaviour (documented by comments) in this case is that the current scanner should process the next block, if there is one. If you look at the logic before the IMPALA-3905 change in commit `931bf49cd9`, it split the checking of eosr() and eof() similar to this patch. Testing: Add regression tests that scans a large table with a variety of different scan range lengths, with some randomisation to exercise different edge cases. This reliably triggered the bug. Change-Id: I49a70a4925b0271204b8eea4f980299d7654805a Reviewed-on: http://gerrit.cloudera.org:8080/11062 Reviewed-by: Michael Ho <kwho@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-27 05:00:20 +00:00
Fredy Wijaya	a203733fac	IMPALA-7295: Remove IMPALA_MINICLUSTER_PROFILE=2 This patch removes the use of IMPALA_MINICLUSTER_PROFILE. The code that uses IMPALA_MINICLUSTER_PROFILE=2 is removed and it defaults to code from IMPALA_MINICLUSTER_PROFILE=3. In order to reduce having too many code changes in this patch, there is no code change for the shims. The shims for IMPALA_MINICLUSTER_PROFILE=3 automatically become the default implementation. Testing: - Ran core and exhaustive tests Change-Id: Iba4a81165b3d2012dc04d4115454372c41e39f08 Reviewed-on: http://gerrit.cloudera.org:8080/10940 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-14 01:03:18 +00:00
Michael Brown	3da2dc63fe	IMPALA-6810: runtime_row_filters.test: omit pool name in pattern Some downstream tests run this with a fair-scheduler.xml set that, while not changing admission control behavior, does change the name of the pool. Omit the pool name to permit that downstream test to succeed. Testing: - local with change in minicluster - downstream in environment as well Change-Id: I3fe6beb169dc6bfefabde9dc7a4632c1a5e63fa7 Reviewed-on: http://gerrit.cloudera.org:8080/10942 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-13 20:45:37 +00:00
Thomas Tauber-Marshall	e2aafae204	IMPALA-7251: Fix QueryMaintenance calls in Aggregators A recent change, IMPALA-110 (part 2), refactored PartitionedAggregationNode into several classes, including a new type 'Aggregator'. During this refactor, code that makes local allocations while evaluating exprs was moved from the ExecNode (now AggregationNode/StreamingAggregationNode) into the Aggregators, but code related to cleaning these allocations up (ie QueryMaintenance()) was not, resulting in some queries using an excessive amount of memory. This patch removes all calls to QueryMaintenance() from the exec nodes and moves them into the Aggregators. Testing: - Added new test cases with a mem limit that fails if the expr allocations aren't released in a timely manner. - Passed a full exhaustive run. Change-Id: I4dac2bb0a15cdd7315ee15608bae409c125c82f5 Reviewed-on: http://gerrit.cloudera.org:8080/10871 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-13 03:04:35 +00:00
poojanilangekar	c6f9b61ec2	IMPALA-6625: Skip computing parquet conjuncts for non-Parquet scans This change ensures that the planner computes parquet conjuncts only when for scans containing parquet files. Additionally, it also handles PARQUET_DICTIONARY_FILTERING and PARQUET_READ_STATISTICS query options in the planner. Testing was carried out independently on parquet and non-parquet scans: 1. Parquet scans were tested via the existing parquet-filtering planner test. Additionally, a new test [parquet-filtering-disabled] was added to ensure that the explain plan generated skips parquet predicates based on the query options. 2. Non-parquet scans were tested manually to ensure that the functions to compute parquet conjucts were not invoked. Additional test cases were added to the parquet-filtering planner test to scan non parquet tables and ensure that the plans do not contain conjuncts based on parquet statistics. 3. A parquet partition was added to the alltypesmixedformat table in the functional database. Planner tests were added to ensure that Parquet conjuncts are constructed only when the Parquet partition is included in the query. Change-Id: I9d6c26d42db090c8a15c602f6419ad6399c329e7 Reviewed-on: http://gerrit.cloudera.org:8080/10704 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-07-06 02:06:50 +00:00

1 2 3 4 5 ...

1527 Commits