impala

mirror of https://github.com/apache/impala.git synced 2025-12-19 18:12:08 -05:00

Author	SHA1	Message	Date
noemi	80c1d2dbaa	IMPALA-4530: Implement in-memory merge of quicksorted runs This change aims to decrease back-pressure in the sorter. It offers an alternative for the in-memory run formation strategy and sorting algorithm by introducing a new in-memory merge level between the in-memory quicksort and the external merge phase. Instead of forming one big run, it produces many smaller in-memory runs (called miniruns), sorts those with quicksort, then merges them in memory, before spilling or serving GetNext(). The external merge phase remains the same. Works with MAX_SORT_RUN_SIZE development query option that determines the maximum number of pages in a 'minirun'. The default value of MAX_SORT_RUN_SIZE is 0, which keeps the original implementation of 1 big initial in-memory run. Other options are integers of 2 and above. The recommended value is 10 or more, to avoid high fragmentation in case of large workloads and variable length data. Testing: - added MAX_SORT_RUN_SIZE as an additional test dimension to test_sort.py with values [0, 2, 20] - additional partial sort test case (inserting into partitioned kudu table) - manual E2E testing Change-Id: I58c0ae112e279b93426752895ded7b1a3791865c Reviewed-on: http://gerrit.cloudera.org:8080/18393 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Csaba Ringhofer <csringhofer@cloudera.com>	2023-06-08 05:30:33 +00:00
Daniel Becker	ff3d0c7984	IMPALA-12019: Support ORDER BY for arrays of fixed length types in select list As a first stage of IMPALA-10939, this change implements support for including in the sorting tuple top-level collections that only contain fixed length types (including fixed length structs). For these types the implementation is almost the same as the existing handling of strings. Another limitation is that structs that contain any type of collection are not yet allowed in the sorting tuple. Also refactored the RawValue::Write*() functions to have a clearer interface. Testing: - Added a new test table that contains many rows with arrays. This is queried in a new test added in test_sort.py, to ensure that we handle spilling correctly. - Added tests that have arrays and/or maps in the sorting tuple in test_queries.py::TestQueries::{test_sort, test_top_n,test_partitioned_top_n}. Change-Id: Ic7974ef392c1412e8c60231e3420367bd189677a Reviewed-on: http://gerrit.cloudera.org:8080/19660 Reviewed-by: Csaba Ringhofer <csringhofer@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2023-05-18 09:56:55 +00:00
Joe McDonnell	eb66d00f9f	IMPALA-11974: Fix lazy list operators for Python 3 compatibility Python 3 changes list operators such as range, map, and filter to be lazy. Some code that expects the list operators to happen immediately will fail. e.g. Python 2: range(0,5) == [0,1,2,3,4] True Python 3: range(0,5) == [0,1,2,3,4] False The fix is to wrap locations with list(). i.e. Python 3: list(range(0,5)) == [0,1,2,3,4] True Since the base operators are now lazy, Python 3 also removes the old lazy versions (e.g. xrange, ifilter, izip, etc). This uses future's builtins package to convert the code to the Python 3 behavior (i.e. xrange -> future's builtins.range). Most of the changes were done via these futurize fixes: - libfuturize.fixes.fix_xrange_with_import - lib2to3.fixes.fix_map - lib2to3.fixes.fix_filter This eliminates the pylint warnings: - xrange-builtin - range-builtin-not-iterating - map-builtin-not-iterating - zip-builtin-not-iterating - filter-builtin-not-iterating - reduce-builtin - deprecated-itertools-function Testing: - Ran core job Change-Id: Ic7c082711f8eff451a1b5c085e97461c327edb5f Reviewed-on: http://gerrit.cloudera.org:8080/19589 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Joe McDonnell	82bd087fb1	IMPALA-11973: Add absolute_import, division to all eligible Python files This takes steps to make Python 2 behave like Python 3 as a way to flush out issues with running on Python 3. Specifically, it handles two main differences: 1. Python 3 requires absolute imports within packages. This can be emulated via "from __future__ import absolute_import" 2. Python 3 changed division to "true" division that doesn't round to an integer. This can be emulated via "from __future__ import division" This changes all Python files to add imports for absolute_import and division. For completeness, this also includes print_function in the import. I scrutinized each old-division location and converted some locations to use the integer division '//' operator if it needed an integer result (e.g. for indices, counts of records, etc). Some code was also using relative imports and needed to be adjusted to handle absolute_import. This fixes all Pylint warnings about no-absolute-import and old-division, and these warnings are now banned. Testing: - Ran core tests Change-Id: Idb0fcbd11f3e8791f5951c4944be44fb580e576b Reviewed-on: http://gerrit.cloudera.org:8080/19588 Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com> Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>	2023-03-09 17:17:57 +00:00
Riza Suminto	49ac55fb69	IMPALA-9856: Enable result spooling by default. Result spooling has been relatively stable since it was introduced, and it has several benefits described in IMPALA-8656. This patch enable result spooling (SPOOL_QUERY_RESULTS) query options by default. Furthermore, some tests need to be adjusted to account for result spooling by default. The following are the adjustment categories and list of tests that fall under such category. Change in assertions: PlannerTest#testAcidTableScans PlannerTest#testBloomFilterAssignment PlannerTest#testConstantFolding PlannerTest#testFkPkJoinDetection PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled PlannerTest#testKuduSelectivity PlannerTest#testMaxRowSize PlannerTest#testMinMaxRuntimeFilters PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled PlannerTest#testMtDopValidation PlannerTest#testParquetFiltering PlannerTest#testParquetFilteringDisabled PlannerTest#testPartitionPruning PlannerTest#testPreaggBytesLimit PlannerTest#testResourceRequirements PlannerTest#testRuntimeFilterQueryOptions PlannerTest#testSortExprMaterialization PlannerTest#testSpillableBufferSizing PlannerTest#testTableSample PlannerTest#testTpch PlannerTest#testKuduTpch PlannerTest#testTpchNested PlannerTest#testUnion TpcdsPlannerTest custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs metadata/test_explain.py::TestExplain::test_explain_level2 metadata/test_explain.py::TestExplain::test_explain_level3 metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation Increase BUFFER_POOL_LIMIT: query_test/test_queries.py::TestQueries::test_analytic_fns query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive query_test/test_udfs.py::TestUdfExecution::test_mem_limits Increase MEM_LIMIT: query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling Increase MAX_ROW_SIZE: custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config query_test/test_insert.py::TestInsertQueries::test_insert_large_string query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter query_test/test_scanners.py::TestWideRow::test_wide_row Disable result spooling to maintain assertion: custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only) query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit shell/test_shell_client.py::TestShellClient::test_fetch_size Testing: - Pass exhaustive tests. Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4 Reviewed-on: http://gerrit.cloudera.org:8080/16755 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-02 04:58:51 +00:00
Riza Suminto	5149ea7ea3	IMPALA-10054: Fix flakiness in test_multiple_sort_run_bytes_limits test_multiple_sort_run_bytes_limits seems to become flaky in ubuntu-16.04-dockerised-tests. This flakiness may come from accuracy change in query estimates, the mem_limit specified in the test does not fit anymore, or query concurrency in mini cluster that may disturb expected memory allocation. This patch remove the second test case of test_multiple_sort_run_bytes_limits due to variability in several test run in the past. It does not compromise the test itself because the basic feature of sort_run_bytes_limit is still verifiable by the remaining test cases. The assertion is also changed a bit to allow easier debugging in case test regression occurs again in the future. Testing: - Run and pass test_sort.py Change-Id: I84a8b579c943cddba4432cf183f7f002ef8ec6ad Reviewed-on: http://gerrit.cloudera.org:8080/16301 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-08-12 07:07:03 +00:00
Riza Suminto	09727a8d51	IMPALA-6692: Trigger sort node run before hitting memory limit. Sorter node works by adding row batches to a sort run. After all batches are added to current unsorted run or memory limit is hit, sorter will immediately start the run. If the latter case happens, sorter will spill the sorted run to disk after sort complete, create new unsorted run object, and continue to add the next row batches, and so on. This algorithm tries to fit as much rows into memory before start sorting. However, in the case of partitioned sort with large number of row batches, fitting too much rows into memory will cause the sort to be slow and block the sorter node for a long time before it can release some memory and continue accepting the next row batch from exchange node. One slow sorter node can block exchange node from sending row batches to other sorter node that is free. This patch speeds up the decision to start the sort without waiting it to hit memory limit first by capping the intermediary quicksort run to lower memory limit, determined by query option 'sort_run_bytes_limit'. If the total used reservation of quicksort has exceeded sort_run_bytes_limit, current unsorted_run_ will be wrapped up, sorted, and then spilled. Thus, overlapping the next sort run with spill from previous sort run. To reduce regression for cases where total input size of sort node might be fully fit into available memory, sort_run_bytes_limit will not be enforced for the first sort run. However, it will stay limited by sort_run_bytes_limit if planner estimates hint that spill is inevitably will happen. We also add new summary counter 'AddBatchTime' to get summary of how much time spent in Sorter::AddBatch. Max of 'AddBatchTime' indicate the longest time spent in Sorter::AddBatch, presumably busy doing intermediary sort. Testing: - Add new e2e test TestQueryFullSort::test_multiple_sort_run_bytes_limits - Run core tests - Run data loading of 3 largest TPC-DS facts table of 300GB scale into real cluster using 5 backends, and 4GB mem_limit. sort_run_bytes_limit is varied between unspecified (not limited) vs 512 MB. The performance result is summarized in the following table. +---------------+---------+--------------+-----------------------+-------------------------+ \| Insert table \| #Rows \| Avg \| no limit \| 512 MB limit \| \| \| \| SortDataSize +--------+--------------+---------+---------------+ \| \| \| per Node \| Query \| Max \| Query \| Max \| \| \| \| \| Time \| AddBatchTime \| Time \| AddBatchTime \| +---------------+---------+--------------+--------+--------------+---------+---------------+ \| store_sales \| 864.00M \| 15.29 GB \| 30m18s \| 53s311ms \| 20m \| 5s634ms \| +---------------+---------+--------------+--------+--------------+---------+---------------+ \| catalog_sales \| 431.97M \| 11.34 GB \| 23m24s \| 31s212ms \| 15m27s \| 3s603ms \| +---------------+---------+--------------+--------+--------------+---------+---------------+ \| web_sales \| 216.01M \| 5.67 GB \| 8m16s \| 29s250ms \| 6m41s \| 3s856ms \| +---------------+---------+--------------+--------+--------------+---------+---------------+ Change-Id: I2a0ba7c4bae4f1d300d4d9d7f594f63ced06a240 Reviewed-on: http://gerrit.cloudera.org:8080/15963 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-07-23 17:13:43 +00:00
Thomas Tauber-Marshall	6fa35478d5	IMPALA-5847: Fix incorrect use of SET in .test files The '.test' files are used to run queries for tests. These files are run with a vector of default query options. They also sometimes include SET queries that modify query options. If SET is used on a query option that is included in the vector, the default value from the vector will override the value from the SET, leading to tests that don't actually run with the query options they appear to. This patch asserts that '.test' files don't use SET for values present in the default vector. It also fixes various tests that already had this incorrect behavior. Testing: - Passed a full exhaustive run. Change-Id: I4e4c0f31bf4850642b624acdb1f6cb8837957990 Reviewed-on: http://gerrit.cloudera.org:8080/12220 Reviewed-by: Thomas Marshall <thomasmarshall@cmu.edu> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-01-22 01:20:31 +00:00
Gabor Kaszab	88366bab07	IMPALA-5706: Spilling sort optimisations This patch covers multiple changes with the purpose of optimizing spilling sort mechanism: - Remove the hard-coded maximum limit of buffers that can be used for merging the sorted runs. Instead this number is calculated based on the available memory through buffer pool. - The already sorted runs are distributed more optimally between the last intermediate merge and the final merge to avoid that a heavy intermediate merge is followed by a light final merge. - Right before starting the merging phase Sorter tries to allocate additional memory through the buffer pool. - An output run is not allocated anymore for the final merge. Note, double-buffering the runs during a merge was also planned with this patch. However, performance testing showed that except some exotic queries with unreasonably small amount of buffer pool memory available double-buffering doesn't add to the overall performance. It's basically because the half of the available buffers have to be sacrificed to do double-buffering and as a result the merge tree can get deeper. In addition the amount of I/O wait time is not reaching the level where double-buffering could countervail the reduced number of runs during a particular merge. Performance measurements were made during manual testing to verify that this is in fact an optimization: - In case doing a sort on top of a join when working with a restricted amount of memory then the Sort node successfully allocates additional memory right before the merging phase. This is feasible because once Join finishes sending new input data and calls InputDone() then it releases memory that can be picked up by the Sorter. This results in shallower merging trees (more runs grabbed for a merge). - On a multi-node cluster I verified that in cases when at least one merging step is done then this change reduces the execution time for sorts. - The more merging steps are done the bigger the performance gain is compared to the baseline. Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9 Reviewed-on: http://gerrit.cloudera.org:8080/9943 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-12 01:15:44 +00:00
Tim Armstrong	25c13bfdd6	IMPALA-7010: don't run memory usage tests on non-HDFS Moved a number of tests with tuned mem_limits. In some cases this required separating the tests from non-tuned functional tests. TestQueryMemLimit used very high and very low limits only, so seemed safe to run in all configurations. Change-Id: I9686195a29dde2d87b19ef8bb0e93e08f8bee662 Reviewed-on: http://gerrit.cloudera.org:8080/10370 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-05-11 22:41:49 +00:00
Tim Armstrong	418c705787	IMPALA-6679,IMPALA-6678: reduce scan reservation This has two related changes. IMPALA-6679: defer scanner reservation increases ------------------------------------------------ When starting each scan range, check to see how big the initial scan range is (the full thing for row-based formats, the footer for Parquet) and determine whether more reservation would be useful. For Parquet, base the ideal reservation on the actual column layout of each file. This avoids reserving memory that we won't use for the actual files that we're scanning. This also avoid the need to estimate ideal reservation in the planner. We also release scanner thread reservations above the minimum as soon as threads complete, so that resources can be released slightly earlier. IMPALA-6678: estimate Parquet column size for reservation --------------------------------------------------------- This change also reduces reservation computed by the planner in certain cases by estimating the on-disk size of column data based on stats. It also reduces the default per-column reservation to 4MB since it appears that < 8MB columns are generally common in practice and the method for estimating column size is biased towards over-estimating. There are two main cases to consider for the performance implications: * Memory is available to improve query perf - if we underestimate, we can increase the reservation so we can do "efficient" 8MB I/Os for large columns. * The ideal reservation is not available - query performance is affected because we can't overlap I/O and compute as much and may do smaller (probably 4MB I/Os). However, we should avoid pathological behaviour like tiny I/Os. When stats are not available, we just default to reserving 4MB per column, which typically is more memory than required. When stats are available, the memory required can be reduced below when some heuristic tell us with high confidence that the column data for most or all files is smaller than 4MB. The stats-based heuristic could reduce scan performance if both the conservative heuristics significantly underestimate the column size and memory is constrained such that we can't increase the scan reservation at runtime (in which case the memory might be used by a different operator or scanner thread). Observability: Added counters to track when threads were not spawned due to reservation and to track when reservation increases are requested and denied. These allow determining if performance may have been affected by memory availability. Testing: Updated test_mem_usage_scaling.py memory requirements and added steps to regenerate the requirements. Loops test for a while to flush out flakiness. Added targeted planner and query tests for reservation calculations and increases. Change-Id: Ifc80e05118a9eef72cac8e2308418122e3ee0842 Reviewed-on: http://gerrit.cloudera.org:8080/9757 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	fb5dc9eb48	IMPALA-4835: switch I/O buffers to buffer pool This is the following squashed patches that were reverted. I will fix the known issues with some follow-on patches. ====================================================================== IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation In preparation for switching the I/O mgr to the buffer pool, this removes and cleans up a lot of code so that the switchover patch starts from a cleaner slate. * Remove the free buffer cache (which will be replaced by buffer pool's own caching). * Make memory limit exceeded error checking synchronous (in anticipation of having to propagate buffer pool errors synchronously). * Simplify error propagation - remove the (ineffectual) code that enqueued BufferDescriptors containing error statuses. * Document locking scheme better in a few places, make it part of the function signature when it seemed reasonable. * Move ReturnBuffer() to ScanRange, because it is intrinsically connected with the lifecycle of a scan range. * Separate external ReturnBuffer() and internal CleanUpBuffer() interfaces - previously callers of ReturnBuffer() were fudging the num_buffers_in_reader accounting to make the external interface work. * Eliminate redundant state in ScanRange: 'eosr_returned_' and 'is_cancelled_'. * Clarify the logic around calling Close() for the last BufferDescriptor. -> There appeared to be an implicit assumption that buffers would be freed in the order they were returned from the scan range, so that the "eos" buffer was returned last. Instead just count the number of outstanding buffers to detect the last one. -> Touching the is_cancelled_ field without holding a lock was hard to reason about - violated locking rules and it was unclear that it was race-free. * Remove DiskIoMgr::Read() to simplify the interface. It is trivial to inline at the callsites. This will probably regress performance somewhat because of the cache removal, so my plan is to merge it around the same time as switching the I/O mgr to allocate from the buffer pool. I'm keeping the patches separate to make reviewing easier. Testing: * Ran exhaustive tests * Ran the disk-io-mgr-stress-test overnight ====================================================================== IMPALA-4835: Part 2: Allocate scan range buffers upfront This change is a step towards reserving memory for buffers from the buffer pool and constraining per-scanner memory requirements. This change restructures the DiskIoMgr code so that each ScanRange operates with a fixed set of buffers that are allocated upfront and recycled as the I/O mgr works through the ScanRange. One major change is that ScanRanges get blocked when a buffer is not available and get unblocked when a client returns a buffer via ReturnBuffer(). I was able to remove the logic to maintain the blocked_ranges_ list by instead adding a separate set with all ranges that are active. There is also some miscellaneous cleanup included - e.g. reducing the amount of code devoted to maintaining counters and metrics. One tricky part of the existing code was the it called IssueInitialRanges() with empty lists of files and depended on DiskIoMgr::AddScanRanges() to not check for cancellation in that case. See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue ranges for empty lists of files. I plan to merge this along with the actual buffer pool switch, but separated it out to allow review of the DiskIoMgr changes separate from other aspects of the buffer pool switchover. Testing: * Ran core and exhaustive tests. ====================================================================== IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00 Reviewed-on: http://gerrit.cloudera.org:8080/9679 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Thomas Tauber-Marshall	ffb74e7c0f	IMPALA-6713: Fix request for unneeded memory in partial sort When a Sorter::Run is initialized, if it is an initial run and has varlen data, it requests an extra buffer to have space to sort the varlen data should it need to spill to disk. This extra buffer is not needed in the case of partial sorts, which do not spill, and because this extra buffer was not included in the calculation of the minimum required reservation, requesting it caused the partial sort to fail in cases where the partial sort only had its minimum reservation available to use. The solution is to not request the extra memory for partial sorts. Testing: - Added a test to test_sort.py that ensures the partial sort can complete successfully even if additional memory requests beyond its minimum reservation are denied. Change-Id: I2d9c0863009021340d8b684669b371a2cfb1ecad Reviewed-on: http://gerrit.cloudera.org:8080/10031 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-04-13 21:20:47 +00:00
Tim Armstrong	161cbe30ff	Revert IMPALA-4835 and dependent changes Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit" This reverts commit `25bcb258df`. Revert "IMPALA-6588: don't add empty list of ranges in text scan" This reverts commit `d57fbec6f6`. Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool" This reverts commit `24b4ed0b29`. Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront" This reverts commit `5699b59d0c`. Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation" This reverts commit `65680dc421`. Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb Reviewed-on: http://gerrit.cloudera.org:8080/9480 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-03 04:22:12 +00:00
Tim Armstrong	66d222ec9d	IMPALA-6594: fix tests on local fs Skip most mem_usage_scaling tests, which are tuned for the memory requirements on 3 daemons. Also skip test_sort_reservation_usage, which similarly is tuned for 3 daemons. Fix the other test to wrap the HDFS path correctly. Testing: Ran the modified tests by hand against the minicluster to confirm they still ran as expected. Running full set of local tests. Change-Id: I76086ed695bf78e3e0f2745c1964dac8330d6c19 Reviewed-on: http://gerrit.cloudera.org:8080/9463 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-01 22:56:07 +00:00
Tim Armstrong	24b4ed0b29	IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786 Reviewed-on: http://gerrit.cloudera.org:8080/8966 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 04:17:41 +00:00
Tianyi Wang	f0b3d9d122	IMPALA-3916: Reserve SQL:2016 reserved words This patch reserves SQL:2016 reserved words, excluding: 1. Impala builtin function names. 2. Time unit words(year, month, etc.). 3. An exception list based on a discussion. Some test cases are modified to avoid these words. A impalad and catalogd startup option reserved_words_version is added. The words are reserved if the option is set to "3.0.0". Change-Id: If1b295e6a77e840cf1b794c2eb73e1b9d2b8ddd6 Reviewed-on: http://gerrit.cloudera.org:8080/9096 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Reviewed-by: Philip Zeyliger <philip@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-02 01:13:08 +00:00
Tim Armstrong	8609b09a95	IMPALA-5681: release reservation from blocking operators When an in-memory blocking aggregation or join is in the GetNext() phase where it is outputting accumulated rows then we expect memory consumption to monotonically decrease because no more rows will be accumulated in memory. This change adds support to release unused reservation and makes use of it for in-memory aggregations and sorts. We don't release memory for operators with spilled data, since they may need the reservation to bring it back into memory. We also don't release memory in subplans, since it will probably be used in a later iteration of the subplan. Testing: Updated spilling test that now requires less memory. Ran stress test binary search on tpch_parquet. No changes, except Q18 now requires 325MB instead of 450MB to execute without spilling. Ran query with two sorts in the same pipeline and watched /memz to confirm that the first node in the pipeline was incrementally releasing memory. Added a regression test based on this experiment. Added a backend test to directly test reservation decreasing. Change-Id: I6f4d0ad127d5fcd14b9821a7c127eec11d98692f Reviewed-on: http://gerrit.cloudera.org:8080/7619 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-17 20:17:48 +00:00
Tim Armstrong	a98b90bd38	IMPALA-4674: Part 2: port backend exec to BufferPool Always create global BufferPool at startup using 80% of memory and limit reservations to 80% of query memory (same as BufferedBlockMgr). The query's initial reservation is computed in the planner, claimed centrally (managed by the InitialReservations class) and distributed to query operators from there. min_spillable_buffer_size and default_spillable_buffer_size query options control the buffer size that the planner selects for spilling operators. Port ExecNodes to use BufferPool: * Each ExecNode has to claim its reservation during Open() * Port Sorter to use BufferPool. * Switch from BufferedTupleStream to BufferedTupleStreamV2 * Port HashTable to use BufferPool via a Suballocator. This also makes PAGG memory consumption more efficient (avoid wasting buffers) and improve the spilling algorithm: * Allow preaggs to execute with 0 reservation - if streams and hash tables cannot be allocated, it will pass through rows. * Halve the buffer requirement for spilling aggs - avoid allocating buffers for aggregated and unaggregated streams simultaneously. * Rebuild spilled partitions instead of repartitioning (IMPALA-2708) TODO in follow-up patches: * Rename BufferedTupleStreamV2 to BufferedTupleStream * Implement max_row_size query option. Testing: * Updated tests to reflect new memory requirements Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e Reviewed-on: http://gerrit.cloudera.org:8080/5801 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-05 01:03:02 +00:00
Thomas Tauber-Marshall	698f4a34c6	IMPALA-5262: test_analytic_order_by_random fails with assert This was a poorly written test that relies on assumptions about the behavior of 'rand' and the order that rows get processed in a table that Impala doesn't actually guarantee. The new version is still sensitive to the precise behavior of 'rand()', but shouldn't be flaky unless that behavior is changed. Change-Id: If1ba8154c2b6a8d508916d85391b95885ef915a9 Reviewed-on: http://gerrit.cloudera.org:8080/6775 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins	2017-05-03 01:19:58 +00:00
Thomas Tauber-Marshall	6cddb952ce	IMPALA-4731/IMPALA-397/IMPALA-4728: Materialize sort exprs Previously, exprs used in sorts were evaluated lazily. This can potentially be bad for performance if the exprs are expensive to evaluate, and it can lead to crashes if the exprs are non-deterministic, as this violates assumptions of our sorting algorithm. This patch addresses these issues by materializing ordering exprs. It does so when the expr is non-deterministic (including when it contains a UDF, which we cannot currently know if they are non-deterministic), or when its cost exceeds a threshold (or the cost is unknown). Testing: - Added e2e tests in test_sort.py. - Updated planner tests. Change-Id: Ifefdaff8557a30ac44ea82ed428e6d1ffbca2e9e Reviewed-on: http://gerrit.cloudera.org:8080/6322 Reviewed-by: Thomas Tauber-Marshall <tmarshall@cloudera.com> Tested-by: Impala Public Jenkins	2017-04-26 22:34:04 +00:00
David Knupp	f590bc0da6	IMPALA-4750: Rename test infra classes so they don't mimic test classes. This patch addresses warning messages from pytest re: the imported TestMatrix, TestVector, and TestDimension classes, which were being collected as potential test classes. The fix was to simply prepend the class names with Impala- git grep -l 'TestDimension' \| xargs \ sed -i 's/TestDimension/ImpalaTestDimension/g' git grep -l 'TestMatrix' \| xargs \ sed -i 's/TestMatrix/ImpalaTestMatrix/g' git grep -l 'TestVector' \| xargs \ sed -i 's/TestVector/ImpalaTestVector/g' The tests all passed in an exhaustive run on the upstream jenkins server: http://jenkins.impala.io:8080/view/Utility/job/pre-review-test/8/ Change-Id: I06b7bc6fd99fbb637a47ba376bf9830705c1fce1 Reviewed-on: http://gerrit.cloudera.org:8080/5794 Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Jim Apple <jbapple-impala@apache.org> Tested-by: Impala Public Jenkins	2017-01-26 23:40:22 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Tim Armstrong	d23e5505c8	IMPALA-3670: fix sorter buffer mgmt bugs Also make test_scratch_disk.py more deterministic, by using max_block_mgr_memory, which doesn't include scanner memory. The fixed test_scratch_disk.py exercises the other sorter bugs that occurs when scratch cannot be written. Testing: Added a test that does a sort with various memory limits and consumes the whole output of the sorter (we have many tests of sorts with limits but limited coverage of sorts without limits). Ran an exhaustive test run before posting for review. This added test reproduced one of the sorter bugs, where var-len blocks were not always attached to the output batch. The other test was reproduced by the test change in IMPALA-3669: test_scratch_disk fix. Change-Id: Ia1a0ddffa0a5b157ab86a376b7b7360a923698d6 Reviewed-on: http://gerrit.cloudera.org:8080/3315 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-06-06 22:34:19 -07:00
Tim Armstrong	37ec25396f	IMPALA-3344: Simplify sorter and document/enforce invariants. Clarify relationships between classes, clean up the previous mess where every class was friends with the other so there's an actual distinction between public and private members. TupleIterator is now no longer tied to TupleSorter, just Run. Document and enforce invariants in many cases. Factor out some functions from large functions. Simplify and document iterator logic. Make management of buffers when iterating over output stream more explicitly correct: either use MarkNeedToReturn() or attach block to the batch as appropriate. The SortedRunMerger didn't handle resource transfer correctly, except if all the memory came from the batch's MemPool. This patch fixes the cases when resources are attached to the batches, but not the 'need_to_return' case. Document that SortedRunMerger requires 'deep_copy_input' to be true if batches can have the 'need_to_return' flag set. Also use the atomic block exchange operation when moving between blocks in unpinned runs to prevent pin failures at that point. I explicitly have avoided changing the hairy block management logic when allocating buffers for merging, that will need addressing in a follow-up patch. Add a SpilledRuns counter so that it's more explicit that spilling occurred. Testing: Added some tests for corner cases with empty and NULL strings. Fixed a test that previously failed with OOM but now succeeds. Performance: Benchmarking against old code initial revealed some regressions from changes in inlining. Force inlining the TupleComparator::operator() and iterator Next()/Prev() functions helped and performance seems similar or slightly better on the targeted orderby benchmarks. Change-Id: I9c619e81fd1b8ac50e257172c8bce101a112b52a Reviewed-on: http://gerrit.cloudera.org:8080/2826 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-06-02 21:33:08 -07:00
Tim Armstrong	89aa6597f4	IMPALA-3354: bad sorter pivot selection on some inputs Switch to a median of three random tuples that should be very robust to a range of inputs. It may be slightly worse than the existing pivot selection on some inputs where the original algorithm is close to optimal (e.g. already sorted inputs), but should be typically better overall. Always recurse on the smaller partition: this prevent the stack overflow even with bad pivot selection. The overhead is minimal - in profiles for small sorts I'm seeing pivot selection take at most 0.5% of CPU time. The improved pivot selections gives modest improvements of 2-5% on the targeted perf order by benchmarks on a single node run with TPC-H scale factor 20. Change-Id: Iae50112b6deca3d6268e18b6f4daae1af279b452 Reviewed-on: http://gerrit.cloudera.org:8080/2824 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:39 -07:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
Nong Li	f96fb27982	BlockMgr reservation improvements. Change-Id: I9c7fa0ce54cbbf5b2a7a368c27c379c2a5241fc8 Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4732 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/5092	2014-11-03 22:32:57 -08:00
Nong Li	e08ffde009	PA/PHJ: Increase fanout to 32 and fix interaction with small buffers. Small buffers introduced an issue that is exacerbated by the large fanout. A stream can only be appended to forever once it has grabbed the initial io sized buffer. With small buffers, we don't grab that at the beginning anymore and, before this patch, it is grabbed when the stream first needs it. This means when one stream needs it, another stream could have already grabbed it (meaning this stream is pinned with multiple buffers). This patch has all the streams grab an IO buffer as soon as the first stream needs an io buffer. This guarantees that all streams get 1 before any get 2. Change-Id: I1be1219fc5f1fa3ceedd4d5e76ae056c8bb8ff3d	2014-10-06 15:16:16 -07:00
Nong Li	28e16b02bb	Change BufferedBlockMgr to be a query wide singleton. Similar to some of our other resource management objects, the buffered block mgr will be shared by all fragments within a query. The memory given to the block mgr is based on the query limit (e.g. 80% of query limit). We can't have each fragment having a block mgr that uses 80% of the query limit and we probably don't want to impose per fragment limits. Change-Id: Idcd89f302534b37ed236cdd42784ae8d717ec29e Reviewed-on: http://gerrit.sjc.cloudera.com:8080/3965 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.sjc.cloudera.com:8080/4179	2014-09-05 00:08:05 -07:00
Nong Li	7dc57aaa9e	Change buffered block mgr to support multiple clients. This patch does a few things: 1. Moves the buffer block mgr from the sorter to the runtime state. This is now one that is shared across the query fragment. The partitioned hash join and agg will use this as well. 2. Adds a Client interface to the block mgr. Each exec node is a different client and can reserve a minimum number of buffers. This avoid starvation. 3. Updated the BufferedBlockMgr interface's for getting pinned blocks to collapse two existing APIs. Change-Id: Ibb31fbe480f3726048457f26e24a9e33f7201d86 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3504 Reviewed-by: Nong Li <nong@cloudera.com> Tested-by: Nong Li <nong@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/3574	2014-07-22 12:45:37 -07:00
Nong Li	188a0ea833	Rework structure of hash table. This patch does two things in preparation for external joins. The hash table used to contain a directory structure (buckets and nodes) both of which were contiguous. The nodes contained the tuple ptrs within it. This patch changes it so the nodes are not stored contiguously but allocated in pages. (this structure is dense and does not require random lookups by index). The bucket structure is still contiguous since we rely on the doubling property and random lookup by index. The second change is that the node's no longer store the tuple ptrs within them. This makes it easier to build the hash table ontop of existing data. Here's a quick benchmark doing a self join on tpch lineitem. Both build and probe times decreased a bit. Before: HASH_JOIN_NODE (id=2):(Total: 1s139ms, non-child: 985.939ms, % non-child: 86.50%) - BuildBuckets: 2.10M (2097152) - BuildRows: 6.00M (6001215) - BuildTime: 527.991ms - LeftChildRows: 6.00M (6001215) - LeftChildTime: 451.964ms - LoadFactor: 0.50 - RowsReturned: 30.01M (30012985) - RowsReturnedRate: 26.33 M/sec After: HASH_JOIN_NODE (id=2):(Total: 1s019ms, non-child: 835.350ms, % non-child: 81.97%) - BuildBuckets: 2.10M (2097152) - BuildRows: 6.00M (6001215) - BuildTime: 423.175ms - LeftChildRows: 6.00M (6001215) - LeftChildTime: 406.67ms - LoadFactor: 0.50 - RowsReturned: 30.01M (30012985) - RowsReturnedRate: 29.45 M/sec Change-Id: I79e209a24c24fb4f2f99574bcf187746fddadc06 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3245 Tested-by: jenkins Reviewed-by: Nong Li <nong@cloudera.com>	2014-07-15 16:57:09 -07:00
Alex Behm	881f3a8c33	Re-order union operands descending by their estimated per-host memory. Re-order union operands descending by their estimated per-host memory, s.t. parent nodes can gauge the peak memory consumption of a MergeNode after opening it during execution (a MergeNode opens its first operand in Open()). Scan nodes are always ordered last because they can dynamically scale down their memory usage, whereas many other nodes cannot (e.g., joins, aggregations). One goal is to decrease the likelihood of a SortNode parent claiming too much memory in its Open(), possibly causing the mem limit to be hit when subsequent union operands are executed. Change-Id: Ia51caaffd55305ea3dbd2146cd55acc7da67f382 Reviewed-on: http://gerrit.ent.cloudera.com:8080/3146 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Alex Behm <alex.behm@cloudera.com> Reviewed-on: http://gerrit.ent.cloudera.com:8080/3213 Tested-by: jenkins	2014-06-20 18:46:10 -07:00
Taras Bobrovytsky	7faaa65996	Added order by query tests - Added static order by tests to test_queries.py and QueryTest/sort.test - test_order_by.py also contains tests with static queries that are run with multiple memory limits. - Added stress, scratch disk and failpoints tests - Incorporated Srinath's change that copied all order by with limit tests into the top-n.test file Extra time required: Serial: scratch disk: 42 seconds test queries sort : 77 seconds test sort: 56 seconds sort stress: 142 seconds TOTAL: 5 min 17 seconds Parallel(8 threads): scratch disk: 40 seconds test queries sort: 42 seconds test sort: 49 seconds sort stress: 93 seconds TOTAL: 3 min 44 sec Change-Id: Ic5716bcfabb5bb3053c6b9cebc9bfbbb9dc64a7c Reviewed-on: http://gerrit.ent.cloudera.com:8080/2820 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3205	2014-06-20 13:35:10 -07:00

35 Commits