impala

mirror of https://github.com/apache/impala.git synced 2025-12-20 02:20:11 -05:00

Author	SHA1	Message	Date
baggio000	1a3ff11d82	IMPALA-10429 Add support for specifying HDFS path in 'scratch_dirs' startup option We support the HDFS scratch space, but as a test-only feature with a fixed HDFS default local path. In this patch, we extend the HDFS scratch space to support the customer's input. For supporting the function, we add a new format for HDFS scratch space path. It forces the HDFS path to have the port number to solve the contradiction to the current format of the scratch space path. For example, previously, the format for scratch space path is, take s3 for example, s3a://bucketpath:#bytes:#priority. In this case, the bucketpath doesn't have a port number. In this patch, the new format of HDFS scratch path is hdfs://ipaddr:#port:#bytes:#priority. The port number is required, therefore, there must be at least one colon in the HDFS path, the bytes and priority are optional as before. For other scratch spaces, the path format doesn’t change. Also, option allow_spill_to_hdfs is removed because the spilling to HDFS is not a test-only function anymore, as a result, the e2e tests involved are updated. Tests: Added and passed TmpFileMgrTest::TestDirectoryLimitParsingRemotePath. Ran the Core tests. Change-Id: I0882ed1e80b02724dd5cb3cdb1fa7b6c2debcbf4 Reviewed-on: http://gerrit.cloudera.org:8080/17720 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-08-17 01:10:36 +00:00
Bikramjeet Vig	06c9016a37	IMPALA-8762: Track host level admission stats across all coordinators This patch adds the ability to share the per-host stats for locally admitted queries across all coordinators. This helps to get a more consolidated view of the cluster for stats like slots_in_use and mem_admitted when making local admission decisions. Testing: Added e2e py test Change-Id: I2946832e0a89b077d0f3bec755e4672be2088243 Reviewed-on: http://gerrit.cloudera.org:8080/17683 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-07-28 05:33:16 +00:00
Riza Suminto	47219ec366	IMPALA-10565: Adjust result spooling memory based on scratch_limit IMPALA-9856 enables result spooling by default. Result spooling depends on the ability to spill its entire BufferedTupleStream to disk once it hits maximum memory reservation. However, if the query option scratch_limit is set lower than max_spilled_result_spooling_mem, the query might fail in the middle of execution due to insufficient scratch space. This patch adds planner change to consider scratch_limit and scratch_dirs query option when computing resource used by result spooling. The algorithm is as follow: * If scratch_dirs is empty or scratch_limit < minMemReservationBytes required to use BufferedPlanRootSink, we set spool_query_results to false and fallback to use BlockingPlanRootSink. * If scratch_limit > minMemReservationBytes but still fairly low, we lower the max_result_spooling_mem (default is 100MB) and max_spilled_result_spooling_mem (default is 1GB) to fit scratch_limit. * if scratch_limit > max_spilled_result_spooling_mem, do nothing. Testing: - Add TestScratchLimit::test_result_spooling_and_varying_scratch_limit - Verify that spool_query_results query option is disabled in TestScratchDir::test_no_dirs - Pass exhaustive tests. Change-Id: I541f46e6911694e14c0fc25be1a6982fd929d3a9 Reviewed-on: http://gerrit.cloudera.org:8080/17166 Reviewed-by: Aman Sinha <amsinha@cloudera.com> Tested-by: Aman Sinha <amsinha@cloudera.com>	2021-03-14 03:35:40 +00:00
Riza Suminto	49ac55fb69	IMPALA-9856: Enable result spooling by default. Result spooling has been relatively stable since it was introduced, and it has several benefits described in IMPALA-8656. This patch enable result spooling (SPOOL_QUERY_RESULTS) query options by default. Furthermore, some tests need to be adjusted to account for result spooling by default. The following are the adjustment categories and list of tests that fall under such category. Change in assertions: PlannerTest#testAcidTableScans PlannerTest#testBloomFilterAssignment PlannerTest#testConstantFolding PlannerTest#testFkPkJoinDetection PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled PlannerTest#testKuduSelectivity PlannerTest#testMaxRowSize PlannerTest#testMinMaxRuntimeFilters PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled PlannerTest#testMtDopValidation PlannerTest#testParquetFiltering PlannerTest#testParquetFilteringDisabled PlannerTest#testPartitionPruning PlannerTest#testPreaggBytesLimit PlannerTest#testResourceRequirements PlannerTest#testRuntimeFilterQueryOptions PlannerTest#testSortExprMaterialization PlannerTest#testSpillableBufferSizing PlannerTest#testTableSample PlannerTest#testTpch PlannerTest#testKuduTpch PlannerTest#testTpchNested PlannerTest#testUnion TpcdsPlannerTest custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs metadata/test_explain.py::TestExplain::test_explain_level2 metadata/test_explain.py::TestExplain::test_explain_level3 metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation Increase BUFFER_POOL_LIMIT: query_test/test_queries.py::TestQueries::test_analytic_fns query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive query_test/test_udfs.py::TestUdfExecution::test_mem_limits Increase MEM_LIMIT: query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling Increase MAX_ROW_SIZE: custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config query_test/test_insert.py::TestInsertQueries::test_insert_large_string query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter query_test/test_scanners.py::TestWideRow::test_wide_row Disable result spooling to maintain assertion: custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only) query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit shell/test_shell_client.py::TestShellClient::test_fetch_size Testing: - Pass exhaustive tests. Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4 Reviewed-on: http://gerrit.cloudera.org:8080/16755 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-03-02 04:58:51 +00:00
Yida Wu	bead2ede1c	IMPALA-10533: Fix TestScratchDir.test_scratch_dirs_mix_local_and_remote_dir_spill_local_only seems flaky The E2E testcase emulates the situation when there are two types of scratch directories, the data only spills to the local one when the space of local directory is sufficient. The testcase works fine for the debug build, however in the release build, the system runs faster and more data is spilled from memory which exceeds the setting of the local scratch space limit. To solve this, the size limit of local scratch space is changed from 100M to 2GB, so that allows all of the spilled data is in the local instead of the remote directory. Tests: Reran test_scratch_dirs_mix_local_and_remote_dir_spill_local_only in the release build. Change-Id: If2dc32196b2554aee9fc94a4ccbbf5803dbcce1d Reviewed-on: http://gerrit.cloudera.org:8080/17102 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-25 07:50:36 +00:00
Yida Wu	85e39e170f	IMPALA-9867: Add Support for Spilling to S3: Milestone 1 Major Features 1) Local files as buffers for spilling to S3. 2) Async Upload for remote files. 3) Sync remote files deletion after query ends. 4) Local buffer files management. 5) Compatibility of spilling to local and remote. 6) All the errors from hdfs/s3 should terminate the query. Changes on TmpFile: * TmpFile is separated into two types of implementation, TmpFileLocal and TmpFileRemote. TmpFileLocal is used for Spilling to local file system. TmpFileRemote is a new type for Spilling to the remote. It contains two DiskFiles, one for local buffer, the other for the remote file. * The DiskFile is an object that contains the information of a pysical file for passing to the DiskIOMgr to execute the IO operations on that specific file. The DiskFile also contains status information of the file,includes DiskFileStatus::INWRITING/PERSISTED/DELETED. When the DiskFile is initialized, it is in INWRITING status. If the file is persisted into the file system, it would become PERSISTED status. If the file is deleted, for example, the local buffer is evicted, so the DiskFile status of the buffer file would become deleted. After that, if the file is fetching from the remote, the DiskFile status of the buffer file would become INWRITING, and then PERSISTED if the fetching finishes successfully. Implementation Details: 1) A new enum type is added to specify the disk type of files, indicating where the file physically locates. The types include DiskFileType::LOCAL/LOCAL_BUFFER/DFS/S3. DiskFileType::LOCAL indicates the file is in the local file system. DiskFileType::LOCAL_BUFFER indicates the file is in the local file system, and it is the buffer of a remote scratch file. DiskFileType::DFS/S3 indicates the file is in the HDFS/S3. The local buffer allows the buffer pool to pin(read), but mainly for remote files, buffer pool would pin(read) the page from the remote file system. 2) Two disk queues have been added to do the file operation jobs. Queue name: RemoteS3DiskFileOper/RemoteDfsDiskFileOper File operations on the remote disk like upload and fetch should be done in these queues. The purpose of the queues is to isolate the file operations from normal read/write IO operations in different queues. It could increase the efficiency of the file operations by not being interrupted during a relatively long execution time, and also provide a more accurate control on the thread number working on file operation jobs. RemoteOperRange is the new type to carry the file operation jobs. Previously,we have request types of READ and WRITE. Now FILE_FETCH/FILE_UPLOAD are added. 3) The tmp files are physically deleted when the tmp file group is deconstructing. For remote files, the entire directory would be deleted. 4) The local buffer files management is to control the total size of local buffer files and evict files if needed. A local buffer file can be evicted if the temporary file has uploaded a copy to the remote disk or the query ends. There are two modes to decide the sequence of choosing files to be evicted first. Default is LIFO, the other is FIFO. It can be controlled by startup option remote_tmp_files_avail_pool_lifo. Also, a thread TmpFileSpaceReserveThreadLoop in TmpFileMgr is running to allow to reserve buffer file space in an async way to avoid deadlocks. Startup option allow_spill_to_hdfs is added. By default the HDFS path is not allowed, but for testcases, the option can be set true to allow the use of HDFS path as scratch space for testing only. Startup option wait_for_spill_buffer_timeout_s is added to control the maximum duration waiting for the buffer in the TmpFileBufferPool. Default value is 60, stands for 60 seconds. 5) Spilling to local has higher priority than spilling to remote. If no local scratch space is available, temporary data will be spilled to remote. The first available local directory is used for the local buffer for spilling to remote if any remote directory is configured. If remote directory is configured without any available local scratch space, an error will be returned during initialization. The purpose of the design is to simplify the implementation in milestone 1 with less changes on the configuration. Example (setting remote scratch space): Assume that the directories we have for scratch space: * Local dir: /tmp/local_buffer, /tmp/local, /tmp/local_sec * Remote dir: s3a://tmp/remote The scratch space path is configured in the startup options, and could have three types of configurations: 1. Pure local scratch space --scratch_dirs="/tmp/local" 2. Pure remote scratch space --scratch_dirs="s3a://tmp/remote,/tmp/local_buffer:16GB" 3. Mixed local and remote scratch space --scratch_dirs="s3a://tmp/romote:200GB,/tmp/local_buffer:1GB, /tmp/local:2GB, /tmp/local_sec:16GB" * Type 1: a pure local scratch space with unlimited size. * Type 2: a pure remote scratch space with a 16GB local buffer. * Type 3: a mixed local and remote scratch space, the size of the local buffer for the remote directory is 1GB, while local scratch spaces are 2GB and 16GB, remote scratch space bytes limit is 200GB. Remote scratch space is used only when all of the local spaces are at capacity. * Note: The first local directory would be the local buffer path, if a remote scratch space is registered. Limitations: * Only one remote scratch dir is supported. * The first local scratch dir is used for the buffer of remote scratch space if remote scratch dir exists. Testcases: * Ran pre-review-test * Unit Tests added to tmp-file-mgr-test/disk-io-mgr-test/buffer-pool-test. * E2E Tests added to custom_cluster/test_scratch_disk.py. * Ran Unit Tests: $IMPALA_HOME/be/build/debug/runtime/buffered-tuple-stream-test $IMPALA_HOME/be/build/debug/runtime/tmp-file-mgr-test $IMPALA_HOME/be/build/debug/runtime/bufferpool/buffer-pool-test $IMPALA_HOME/be/build/debug/runtime/io/disk-io-mgr-test * Ran E2E Tests: custom_cluster/test_scratch_disk.py Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89 Reviewed-on: http://gerrit.cloudera.org:8080/16318 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2021-02-19 23:54:44 +00:00
Abhishek Rawat	3b9ae415e2	IMPALA-9697: Support priority based scratch directory selection The '--scratch_dirs' configuration option now supports specifying the priority of the scratch direcotry. The lower the numeric value, the higher is the priority. If priority is not specified then default priority with value numeric_limits<int>::max() is used. Valid formats for specifying the priority are: - <dir-path>:<limit>:<priority> - <dir-path>::<priority> Following formats use default priority: - <dir-path> - <dir-path>:<limit> - <dir-path>:<limit>: The new logic in TmpFileGroup::AllocateSpace() tries to find a target file using a prioritized round-robin scheme. Files are ordered in decreasing order of their priority. The priority of a file is same as the priority of the related directory. A target file is selected by always searching in the ordered list starting from the file with highest priority. If multiple files have same priority, then the target file is selected in a round robin manner. Testing: - Added unit and e2e tests for priority based spilling logic. Change-Id: I381c3a358e1382e6696325fec74667f1fa18dd17 Reviewed-on: http://gerrit.cloudera.org:8080/16091 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2020-06-26 10:20:41 +00:00
Tim Armstrong	ebbe52b4be	IMPALA-3766: optionally compress spilled data Enabled via --disk_spill_compression_codec, which uses the same syntax as the compression_codec query option. Recommended codecs are LZ4 and ZSTD. ZSTD supports specifying a compression level. The compression is done in TmpFileMgr using a temporary buffer. Allocation of disk space is reworked slightly so that the allocation can happen after compression. The default power-of-two disk block sizes would lead to a lot of internal fragmentation, so a new strategy for free space management, similar to that used in the data cache, can be used with --disk_spill_punch_holes=true. TmpFileMgr will allocate a range of the actual compressed size and punch holes in the file for each range that is no longer needed. UncompressedWriteIoBytes is added to the buffer pool profiles, so that you can see what degree of compression is achieved. Typically I saw ratios of 2-3x for LZ4 and ZSTD (with LZ4 toward the lower end and ZSTD toward the higher end). Limitations: The management of the compression buffer memory could be improved. Ideally it would be integrated with the buffer pool and use the buffer pool allocator instead of being done "on the side". We would probably want to do this before making this the default, for resource management and performance reasons (doing a malloc() directly does not use the caching supported by the buffer pool). Testing: * Run buffer pool spilling tests with different combinations of the new options. * Extend existing TmpFileMgr tests for file space allocation to run with hole punching enabled. * Switch a couple of spilling tests to use the new option. * Add a metrics test to check for scratch leaks. * Enable the new options by default for end-to-end dockerized tests to get additional coverage. * Add a unit test where allocating compression memory fails, both on the read and write path. * Ran a single-node stress test on TPC-DS SF 1 and TPC-H SF 10 The peak compression buffer usage was ~40MB. Perf: I ran this spilling query using an SSD as the scratch disk: set mem_limit=200m; select count(distinct l_partkey) from tpch30_parquet.lineitem; The time taken for the second run of each query was: No compression: 19.59s LZ4: 18.56s ZSTD: 20.59s Change-Id: I9c08ff9504097f0fee8c32316c5c150136abe659 Reviewed-on: http://gerrit.cloudera.org:8080/15454 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>	2020-03-31 01:36:44 +00:00
Tim Armstrong	4fb8e8e324	IMPALA-8816: reduce custom cluster test runtime in core This includes some optimisations and a bulk move of tests to exhaustive. Move a bunch of custom cluster tests to exhaustive. I selected these partially based on runtime (i.e. I looked most carefully at the tests that ran for over a minute) and the likelihood of them catching a precommit bug. Regression tests for specific edge cases and tests for parts of the code that are very stable were prime candidates. Remove an unnecessary cluster restart in test_breakpad. Merge test_scheduler_error into test_failpoints to avoid an unnecessary cluster restart. Speed up cluster starts by ensuring that the default statestore args are applied even when _start_impala_cluster() is called directly. This shaves a couple of seconds off each restart. We made the default args use a faster update frequency - see IMPALA-7185 - but they did not take effect in all tests. Change-Id: Ib2e3e7ebc9695baec4d69183387259958df10f62 Reviewed-on: http://gerrit.cloudera.org:8080/13967 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-08-06 21:34:26 +00:00
Tim Armstrong	f9ced753ba	IMPALA-7999: clean up start-d.sh scripts Delete these wrapper scripts and replace with a generic start-daemon.sh script that sets environment variables without the other logic. Move the logic for setting JAVA_TOOL_OPTIONS into start-impala-cluster.py. Remove some options like -jvm_suspend, -gdb, -perf that may not be used. These can be reintroduced if needed. Port across the kerberized minicluster logic (which has probably bitrotted) in case it needs to be revived. Remove --verbose option that didn't appear to be useful (it claims to print daemon output to the console, but output is still redirected regardless). Removed a level of quoting in custom cluster test argument handling - this was made unnecessary by properly escaping arguments with pipes.escape() in run_daemon(). Testing: Ran exhaustive tests. * Ran on CentOS 6 to confirm we didn't reintroduce Popen issue worked around by kwho. Change-Id: Ib67444fd4def8da119db5d3a0832ef1de15b068b Reviewed-on: http://gerrit.cloudera.org:8080/12271 Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2019-02-05 13:10:08 +00:00
Tim Armstrong	3e4015c601	IMPALA-7179: allow_multiple_scratch_dirs_per_device=true by default The previous default was often confusing to users of Impala. It is simpler to do exactly what is asked instead of trying to fix bad configurations automatically. Testing: Ran core tests. Change-Id: I23394c9949ae4cd0a21d7bb25551371b3198e76c Reviewed-on: http://gerrit.cloudera.org:8080/10736 Reviewed-by: anujphadke <aphadke@cloudera.com> Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>	2018-06-16 07:51:42 +00:00
Tim Armstrong	fb5dc9eb48	IMPALA-4835: switch I/O buffers to buffer pool This is the following squashed patches that were reverted. I will fix the known issues with some follow-on patches. ====================================================================== IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation In preparation for switching the I/O mgr to the buffer pool, this removes and cleans up a lot of code so that the switchover patch starts from a cleaner slate. * Remove the free buffer cache (which will be replaced by buffer pool's own caching). * Make memory limit exceeded error checking synchronous (in anticipation of having to propagate buffer pool errors synchronously). * Simplify error propagation - remove the (ineffectual) code that enqueued BufferDescriptors containing error statuses. * Document locking scheme better in a few places, make it part of the function signature when it seemed reasonable. * Move ReturnBuffer() to ScanRange, because it is intrinsically connected with the lifecycle of a scan range. * Separate external ReturnBuffer() and internal CleanUpBuffer() interfaces - previously callers of ReturnBuffer() were fudging the num_buffers_in_reader accounting to make the external interface work. * Eliminate redundant state in ScanRange: 'eosr_returned_' and 'is_cancelled_'. * Clarify the logic around calling Close() for the last BufferDescriptor. -> There appeared to be an implicit assumption that buffers would be freed in the order they were returned from the scan range, so that the "eos" buffer was returned last. Instead just count the number of outstanding buffers to detect the last one. -> Touching the is_cancelled_ field without holding a lock was hard to reason about - violated locking rules and it was unclear that it was race-free. * Remove DiskIoMgr::Read() to simplify the interface. It is trivial to inline at the callsites. This will probably regress performance somewhat because of the cache removal, so my plan is to merge it around the same time as switching the I/O mgr to allocate from the buffer pool. I'm keeping the patches separate to make reviewing easier. Testing: * Ran exhaustive tests * Ran the disk-io-mgr-stress-test overnight ====================================================================== IMPALA-4835: Part 2: Allocate scan range buffers upfront This change is a step towards reserving memory for buffers from the buffer pool and constraining per-scanner memory requirements. This change restructures the DiskIoMgr code so that each ScanRange operates with a fixed set of buffers that are allocated upfront and recycled as the I/O mgr works through the ScanRange. One major change is that ScanRanges get blocked when a buffer is not available and get unblocked when a client returns a buffer via ReturnBuffer(). I was able to remove the logic to maintain the blocked_ranges_ list by instead adding a separate set with all ranges that are active. There is also some miscellaneous cleanup included - e.g. reducing the amount of code devoted to maintaining counters and metrics. One tricky part of the existing code was the it called IssueInitialRanges() with empty lists of files and depended on DiskIoMgr::AddScanRanges() to not check for cancellation in that case. See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue ranges for empty lists of files. I plan to merge this along with the actual buffer pool switch, but separated it out to allow review of the DiskIoMgr changes separate from other aspects of the buffer pool switchover. Testing: * Ran core and exhaustive tests. ====================================================================== IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00 Reviewed-on: http://gerrit.cloudera.org:8080/9679 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2018-04-28 23:41:39 +00:00
Tim Armstrong	161cbe30ff	Revert IMPALA-4835 and dependent changes Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit" This reverts commit `25bcb258df`. Revert "IMPALA-6588: don't add empty list of ranges in text scan" This reverts commit `d57fbec6f6`. Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool" This reverts commit `24b4ed0b29`. Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront" This reverts commit `5699b59d0c`. Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation" This reverts commit `65680dc421`. Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb Reviewed-on: http://gerrit.cloudera.org:8080/9480 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Impala Public Jenkins	2018-03-03 04:22:12 +00:00
Tim Armstrong	24b4ed0b29	IMPALA-4835: Part 3: switch I/O buffers to buffer pool This is the final patch to switch the Disk I/O manager to allocate all buffer from the buffer pool and to reserve the buffers required for a query upfront. * The planner reserves enough memory to run a single scanner per scan node. * The multi-threaded scan node must increase reservation before spinning up more threads. * The scanner implementations must be careful to stay within their assigned reservation. The row-oriented scanners were most straightforward, since they only have a single scan range active at a time. A single I/O buffer is sufficient to scan the whole file but more I/O buffers can improve I/O throughput. Parquet is more complex because it issues a scan range per column and the sizes of the columns on disk are not known during planning. To deal with this, the reservation in the frontend is based on a heuristic involving the file size and # columns. The Parquet scanner can then divvy up reservation to columns based on the size of column data on disk. I adjusted how the 'mem_limit' is divided between buffer pool and non buffer pool memory for low mem_limits to account for the increase in buffer pool memory. Testing: * Added more planner tests to cover reservation calcs for scan node. * Test scanners for all file formats with the reservation denial debug action, to test behaviour when the scanners hit reservation limits. * Updated memory and buffer pool limits for tests. * Added unit tests for dividing reservation between columns in parquet, since the algorithm is non-trivial. Perf: I ran TPC-H and targeted perf locally comparing with master. Both showed small improvements of a few percent and no regressions of note. Cluster perf tests showed no significant change. Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786 Reviewed-on: http://gerrit.cloudera.org:8080/8966 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2018-02-23 04:17:41 +00:00
Tim Armstrong	7007cd85fe	IMPALA-5772: failure in test_scratch_disk.TestScratchDir didn't occur The buffer pool changes reduced the memory requirement for the sort, which seems to have been enough to get the query to execute without spilling on S3. Reduce the limit in the test to force it to spill. Testing: Ran in a loop locally for an hour. Ran custom cluster tests on S3. Change-Id: If65fee3e6a4b759d0d18e30a1c30bd48db0f2a54 Reviewed-on: http://gerrit.cloudera.org:8080/7615 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-09 00:04:25 +00:00
Tim Armstrong	a98b90bd38	IMPALA-4674: Part 2: port backend exec to BufferPool Always create global BufferPool at startup using 80% of memory and limit reservations to 80% of query memory (same as BufferedBlockMgr). The query's initial reservation is computed in the planner, claimed centrally (managed by the InitialReservations class) and distributed to query operators from there. min_spillable_buffer_size and default_spillable_buffer_size query options control the buffer size that the planner selects for spilling operators. Port ExecNodes to use BufferPool: * Each ExecNode has to claim its reservation during Open() * Port Sorter to use BufferPool. * Switch from BufferedTupleStream to BufferedTupleStreamV2 * Port HashTable to use BufferPool via a Suballocator. This also makes PAGG memory consumption more efficient (avoid wasting buffers) and improve the spilling algorithm: * Allow preaggs to execute with 0 reservation - if streams and hash tables cannot be allocated, it will pass through rows. * Halve the buffer requirement for spilling aggs - avoid allocating buffers for aggregated and unaggregated streams simultaneously. * Rebuild spilled partitions instead of repartitioning (IMPALA-2708) TODO in follow-up patches: * Rename BufferedTupleStreamV2 to BufferedTupleStream * Implement max_row_size query option. Testing: * Updated tests to reflect new memory requirements Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e Reviewed-on: http://gerrit.cloudera.org:8080/5801 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-08-05 01:03:02 +00:00
Lars Volker	8b7f876649	IMPALA-4722: Disable log caching in test_scratch_disk test_scratch_disk fails sporadically when trying to assert the presence of log messages. This is probably caused by log caching, since after such failures the log files do contains the lines in question. I manually tested this by running the tests repeatedly for 2 days (10k runs). To make future diagnosis of similar problems easier, this change also adds more output to assert_impalad_log_contains(). Change-Id: I9f21284338ee7b4374aca249b6556282b0148389 Reviewed-on: http://gerrit.cloudera.org:8080/5669 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-01-12 18:58:48 +00:00
Tim Armstrong	95ed4434f2	IMPALA-3202,IMPALA-2079: rework scratch file I/O Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into TmpFileMgr, in anticipation of it being shared with BufferPool. TmpFileMgr now handles: * Scratch space allocation and recycling * Read and write I/O The interface is also greatly changed so that it is built around Write() and Read() calls, abstracting away the details of temporary file allocation from clients. This means the TmpFileMgr::File class can be hidden from clients. Write error recovery: Also implement write error recovery in TmpFileMgr. If an error occurs while writing to scratch and we have multiple scratch directories, we will try one of the other directories before cancelling the query. File-level blacklisting is used to prevent excessive repeated attempts to resize a scratch file during a single query. Device-level blacklisting is not implemented because it is problematic to permanently take a scratch directory out of use. To reduce the number of error paths, all I/O errors are now handled asynchronously. Previously errors creating or extending the file were returned synchronously from WriteUnpinnedBlock(). This required modifying DiskIoMgr to create the file if not present when opened. Also set the default max_errors value in the thrift definition file, so that it is in effect for backend tests. Future Work: * Support for recycling variable-length scratch file ranges. I omitted this to avoid making the patch even large. Testing: Updated BufferedBlockMgr unit test to reflect changes in behaviour: * Scratch space is no longer permanently associated with a block, and is remapped every time a new block is written to disk . * Files are now blacklisted - updated existing tests and enable the disable blacklisting test. Added some basic testing of recycling of scratch file ranges in the TmpFileMgr unit test. I also manually tested the code in two ways. First by removing permissions for /tmp/impala-scratch and ensuring that a spilling query fails cleanly. Second, by creating a tiny ramdisk (16M) and running with two scratch directories: one on /tmp and one on the tiny ramdisk. When spilling, an out of space error is encountered for the tiny ramdisk and impala spills the remaining data (72M) to /tmp. Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17 Reviewed-on: http://gerrit.cloudera.org:8080/5141 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Impala Public Jenkins	2017-01-05 02:26:24 +00:00
Dan Hecht	ffa7829b70	IMPALA-3918: Remove Cloudera copyrights and add ASF license header For files that have a Cloudera copyright (and no other copyright notice), make changes to follow the ASF source file header policy here: http://www.apache.org/legal/src-headers.html#headers Specifically: 1) Remove the Cloudera copyright. 2) Modify NOTICE.txt according to http://www.apache.org/legal/src-headers.html#notice to follow that format and add a line for Cloudera. 3) Replace or add the existing ASF license text with the one given on the website. Much of this change was automatically generated via: git grep -li 'Copyright.Cloudera' > modified_files.txt cat modified_files.txt \| xargs perl -n -i -e 'print unless m#Copyright.Cloudera#i;' cat modified_files_txt \| xargs fix_apache_license.py [1] Some manual fixups were performed following those steps, especially when license text was completely missing from the file. [1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor modification to ORIG_LICENSE to match Impala's license text. Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86 Reviewed-on: http://gerrit.cloudera.org:8080/3779 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Tested-by: Internal Jenkins	2016-08-09 08:19:41 +00:00
Taras Bobrovytsky	609b80410e	Clean up Python test import statements Many of our test scripts have import statements that look like "from xxx import *". It is a good practice to explicitly name what needs to be imported. This commit implements this practice. Also, unused import statements are removed. Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8 Reviewed-on: http://gerrit.cloudera.org:8080/3444 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Internal Jenkins	2016-07-15 23:26:18 +00:00
Tim Armstrong	c1093ed861	IMPALA-3669: test_scratch_disk fails on S3 Make the test deterministic by using max_block_mgr_memory instead of mem_limit, so that the non-deterministic scanner memory usage does not influence the spilling behaviour of the queries. Testing: Ran the test locally to confirm that it succeeded. Also manually computed the memory requirement. The data size to be sorted is ~220MB, so with a 64MB block manager limit per node, at least one node must spill. Change-Id: I9525a029ac020bb5b8bea210a741c9f9c5ec3c75 Reviewed-on: http://gerrit.cloudera.org:8080/3318 Reviewed-by: Dan Hecht <dhecht@cloudera.com> Reviewed-by: Michael Brown <mikeb@cloudera.com> Reviewed-by: Silvius Rus <srus@cloudera.com> Tested-by: Tim Armstrong <tarmstrong@cloudera.com>	2016-06-09 17:31:00 -07:00
Michael Brown	067af1957c	IMPALA-3614: work around pytest bugs causing custom cluster test skips All versions of pytest contain various bugs regarding test marking (including skips) when tests are both: 1. class-level marked 2. inherited More info is available in IMPALA-3614 and IMPALA-2943, but the gist is that it's possible for some tests to be skipped when they shouldn't be. This is happening pretty badly with the custom cluster tests, because CustomClusterTestSuite has a class level skipif mark. The easiest workaround for now is to remove the pytest skipif mark in CustomClusterTestSuite and skip using explicit pytest.skip() in the setup_class() method. Some CustomClusterTestSuite children implemented their own setup_* methods, and I made some adjustments to them both to clean them up and implement proper parent method calling via super(). Testing: I ran the following combinations of all the custom cluster tests: DEBUG / HDFS / core RELEASE / HDFS / exhaustive DEBUG / LOCAL / core DEBUG / S3 / core Before, we'd get situations in which most of the tests were skipped. Consider the RELEASE/HDFS/exhaustive situation: custom_cluster/test_admission_controller.py ..... custom_cluster/test_alloc_fail.py ss custom_cluster/test_breakpad.py sssss custom_cluster/test_delegation.py sss custom_cluster/test_exchange_delays.py ss custom_cluster/test_hdfs_fd_caching.py s custom_cluster/test_hive_parquet_timestamp_conversion.py ss custom_cluster/test_insert_behaviour.py ss custom_cluster/test_legacy_joins_aggs.py s custom_cluster/test_parquet_max_page_header.py s custom_cluster/test_permanent_udfs.py sss custom_cluster/test_query_expiration.py sss custom_cluster/test_redaction.py ssss custom_cluster/test_s3a_access.py s custom_cluster/test_scratch_disk.py ssss custom_cluster/test_session_expiration.py s custom_cluster/test_spilling.py ssss authorization/test_authorization.py ss authorization/test_grant_revoke.py s Now, more tests run appropriately: custom_cluster/test_admission_controller.py ..... custom_cluster/test_alloc_fail.py ss custom_cluster/test_breakpad.py sssss custom_cluster/test_delegation.py ... custom_cluster/test_exchange_delays.py ss custom_cluster/test_hdfs_fd_caching.py . custom_cluster/test_hive_parquet_timestamp_conversion.py .. custom_cluster/test_insert_behaviour.py .. custom_cluster/test_kudu_not_available.py . custom_cluster/test_legacy_joins_aggs.py . custom_cluster/test_parquet_max_page_header.py . custom_cluster/test_permanent_udfs.py ... custom_cluster/test_query_expiration.py ... custom_cluster/test_redaction.py .... custom_cluster/test_s3a_access.py s custom_cluster/test_scratch_disk.py .... custom_cluster/test_session_expiration.py . custom_cluster/test_spilling.py .... authorization/test_authorization.py .. authorization/test_grant_revoke.py . Change-Id: Ie301b69718f8690322cc3b4130fb1c715344779c Reviewed-on: http://gerrit.cloudera.org:8080/3265 Reviewed-by: Michael Brown <mikeb@cloudera.com> Tested-by: Michael Brown <mikeb@cloudera.com>	2016-06-06 17:34:07 -07:00
Taras Bobrovytsky	9dcf857ddb	IMPALA-3368: Fix race in test_scratch_disk.py The problem is that xdist is spinning up multiple processes and each process tries to create the the same temp dirs. This commit fixes the issue by creating a randomly named dir to avoid the conflict. Change-Id: Ic0764843ace00aef8c9b01139906e01ab5213047 Reviewed-on: http://gerrit.cloudera.org:8080/2817 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Internal Jenkins	2016-05-12 14:17:35 -07:00
Tim Armstrong	1d2afcfec2	IMPALA-2079: Part 1: report non-writable scratch dirs at startup Previously Impala could erroneously decide to use non-writable scratch directories, e.g. if /tmp/impala-scratch already exists and is not writable by the current user. With this change, if we cannot remove and recreate a fresh scratch directory, it is not used. If we have no valid scratch directories, we log an error and continue startup. Add unit test for CreateDirectory to test behavior for success and failure cases. Add system tests to check logging and query execution in various scenarios where we do not have scratch available. Modify FilesystemUtil to use non-exception-throwing Boost functions to avoid unhandled exceptions escaping into the rest of the Impala codebase, which does not expect the use of exceptions. Change-Id: Icaa8429051942424e1d811c54bde10102ac7f7b3 Reviewed-on: http://gerrit.cloudera.org:8080/565 Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com> Tested-by: Internal Jenkins	2015-08-14 00:38:22 +00:00
Casey Ching	074e5b4349	Remove hashbang from non-script python files Many python files had a hashbang and the executable bit set though they were not intended to be run a standalone script. That makes determining which python files are actually scripts very difficult. A future patch will update the hashbang in real python scripts so they use $IMPALA_HOME/bin/impala-python. Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba Reviewed-on: http://gerrit.cloudera.org:8080/599 Reviewed-by: Casey Ching <casey@cloudera.com> Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: Internal Jenkins	2015-08-04 05:26:07 +00:00
Taras Bobrovytsky	7faaa65996	Added order by query tests - Added static order by tests to test_queries.py and QueryTest/sort.test - test_order_by.py also contains tests with static queries that are run with multiple memory limits. - Added stress, scratch disk and failpoints tests - Incorporated Srinath's change that copied all order by with limit tests into the top-n.test file Extra time required: Serial: scratch disk: 42 seconds test queries sort : 77 seconds test sort: 56 seconds sort stress: 142 seconds TOTAL: 5 min 17 seconds Parallel(8 threads): scratch disk: 40 seconds test queries sort: 42 seconds test sort: 49 seconds sort stress: 93 seconds TOTAL: 3 min 44 sec Change-Id: Ic5716bcfabb5bb3053c6b9cebc9bfbbb9dc64a7c Reviewed-on: http://gerrit.ent.cloudera.com:8080/2820 Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com> Tested-by: jenkins Reviewed-on: http://gerrit.ent.cloudera.com:8080/3205	2014-06-20 13:35:10 -07:00

26 Commits