We support the HDFS scratch space, but as a test-only feature with
a fixed HDFS default local path.
In this patch, we extend the HDFS scratch space to support the
customer's input. For supporting the function, we add a new
format for HDFS scratch space path. It forces the HDFS path
to have the port number to solve the contradiction to the
current format of the scratch space path.
For example, previously, the format for scratch space path is,
take s3 for example, s3a://bucketpath:#bytes:#priority. In this
case, the bucketpath doesn't have a port number.
In this patch, the new format of HDFS scratch path is
hdfs://ipaddr:#port:#bytes:#priority. The port number is required,
therefore, there must be at least one colon in the HDFS path, the
bytes and priority are optional as before. For other scratch
spaces, the path format doesn’t change.
Also, option allow_spill_to_hdfs is removed because the spilling
to HDFS is not a test-only function anymore, as a result, the e2e
tests involved are updated.
Tests:
Added and passed TmpFileMgrTest::TestDirectoryLimitParsingRemotePath.
Ran the Core tests.
Change-Id: I0882ed1e80b02724dd5cb3cdb1fa7b6c2debcbf4
Reviewed-on: http://gerrit.cloudera.org:8080/17720
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This patch adds the ability to share the per-host stats for locally
admitted queries across all coordinators. This helps to get a more
consolidated view of the cluster for stats like slots_in_use and
mem_admitted when making local admission decisions.
Testing:
Added e2e py test
Change-Id: I2946832e0a89b077d0f3bec755e4672be2088243
Reviewed-on: http://gerrit.cloudera.org:8080/17683
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
IMPALA-9856 enables result spooling by default. Result spooling depends
on the ability to spill its entire BufferedTupleStream to disk once it
hits maximum memory reservation. However, if the query option
scratch_limit is set lower than max_spilled_result_spooling_mem, the
query might fail in the middle of execution due to insufficient scratch
space. This patch adds planner change to consider scratch_limit and
scratch_dirs query option when computing resource used by result
spooling. The algorithm is as follow:
* If scratch_dirs is empty or scratch_limit < minMemReservationBytes
required to use BufferedPlanRootSink, we set spool_query_results to
false and fallback to use BlockingPlanRootSink.
* If scratch_limit > minMemReservationBytes but still fairly low, we
lower the max_result_spooling_mem (default is 100MB) and
max_spilled_result_spooling_mem (default is 1GB) to fit scratch_limit.
* if scratch_limit > max_spilled_result_spooling_mem, do nothing.
Testing:
- Add TestScratchLimit::test_result_spooling_and_varying_scratch_limit
- Verify that spool_query_results query option is disabled in
TestScratchDir::test_no_dirs
- Pass exhaustive tests.
Change-Id: I541f46e6911694e14c0fc25be1a6982fd929d3a9
Reviewed-on: http://gerrit.cloudera.org:8080/17166
Reviewed-by: Aman Sinha <amsinha@cloudera.com>
Tested-by: Aman Sinha <amsinha@cloudera.com>
Result spooling has been relatively stable since it was introduced, and
it has several benefits described in IMPALA-8656. This patch enable
result spooling (SPOOL_QUERY_RESULTS) query options by default.
Furthermore, some tests need to be adjusted to account for result
spooling by default. The following are the adjustment categories and
list of tests that fall under such category.
Change in assertions:
PlannerTest#testAcidTableScans
PlannerTest#testBloomFilterAssignment
PlannerTest#testConstantFolding
PlannerTest#testFkPkJoinDetection
PlannerTest#testFkPkJoinDetectionWithHDFSNumRowsEstDisabled
PlannerTest#testKuduSelectivity
PlannerTest#testMaxRowSize
PlannerTest#testMinMaxRuntimeFilters
PlannerTest#testMinMaxRuntimeFiltersWithHDFSNumRowsEstDisabled
PlannerTest#testMtDopValidation
PlannerTest#testParquetFiltering
PlannerTest#testParquetFilteringDisabled
PlannerTest#testPartitionPruning
PlannerTest#testPreaggBytesLimit
PlannerTest#testResourceRequirements
PlannerTest#testRuntimeFilterQueryOptions
PlannerTest#testSortExprMaterialization
PlannerTest#testSpillableBufferSizing
PlannerTest#testTableSample
PlannerTest#testTpch
PlannerTest#testKuduTpch
PlannerTest#testTpchNested
PlannerTest#testUnion
TpcdsPlannerTest
custom_cluster/test_admission_controller.py::TestAdmissionController::test_dedicated_coordinator_planner_estimates
custom_cluster/test_admission_controller.py::TestAdmissionController::test_memory_rejection
custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_mem_limit_configs
metadata/test_explain.py::TestExplain::test_explain_level2
metadata/test_explain.py::TestExplain::test_explain_level3
metadata/test_stats_extrapolation.py::TestStatsExtrapolation::test_stats_extrapolation
Increase BUFFER_POOL_LIMIT:
query_test/test_queries.py::TestQueries::test_analytic_fns
query_test/test_runtime_filters.py::TestRuntimeRowFilters::test_row_filter_reservation
query_test/test_sort.py::TestQueryFullSort::test_multiple_mem_limits_full_output
query_test/test_spilling.py::TestSpillingBroadcastJoins::test_spilling_broadcast_joins
query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_aggs
query_test/test_spilling.py::TestSpillingDebugActionDimensions::test_spilling_regression_exhaustive
query_test/test_udfs.py::TestUdfExecution::test_mem_limits
Increase MEM_LIMIT:
query_test/test_mem_usage_scaling.py::TestExchangeMemUsage::test_exchange_mem_usage_scaling
query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_hdfs_scanner_thread_mem_scaling
Increase MAX_ROW_SIZE:
custom_cluster/test_parquet_max_page_header.py::TestParquetMaxPageHeader::test_large_page_header_config
query_test/test_insert.py::TestInsertQueries::test_insert_large_string
query_test/test_query_mem_limit.py::TestQueryMemLimit::test_mem_limit
query_test/test_scanners.py::TestTextSplitDelimiters::test_text_split_across_buffers_delimiter
query_test/test_scanners.py::TestWideRow::test_wide_row
Disable result spooling to maintain assertion:
custom_cluster/test_admission_controller.py::TestAdmissionController::test_set_request_pool
custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_host_memory
custom_cluster/test_admission_controller.py::TestAdmissionController::test_timeout_reason_pool_memory
custom_cluster/test_admission_controller.py::TestAdmissionController::test_queue_reasons_memory
custom_cluster/test_admission_controller.py::TestAdmissionController::test_pool_config_change_while_queued
custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_fetched_rows
custom_cluster/test_query_retries.py::TestQueryRetries::test_retry_finished_query
custom_cluster/test_scratch_disk.py::TestScratchDir::test_no_dirs
custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_existing_dirs
custom_cluster/test_scratch_disk.py::TestScratchDir::test_non_writable_dirs
query_test/test_insert.py::TestInsertQueries::test_insert_large_string (the last query only)
query_test/test_kudu.py::TestKuduMemLimits::test_low_mem_limit_low_selectivity_scan
query_test/test_mem_usage_scaling.py::TestScanMemLimit::test_kudu_scan_mem_usage
query_test/test_queries.py::TestQueriesParquetTables::test_very_large_strings
query_test/test_query_mem_limit.py::TestCodegenMemLimit::test_codegen_mem_limit
shell/test_shell_client.py::TestShellClient::test_fetch_size
Testing:
- Pass exhaustive tests.
Change-Id: I9e360c1428676d8f3fab5d95efee18aca085eba4
Reviewed-on: http://gerrit.cloudera.org:8080/16755
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The E2E testcase emulates the situation when there are two types of
scratch directories, the data only spills to the local one when the
space of local directory is sufficient. The testcase works fine for
the debug build, however in the release build, the system runs faster
and more data is spilled from memory which exceeds the setting of the
local scratch space limit. To solve this, the size limit of local
scratch space is changed from 100M to 2GB, so that allows all of the
spilled data is in the local instead of the remote directory.
Tests:
Reran test_scratch_dirs_mix_local_and_remote_dir_spill_local_only in
the release build.
Change-Id: If2dc32196b2554aee9fc94a4ccbbf5803dbcce1d
Reviewed-on: http://gerrit.cloudera.org:8080/17102
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Major Features
1) Local files as buffers for spilling to S3.
2) Async Upload for remote files.
3) Sync remote files deletion after query ends.
4) Local buffer files management.
5) Compatibility of spilling to local and remote.
6) All the errors from hdfs/s3 should terminate the query.
Changes on TmpFile:
* TmpFile is separated into two types of implementation, TmpFileLocal
and TmpFileRemote.
TmpFileLocal is used for Spilling to local file system.
TmpFileRemote is a new type for Spilling to the remote. It contains
two DiskFiles, one for local buffer, the other for the remote file.
* The DiskFile is an object that contains the information of a pysical
file for passing to the DiskIOMgr to execute the IO operations on
that specific file. The DiskFile also contains status information of
the file,includes DiskFileStatus::INWRITING/PERSISTED/DELETED.
When the DiskFile is initialized, it is in INWRITING status. If the
file is persisted into the file system, it would become PERSISTED
status. If the file is deleted, for example, the local buffer is
evicted, so the DiskFile status of the buffer file would become
deleted. After that, if the file is fetching from the remote, the
DiskFile status of the buffer file would become INWRITING, and then
PERSISTED if the fetching finishes successfully.
Implementation Details:
1) A new enum type is added to specify the disk type of files,
indicating where the file physically locates.
The types include DiskFileType::LOCAL/LOCAL_BUFFER/DFS/S3.
DiskFileType::LOCAL indicates the file is in the local file system.
DiskFileType::LOCAL_BUFFER indicates the file is in the local file
system, and it is the buffer of a remote scratch file.
DiskFileType::DFS/S3 indicates the file is in the HDFS/S3.
The local buffer allows the buffer pool to pin(read), but mainly
for remote files, buffer pool would pin(read) the page from the
remote file system.
2) Two disk queues have been added to do the file operation jobs.
Queue name: RemoteS3DiskFileOper/RemoteDfsDiskFileOper
File operations on the remote disk like upload and fetch should
be done in these queues. The purpose of the queues is to isolate
the file operations from normal read/write IO operations in different
queues. It could increase the efficiency of the file operations by
not being interrupted during a relatively long execution time, and
also provide a more accurate control on the thread number working on
file operation jobs.
RemoteOperRange is the new type to carry the file operation jobs.
Previously,we have request types of READ and WRITE.
Now FILE_FETCH/FILE_UPLOAD are added.
3) The tmp files are physically deleted when the tmp file group is
deconstructing. For remote files, the entire directory would be
deleted.
4) The local buffer files management is to control the total size
of local buffer files and evict files if needed.
A local buffer file can be evicted if the temporary file has uploaded
a copy to the remote disk or the query ends.
There are two modes to decide the sequence of choosing files to be
evicted first. Default is LIFO, the other is FIFO. It can be
controlled by startup option remote_tmp_files_avail_pool_lifo.
Also, a thread TmpFileSpaceReserveThreadLoop in TmpFileMgr is
running to allow to reserve buffer file space in an async way to
avoid deadlocks.
Startup option allow_spill_to_hdfs is added. By default the HDFS path
is not allowed, but for testcases, the option can be set true to
allow the use of HDFS path as scratch space for testing only.
Startup option wait_for_spill_buffer_timeout_s is added to control
the maximum duration waiting for the buffer in the TmpFileBufferPool.
Default value is 60, stands for 60 seconds.
5) Spilling to local has higher priority than spilling to remote.
If no local scratch space is available, temporary data will be
spilled to remote.
The first available local directory is used for the local buffer
for spilling to remote if any remote directory is configured.
If remote directory is configured without any available local
scratch space, an error will be returned during initialization.
The purpose of the design is to simplify the implementation in
milestone 1 with less changes on the configuration.
Example (setting remote scratch space):
Assume that the directories we have for scratch space:
* Local dir: /tmp/local_buffer, /tmp/local, /tmp/local_sec
* Remote dir: s3a://tmp/remote
The scratch space path is configured in the startup options, and could
have three types of configurations:
1. Pure local scratch space
--scratch_dirs="/tmp/local"
2. Pure remote scratch space
--scratch_dirs="s3a://tmp/remote,/tmp/local_buffer:16GB"
3. Mixed local and remote scratch space
--scratch_dirs="s3a://tmp/romote:200GB,/tmp/local_buffer:1GB,
/tmp/local:2GB, /tmp/local_sec:16GB"
* Type 1: a pure local scratch space with unlimited size.
* Type 2: a pure remote scratch space with a 16GB local buffer.
* Type 3: a mixed local and remote scratch space, the size of the local
buffer for the remote directory is 1GB, while local scratch spaces are
2GB and 16GB, remote scratch space bytes limit is 200GB. Remote scratch
space is used only when all of the local spaces are at capacity.
* Note: The first local directory would be the local buffer path, if a
remote scratch space is registered.
Limitations:
* Only one remote scratch dir is supported.
* The first local scratch dir is used for the buffer of remote scratch
space if remote scratch dir exists.
Testcases:
* Ran pre-review-test
* Unit Tests added to
tmp-file-mgr-test/disk-io-mgr-test/buffer-pool-test.
* E2E Tests added to custom_cluster/test_scratch_disk.py.
* Ran Unit Tests:
$IMPALA_HOME/be/build/debug/runtime/buffered-tuple-stream-test
$IMPALA_HOME/be/build/debug/runtime/tmp-file-mgr-test
$IMPALA_HOME/be/build/debug/runtime/bufferpool/buffer-pool-test
$IMPALA_HOME/be/build/debug/runtime/io/disk-io-mgr-test
* Ran E2E Tests:
custom_cluster/test_scratch_disk.py
Change-Id: I419b1d5dbbfe35334d9f964c4b65e553579fdc89
Reviewed-on: http://gerrit.cloudera.org:8080/16318
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The '--scratch_dirs' configuration option now supports specifying the
priority of the scratch direcotry. The lower the numeric value, the
higher is the priority. If priority is not specified then default
priority with value numeric_limits<int>::max() is used.
Valid formats for specifying the priority are:
- <dir-path>:<limit>:<priority>
- <dir-path>::<priority>
Following formats use default priority:
- <dir-path>
- <dir-path>:<limit>
- <dir-path>:<limit>:
The new logic in TmpFileGroup::AllocateSpace() tries to find a target
file using a prioritized round-robin scheme. Files are ordered in
decreasing order of their priority. The priority of a file is same as
the priority of the related directory. A target file is selected by
always searching in the ordered list starting from the file with highest
priority. If multiple files have same priority, then the target file is
selected in a round robin manner.
Testing:
- Added unit and e2e tests for priority based spilling logic.
Change-Id: I381c3a358e1382e6696325fec74667f1fa18dd17
Reviewed-on: http://gerrit.cloudera.org:8080/16091
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Enabled via --disk_spill_compression_codec, which uses
the same syntax as the compression_codec query option.
Recommended codecs are LZ4 and ZSTD. ZSTD supports
specifying a compression level.
The compression is done in TmpFileMgr using a temporary
buffer. Allocation of disk space is reworked slightly
so that the allocation can happen after compression.
The default power-of-two disk block sizes would lead
to a lot of internal fragmentation, so a new strategy
for free space management, similar to that used in
the data cache, can be used with
--disk_spill_punch_holes=true. TmpFileMgr will allocate
a range of the actual compressed size and punch holes
in the file for each range that is no longer needed.
UncompressedWriteIoBytes is added to the buffer pool
profiles, so that you can see what degree of compression
is achieved. Typically I saw ratios of 2-3x for LZ4 and
ZSTD (with LZ4 toward the lower end and ZSTD toward
the higher end).
Limitations:
The management of the compression buffer memory could
be improved. Ideally it would be integrated with the
buffer pool and use the buffer pool allocator instead
of being done "on the side". We would probably want to
do this before making this the default, for resource
management and performance reasons (doing a malloc()
directly does not use the caching supported by the
buffer pool).
Testing:
* Run buffer pool spilling tests with different combinations of
the new options.
* Extend existing TmpFileMgr tests for file space allocation to
run with hole punching enabled.
* Switch a couple of spilling tests to use the new option.
* Add a metrics test to check for scratch leaks.
* Enable the new options by default for end-to-end dockerized
tests to get additional coverage.
* Add a unit test where allocating compression memory fails,
both on the read and write path.
* Ran a single-node stress test on TPC-DS SF 1 and TPC-H SF 10
The peak compression buffer usage was ~40MB.
Perf:
I ran this spilling query using an SSD as the scratch disk:
set mem_limit=200m;
select count(distinct l_partkey) from
tpch30_parquet.lineitem;
The time taken for the second run of each query was:
No compression: 19.59s
LZ4: 18.56s
ZSTD: 20.59s
Change-Id: I9c08ff9504097f0fee8c32316c5c150136abe659
Reviewed-on: http://gerrit.cloudera.org:8080/15454
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Bikramjeet Vig <bikramjeet.vig@cloudera.com>
This includes some optimisations and a bulk move of tests
to exhaustive.
Move a bunch of custom cluster tests to exhaustive. I selected
these partially based on runtime (i.e. I looked most carefully
at the tests that ran for over a minute) and the likelihood
of them catching a precommit bug. Regression tests for specific
edge cases and tests for parts of the code that are very stable
were prime candidates.
Remove an unnecessary cluster restart in test_breakpad.
Merge test_scheduler_error into test_failpoints to avoid an unnecessary
cluster restart.
Speed up cluster starts by ensuring that the default statestore args are
applied even when _start_impala_cluster() is called directly. This
shaves a couple of seconds off each restart. We made the default args
use a faster update frequency - see IMPALA-7185 - but they did not
take effect in all tests.
Change-Id: Ib2e3e7ebc9695baec4d69183387259958df10f62
Reviewed-on: http://gerrit.cloudera.org:8080/13967
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Delete these wrapper scripts and replace with a generic
start-daemon.sh script that sets environment variables
without the other logic.
Move the logic for setting JAVA_TOOL_OPTIONS into
start-impala-cluster.py.
Remove some options like -jvm_suspend, -gdb, -perf that
may not be used. These can be reintroduced if needed.
Port across the kerberized minicluster logic (which has
probably bitrotted) in case it needs to be revived.
Remove --verbose option that didn't appear to be useful
(it claims to print daemon output to the console,
but output is still redirected regardless).
Removed a level of quoting in custom cluster test argument
handling - this was made unnecessary by properly escaping
arguments with pipes.escape() in run_daemon().
Testing:
* Ran exhaustive tests.
* Ran on CentOS 6 to confirm we didn't reintroduce Popen issue
worked around by kwho.
Change-Id: Ib67444fd4def8da119db5d3a0832ef1de15b068b
Reviewed-on: http://gerrit.cloudera.org:8080/12271
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
The previous default was often confusing to users of Impala. It is
simpler to do exactly what is asked instead of trying to fix bad
configurations automatically.
Testing:
Ran core tests.
Change-Id: I23394c9949ae4cd0a21d7bb25551371b3198e76c
Reviewed-on: http://gerrit.cloudera.org:8080/10736
Reviewed-by: anujphadke <aphadke@cloudera.com>
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
This is the following squashed patches that were reverted.
I will fix the known issues with some follow-on patches.
======================================================================
IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation
In preparation for switching the I/O mgr to the buffer pool, this
removes and cleans up a lot of code so that the switchover patch starts
from a cleaner slate.
* Remove the free buffer cache (which will be replaced by buffer pool's
own caching).
* Make memory limit exceeded error checking synchronous (in anticipation
of having to propagate buffer pool errors synchronously).
* Simplify error propagation - remove the (ineffectual) code that
enqueued BufferDescriptors containing error statuses.
* Document locking scheme better in a few places, make it part of the
function signature when it seemed reasonable.
* Move ReturnBuffer() to ScanRange, because it is intrinsically
connected with the lifecycle of a scan range.
* Separate external ReturnBuffer() and internal CleanUpBuffer()
interfaces - previously callers of ReturnBuffer() were fudging
the num_buffers_in_reader accounting to make the external interface work.
* Eliminate redundant state in ScanRange: 'eosr_returned_' and
'is_cancelled_'.
* Clarify the logic around calling Close() for the last
BufferDescriptor.
-> There appeared to be an implicit assumption that buffers would be
freed in the order they were returned from the scan range, so that
the "eos" buffer was returned last. Instead just count the number
of outstanding buffers to detect the last one.
-> Touching the is_cancelled_ field without holding a lock was hard to
reason about - violated locking rules and it was unclear that it
was race-free.
* Remove DiskIoMgr::Read() to simplify the interface. It is trivial to
inline at the callsites.
This will probably regress performance somewhat because of the cache
removal, so my plan is to merge it around the same time as switching
the I/O mgr to allocate from the buffer pool. I'm keeping the patches
separate to make reviewing easier.
Testing:
* Ran exhaustive tests
* Ran the disk-io-mgr-stress-test overnight
======================================================================
IMPALA-4835: Part 2: Allocate scan range buffers upfront
This change is a step towards reserving memory for buffers from the
buffer pool and constraining per-scanner memory requirements. This
change restructures the DiskIoMgr code so that each ScanRange operates
with a fixed set of buffers that are allocated upfront and recycled as
the I/O mgr works through the ScanRange.
One major change is that ScanRanges get blocked when a buffer is not
available and get unblocked when a client returns a buffer via
ReturnBuffer(). I was able to remove the logic to maintain the
blocked_ranges_ list by instead adding a separate set with all ranges
that are active.
There is also some miscellaneous cleanup included - e.g. reducing the
amount of code devoted to maintaining counters and metrics.
One tricky part of the existing code was the it called
IssueInitialRanges() with empty lists of files and depended on
DiskIoMgr::AddScanRanges() to not check for cancellation in that case.
See IMPALA-6564/IMPALA-6588. I changed the logic to not try to issue
ranges for empty lists of files.
I plan to merge this along with the actual buffer pool switch, but
separated it out to allow review of the DiskIoMgr changes separate from
other aspects of the buffer pool switchover.
Testing:
* Ran core and exhaustive tests.
======================================================================
IMPALA-4835: Part 3: switch I/O buffers to buffer pool
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: I3ef471dc0746f0ab93b572c34024fc7343161f00
Reviewed-on: http://gerrit.cloudera.org:8080/9679
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
Revert "IMPALA-6585: increase test_low_mem_limit_q21 limit"
This reverts commit 25bcb258df.
Revert "IMPALA-6588: don't add empty list of ranges in text scan"
This reverts commit d57fbec6f6.
Revert "IMPALA-4835: Part 3: switch I/O buffers to buffer pool"
This reverts commit 24b4ed0b29.
Revert "IMPALA-4835: Part 2: Allocate scan range buffers upfront"
This reverts commit 5699b59d0c.
Revert "IMPALA-4835: Part 1: simplify I/O mgr mem mgmt and cancellation"
This reverts commit 65680dc421.
Change-Id: Ie5ca451cd96602886b0a8ecaa846957df0269cbb
Reviewed-on: http://gerrit.cloudera.org:8080/9480
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Impala Public Jenkins
This is the final patch to switch the Disk I/O manager to allocate all
buffer from the buffer pool and to reserve the buffers required for
a query upfront.
* The planner reserves enough memory to run a single scanner per
scan node.
* The multi-threaded scan node must increase reservation before
spinning up more threads.
* The scanner implementations must be careful to stay within their
assigned reservation.
The row-oriented scanners were most straightforward, since they only
have a single scan range active at a time. A single I/O buffer is
sufficient to scan the whole file but more I/O buffers can improve I/O
throughput.
Parquet is more complex because it issues a scan range per column and
the sizes of the columns on disk are not known during planning. To
deal with this, the reservation in the frontend is based on a
heuristic involving the file size and # columns. The Parquet scanner
can then divvy up reservation to columns based on the size of column
data on disk.
I adjusted how the 'mem_limit' is divided between buffer pool and non
buffer pool memory for low mem_limits to account for the increase in
buffer pool memory.
Testing:
* Added more planner tests to cover reservation calcs for scan node.
* Test scanners for all file formats with the reservation denial debug
action, to test behaviour when the scanners hit reservation limits.
* Updated memory and buffer pool limits for tests.
* Added unit tests for dividing reservation between columns in parquet,
since the algorithm is non-trivial.
Perf:
I ran TPC-H and targeted perf locally comparing with master. Both
showed small improvements of a few percent and no regressions of
note. Cluster perf tests showed no significant change.
Change-Id: Ic09c6196b31e55b301df45cc56d0b72cfece6786
Reviewed-on: http://gerrit.cloudera.org:8080/8966
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
The buffer pool changes reduced the memory requirement for the sort,
which seems to have been enough to get the query to execute without
spilling on S3. Reduce the limit in the test to force it to spill.
Testing:
Ran in a loop locally for an hour. Ran custom cluster tests on S3.
Change-Id: If65fee3e6a4b759d0d18e30a1c30bd48db0f2a54
Reviewed-on: http://gerrit.cloudera.org:8080/7615
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Always create global BufferPool at startup using 80% of memory and
limit reservations to 80% of query memory (same as BufferedBlockMgr).
The query's initial reservation is computed in the planner, claimed
centrally (managed by the InitialReservations class) and distributed
to query operators from there.
min_spillable_buffer_size and default_spillable_buffer_size query
options control the buffer size that the planner selects for
spilling operators.
Port ExecNodes to use BufferPool:
* Each ExecNode has to claim its reservation during Open()
* Port Sorter to use BufferPool.
* Switch from BufferedTupleStream to BufferedTupleStreamV2
* Port HashTable to use BufferPool via a Suballocator.
This also makes PAGG memory consumption more efficient (avoid wasting buffers)
and improve the spilling algorithm:
* Allow preaggs to execute with 0 reservation - if streams and hash tables
cannot be allocated, it will pass through rows.
* Halve the buffer requirement for spilling aggs - avoid allocating
buffers for aggregated and unaggregated streams simultaneously.
* Rebuild spilled partitions instead of repartitioning (IMPALA-2708)
TODO in follow-up patches:
* Rename BufferedTupleStreamV2 to BufferedTupleStream
* Implement max_row_size query option.
Testing:
* Updated tests to reflect new memory requirements
Change-Id: I7fc7fe1c04e9dfb1a0c749fb56a5e0f2bf9c6c3e
Reviewed-on: http://gerrit.cloudera.org:8080/5801
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
test_scratch_disk fails sporadically when trying to assert the presence
of log messages. This is probably caused by log caching, since after
such failures the log files do contains the lines in question.
I manually tested this by running the tests repeatedly for 2 days (10k
runs).
To make future diagnosis of similar problems easier, this change also
adds more output to assert_impalad_log_contains().
Change-Id: I9f21284338ee7b4374aca249b6556282b0148389
Reviewed-on: http://gerrit.cloudera.org:8080/5669
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
Refactor BufferedBlockMgr/TmpFileMgr to push more I/O logic into
TmpFileMgr, in anticipation of it being shared with BufferPool.
TmpFileMgr now handles:
* Scratch space allocation and recycling
* Read and write I/O
The interface is also greatly changed so that it is built around Write()
and Read() calls, abstracting away the details of temporary file
allocation from clients. This means the TmpFileMgr::File class can
be hidden from clients.
Write error recovery:
Also implement write error recovery in TmpFileMgr.
If an error occurs while writing to scratch and we have multiple
scratch directories, we will try one of the other directories
before cancelling the query. File-level blacklisting is used to
prevent excessive repeated attempts to resize a scratch file during
a single query. Device-level blacklisting is not implemented because
it is problematic to permanently take a scratch directory out of use.
To reduce the number of error paths, all I/O errors are now handled
asynchronously. Previously errors creating or extending the file were
returned synchronously from WriteUnpinnedBlock(). This required
modifying DiskIoMgr to create the file if not present when opened.
Also set the default max_errors value in the thrift definition file,
so that it is in effect for backend tests.
Future Work:
* Support for recycling variable-length scratch file ranges. I omitted
this to avoid making the patch even large.
Testing:
Updated BufferedBlockMgr unit test to reflect changes in behaviour:
* Scratch space is no longer permanently associated with a block, and
is remapped every time a new block is written to disk .
* Files are now blacklisted - updated existing tests and enable the
disable blacklisting test.
Added some basic testing of recycling of scratch file ranges in
the TmpFileMgr unit test.
I also manually tested the code in two ways. First by removing permissions
for /tmp/impala-scratch and ensuring that a spilling query fails cleanly.
Second, by creating a tiny ramdisk (16M) and running with two scratch
directories: one on /tmp and one on the tiny ramdisk. When spilling, an
out of space error is encountered for the tiny ramdisk and impala spills
the remaining data (72M) to /tmp.
Change-Id: I8c9c587df006d2f09d72dd636adafbd295fcdc17
Reviewed-on: http://gerrit.cloudera.org:8080/5141
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Impala Public Jenkins
For files that have a Cloudera copyright (and no other copyright
notice), make changes to follow the ASF source file header policy here:
http://www.apache.org/legal/src-headers.html#headers
Specifically:
1) Remove the Cloudera copyright.
2) Modify NOTICE.txt according to
http://www.apache.org/legal/src-headers.html#notice
to follow that format and add a line for Cloudera.
3) Replace or add the existing ASF license text with the one given
on the website.
Much of this change was automatically generated via:
git grep -li 'Copyright.*Cloudera' > modified_files.txt
cat modified_files.txt | xargs perl -n -i -e 'print unless m#Copyright.*Cloudera#i;'
cat modified_files_txt | xargs fix_apache_license.py [1]
Some manual fixups were performed following those steps, especially when
license text was completely missing from the file.
[1] https://gist.github.com/anonymous/ff71292094362fc5c594 with minor
modification to ORIG_LICENSE to match Impala's license text.
Change-Id: I2e0bd8420945b953e1b806041bea4d72a3943d86
Reviewed-on: http://gerrit.cloudera.org:8080/3779
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Tested-by: Internal Jenkins
Many of our test scripts have import statements that look like
"from xxx import *". It is a good practice to explicitly name what
needs to be imported. This commit implements this practice. Also,
unused import statements are removed.
Change-Id: I6a33bb66552ae657d1725f765842f648faeb26a8
Reviewed-on: http://gerrit.cloudera.org:8080/3444
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Internal Jenkins
Make the test deterministic by using max_block_mgr_memory instead of
mem_limit, so that the non-deterministic scanner memory usage does not
influence the spilling behaviour of the queries.
Testing:
Ran the test locally to confirm that it succeeded. Also manually
computed the memory requirement. The data size to be sorted is ~220MB,
so with a 64MB block manager limit per node, at least one node must
spill.
Change-Id: I9525a029ac020bb5b8bea210a741c9f9c5ec3c75
Reviewed-on: http://gerrit.cloudera.org:8080/3318
Reviewed-by: Dan Hecht <dhecht@cloudera.com>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Silvius Rus <srus@cloudera.com>
Tested-by: Tim Armstrong <tarmstrong@cloudera.com>
All versions of pytest contain various bugs regarding test marking
(including skips) when tests are both:
1. class-level marked
2. inherited
More info is available in IMPALA-3614 and IMPALA-2943, but the gist is
that it's possible for some tests to be skipped when they shouldn't be.
This is happening pretty badly with the custom cluster tests, because
CustomClusterTestSuite has a class level skipif mark.
The easiest workaround for now is to remove the pytest skipif mark in
CustomClusterTestSuite and skip using explicit pytest.skip() in the
setup_class() method. Some CustomClusterTestSuite children implemented
their own setup_* methods, and I made some adjustments to them both to
clean them up and implement proper parent method calling via super().
Testing:
I ran the following combinations of all the custom cluster tests:
DEBUG / HDFS / core
RELEASE / HDFS / exhaustive
DEBUG / LOCAL / core
DEBUG / S3 / core
Before, we'd get situations in which most of the tests were skipped.
Consider the RELEASE/HDFS/exhaustive situation:
custom_cluster/test_admission_controller.py .....
custom_cluster/test_alloc_fail.py ss
custom_cluster/test_breakpad.py sssss
custom_cluster/test_delegation.py sss
custom_cluster/test_exchange_delays.py ss
custom_cluster/test_hdfs_fd_caching.py s
custom_cluster/test_hive_parquet_timestamp_conversion.py ss
custom_cluster/test_insert_behaviour.py ss
custom_cluster/test_legacy_joins_aggs.py s
custom_cluster/test_parquet_max_page_header.py s
custom_cluster/test_permanent_udfs.py sss
custom_cluster/test_query_expiration.py sss
custom_cluster/test_redaction.py ssss
custom_cluster/test_s3a_access.py s
custom_cluster/test_scratch_disk.py ssss
custom_cluster/test_session_expiration.py s
custom_cluster/test_spilling.py ssss
authorization/test_authorization.py ss
authorization/test_grant_revoke.py s
Now, more tests run appropriately:
custom_cluster/test_admission_controller.py .....
custom_cluster/test_alloc_fail.py ss
custom_cluster/test_breakpad.py sssss
custom_cluster/test_delegation.py ...
custom_cluster/test_exchange_delays.py ss
custom_cluster/test_hdfs_fd_caching.py .
custom_cluster/test_hive_parquet_timestamp_conversion.py ..
custom_cluster/test_insert_behaviour.py ..
custom_cluster/test_kudu_not_available.py .
custom_cluster/test_legacy_joins_aggs.py .
custom_cluster/test_parquet_max_page_header.py .
custom_cluster/test_permanent_udfs.py ...
custom_cluster/test_query_expiration.py ...
custom_cluster/test_redaction.py ....
custom_cluster/test_s3a_access.py s
custom_cluster/test_scratch_disk.py ....
custom_cluster/test_session_expiration.py .
custom_cluster/test_spilling.py ....
authorization/test_authorization.py ..
authorization/test_grant_revoke.py .
Change-Id: Ie301b69718f8690322cc3b4130fb1c715344779c
Reviewed-on: http://gerrit.cloudera.org:8080/3265
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Tested-by: Michael Brown <mikeb@cloudera.com>
The problem is that xdist is spinning up multiple processes and each
process tries to create the the same temp dirs. This commit fixes the
issue by creating a randomly named dir to avoid the conflict.
Change-Id: Ic0764843ace00aef8c9b01139906e01ab5213047
Reviewed-on: http://gerrit.cloudera.org:8080/2817
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Internal Jenkins
Previously Impala could erroneously decide to use non-writable scratch
directories, e.g. if /tmp/impala-scratch already exists and is not
writable by the current user.
With this change, if we cannot remove and recreate a fresh scratch directory,
it is not used. If we have no valid scratch directories, we log an
error and continue startup.
Add unit test for CreateDirectory to test behavior for success and
failure cases.
Add system tests to check logging and query execution in various
scenarios where we do not have scratch available.
Modify FilesystemUtil to use non-exception-throwing Boost functions to
avoid unhandled exceptions escaping into the rest of the Impala
codebase, which does not expect the use of exceptions.
Change-Id: Icaa8429051942424e1d811c54bde10102ac7f7b3
Reviewed-on: http://gerrit.cloudera.org:8080/565
Reviewed-by: Tim Armstrong <tarmstrong@cloudera.com>
Tested-by: Internal Jenkins
Many python files had a hashbang and the executable bit set though
they were not intended to be run a standalone script. That makes
determining which python files are actually scripts very difficult.
A future patch will update the hashbang in real python scripts so they
use $IMPALA_HOME/bin/impala-python.
Change-Id: I04eafdc73201feefe65b85817a00474e182ec2ba
Reviewed-on: http://gerrit.cloudera.org:8080/599
Reviewed-by: Casey Ching <casey@cloudera.com>
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: Internal Jenkins
- Added static order by tests to test_queries.py and QueryTest/sort.test
- test_order_by.py also contains tests with static queries that are run with
multiple memory limits.
- Added stress, scratch disk and failpoints tests
- Incorporated Srinath's change that copied all order by with limit tests into
the top-n.test file
Extra time required:
Serial:
scratch disk: 42 seconds
test queries sort : 77 seconds
test sort: 56 seconds
sort stress: 142 seconds
TOTAL: 5 min 17 seconds
Parallel(8 threads):
scratch disk: 40 seconds
test queries sort: 42 seconds
test sort: 49 seconds
sort stress: 93 seconds
TOTAL: 3 min 44 sec
Change-Id: Ic5716bcfabb5bb3053c6b9cebc9bfbbb9dc64a7c
Reviewed-on: http://gerrit.ent.cloudera.com:8080/2820
Reviewed-by: Taras Bobrovytsky <tbobrovytsky@cloudera.com>
Tested-by: jenkins
Reviewed-on: http://gerrit.ent.cloudera.com:8080/3205