21 Commits

Author SHA1 Message Date
Joe McDonnell
775f73f03e IMPALA-14462: Fix tie-breaking for sorting scan ranges oldest to newest
TestTupleCacheFullCluster.test_scan_range_distributed is flaky on s3
builds. The addition of a single file is changing scheduling significantly
even with scan ranges sorted oldest to newest. This is because modification
times on S3 have a granularity of one second. Multiple files have the
same modification time, and the fix for IMPALA-13548 did not properly
break ties for sorting.

This adds logic to break ties for files with the same modification
time. It compares the path (absolute path or relative path + partition)
as well as the offset within the file. These should be enough to break
all conceivable ties, as it is not possible to have two scan ranges with
the same file at the same offset. In debug builds, this does additional
validation to make sure that when a != b, comp(a, b) != comp(b, a).

The test requires that adding a single file to the table changes exactly
one cache key. If that final file has the same modification time as
an existing file, scheduling may still mix up the files and change more
than one cache key, even with tie-breaking. This adds a sleep just before
generating the final file to guarantee that it gets a newer modification
time.

Testing:
 - Ran TestTupleCacheFullCluster.test_scan_range_distributed for 15
   iterations on S3

Change-Id: I3f2e40d3f975ee370c659939da0374675a28cd38
Reviewed-on: http://gerrit.cloudera.org:8080/23458
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
2025-09-25 23:30:28 +00:00
Joe McDonnell
e05d92cb3d IMPALA-13548: Schedule scan ranges oldest to newest for tuple caching
Scheduling does not sort scan ranges by modification time. When a new
file is added to a table, its order in the list of scan ranges is
not based on modification time. Instead, it is based on which partition
it belongs to and what its filename is. A new file that is added early
in the list of scan ranges can cause cascading differences in scheduling.
For tuple caching, this means that multiple runtime cache keys could
change due to adding a single file.

To minimize that disruption, this adds the ability to sort the scan
ranges by modification time and schedule scan ranges oldest to newest.
This enables it for scan nodes that feed into tuple cache nodes
(similar to deterministic scan range assignment).

Testing:
 - Modified TestTupleCacheFullCluster::test_scan_range_distributed
   to have stricter checks about how many cache keys change after
   an insert (only one should change)
 - Modified TupleCacheTest#testDeterministicScheduling to verify that
   oldest to newest scheduling is also enabled.

Change-Id: Ia4108c7a00c6acf8bbfc036b2b76e7c02ae44d47
Reviewed-on: http://gerrit.cloudera.org:8080/23228
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-22 20:51:46 +00:00
Joe McDonnell
ca356a8df5 IMPALA-13437 (part 2): Implement cost-based tuple cache placement
This changes the default behavior of the tuple cache to consider
cost when placing the TupleCacheNodes. It tries to pick the best
locations within a budget. First, it eliminates unprofitable locations
via a threshold. Next, it ranks the remaining locations by their
profitability. Finally, it picks the best locations in rank order
until it reaches the budget.

The threshold is based on the ratio of processing cost for regular
execution versus the processing cost for reading from the cache.
If the ratio is below the threshold, the location is eliminated.
The threshold is specified by the tuple_cache_required_cost_reduction_factor
query option. This defaults to 3.0, which means that the cost of
reading from the cache must be less than 1/3 the cost of computing
the value normally. A higher value makes this more restrictive
about caching locations, which pushes in the direction of lower
overhead.

The ranking is based on the cost reduction per byte. This is given
by the formula:
 (regular processing cost - cost to read from cache) / estimated serialized size
This prefers locations with small results or high reduction in cost.

The budget is based on the estimated serialized size per node. This
limits the total caching that a query will do. A higher value allows more
caching, which can increase the overhead on the first run of a query. A lower
value is less aggressive and can limit the overhead at the expense of less
caching. This uses a per-node limit as the limit should scale based on the
size of the executor group as each executor brings extra capacity. The budget
is specified by the tuple_cache_budget_bytes_per_executor.

The old behavior to place the tuple cache at all eligible locations is
still available via the tuple_cache_placement_policy query option. The
default is the cost_based policy described above, but the old behavior
is available via the all_eligible policy. This is useful for correctness
testing (and the existing tuple cache test cases).

This changes the explain plan output:
 - The hash trace is only enabled at VERBOSE level. This means that the regular
   profile will not contain the hash trace, as the regular profile uses EXTENDED.
 - This adds additional information at VERBOSE to display the cost information
   for each plan node. This can help trace why a particular location was
   not picked.

Testing:
 - This adds a TPC-DS planner test with tuple caching enabled (based on the
   existing TpcdsCpuCostPlannerTest)
 - This modifies existing tests to adapt to changes in the explain plan output

Change-Id: Ifc6e7b95621a7937d892511dc879bf7c8da07cdc
Reviewed-on: http://gerrit.cloudera.org:8080/23219
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-09-18 21:02:51 +00:00
Riza Suminto
1cead45114 IMPALA-13947: Test local catalog mode by default
Local catalog mode has been the default and works well in downstream
Impala for over 5 years. This patch turn on local catalog mode by
default (--catalog_topic_mode=minimal and --use_local_catalog=true) as
preferred mode going forward.

Implemented LocalCatalog.setIsReady() to facilitate using local catalog
mode for FE tests. Some FE tests fail due to behavior differences in
local catalog mode like IMPALA-7539. This is probably OK since Impala
now largely hand over FileSystem permission check to Apache Ranger.

The following custom cluster tests are pinned to evaluate under legacy
catalog mode because their behavior changed in local catalog mode:

TestCalcitePlanner.test_calcite_frontend
TestCoordinators.test_executor_only_lib_cache
TestMetadataReplicas
TestTupleCacheCluster
TestWorkloadManagementSQLDetailsCalcite.test_tpcds_8_decimal

At TestHBaseHmsColumnOrder.test_hbase_hms_column_order, set
--use_hms_column_order_for_hbase_tables=true flag for both impalad and
catalogd to get consistent column order in either local or legacy
catalog mode.

Changed TestCatalogRpcErrors.test_register_subscriber_rpc_error
assertions to be more fine grained by matching individual query id.

Move most of test methods from TestRangerLegacyCatalog to
TestRangerLocalCatalog, except for some that do need to run in legacy
catalog mode. Also renamed TestRangerLocalCatalog to
TestRangerDefaultCatalog. Table ownership issue in local catalog mode
remains unresolved (see IMPALA-8937).

Testing:
Pass exhaustive tests.

Change-Id: Ie303e294972d12b98f8354bf6bbc6d0cb920060f
Reviewed-on: http://gerrit.cloudera.org:8080/23080
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-08-06 21:42:24 +00:00
Joe McDonnell
07a5716773 IMPALA-13892: Add support for printing STRUCTs
Tuple cache correctness verification is failing as the code in
debug-util.cc used for printing the text version of tuples does
not support printing structs. It hits a DCHECK and kills Impala.

This adds supports for printing structs to debug-util.cc, fixing
tuple cache correctness verification for complex types. To print
structs correctly, each slot needs to know its field name. The
ColumnType has this information, but it requires a field idx to
lookup the name. This is the last index in the absolute path for
this slot. However, the materialized path can be truncated to
remove some indices at the end. Since we need that information to
resolve the field name, this adds the struct field idx to the
TSlotDescriptor to pass it to the backend.

This also adds a counter to the profile to track when correctness
verification is on. This is useful for testing.

Testing:
 - Added a custom cluster test using nested types with
   correctness verification
 - Examined some of the text files

Change-Id: Ib9479754c2766a9dd6483ba065e26a4d3a22e7e9
Reviewed-on: http://gerrit.cloudera.org:8080/23075
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-07-23 16:15:30 +00:00
Joe McDonnell
78a27c56fe IMPALA-13898: Incorporate partition information into tuple cache keys
Currently, the tuple cache keys do not include partition
information in either the planner key or the fragment instance
key. However, the partition actually is important to correctness.

First, there are settings defined on the table and partition that
can impact the results. For example, for processing text files,
the separator, escape character, etc are specified at the table
level. This impacts the rows produced from a given file. There
are other such settings stored at the partition level (e.g.
the JSON binary format).

Second, it is possible to have two partitions pointed at the same
filesystem location. For example, scale_db.num_partitions_1234_blocks_per_partition_1
is a table that has all partitions pointing to the same
location. In that case, the cache can't tell the partitions
apart based on the files alone. This is an exotic configuration.
Incorporating an identifier of the partition (e.g. the partition
keys/values) allows the cache to tell the difference.

To fix this, we incorporate partition information into the
key. At planning time, when incorporating the scan range information,
we also incorporate information about the associated partitions.
This moves the code to HdfsScanNode and changes it to iterate over
the partitions, hashing both the partition information and the scan
ranges. At runtime, the TupleCacheNode looks up the partition
associated with a scan node and hashes the additional information
on the HdfsPartitionDescriptor.

This includes some test-only changes to make it possible to run the
TestBinaryType::test_json_binary_format test case with tuple caching.
ImpalaTestSuite::_get_table_location() (used by clone_table()) now
detects a fully-qualified table name and extracts the database from it.
It only uses the vector to calculate the database if the table is
not fully qualified. This allows a test to clone a table without
needing to manipulate its vector to match the right database. This
also changes _get_table_location() so that it does not switch into the
database. This required reworking test_scanners_fuzz.py to use absolute
paths for queries. It turns out that some tests in test_scanners_fuzz.py
were running in the wrong database and running against uncorrupted
tables. After this is corrected, some tests can crash Impala. This
xfails those tests until this can be fixed (tracked by IMPALA-14219).

Testing:
 - Added a frontend test in TupleCacheTest for a table with
   multiple partitions pointed at the same place.
 - Added custom cluster tests testing both issues

Change-Id: I3a7109fcf8a30bf915bb566f7d642f8037793a8c
Reviewed-on: http://gerrit.cloudera.org:8080/23074
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Joe McDonnell <joemcdonnell@cloudera.com>
2025-07-17 01:07:44 +00:00
Joe McDonnell
7b25a7b070 IMPALA-13887: Incorporate column/field information into cache key
The correctness verification for the tuple cache found an issue
with TestParquet::test_resolution_by_name(). The test creates a
table, selects, alters the table to change a column name, and
selects again. With parquet_fallback_schema_resolution=NAME, the
column names determine behavior. The tuple cache key did not
include the column names, so it was producing an incorrect result
after changing the column name.

This change adds information about the column / field name to the
TSlotDescriptor so that it is incorporated into the tuple cache key.
This is only needed when producing the tuple cache key, so it is
omitted for other cases.

Testing:
 - Ran TestParquet::test_resolution_by_name() with correctness
   verification
 - Added custom cluster test that runs the test_resolution_by_name()
   test case with tuple caching. This fails without this change.

Change-Id: Iebfa777452daf66851b86383651d35e1b0a5f262
Reviewed-on: http://gerrit.cloudera.org:8080/23073
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-06-27 18:13:19 +00:00
Michael Smith
f222574f04 IMPALA-13660: Support caching broadcast hash joins
This extends tuple caching to be able to cache above joins. As part
of this, ExchangeNodes are now eligible for broadcast and directed
exchanges. This does not yet support partitioned exchanges. Since
an exchange passes data from all nodes, this incorporates all the
scan range information when passing through an exchange.

For joins with a separate build side, a cache hit above the join
means that a probe-side thread will never arrive. If the builder
is not notified, it will wait for that thread to arrive and extend
the latency of the query significantly. This adds code to notify
the builder when a thread will never participate in the probe
phase.

Testing:
 - Added test cases to TestTupleCace including with distributed
   plans.
 - Added test cases to test_tuple_cache.py to verify behavior when
   updating the build side table and the timing of a cache hit.
 - Performance tests with TPC-DS at scale

Change-Id: Ic61462702b43175c593b34e8c3a14b9cfe85c29e
Reviewed-on: http://gerrit.cloudera.org:8080/22371
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-10 05:53:44 +00:00
Joe McDonnell
7aa4d50484 IMPALA-13882: Fix Iceberg v2 deletes with tuple caching
A variety of Iceberg statements (including v2 deletes) rely
on getting information from the scan node child of the
delete node. Since tuple caching can insert a TupleCacheNode
above that scan, the logic is currently failing, because
it doesn't know how to bypass the TupleCacheNode and get
to the scan node below.

This modifies the logic in multiple places to detect a
TupleCacheNode and go past it to the get the scan node
below it.

Testing:
 - Added a basic Iceberg test with v2 deletes for the
  frontend test and custom cluster test

Change-Id: I162e738c4e4449a536701a740272aaac56ce8fd8
Reviewed-on: http://gerrit.cloudera.org:8080/22666
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-10 00:20:24 +00:00
Csaba Ringhofer
f98b697c7b IMPALA-13929: Make 'functional-query' the default workload in tests
This change adds get_workload() to ImpalaTestSuite and removes it
from all test suites that already returned 'functional-query'.
get_workload() is also removed from CustomClusterTestSuite which
used to return 'tpch'.

All other changes besides impala_test_suite.py and
custom_cluster_test_suite.py are just mass removals of
get_workload() functions.

The behavior is only changed in custom cluster tests that didn't
override get_workload(). By returning 'functional-query' instead
of 'tpch', exploration_strategy() will no longer return 'core' in
'exhaustive' test runs. See IMPALA-3947 on why workload affected
exploration_strategy. An example for affected test is
TestCatalogHMSFailures which was skipped both in core and exhaustive
runs before this change.

get_workload() functions that return a different workload than
'functional-query' are not changed - it is possible that some of
these also don't handle exploration_strategy() as expected, but
individually checking these tests is out of scope in this patch.

Change-Id: I9ec6c41ffb3a30e1ea2de773626d1485c69fe115
Reviewed-on: http://gerrit.cloudera.org:8080/22726
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Daniel Becker <daniel.becker@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-04-08 07:12:55 +00:00
Michael Smith
4a645105f9 IMPALA-13658: Enable tuple caching aggregates
Enables tuple caching on aggregates directly above scan nodes. Caching
aggregates requires that their children are also eligible for caching,
so this excludes aggregates above an exchange, union, or hash join.

Testing:
- Adds Planner tests for different aggregate cases to confirm they have
  stable tuple cache keys and are valid for caching.
- Adds custom cluster tests that cached aggregates are used, and can be
  re-used in slightly different statements.

Change-Id: I9bd13c2813c90d23eb3a70f98068fdcdab97a885
Reviewed-on: http://gerrit.cloudera.org:8080/22322
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2025-01-22 23:26:21 +00:00
Michael Smith
2085edbe1c IMPALA-13503: CustomClusterTestSuite for whole class
Allow using CustomClusterTestSuite with a single cluster for the whole
class. This speeds up tests by letting us group together multiple test
cases on the same cluster configuration and only starting the cluster
once.

Updates tuple cache tests as an example of how this can be used. Reduces
test_tuple_cache execution time from 100s to 60s.

Change-Id: I7a08694edcf8cc340d89a0fb33beb8229163b356
Reviewed-on: http://gerrit.cloudera.org:8080/22006
Reviewed-by: Riza Suminto <riza.suminto@cloudera.com>
Reviewed-by: Jason Fehr <jfehr@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-18 23:57:39 +00:00
Yida Wu
493d563590 IMPALA-13510: Unset the environment variable for tuple cache tests
The test_cache_disabled test case would fail in the tuple cache
build because the build enables the tuple cache using the
environment variables, while the test case requires the tuple
cache to remain disabled.

This patch unset the related environment variable TUPLE_CACHE_DIR
before running the tuple cache related tests to ensure the tests
are using its own starting flags.

Tests:
Passed the test_tuple_cache.py under the tuple cache build.

Change-Id: I2b551e533c7c69d5b29ed6ad6af90be57f53c937
Reviewed-on: http://gerrit.cloudera.org:8080/22018
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-11-07 10:24:19 +00:00
Joe McDonnell
1267fde57b IMPALA-13497: Add TupleCacheBytesWritten/Read to the profile
This adds counters for the number of bytes written / read
from the tuple cache. This gives visibility into whether
certain locations have enormous result sizes. This will be
used to tune the placement of tuple cache nodes.

Tests:
 - Added checks of the TupleCacheBytesWritten/Read counters
   to existing tests in test_tuple_cache.py

Change-Id: Ib5c9249049d8d46116a65929896832d02c2d9f1f
Reviewed-on: http://gerrit.cloudera.org:8080/21991
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-30 21:50:32 +00:00
Yida Wu
47b638e667 IMPALA-13188: Add test that compute stats does not result in a different tuple cache key
The patch introduces a new test, TestTupleCacheComputeStats, to
verify that compute stats does not change the tuple cache key.
The test creates a simple table with one row, runs an explain
on a basic query, then inserts more rows, computes the stats,
and reruns the same explain query. It compares the two results
to ensure that the cache keys are identical in the planning
phase.

Tests:
Passed the test.

Change-Id: I918232f0af3a6ab8c32823da4dba8f8cd31369d0
Reviewed-on: http://gerrit.cloudera.org:8080/21917
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-22 02:20:34 +00:00
Yida Wu
f5a8771a98 IMPALA-13411: Fix DCHECK fires for scan nodes that produce zero-length tuples
Removed the DCHECK assertion that tuple_data_len must be greater
than zero in tuple-file-writer.cc and tuple-file-reader.cc,
because in certain cases, such as count(*), tuple_data_len can
be zero, as no column data is returned and only the row count
matters.

Tests:
Adds TestTupleCacheCountStar for the regression test.

Change-Id: I264b537f0eb678b65081e90c21726198f254513d
Reviewed-on: http://gerrit.cloudera.org:8080/21953
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-17 23:25:40 +00:00
Michael Smith
b6b953b48e IMPALA-13186: Tag query option scope for tuple cache
Constructs a hash of non-default query options that are relevant to
query results; by default query options are included in the hash.
Passes this hash to the frontend for inclusion in the tuple cache key
on plan leaf nodes (which will be included in parent key calculation).

Modifies MurmurHash3 to be re-entrant so the backend can construct a
hash incrementally. This is slightly slower but more memory efficient
than accumulating all hash inputs in a contiguous array first.

Adds TUPLE_CACHE_EXEMPT_QUERY_OPT_FN to mark query options that can be
ignored when calculating a tuple cache hash.

Adds startup flag 'tuple_cache_exempt_query_options' as a safety valve
for query options that might be important to exempt that we missed.

Removes duplicate printing logic for query options from child-query.cc
in favor of re-using TQueryOptionsToMap, which does the same thing.

Cleans up query-options.cc helpers so they're static and reduces
duplicate printing logic.

Adds test that different values for a relevant query option use
different cache entries. Adds startup flag
'tuple_cache_ignore_query_options' to omit query options for testing
certain tuple cache failure modes, where we need to use debug actions.

Change-Id: I1f4802ad9548749cd43df8848b6f46dca3739ae7
Reviewed-on: http://gerrit.cloudera.org:8080/21698
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-09 17:21:10 +00:00
Michael Smith
d81c4db5e1 IMPALA-13185: Include runtime filter source in key
Incorporates the build-side PlanNode of a runtime filter in the tuple
cache key to avoid re-using intermediate results that were generated
using a runtime filter on the same target but different selection
criteria (build-side conjuncts).

We currently don't support caching ExchangeNode, but a common scenario
is a runtime filter produced by a HashJoin, with an Exchange on the
build side. Looks through the first ExchangeNode when considering the
cache key and eligibility for the build side source for a runtime
filter.

Testing shows all tests now passing for test_tuple_cache_tpc_queries
except those that hit "TupleCacheNode does not enforce limits itself and
cannot have a limit set."

Adds planner tests covering some scenarios where runtime filters are
expected to match or differ, and custom cluster tests for multi-node
testing.

Change-Id: I0077964be5acdb588d76251a6a39e57a0f42bb5a
Reviewed-on: http://gerrit.cloudera.org:8080/21729
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Joe McDonnell <joemcdonnell@cloudera.com>
2024-10-04 17:10:29 +00:00
Michael Smith
6642b75efc IMPALA-13402: Clean up test_tuple_cache dimensions
Uses exec_option_dimension to specify exec options, and avoids starting
a cluster when the test would just be skipped.

Uses other standard helpers to replace custom methods that were less
flexible.

Change-Id: Ib241f1f1cfaf918dffaddd5aeef3884c70e0a3fb
Reviewed-on: http://gerrit.cloudera.org:8080/21859
Reviewed-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
2024-10-01 21:32:40 +00:00
Joe McDonnell
831585c2f5 IMPALA-12906: Incorporate scan range information into the tuple cache key
This change is accomplishing two things:
1. It incorporates scan range information into the tuple
   cache key.
2. It reintroduces deterministic scheduling as an option
   for mt_dop and uses it for HdfsScanNodes that feed
   into a TupleCacheNode.

The combination of these two things solves several problems:
1. When a table is modified, the list of scan ranges will
   change, and this will naturally change the cache keys.
2. When accessing a partitioned table, two queries may have
   different predicates on the partition columns. Since the
   predicates can be satisfied via partition pruning, they are
   not included at runtime. This means that two queries
   may have identical compile-time keys with only the scan
   ranges being different due to different partition pruning.
3. Each fragment instance processes different scan ranges, so
   each will have a unique cache key.

To incorporate scan range information, this introduces a new
per-fragment-instance cache key. At compile time, the planner
now keeps track of which HdfsScanNodes feed into a TupleCacheNode.
This is passed over to the runtime as a list of plan node ids
that contain scan ranges. At runtime, the fragment instance
walks through the list of plan nodes ids and hashes any scan ranges
associated with them. This hash is the per-fragment-instance
cache key. The combination of the compile-time cache key produced
by the planner and the per-fragment-instance cache key is a unique
identifier of the result.

Deterministic scheduling for mt_dop was removed via IMPALA-9655
with the introduction of the shared queue. This revives some of
the pre-IMPALA-9655 scheduling logic as an option. Since the
TupleCacheNode knows which HdfsScanNodes feed into it, the
TupleCacheNode turns on deterministic scheduling for all of those
HdfsScanNodes. Since this only applies to HdfsScanNodes that feed
into a TupleCacheNode, it means that any HdfsScanNode that doesn't
feed into a TupleCacheNode continues using the current algorithm.
The pre-IMPALA-9655 code is modified to make it more deterministic
about how it assigns scan ranges to instances.

Testing:
 - Added custom cluster tests for the scan range information
   including modifying a table, selecting from a partitioned
   table, and verifying that fragment instances have unique
   keys
 - Added basic frontend test to verify that deterministic scheduling
   gets set on the HdfsScanNode that feed into the TupleCacheNode.
 - Restored the pre-IMPALA-9655 backend test to cover the LPT code

Change-Id: Ibe298fff0f644ce931a2aa934ebb98f69aab9d34
Reviewed-on: http://gerrit.cloudera.org:8080/21541
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Michael Smith <michael.smith@cloudera.com>
Reviewed-by: Yida Wu <wydbaggio000@gmail.com>
2024-09-04 16:31:31 +00:00
Michael Smith
6121c4f7d6 IMPALA-12905: Disk-based tuple caching
This implements on-disk caching for the tuple cache. The
TupleCacheNode uses the TupleFileWriter and TupleFileReader
to write and read back tuples from local files. The file format
uses RowBatch's standard serialization used for KRPC data streams.

The TupleCacheMgr is the daemon-level structure that coordinates
the state machine for cache entries, including eviction. When a
writer is adding an entry, it inserts an IN_PROGRESS entry before
starting to write data. This does not count towards cache capacity,
because the total size is not known yet. This IN_PROGRESS entry
prevents other writers from concurrently writing the same entry.
If the write is successful, the entry transitions to the COMPLETE
state and updates the total size of the entry. If the write is
unsuccessful and a new execution might succeed, then the entry is
removed. If the write is unsuccessful and won't succeed later
(e.g. if the total size of the entry exceeds the max size of an
entry), then it transitions to the TOMBSTONE state. TOMBSTONE
entries avoid the overhead of trying to write entries that are
too large.

Given these states, when a TupleCacheNode is doing its initial
Lookup() call, one of three things can happen:
 1. It can find a COMPLETE entry and read it.
 2. It can find an IN_PROGRESS/TOMBSTONE entry, which means it
    cannot read or write the entry.
 3. It finds no entry and inserts its own IN_PROGRESS entry
    to start a write.

The tuple cache is configured using the tuple_cache parameter,
which is a combination of the cache directory and the capacity
similar to the data_cache parameter. For example, /data/0:100GB
uses directory /data/0 for the cache with a total capacity of
100GB. This currently supports a single directory, but it can
be expanded to multiple directories later if needed. The cache
eviction policy can be specified via the tuple_cache_eviction_policy
parameter, which currently supports LRU or LIRS. The tuple_cache
parameter cannot be specified if allow_tuple_caching=false.

This contains contributions from Michael Smith, Yida Wu,
and Joe McDonnell.

Testing:
 - This adds basic custom cluster tests for the tuple cache.

Change-Id: I13a65c4c0559cad3559d5f714a074dd06e9cc9bf
Reviewed-on: http://gerrit.cloudera.org:8080/21171
Reviewed-by: Michael Smith <michael.smith@cloudera.com>
Tested-by: Impala Public Jenkins <impala-public-jenkins@cloudera.com>
Reviewed-by: Kurt Deschler <kdeschle@cloudera.com>
2024-04-10 03:11:49 +00:00