From 4031dd0e383666e000a2bf922abd50d69376973c Mon Sep 17 00:00:00 2001 From: Martin Atkins Date: Wed, 1 Oct 2025 15:39:32 -0700 Subject: [PATCH] rfc: A new approach to configuration evaluation, planning, and applying This is a followup to our earlier RFC describing some drawbacks and limitations of the current OpenTofu language runtime and proposing to move to a new approach. Whereas the previous RFC primarily focused on defining the problem, this document aims to propose the start of a solution, in the form of a high-level architectural model that we can hopefully find consensus on before we move on to discussing the associated implementation details. Signed-off-by: Martin Atkins --- rfc/20251001-eval-plan-apply-architecture.md | 1340 ++++++++++++++++++ 1 file changed, 1340 insertions(+) create mode 100644 rfc/20251001-eval-plan-apply-architecture.md diff --git a/rfc/20251001-eval-plan-apply-architecture.md b/rfc/20251001-eval-plan-apply-architecture.md new file mode 100644 index 0000000000..5a42ff0f52 --- /dev/null +++ b/rfc/20251001-eval-plan-apply-architecture.md @@ -0,0 +1,1340 @@ +# A new approach to configuration evaluation, planning, and applying + +This RFC is a continuation of the design discussion previously started in +[Revisiting OpenTofu's core Architecture](./20250728-execution-architecture.md). +That document focused mainly on the problems we were hoping to address, while +_this_ document is discussing a specific technical design that addresses many of +those problems. + +At the time of writing this we already have an initial practical implementation +of most (but not all) of these ideas in [opentofu/opentofu#3191](https://github.com/opentofu/opentofu/pull/3191). +This writeup aims to be at more conceptual level, discussing the overall +architectural ideas without getting into the implementation details too much. +As a result, some of the descriptions and diagrams intentionally gloss over +some implementation-level complexity, and so what's described here doesn't +_exactly_ match the initial implementation. + +In practice we expect that those implementation details will continue to evolve +as we iterate. The goal of this RFC is only to document the general direction +we'd be heading in, so that we can hopefully find concensus on the architectural +goals and then use that to inform the ongoing implementation work. + +This document assumes that the reader is already familiar with the historical +context described in the previous document. Instead of repeating all of that +again in different words, we'll just jump directly into describing the proposed +design direction. + +## Table of Contents + +- [Overview](#overview) +- [The Evaluator](#the-evaluator) + - [Compilation vs. Evaluation](#compilation-vs-evaluation) + - [Concurrent Dynamic Analysis](#concurrent-dynamic-analysis) + - [Interaction with Phase-specific Code](#interaction-with-phase-specific-code) +- [The Planning Engine](#the-planning-engine) + - [Provider Instances During the Planning Phase](#provider-instances-during-the-planning-phase) + - ["Deposed" Objects](#deposed-objects) +- [The Apply Engine](#the-apply-engine) + - [Saving Execution Graphs to Disk](#saving-execution-graphs-to-disk) +- [Known Concerns and Open Questions](#known-concerns-and-open-questions) + - [Overall Implementation Strategy](#overall-implementation-strategy) + - [Provider-defined Functions on Configured Providers](#provider-defined-functions-on-configured-providers) + +> [!NOTE] +> +> Some of the sections end with callout notes like this which refer to concrete +> implementations of what was discussed from our initial implementation sketch. +> +> At this draft stage those are imprecise references by just naming symbols +> and package paths. We're hoping to land a form of that implementation sketch +> in the `main` branch in the near future as a starting point for ongoing +> implementation work, and if we do so then I intend to retroactively update +> this document to link directly to that merged code for easier navigation. +> I'm just deferring that for now because the specific commits on that branch +> are likely to get rewritten and thus garbage collected so that direct +> permalinks to files in those commits would stop working. + +## Overview + +Overall, this new design approach draws some more explicit boundaries between +different parts of the system that are not so strongly delineated today. + +There are multiple ways to think about these boundaries. The way we most +commonly talk about it is to describe OpenTofu doing work in _phases_. This +new architectural approach has two main phases: + +```mermaid +flowchart LR + +classDef artifact stroke:#ddd,fill:#eee; +classDef subsystem fill:#9fc5e8,stroke:#90c0e0,font-weight:bold; + +subgraph Planning +direction TB + +plan_config[Configuration] +class plan_config artifact + +plan_eval([Evaluator]) +class plan_eval subsystem +plan_engine([Planning +Engine]) +class plan_engine subsystem + +plan_desired_inst@{shape: st-rect, label: "Desired\nResource Instances"} +class plan_desired_inst artifact +planned_inst@{shape: st-rect, label: "Resource Instance\nPlanned Values"} +class planned_inst artifact +plan_eval-->plan_desired_inst-->plan_engine-->planned_inst-->plan_eval + +plan_config-->plan_eval + +plan_state([State Storage]) +class plan_state subsystem + +plan_prior_states@{shape: st-rect, label: "Resource Instance\nPrior States"} +class plan_prior_states artifact +plan_state-->plan_prior_states-->plan_engine + +end + +execgraph@{label: "Execution\nGraph"} +class execgraph artifact + +Planning-->execgraph-->Applying + +subgraph Applying +direction TB + +apply_config[Configuration] +class apply_config artifact + +apply_engine([Apply +Engine]) +class apply_engine subsystem +apply_eval([Evaluator]) +class apply_eval subsystem + +apply_desired_inst@{shape: st-rect, label: "Desired\nResource Instances"} +class apply_desired_inst artifact + +apply_state([State Storage]) +class apply_state subsystem + +final_inst@{shape: st-rect, label: "Resource Instance\nFinal Values"} +class final_inst artifact +final_state@{shape: st-rect, label: "Resource Instance\nFinal States"} +class final_state artifact +apply_prior_states@{shape: st-rect, label: "Resource Instance\nPrior States"} +class apply_prior_states artifact + +apply_config-->apply_eval +apply_eval-->apply_desired_inst-->apply_engine-->final_inst-->apply_eval +apply_state-->apply_prior_states-->apply_engine-->final_state-->apply_state + +end +``` + +* **Planning:** Broadly, this means to use the provided configuration to + discover the "desired state" -- which resource instances ought to exist -- + and the relationships between them and the providers that manage them. It + then compares that with the _current_ state and identifies any differences. + + This phase then proposes various specific actions that need to be taken + to bring the remote systems closer to the desired state, but instead of + immediately performing those actions it describes them as an + _execution graph_ for handling in the next phase. + +* **Applying:** The planning phase produced a graph of specific actions that + should be taken to modify the remote systems, and so the apply phase is + responsible for actually performing those actions in a suitable order based + on the constraints described in the execution graph. + + In current OpenTofu this phase ends up re-running a lot of the logic from + the planning phase in the hope of producing an equivalent result. In this + new approach instead the planning phase describes the execution graph in + a form that can be saved as part of a plan file, and so the apply engine + merely needs to "do what it's told" by reloading that graph and executing + the steps exactly as prescribed. + +Each of these phases involve different main logic, but they share a significant +component in common: the _evaluator_. The job of the evaluator is to encapsulate +the handling of all of the surface details of the OpenTofu language that exist +to make decentralized development possible, such as calls between modules and +values flowing between them, so that the main phase-specific logic can focus +only on the main objects that actually cause externally-visible side-effects: +resource instances, and the provider instances that manage them. + +This then leads to a different way to think about the system conceptually: +instead of thinking about the different phases of work, we can instead think of +the series of transforms these phases are performing to move from the concepts +that module authors interact with to the concepts that the plan and apply phases +prefer to interact with: + +```mermaid +graph TB + +subgraph exprs[Expression Graph] + direction LR + + var_env([var.environment]) + var_base_cidr([var.base_cidr_block]) + local_tags([local.tags]) + local_subnets([local.subnet_defs]) + output_vpc_id([output.vpc_id]) + output_subnet_ids([output.subnet_ids]) + vpc[aws_vpc.main] + provider@{label: "provider[\"hashicorp/aws\"]"} + + subgraph subnets[aws_subnet.main] + subnet_public@{label: "aws_subnet.main[\"public\"]"} + subnet_private@{label: "aws_subnet.main[\"private\"]"} + end + + var_env-->local_tags + var_base_cidr-->local_subnets + vpc-->local_subnets + local_subnets-->subnets + var_env-->provider + subnets-->output_subnet_ids + vpc-->output_vpc_id + local_tags-->vpc + local_tags-->subnets + provider-->vpc + provider-->subnets + +end + +subgraph ris[Resource Instance Graph] + direction LR + + final_provider@{label: "provider[\"hashicorp/aws\"]"} + final_vpc[aws_vpc.main] + final_subnet_public@{label: "aws_subnet.main[\"public\"]"} + final_subnet_private@{label: "aws_subnet.main[\"private\"]"} + + final_provider-->final_vpc + final_provider-->final_subnet_public + final_vpc-->final_subnet_public + final_vpc-->final_subnet_private + final_provider-->final_subnet_private + +end + +subgraph exec[Execution Graph] + direction LR + + exec_open_provider@{label: "Open Provider\n\"hashicorp/aws\""} + exec_vpc_plan@{label: "Final Plan\naws_vpc.main"} + exec_vpc_apply@{label: "Apply Changes\naws_vpc.main"} + exec_subnet_public_plan@{label: "Final Plan\naws_subnet.main[\"public\"]"} + exec_subnet_private_plan@{label: "Final Plan\naws_subnet.main[\"private\"]"} + exec_subnet_public_apply@{label: "Apply Changes\naws_subnet.main[\"public\"]"} + exec_subnet_private_apply@{label: "Apply Changes\naws_subnet.main[\"private\"]"} + + exec_close_provider@{label: "Close Provider\n\"hashicorp/aws\""} + + exec_open_provider-->exec_vpc_plan + exec_open_provider-->exec_subnet_public_plan + exec_vpc_apply-->exec_subnet_public_plan + exec_vpc_apply-->exec_subnet_private_plan + exec_open_provider-->exec_subnet_private_plan + exec_vpc_plan-->exec_vpc_apply + exec_subnet_public_plan-->exec_subnet_public_apply + exec_subnet_private_plan-->exec_subnet_private_apply + exec_subnet_public_apply-->exec_close_provider + exec_vpc_apply-->exec_close_provider + exec_subnet_private_apply-->exec_close_provider + exec_open_provider-->exec_close_provider + +end + +exprs-->ris +ris-->exec +``` + +- The **expression graph** is the most direct representation of what the module + authors wrote in the configuration: a set of expressions that need to be + evaluated, where some expressions refer to others and therefore all need to + be evaluated in a suitable order. + + At this level an OpenTofu configuration is perhaps most similar to a + spreadsheet, where each cell contains an expression and the expressions in + some cells refer to the results of the expressions in other cells. Just + like in a spreadsheet engine's implementation, OpenTofu needs to discover + the relationships between those expressions and then evaluate them gradually + until they have all been transformed into a final value. + + The expression graph is completely encapsulated inside the _evaluator_, + because it's calculated identically in all phases but the rest of the + system doesn't actually need all of the source-level detail, so the + evaluator produces the next artifact as its primary result. + +- The **resource instance graph** boils the expression graph down to just the + essential elements that can potentially cause side-effects during the apply + phase: resource instances and the provider instances that manage them. + + This form ignores details such as input variables, output values, and + the fact that each module has its own separate namespace. If a resource + instance configuration refers to a local value which in turn refers to the + results from another resource instance then in the resource instance graph + the local value is disregarded and we instead model just the dependency + between the resource instances. + + The planning phase is driven primarily by the resource instance graph: it + visits each resource instance reported by the evaluator and asks its + associated provider to produce a plan by comparing the effective + configuration with the current state. Based on the provider's answer, + the planning phase then chooses a set of specific actions that would need + to be taken to bring the remote system closer to the desired state, but + instead of performing those actions directly it produces the next artifact + for the apply phase to use. + +- The **execution graph** expands the pure relationship information from the + resource instance graph to a graph of specific actions that can cause + externally-visible side effects, making sure that those actions happen in + a suitable order that respects the dependencies from the resource instance + graph and respects OpenTofu's own mechanical constraints. + + For example, if the provider's proposed plan for a resource instance + was to replace the existing object with a new one matching the latest + configuration then the _execution graph_ must capture the individual steps + of destroying the previous object and creating the new object as separate + nodes, ordering them appropriately based on the `create_before_destroy` + settings in the resource configurations. + + The job of the apply phase is to perform all of the operations described + in the execution graph, while ensuring that the dependencies between them + are correctly respected, and that other constraints like OpenTofu's + configurable concurrency limit are respected. + +As a series of phases or as a series of graph variants are both valid lenses to +view the proposed new architecture through. These are two ways to describe the +same ideas, rather than two alternative approaches to choose between. + +Overall then, the main goal of the new approach described in this document is +to draw more explicit boundaries between these phases and between these +artifacts, whereas the current OpenTofu language runtime has all of these +concerns muddled together in a way that makes it hard to follow what role each +component is playing and how the different components depend on one another. + +The subsequent sections describe each of the main components in some more +detail. + +## The Evaluator + +The most central new concept in this proposal is the shared _evaluator_, which +is responsible for interpreting the source language as written by module +authors, evaluating all of the expressions inside of it (which will often refer +to one another), and then ultimately reporting: + +- Which instance keys are declared for each object that supports dynamic + instances using any of the meta-arguments `count`, `for_each`, and `enabled`. +- All of the resource instances that are declared as "desired", the + configuration values associated with each one, the address of the provider + instance that the resource instance is declared to belong to, and the + dependencies between them that are either implied by expression references or + declared explicitly using the `depends_on` meta-argument. +- The configuration values associated with each provider instance, and which + other resource instances the provider instance itself depends on. +- Various other ancillary information that might influence exactly how the + planning engine would ultimately decide to react to a difference between + desired and current state of a resource instance, such as `moved`, `removed`, + and `import` blocks. + + These are all essentially just additional settings belonging to resource + instances, but the surface language allows declaring them in separate blocks + so that those decisions can potentially be made in a different module to the + one where the resources themselves are declared. The evaluator is + therefore responsible for finding those associations and presenting them + as if they were declared as part of the resource instance. + +This intentionally flattens away various details that are useful to module +authors for maintainability and composition, but that don't _directly_ describe +the "desired state": + +- Input variables, local values, and output values: these exist only to allow + factoring out common code and splitting declarations into separate namespaces. + It ultimately doesn't matter to any other part of the system whether one + resource instance refers directly to another or whether that reference is + indirect through an output value across a module boundary. +- The module tree: the rest of the system just sees a flat set of + resource instance addresses, some of which happen to include + module-instance-related prefixes just to honor the separate namespaces between + the different module instances. +- The unexpanded source objects that "expand" into multiple instances during + evaluation: for example, if `aws_instance.foo` uses `count = 2` then + the evaluator announces that `aws_instance.foo[0]` and `aws_instance.foo[1]` + both exist, but doesn't have anything to say about `aws_instance.foo` + _as a whole_ except that its full set of instance keys is `[0, 1]`. + +```mermaid +flowchart LR + classDef artifact stroke:#ddd,fill:#eee + classDef subsystem fill:#9fc5e8,stroke:#90c0e0,font-weight:bold + + root_module:::artifact@{ shape: rect, label: "Root Module\nCall" } + child_modules:::artifact@{ shape: st-rect, label: "Child Module\nCalls"} + + module_cache:::subsystem@{shape: rounded, label: "Module\nCache"} + + + module_source:::artifact@{ shape: st-rect, label: "Module\nSource Code"} + +subgraph Evaluator + + compiler:::subsystem@{shape: rounded, label: "Module\nCompiler\n(tofu2024)"} + + module_compiled:::artifact@{ shape: st-rect, label: "Compiled\nModule Instances"} + + module_source-->compiler-->module_compiled + + evaluation:::subsystem@{shape: rounded, label: "Evaluation\n(configgraph)"} + + module_compiled-->evaluation + +end + + resource_inst:::artifact@{ shape: st-rect, label: "Desired\nResource Instances"} + provider_inst:::artifact@{ shape: st-rect, label: "Provider Instance\nConfigurations"} + + evaluation-->resource_inst + evaluation-->provider_inst + + + evaluation-->child_modules-->module_cache-->module_source + root_module-->module_cache + + engine:::subsystem@{ shape: rounded, label: "Plan or Apply\nEngine" } + + resource_inst-->engine + provider_inst-->engine + + etc@{shape: text, label: "⋯"} + + engine-->etc + + resource_results:::artifact@{ shape: st-rect, label: "Resource Instance\nResult Values"} + + engine-->resource_results-->evaluation +``` + +### Compilation vs. Evaluation + +The evaluation subsystem is itself split into two parts: + +- The "compile" step interacts directly with the HCL library to analyze the + source-level language constructs in a particular module, and is responsible + for defining the variables and functions available in that module's scope + based on what is declared. + + The result is a higher-level description of the relationships between the + objects declared in a particular module. In this representation the objects + and expressions are represented more abstractly so that the specific + interactions with HCL are hidden away. + +- The "evaluate" step consumes the result of the "compile" step and actually + visits all of the declared objects and evaluates their expressions, collecting + any diagnostics for problems that might arise along the way. + + This part of the evaluator is not directly aware of HCL and so could _in + principle_ handle objects declared in some other base language, although that + remains very theoretical since until we have another concrete language to + consider it's impossible to predict exactly which concepts would be common + between that and HCL. + + More likely in the medium term is to support multiple variations of the + HCL-based language, so that we can implement features that require some kind + of breaking change to the language while preserving support for older + configurations that were written for an earlier edition of the language. + +In practice these two parts have a cyclic dependency between them: evaluating +one module can discover module calls referring to other modules, which must +then in turn be compiled and evaluated themselves. That causes some complexity +in the implementation, but does mean that it's possible in principle for +different modules in the same configuration to use different implementations +of "compile" so that we can support multiple editions of the language together +in the same configuration, as long as the differences between those editions +are private within each module's namespace. + +> [!NOTE] +> +> If you'd like to learn more about how the compile and evaluate steps might +> be implemented in practice, refer to +> [the current implementation sketch](https://github.com/opentofu/opentofu/pull/3290). +> +> The "compile" step is primarily implemented in +> `lang/eval/internal/tofu2024`, which in turn produces instances +> of types from `lang/eval/internal/configgraph` that interact with +> each other in the evaluation step. + +### Concurrent Dynamic Analysis + +The current OpenTofu language runtime has a number of internal sub-phases of +evaluation, which notably means that analysis of dependencies always happens +completely before any expression evaluation can occur. This means that the +dependency graph for evaluation is built by _static analysis_, where OpenTofu +asks HCL which static references exist in each expression. + +Consider the following example: + +```hcl +variable "instance_count" { + type = number +} + +resource "foo" "example" { + count = var.instance_count +} + +resource "bar" "example" { + count = var.instance_count + + foo_id = foo.example[count.index].id +} +``` + +The current language runtime's static analysis step can detect that the +`foo_id` argument in `bar.example` refers to both `foo.example` and +`count.index`, but it cannot "see" how each of those symbols is used, so +OpenTofu just conservatively assumes that all instances of `bar.example` +depend on all instances of `foo.example`. + +In fact, because the current engine cannot evaluate expressions _at all_ +until the dependency graph has been built, the `count` expressions in each +of the resource blocks cannot be evaluated until the graph has already been +built, and so the plan-time dependency graph is between _whole resources_ +rather than between _resource instances_, and the decision about which instances +exist for each resource isn't made until OpenTofu begins visiting the graph +nodes and evaluating their associated expressions. + +This means that overall OpenTofu's analysis of dependencies is considerably +more conservative than module authors might expect: + +- Even though in practice `bar.example[1]` only needs the id from + `foo.example[1]`, OpenTofu must still wait until `foo.example[0]`'s plan + is complete before planning `bar.example[1]`, reducing opportunities for + concurrent work and thus slowing down the planning phase. +- The `-target` and `-exclude` planning options are documented as accepting + resource instance addresses, but in practice things don't really work properly + when doing so because the graph pruning happens before the `count` expression + has been resolved, and so `-target=bar.example[1]` causes both + `foo.example[0]` and `foo.example[1]` to be included in the plan. +- It isn't possible for one instance of a resource to refer to another instance + of the same resource, because OpenTofu currently wants to evaluate all + instances of a resource before evaluating anything that depends on any + instance of that resource. + +A major design goal for the new evaluator is to perform dependency analysis and +expression evaluation _concurrently_, so that we can use the dynamic results +of evaluating expressions to determine exactly what they refer to. In +particular, the new evaluator can evaluate the `count = var.instance_count` of +each of those resources as soon as the value of `var.instance_count` is +available, without any concern for what any other part of the resource +configuration refers to. It then evaluates the `foo_id` argument from +`bar.example` twice, with `count.index` set to `0` and `1` respectively, and +then the evaluation system dynamically recognizes that evaluation of those +expressions must wait until there's a value for `foo.example[0]` or +`foo.example[1]`, blocking until a concurrent goroutine has provided those +values. + +Finally, the dynamic result of `foo.example[0]` is +[marked](https://github.com/zclconf/go-cty/blob/main/docs/marks.md) as belonging +to the zeroth instance of `foo.example`, and so once that expression's result +is available the evaluator can determine _precisely_ that `bar.example[0]` +depends only on `foo.example[0]`, and not also on `foo.example[1]`. + +This section focused on references between resource instances as the most +prominent example, but the underlying building blocks that allow us to +dynamically discover dependencies during evaluation have various other benefits +too, such as unifying what our current system calls "early evaluation" with +the main evaluation logic. Instead of having a completely separate evaluator +for obtaining that early information, we can instead use the same evaluator +but impose some additional rules on certain parts of the language. For example, +the logic which evaluates a `source` argument in a `module` block rejects any +value that we determine (through dynamic analysis) was derived from a resource +instance attribute, rather than imposing static-analysis constraints about +which specific _global symbols_ the module source argument is allowed to refer +to. + +> [!NOTE] +> +> If you'd like to learn more about how concurrent expression evaluation and +> dynamic analysis might be implemented in practice, refer to +> [the current implementation sketch](https://github.com/opentofu/opentofu/pull/3290). +> The most relevant packages are `lang/eval/internal/configgraph`, +> `lang/eval/grapheval`, and `lang/exprs`. + +### Interaction with Phase-specific Code + +Although the evaluator aims to encapsulate as much as possible, there is one +need that the evaluator cannot solve entirely on its own: deciding the value +used to represent each resource instance. + +Because resource instances are our main vehicle for modelling side-effects, +their treatment differs a lot between phases: + +- During the planning phase, the value representing a resource instance is its + "planned new state": a partial prediction of what the object should look like + after changes have been applied, decided by the provider itself, with + unknown values in any location where the provider cannot predict a result + until after the side-effects have occured. +- During the applying phase, the value representing a resource instance is its + "final new state": the value that the provider returned after making the + actual changes to the remote system, which is required to be fully known. + +The evaluator deals with these differences by requiring its caller to provide +an implementation of an interface which has a method that takes a description +of a "desired resource instance" -- its resource instance address, its +configuration values, and other meta-information such as which provider instance +it's associated with -- and returns a value that should be used to represent +that resource instance in ongoing evaluation. + +The subsequent sections about the plan and apply engines will describe their +different approaches to this in more detail, but the common theme is that -- as +with so much of the behavior described above -- the evaluator and its caller +have a cyclic dependency between them where the configuration for one resource +instance might depend on the result value of another resource instance, and so +again the evaluator runs concurrently with its caller so that the two can +work alongside each other and trade work across the boundary of responsibility +as needed. + +> [!NOTE] +> +> In [the current implementation sketch](https://github.com/opentofu/opentofu/pull/3290), +> the interface used by the evaluator to request information from its caller +> is called `Glue` in the package `lang/eval/internal/evalglue`, +> although there are higher-level wrappers around it to help adapt that +> API to each phase. +> +> For example, the planning engine implements `PlanGlue` instead, which is +> defined in `lang/eval`. + +## The Planning Engine + +The planning engine's job is to combine information produced by the evaluator +with information taken from the prior state and current remote system state to +decide which actions, if any, to take during a subsequent apply phase to change +the remote system to be closer to the desired state. + +```mermaid +flowchart LR + +classDef artifact stroke:#ddd,fill:#eee; +classDef subsystem fill:#9fc5e8,stroke:#90c0e0,font-weight:bold; + +plan_config[Configuration] +class plan_config artifact + +plan_eval([Evaluator]) +class plan_eval subsystem +plan_engine([Planning +Engine]) +class plan_engine subsystem + +plan_desired_inst@{shape: st-rect, label: "Desired\nResource Instances"} +class plan_desired_inst artifact +planned_inst@{shape: st-rect, label: "Resource Instance\nPlanned Values"} +class planned_inst artifact +planned_inst-->plan_eval +plan_eval-->plan_desired_inst-->plan_engine-->planned_inst + +plan_config-->plan_eval + +plan_state([State Storage]) +class plan_state subsystem + +plan_prior_states@{shape: st-rect, label: "Resource Instance\nPrior States"} +class plan_prior_states artifact +plan_state-->plan_prior_states-->plan_engine + + +plan_prior_states2@{shape: st-rect, label: "Resource Instance\nPrior States"} +class plan_prior_states2 artifact +providers@{shape: stadium, label: "Providers"} +class providers subsystem +plan_refreshed_states@{shape: st-rect, label: "Resource Instance\nRefreshed States"} +class plan_refreshed_states artifact + +plan_engine-->plan_prior_states2-->providers-->plan_refreshed_states-->plan_engine + +exec_graph[Execution Graph] +class exec_graph artifact +style exec_graph font-weight:bold; +etc@{shape: text, label: "⋯"} + +plan_engine-->exec_graph-->etc +``` + +The order in which resource instances need to be "planned" is essentially the +same as the order of evaluating the expressions that define their configurations -- +we typically want to plan a resource instance as soon as the needed data is +available -- and so the planning engine uses an inversion-of-control style +where it treats each new piece of information from the evaluator as an event +that can trigger planning work. + +From the perspective of the planning engine there are various different "events" +generated by the evaluator which the planning process reacts to in different +ways: + +- **All resources known for a module instance:** each time a module instance + is compiled, the evaluator reports which resources (of all modes) are declared + in it. + + The planning engine then checks the prior state for any resources that + are not currently declared, and makes plans to delete or forget all + instances of them. + +- **All child module calls known for a module instance:** similar to the + previous point, but for `module` blocks instead of resource declarations. + + In this case, all instances of all resources under the no-longer-desired + module calls need to be deleted or forgotten together. + +- **Instance keys decided for a resource:** once the `for_each`/etc arguments + for a resource block have been resolved, the evaluator reports the full + set of desired instance keys for the resource. + + The planning engine then checks the prior state for any instances of that + resource which are _not_ desired, and makes plans to delete or forget them. + +- **Instance keys decided for a module call:** similar to the previous point, + but for the instances of a `module` block rather than for a resource. + + In this case, all instances of all resources under the no-longer-desired + module instances need to be deleted or forgotten together. + +- **A specific resource instance is desired:** finally, each resource instance + that is currently declared in the configuration has its configuration + value and associated metadata (i.e. which provider instance they use) + reported. + + This is the main case: the planning engine responds by comparing the + desired configuration with the current state (the result of refreshing) + and, if necessary, proposes to create, update, or replace the remote + object associated with that resource instance. + +Each of the above events can therefore cause additional operations to be added +to the execution graph. The planning engine constructs the execution graph +gradually, on a per-resource-instance object basis, by adding just the +operations needed for a particular resource instance and then remembering which +of those operations produces the "final state" of the resource instance and +which represent opening and closing each provider instance, so that the +operations for downstream resource instances can be made to await the completion +of whatever they depends on, without needing to know the fine details of how the +other parts of the execution graph are shaped. + +```mermaid +graph LR + + exec_open_provider@{label: "Open Provider\n\"hashicorp/aws\""} + exec_vpc_plan@{label: "Final Plan\naws_vpc.main"} + exec_vpc_apply@{label: "Apply Changes\naws_vpc.main"} + exec_subnet_public_plan@{label: "Final Plan\naws_subnet.main[\"public\"]"} + exec_subnet_private_plan@{label: "Final Plan\naws_subnet.main[\"private\"]"} + exec_subnet_public_apply@{label: "Apply Changes\naws_subnet.main[\"public\"]"} + exec_subnet_private_apply@{label: "Apply Changes\naws_subnet.main[\"private\"]"} + + exec_close_provider@{label: "Close Provider\n\"hashicorp/aws\""} + + exec_open_provider-->exec_vpc_plan + exec_open_provider-->exec_subnet_public_plan + exec_vpc_apply-->exec_subnet_public_plan + exec_vpc_apply-->exec_subnet_private_plan + exec_open_provider-->exec_subnet_private_plan + exec_vpc_plan-->exec_vpc_apply + exec_subnet_public_plan-->exec_subnet_public_apply + exec_subnet_private_plan-->exec_subnet_private_apply + exec_subnet_public_apply-->exec_close_provider + exec_vpc_apply-->exec_close_provider + exec_subnet_private_apply-->exec_close_provider + exec_open_provider-->exec_close_provider +``` + +Considering again the relatively-simple example from earlier in the document, +the evaluator's report of "`aws_subnet.main["public"]` is desired" would +cause the planning engine to use the associated provider instance to learn if +any changes are needed for that resource instance, and if so the code handling +that specific resource instance would: + +- Look up the graph nodes representing the opening and closing of + `provider["hashicorp/aws"]`, adding them to the graph for the first time + if they aren't already present. + + (In practice, this lazy-population of the two needed nodes is encapsulated in + a helper function that takes a provider instance address and returns references + either to newly-created nodes or nodes created by a previous call to the same + function.) + +- Look up the graph node that provides the final state for `aws_vpc.main`, + which should already have been handled due to the natural order of expression + evaluation in the evaluator. + +- Add a new operation to the graph representing "create final plan" for + `aws_subnet.main["public"]`, which depends on the nodes representing + "open `provider["hashicorp/aws"]`" and the final state of + `aws_vpc.main`. + +- Add a new operation to the graph representing "apply final plan" for + `aws_subnet.main["public"]`, which has the "create final plan" node + and the provider instance open node as dependencies. + +- Add the "apply final plan" node as a dependency of the + "close `provider["hashicorp/aws"]`" node, so that the provider instance will + remain open at least long enough to deal with `aws_subnet.main["public"]`'s + operations. + + (Again in practice this is encapsulated a little so that the code handling + the resource instance can just pass its "apply final plan" result to a + helper function without needing to know exactly how the graph is being + manipulated to recognize that.) + +- Finally, record the "apply final plan" node as the operation that provides + the final state for `aws_subnet.main["public"]`, so that any other downstream + resource instance that might refer to this one (none, in this example) would + be able to find it to record a dependency on it. + +Overall this means that the code constructing the execution graph is directly +associated with the code which decides which action to take, which is quite +different from the current OpenTofu language runtime where the plan phase +only decides which action to take and then the execution graph is built +separately during the apply phase, after having to repeat much of the same work +to rediscover all of the necessary dependency edges. + +After adding the needed nodes and edges to the execution graph, the "plan +desired resource instance" function then returns the placeholder result value +for the resource instance, which the evaluator will use to handle any references +to `aws_subnet.main["public"]` from elsewhere in the configuration. The +planning engine and evaluator therefore run concurrently and collaborate to +gradually progress through all of the resource instances declared in the +configuration, in expression-dependency order. + +> [!NOTE] +> +> In [the current implementation sketch](https://github.com/opentofu/opentofu/pull/3290), +> the functionality described above is incomplete and not all connected together +> into a fully-functional system, because the first attempt at the planning +> engine was reusing the current system's model of planning instead of the new +> execution graph as a way to reduce scope while prototyping. +> +> However, various parts _are_ there to read if you're willing to use your +> imagination a little to think about how they would connect together in +> practice: +> +> - The evaluator delivers these "events" to the planning engine through +> the planning engine's implementation of `eval.PlanGlue` from `lang/eval`, +> which is struct type `planGlue` in the `engine/planning` package. +> - The above description focused mainly on the case of managed resource +> instances that want to plan and apply changes, and the specific +> implementations of handling those are in the `plan_managed.go` file in +> `engine/planning`. +> +> The code in there is currently stubbed out using existing state and +> plan models, like `plans.ChangesSync`, rather than building an execution +> graph as the text above describes. +> +> Data and ephemeral resource instances require some different treatment +> of course, so they'd have their own separate implementations that are +> currently just stubbed out in the sketch tree. +> - Some machinery for building and running through execution graphs is +> sketched out in `engine/internal/execgraph`. The test +> `TestBuilder_basics` includes some graph-building logic that illustrates +> what the graph-building logic in a resource instance planning handler might +> look like. +> - The evaluator's calls into the `planning.planGlue` methods are in +> `lang/eval/config_plan.go`. + +### Provider Instances during the Planning Phase + +We often think of OpenTofu's phases in an "idealized" way where all of the +side-effects happen during the apply phase and the planning phase is mainly just +gathering existing data from elsewhere. + +However, there are certain "ephemeral objects" that are active only for the +duration of a single phase and that must therefore be opened and closed during +both the plan and apply phases. The two current examples of types of +"ephemeral object" are provider instances and ephemeral resource instances, +both of which must be "opened" during the planning phase, stay open until +various other work has been performed, and then "closed" at some point after +all of the associated work is complete. We call these "ephemeral" because +they get activated in both the plan and apply phases and expect that their +configurations and behavior might reasonably differ between those phases even +within the same plan/apply round. + +Because the current OpenTofu language runtime has a separate static analysis +step before it begins dynamic evaluation during planning, it is able to perform +a conservative analysis of the relationships between resource declarations and +provider config declarations and so predict when all of the work for a provider +instance or ephemeral resource instance should definitely have completed and +so then close each ephemeral object at that point. This analysis is imprecise, +but "fails open" by keeping ephemeral objects open longer than necessary +sometimes, with no possibility of closing them too early. + +Because the proposed new planning engine runs in a single pass that does the +effects of analysis and evaluation concurrently, it does _not_ have a +precomputed approximation of when it would definitely be safe to close an +ephemeral object early, and so this new engine prioritizes simplicity by +following the most conservative behavior: opening an ephemeral object the +first time it is needed, and then leaving it open for the remainder of the +planning phase until all other work has completed. The ephemeral objects will +then be closed in the opposite order to how they were opened, so that e.g. +if a provider instance depends on an ephemeral resource instance then the +provider instance will be closed before the ephemeral resource instance is +closed. + +(This section was titled "Provider Instances..." because it's a more +recognizable concept than "Ephemeral Objects", but in practice the same rules +apply mostly equally to both kinds of ephemeral object.) + +> [!NOTE] +> +> In [the current implementation sketch](https://github.com/opentofu/opentofu/pull/3290), +> there is an attempt to mostly-precisely precalculate the relationships between +> provider instances, ephemeral resource instances, and non-ephemeral resource +> instances using an additional preparation run of the evaluator, so that the +> planning phase can sometimes close provider instances and ephemeral resource +> instances early once all of the users of those ephemeral objects have finished +> their work. That mechanism is equivalent in _purpose_ to the current language +> runtime's static analysis work, but implemented as an additional phase of +> dynamic analysis using the same evaluator instead of as a separate codepath +> focused only on static analysis. +> +> Our current hypothesis is that it's rare in practice for any non-trivial +> provider to be able to close significantly before the end of the planning +> phase even with the current system's attempt at dependency analysis, and +> although that is sometimes possible for smaller utility providers that serve +> only a limited purpose those tend to have relatively light resource usage, +> and so the increased complexity of trying to precisely track the dependencies +> of provider instances is unlikely to have benefits that exceed the complexity +> costs in most practical uses of OpenTofu. +> +> We're intending to start with the simpler design described above where +> emphemeral objects are opened on demand and then remain open until the +> end of the planning phase, but we have at least two alternate design +> possibilities -- the one in the current implementation sketch, and another +> variation that achieves a similar result with slightly less precision based on +> the static declarations in the configuration -- that we would be able to +> retrofit if we learn that the simpler approach proposed above is insufficient. +> Therefore we'll wait until we know of a specific problem to be solved before +> considering a more complex design to address that problem. + +### "Deposed" Objects + +The descriptions above focused mainly on comparing the current objects from +the desired state with the current objects from the prior state, but there +is one more category of objects in the prior state that are completely out +of scope for the evaluator and entirely the concern of the planning engine: +the so-called "deposed" objects of a resource instance. + +Deposed objects can appear whenever the object associated with a resource +instance is being replaced in "create before destroy" order. In that mode, +during the replace process there are briefly two objects associated with the +same resource instance and so OpenTofu needs to keep track of both of them. + +Therefore during the apply phase the replace process begins by marking the old +object as "deposed", which in practice moves it to a different part of the state +for the resource instance so that the storage for the _current_ object is left +vacant. OpenTofu then creates the new object and stores it as the new current +object, before attempting to destroy the previous object that is now deposed. + +In most cases all of this happens together in a single round's apply phase and +after all of the work is done the state is left in a normal shape where the +resource instance has only one current resource instance and no deposed ones. +However, if the attempt to delete the deposed object fails then OpenTofu must +abort the apply run, but it still needs to remember that there's an unwanted +object that hasn't been destroyed yet, so the deposed object ends up being +saved as part of the persisted state snapshot at the end of the apply phase. + +The next planning phase after that will therefore find the deposed object, and +regardless of what the desired state might have to say about the associated +resource instance a deposed object must _always_ be planned for deletion. + +Therefore the planning engine deals with deposed objects from the prior state +as a special case, visiting each one largely independently of the work the +evaluator is doing. The only exception is that planning to delete a deposed +object sometimes requires making a request to the provider instance that most +recently managed it (whose identity is recorded in the state), and so the +planning engine waits for the configuration of that provider instance to be +available before planning the deletion. If the associated provider instance +is no longer declared in the configuration then the planning process fails +with an error prompting the operator to re-add that provider instance at least +long enough to deal with destroying all of its remaining undesired objects. + +## The Apply Engine + +```mermaid +flowchart LR + +classDef artifact stroke:#ddd,fill:#eee; +classDef subsystem fill:#9fc5e8,stroke:#90c0e0,font-weight:bold; + +etc@{shape: text, label: "⋯"} + +execgraph@{shape: rect, label: "Execution\nGraph"} +class execgraph artifact + +apply_engine([Apply +Engine]) +class apply_engine subsystem + +etc-->execgraph-->apply_engine + +apply_config[Configuration] +class apply_config artifact + +apply_eval([Evaluator]) +class apply_eval subsystem + +apply_desired_inst@{shape: st-rect, label: "Desired\nResource Instances"} +class apply_desired_inst artifact + +apply_state([State Storage]) +class apply_state subsystem + +final_inst@{shape: st-rect, label: "Resource Instance\nFinal Values"} +class final_inst artifact +final_state@{shape: st-rect, label: "Resource Instance\nFinal States"} +class final_state artifact +apply_prior_states@{shape: st-rect, label: "Resource Instance\nPrior States"} +class apply_prior_states artifact + +apply_config-->apply_eval +apply_eval-->apply_desired_inst-->apply_engine-->final_inst-->apply_eval +apply_state-->apply_prior_states-->apply_engine-->final_state-->apply_state + +apply_prior_states2@{shape: st-rect, label: "Resource Instance\nFinal Plans"} +class apply_prior_states2 artifact +providers@{shape: stadium, label: "Providers"} +class providers subsystem +apply_refreshed_states@{shape: st-rect, label: "Resource Instance\nFinal Values"} +class apply_refreshed_states artifact + +apply_engine-->apply_prior_states2-->providers-->apply_refreshed_states-->apply_engine +``` + +The responsibility of the apply engine is to faithfully perform the operations +described in the execution graph, in an order that honors the dependencies +between those operations. In particular it is _not_ responsible for computing +or modifying the execution graph itself: the intention is that all of the logic +for deciding what to do lives in the planning engine, and then the apply +engine is the runtime that finally performs the work. + +Whereas the planning engine is an event-driven system reacting to the arrival +of new information from the evaluator, the apply engine is instead driven +primarily by performing the operations in the execution graph in an imperative +style, without explicit inversion-of-control. If the execution graph was +constructed correctly then the evaluator should always be able to provide the +configuration information for a resource instance or provider instance +immediately on request, because all of the upstream work needed to evaluate +those expressions would already have been performed when visiting an earlier +operation in the execution graph. + +From the perspective of the apply engine, the set of operations in the execution +graph is effectively just a set of functions that all get called concurrently, +with each one blocking until its operand values have been provided by upstream +nodes. Once all of the operands to a particular operation are available, it +can begin performing its main work and then once it returns the result it +returns might then allow other operations to start their main work. + +The nodes of the execution graph are pretty high-level building blocks, though: +operations like "open provider instance", "create final plan for resource +instance", etc. Each of these operations represents a set of side-effects that +must always run together as a single unit. In particular, an +"apply changes for resource instance" operation includes both the provider +request to apply the changes _and_ the state storage request to save the updated +state because the first should never happen without the second unless the +_state storage_ fails, and even in that case we need to be able to recover in +a clear and robust way to avoid data loss. + +A crucial difference compared to the current OpenTofu language runtime is that, +aside from when requesting information such as desired state from outside of the +apply engine, data flows between the operations as function return values and +arguments, rather than all of the graph nodes collaborating to read and write a +single central mutable data structure. This should therefore make the +implementation of each of the operations easier to write, read, and maintain. + +> [!NOTE] +> +> The apply engine itself doesn't currently exist at all in +> [the current implementation sketch](https://github.com/opentofu/opentofu/pull/3290), +> but there is an initial implementation of building, compiling and executing +> execution graphs in `engine/internal/execgraph`. +> +> The test `TestCompiler_resourceInstanceBasics` performs the full process of +> building, compiling, and executing a graph, but for compilation and execution +> it uses a mock implementation of the interface that the apply engine would +> ultimately need to implement to allow external data like desired state to +> flow into the execution graph. The apply engine itself would therefore consist +> mainly of an entrypoint that compiles and begins execution of the graph, +> and an implementation of `execgraph.ExecContext` that mediates all of the +> interactions with other parts of the system, like the evaluator, providers, +> and state storage. + +### Saving Execution Graphs to Disk + +Although many OpenTofu users run the plan and apply phases consecutively in +a single `tofu apply` command, in automation scenarios it's often necessary +to split the phases across two OpenTofu executions, which we allow through +the `tofu plan` command's ability to save a plan to disk and then pass the +created file to a subsequent `tofu apply` command possibly running elsewhere. + +A major reason why the current language runtime rebuilds its equivalent of the +"execution graph" during the apply phase, even though that means repeating a +bunch of work the planning phase already did, is that the current "graph" model +is designed only to be an in-memory data structure, involving arbitrary pointers +between heap-allocated objects. There is no straightforward way to save that +data structure to disk, not only because it's arbitrarily scattered throughout +memory but because it includes non-serializable data such as function pointers +with associated closures whose data is not directly visible to serialization +code. + +With the luxury of a fresh start, this document proposes a new model of +execution graph that is handled in multiple phases: + +1. The "graph builder" API helps multiple callers collaborate to construct + a single, internally-consistent execution graph, and once complete it + produces that in a "source" format that is pure data, without any + associated code: + + - An array of "operations", where each has an integer opcode defining the + type of operation and its operands can refer to other operations using + their indices into the same array. + - Secondary lookup tables for external data that the operations rely on, + such as resource instance addresses that we need desired or prior states + for, or "initial planned state" values generated during the planning phase. + - A mapping table from resource instance addresses to the operations that + will ultimately provide their "final state" objects. + + This structure is designed intentionally to be pure data that can be + serialized to a file on disk and then reloaded from disk to produce an + equivalent graph, but on the other hand it's purely a description of a + set of high-level work to do and not actually directly executable. + +2. The "graph _compiler_" takes the graph data structure from step one, along + with an implementation of an interface that lets the apply engine interact + with the evaluator and other parts of the system, and produces a different + form of the graph that consists just of a slice of function pointers that + should all run conconcurrently in separate goroutines each. + + The compiler arranges for each of those function pointers to have an + associated closure that has accessors for the upstream dependencies in + scope, and so they are essentially just normal code which uses those + accessors to retrieve the values and then uses them to perform whatever + real work the opcode implies, returning diagnostics if any problems occur. + +Ultimately this leads to quite a similar situation to the current runtime, where +each graph node has a separate goroutine that blocks until its dependencies +are available and then does some work, but arranged in a different way so that +the work to decide the nodes and edges of the graph is separated from the +work of preparing that graph for execution and thus under this new model we can +finalize the graph topology during the plan phase and reuse _exactly that graph_ +during the apply phase, without having to first rediscover all of the +relationships again using the configuration and state data with the risk of +producing a slightly different outcome in that second attempt. + +> [!NOTE] +> +> [The current implementation sketch](https://github.com/opentofu/opentofu/pull/3290) +> includes graph building and compilation, but does not yet include +> serialization and deserialization of source graphs. +> +> The most likely way to serialize the "source graph" form is as a set of lists +> of objects in any serialization format that has such concepts. The references +> between "nodes" are effectively just indicies into those lists, with a +> separate list per type of node just because the data we'd want to store for +> each is different. +> +> --- +> +> One notable concern with the current model is that the graph builder uses +> generic types to help the author of the planning engine code wire operations +> together correctly based on the result type expected for each opcode, but +> that generic type information would be lost in serialization because the +> references to operations would just be plain indices into the operations +> list. +> +> However, each distinct opcode has a single result type that it always +> produces, and so with some light rearranging of the code we should be able +> to recover that type information during deserialization by inferring a +> result type from the opcode of the target of a reference. The generic type +> information is primarily for the benefit of the graph _builder_, and is not +> actually crucial for compilation and execution although the compiler does +> nonetheless re-check to confirm whether the source graph appears to be +> constructed correctly primarily for robustness against type-assertion panics +> during the main apply work. + +## Known Concerns and Open Questions + +The ideas described above are a broad overview which glosses over some details. +Our initial implementation sketch also inevitably covers only a subset of the +full requirements of OpenTofu, and has some implementation details that we're +hoping to improve on in subsequent work. + +The following sections describe some concerns and questions that we're +intentionally deferring to deal with in later work, so that we can revisit them +in a more focused way once the broad skeleton of the system is in place. + +### Overall Implementation Strategy + +This is an ambitious project that will take some time to reach the point of +being generally-useful, but having essentially two implementations of the same +behavior for a long period is likely to make ongoing maintenence considerably +harder. + +On the other hand, this significant redesign of existing behavior carries a +high amount of risk, and so we will not want to rush too quickly to shipping +it and will likely need to endure some sort of transitional period where the +old and new implementations are both available, and perhaps initially the +new implementation is opt-in, then later the new implementation is _opt-out_ +for a small cohort experiencing rare edge-cases, before finally removing the +old implementation once we're confident that the new implementation is good +enough. + +We have not yet explored the full extent of the problem space enough to make +a firm plan all the way to the completion of this project, but we do have some +consensus among the maintainers about next steps to move us closer to that: + +- We'll merge [the current implementation sketch](https://github.com/opentofu/opentofu/pull/3290) + _mostly_ as-is, as dead code that the rest of the system won't call into + yet. + + The main goal here is to allow us to then work in smaller increments + building on that code without all of the overhead of a long-lived feature + branch that we'd need to constantly rebase against other concurrent work. + + A secondary goal is that this will allow us to gradually refactor some + existing code so that it can potentially be shared between the old and new + implementations, so that where possible we can minimize the amount of + duplication during the transitional period. We expect that it'll be far + simpler to perform this refactoring on `main`, where we can update both + the old and new system together without creating an effective fork of the + current system for the duration of this project. + +- Our initial focus for subsequent work will be to turn the initial sketch + into a viable implementation of the basic "default" flow for all three + of OpenTofu's current resource modes, but initially excluding support for + lifecycle variations such as `ignore_changes`, `replace_triggered_by`, + and other things that are not strictly needed to demonstrate an end-to-end + plan/apply round for a simple but realistic set of resource instances. + + Once we reach this point, we'd consider (although _this is not yet a commitment_) + making the new implementation accessible to end-users through some sort of + experimental opt-in, so that we can then more easily test with real-world + configurations as we gradually add the remaining features left out in the + initial end-to-end implementation work. + +- Each additional customization feature has some different interactions with the + plan and apply logic, and so as we investigate those further we are likely + to find that we need to add new concepts and mechanisms to the evaluator, + planning engine, or apply engine that this initial proposal does not yet + consider. + + We have made a best effort to run "thought experiments" to try to identify + existing behaviors that would be impossible (or incredibly difficult) to + reproduce under the new architecture described in this document and have not + yet found anything of significant concern, so we do have relatively high + confidence that the high-level architecture described in this document is + sound, but the implementation details will quite possibly change + significantly when we come to add support for some of these features. + + Once we are at a point where it's possible to run a subset of real-world + OpenTofu configurations under the new engine we should be in a better + position to finalize our plans for ultimately completing all of this work + and reaching a future OpenTofu release where the new implementation is + the default (and hopefully _only_) implementation. + +### Provider-defined Functions on Configured Providers + +When implementing the Terraform provider protocol for provider-defined +functions, the OpenTofu project intentionally implemented the protocol +incorrectly to permit an additional feature without directly changing the +protocol: OpenTofu permits attempting to call a function that is not part +of the provider's schema, and when that happens it upgrades the provider +into "configured" mode and asks for the set of available functions again +in the hope that new functions will have appeared. + +That intentional deviation from the protocol was primarily to support +[the `opentofu/lua` provider](https://search.opentofu.org/provider/opentofu/lua/latest), +which uses provider configuration in an unusual way: instead of it representing +settings for connecting to an external system and credentials to use when doing +so, it describes a Lua source file to load to find additional functions that +are then registered with the provider dynamically. + +This design decision had several interesting consequences: + +1. Whereas before this only resource instances could depend on configured + provider instances, in current OpenTofu literally any expression can depend + on a provider instance, which means that the resource graph alone is + insufficient to decide when it is okay to close a provider instance. + +2. The OpenTofu project now maintains a fork of HCL that is primarily motivated + by supporting this feature, since upstream HCL does not offer any way to + perform static analysis of which functions might be called during the + evaluation of an expression. + +3. The provider mocking mechanism in the test framework relies on provider + schema information in order to be able to provide fake implementations of + parts of a provider that would normally be available only when that provider + has been configured, but this OpenTofu design undermined that by making + one part of the schema -- the set of available functions -- be considered + incomplete until configuration has happened. + +Altogether then, this design decision has broken a number of assumptions in +OpenTofu's own design, the design of HCL, and the design of the Terraform +provider protocol that OpenTofu intends to act as a client of. + +With the benefit of hindsight it seems that specifying a local file containing +a set of functions to export is quite a different kind of configuration than +specifying a remote network service to connect to and credentials to use when +communicating with it: an interaction with an external system is something +that OpenTofu must do carefully and only at certain times to ensure that +development aids like validation and testing can work outside of a real +execution environment, whereas specifying additional functions to include in +a configuration is something we'd _want_ to do in an offline validation/testing +environment so that the behavior of those functions can be included in the +tests. + +Unfortunately the first and second points above both (to different extents) +undermine some of the assumptions made in the split between "evaluator" and +"planning engine" in this proposed design: the planning engine is the one +responsible for opening and closing provider instances during the planning +phase, but yet it does not have any awareness of individual expressions that +might cause provider-defined functions due to their encapsulation within the +evaluator. + +Because we are concurrently considering +[designing an additional _OpenTofu-specific_ provider protocol](https://github.com/opentofu/org/blob/main/wg/new-providers/README.md), +and one of the key ideas that working group will consider is making it easier +to write small, single-purpose providers in a variety of programming languages, +our current assumption is that the efforts of that working group will eventually +lead to a superior way to extend OpenTofu with function implementations written +in scripting languages like Lua, JavaScript, and Python. + +We are therefore currently assuming -- but have not yet made a final decision -- +that we will ultimately deprecate OpenTofu's slight misuse of Terraform's plugin +protocol for functions in favor of encouraging those with that need to write +an OpenTofu provider directly in their language of choice, and making it as +easy as possible to write such a provider. + +As a pragmatic compromise then, our intention is to start by implementing a +workaround where the evaluator will start up configured providers itself, +independently of the work of the planning engine, in the currently-rare case +where it detects a call to a provider-defined function that isn't in the +provider's schema. This means that if there were hypothetically a provider +offering both resource types _and_ functions that only appear in the schema +after configuration then they would get configured separately by the evaluator +and by the planning engine to serve those separate responsibilities. In practice +all of the providers we know about that rely on OpenTofu's incorrect protocol +implementation have nothing in their schema except functions, and so this +double-instantiation of a provider instance should not typically occur. + +However, exactly what to do with this concern remains an open question that +we will revisit further along the development path, and hopefully once the +provider protocol working group has made more progress and thus we can make a +more informed decision about whether it seems like there will be a viable new +approach to including external functions written in scripting languages. + +If we eventually find that it isn't viable to deprecate the current treatment of +provider-defined functions then we do have a design sketch for how to complicate +the evaluator with its own awareness of objects that need to be "closed", but +we'd prefer to avoid that significant additional complexity if possible.